Facebook, Duke, Lincoln discussed on Software Engineering Daily

Automatic TRANSCRIPT

Map reduce style jobs Lincoln has been developing its data infrastructure since the early days of the duke ecosystem linked in started using we are you know it's ninety five percent of what the right thing would have been had you gone for that initially and in the meantime it spreads like a virus and I think you're looking at introduced pig which was sort of a new take on sequel tickets often described as an imperative of language but that's not true at all really is a declarative language with some nice escape hatches for you to insert imperative logic through UDF's and things like that so that helped to provide people with higher level abstractions that they could reason about similarly at facebook the introduced Hoodoo hdfs map and very quickly realized that they needed a higher level programming interface for the majority of people at facebook and that's hive came from is interesting to consider like why did facebook produce high and Yahoo produced pig I knew for fact that a lot of the people who worked on high at facebook were ex Oracle engineers so the I think definitely looked at Duke and thought well what we really want here is the interface that a database provides but the scale ability that who provides and I think the people who wrote pig were coming at it from more of a scripting language approach and that's why superficially pig looks a lot more like imperative scripting language than it does declared if language like sequel if we take the lineage of query interfaces forward a little bit further I think that lineage goes from pig to hive and the next thing along that lineage might arguably be presto would you say or is there I mean when you thing about these higher level interfaces for querying large data sets in had duke is Presto the next thing in that series that Linnea jer away the lineage of Presto is a little bit different so Presto is very much an MP database running on top of HDFS or that's able to access data NHTSA's so comparing it to something like verdict or Green Plum. I think it's ancestors I in a way whereas something like aps massively parallel correct yeah and I think a big distinction to make between Presto and things like spark hive and pig is with those ladder systems the job ABC user triggers runs using their ide- so whether they're able to read data on S. or from a blob store depends on whether they using their account have access to that data so the code basically runs with their privileges whereas in a system like Presto the whole service runs as the press due user it's understood that that press do user has access to all of the underlying data and then Presto is able to superimpose its own authoring Russian rules on top of that one implication of this though is that if you need to write your own user defined function you can't just added that to Presto because that's a code injection vector I could very easily sneak some code and use it to access data that I'm not supposed to be able to have access to so when people do ab UDF's depressed oh they need to be vetted to make sure that they're OK whereas with the dupin spark bottles since everything runs as the user who's actually triggering that job inherits their permissions you can allow them to run whatever code they want inject whatever could they want so in terms of sort of velocity iteration speed it's a lot faster those other systems if however your coming at this from a database background and you're comfortable using sequel you can solve your problem using sequel in the standard built in UDF's there's no reason not to use presto I think I probably jumped the gun here in the evolution of a dupe usage so early days of abusive you've got an HDFS cluster the hoop distributed file system that's the place where you're storing all your data the quote Unquote Data Lake and that HDFS usage is still pretty prevalent today people are still largely running HDFS clusters although the large distributed bucket storage systems have this placed some of that HDFS usage if somebody is on aws three or Azure blob storage but the query layer has changed over time so whether we're talking about a duke map reduce being abstracted into pig or hive or being displaced by something like Apache spark and people pulling you know if I understand that the usage of patchy spark people will pull distributed working sets into a district memory system and then they'll query those working sets that are in memory now you can have this ad hoc data science process a little bit more of an interactive workflow than the batch Hadou workflow and my sense is that the spark usage that really changed how people saw aw big data processing can you give me your perspective for how spark changed the data landscape yes so for starters sparked it is more efficient lie caching data and memory as opposed to materializing it to disk between every stage like map reduce does but when did he just I'd like to just say that as soon as map reduce was released as an open source project and my guess would be even before it was released as an open source project the people working on it knew about all of these tricks they knew that forced materialization between map reduce stages and different jobs was a performance problem but the aired on the side of making sure you know that they could recover if a job's failed right because they were optimizing for these very long running jobs I just to clarify what you're talking about here this is the fact that in a map reduce you often doing these three operations map shuffle reduce and your check pointing data at each of these operations which makes it a costly series of operations right so if you have a really long running map task big sense checkpoint before you hit reduced that's right you don't want one if those transfers to fail and then you have to go and repeat that work but if you're talking about let's say a fast query right which looks at a little bit of data need to find that that forces materialization actually accounts for most of the time or it's something that slows you down definitely at every stage another issue was that I think everyone under could immediately that forcing people rather only allowing one reduce stage per map job was an issue if you wanted to group by multiple keys for apple you'd have to chain together multiple map reduce jobs even though you really didn't need those map stages in between so what people really wanted was an m. r. star model where you could have multiple reduce stages per map stage I think paradoxically though since dupin map reduce were such a success right off the bat it meant that the project instantaneously had a very large community both of users as well as developers and finding consensus within that community became a much bigger problem so while everyone was united and understanding like these are the next steps for the project exactly how to accomplish that became a problem similarly users definitely validate project but they also make it very hard for you to evolve API's and I think with API's you're always going to make a handful of mistakes you want to eliminate later on but The more users you have the harder it is to do that so I think that the people who start the spark project had the advantage of looking at hoop recognizing the problems that had duke had and then starting from I'm a clean slate without the baggage of a large user base in a large community they were able to integrate very quickly and produce something so he's like I've you spark as like a version two and you another nice thing about spark is that the API very elegant they're very clean it's very easy to be old application verticals on top of spark in a manner that it isn't really possible with just map reduce another thing that I think has really benefited spark at least in terms of that project ability to integrate quickly as the fact that it still located in a single code repository and it's a single project that governs it whereas with the Duke model it started as a single project but then you had projects spinning out of it so hive and pig both started as a duke projects but then they spun out into their own projects in would that really ended up doing was creating a problem that then vendors had to solve all of these projects had their own release cycle they were not necessarily tested or integration tested against the version that you would want so if you weren't using vendor your distribution your first task was to figure out well for this version of hive what version of Duke do I need what version of pig do I need it out does this all fit together because you could count on the fact that no one from these individual projects had done any integration testing for you so that really was the the value that vendors it at least in the early days there were just cleaning up this mess that had been created I don't know really what the motivations for doing that but I think that spark has avoided those albums just by keeping everything in a single repository let's fast forward to today you work on managing Paducah Infrastructure Lincoln is that your day to day job. Yeah I would say even though I don't really like this term big data infrastructure in general yeah why don't you like that term because you know it's a buzzword people I think use it too frequently but the sad thing is I can't think of a better a better term to use what is your day to day consists of these days I'm focusing most of my time on a project called really rich we've been working on for several years now our goal with Dolly is really to combine the best aspects of a relational database with the best suspects of big data ecosystem so we want to leverage abstractions from the database world things like tables of views the d fling between your logical view of data and the underlying physical details a combined knees with the freedom and flexibility that people are used to in the big data ecosystems so with big data I have the option of using more than just sequel to analyze data I have the ability to leverage different file formats depending on which format this is the best support for the engine that I'm using or the best performance for the query that I'm running a similarly I'm able to take advantage of different storage layers and swap those out so in effect trying to introduce a level of indirection at each one of these layers but also provide the abstractions necessary to really decouple implementation agents from API's with the goal both of making things simpler for users so that they don't have to worry about physical details that they shouldn't really need to know about in the first place but also with the goal of making it easier for the people who provide the infrastructure to make changes underneath without disrupting what's happening above when you say user you're talking about an internal application developer linked rucked give an example of an apple occasion that let's say I'm maybe I'm building a dashboard or I'm building a some kind of reporting system within linked in I'm sure you have prototypical example in your head tell me the problems that Dolly would solve for a prototypical application so I think one thing it solves the discover ability problem we have a data set catalogue that searchable and that's typically I think easier to discover things in that catalog than it would we just looking at paths in HDFS mid also allows you to search by call him names and annotations and things like that so just the process of discovering data sets as well as understanding who produces the data set and what the contract is between you the consumer and the producer as another thing that is aided there the thing is that we are able to over time improve the performance of data set by looking at the queries that people are running against it and using those to inform are we partition the data set of the format that we encode the data set in and things like that so I think efficiency is one thing another thing that Dolly provides I don't think any other system at this point as yet provided the ability for a data set owner to decouple the API of their data set in other words the Schema of the data set from the actual implementation of the data set so overtime right I can evolve Schema without requiring that consumers migrate in Lockstep with me because they have this level of interaction between the Schema that they're consuming which is provided by the view on top of the data set and the scheme of the data that I'm actually materializing on HDFS otherwise right when I make a change to the Schema it becomes instantaneously visible to everyone who's consuming it this basically I think probably the best way of explaining this is to say that if you think of data sets as services which have an API this allows us support multiple EP is on top of the same data set in a manner similar to our service can have a v One v Two v three of the API that it presents to silence and by the way is it Dali or Dolly Doll Do L. L. Y. D. A. L. I. D. A. L. The artist right so there's a history of using ally as a suffix on names of projects that linked in icy originally started this data access Edlington now all catches doesn't stand for the of anything because we've gone beyond just data access to other things like data management data catalog discovery issues like that okay what was the motivation for starting this project well I think that one big motivation was simply to make things easier for people developing application on top of the dupin spark by hiding details from them that they shouldn't need to worry about anyway things like what file format adidas set is using or which clustered data said is stored on her how that data set is partitioned but also to give the people who are running that cluster and managing that data the ability.

Coming up next