LinkedIn Data Platform with Carl Steinbach

Automatic TRANSCRIPT

Sheen's running the dupe distributed file system in order to run calculations across its large data set Lincoln needs to split the computation up using linked in is a social network with Peta bytes of data in order to store that data linked in distributes replicates that data across a large cluster of me Shen and model training for machine learning and four analytics as well so by the time I arrived who really cemented its position patchy spark with the popularization of machine learning there are more applications for data engineering than ever before but the tooling around in started using Hud oop back in two thousand eight this was near the beginning of the age of modern data infrastructure give me developed around storing and computing large amounts of data full disclosure linked is a sponsor of soften engineering daily in two thousand eight and in the last eleven years the company has adopted streaming frameworks distributed databases and newer execution run times like it in working on tools for data engineering in today's episode Carl Discusses The data platform inside linked in and the strategies that the company has done engineering means that it is still hard for developers to find data sets and clean their data and build reliable models. Carl Steinbach is an engineer at least breath towards the site that pretty soon it was taking upwards of a week to complete a single run and at that point the results were becoming increasingly stale and scaling using a process called triangle closing so if I know you and you know Stephen Sitting over here chances are I also may know Stephen so that would be a good suggestion there's a huge success and they were able to bring the time required degenerate this index down to just a couple of hours and they found that by adding more machines they could bring it down even lower so point in time the site had future called people you may know which suggested to members you know here's someone that you should consider acting with and it was very clear based on metrics that this was the engine of growth for Lincoln at that point in time when they first introduced this feature they were using Oracle to actually build the index for people you may know within Lincoln's broader data infrastructure. You know one interesting thing about link Dennis there's this division between online serving systems near lines dreaming systems and brief history of how data infrastructure at Lincoln has evolved yeah well let's start with the introduction of back in two thousand eight so linked in had a big problem with that that first big success than inspired other people to start leveraging Duke for other problems and pretty soon it got to a point where within the company it was just known that standardized process by which somebody stood up a hoodoo cluster or was given access to hoop resources or people just spinning up clusters off line batch analytics systems so duke was basically the offline Bash analytics solution at Lincoln when you joined was there Jura and search for the first mention of that I could find and I found a couple of tickets where people are discussing how to set up the first cluster discussing the first use case which was py can you know in my feet so when the future was first introduced to take a date to build this index on Oracle but it was such a successful feature at drove so much it wasn't really an option so people started looking around for a better approach something that could be horizontally scaled so when I joined Lincoln in two thousand thirteen one of the first things I did was to go was we're all of the data was stored all of the tracking data all of the derived data sets and things like that so we started using it to build search indexes we started using it for future generation incidentally like G. Krebs was heavily involved in this you know and he's the guy who later went on to start the Kafka project so the application of Paducah to the pin k. problems rick and things like that so the model at Lincoln has always been fewer larger clusters over many smaller clusters was there a distinct point at which it went from being a single large cluster to that single arts cluster got so big that you had to instant she eight additional clusters kamini smaller clusters that simplifies management because you have fewer things to manage fewer services to manage at least an it's also good from a data standpoint because you can so call Steinbach welcome to suffering daily the very much it's great to be here so linked I don't know the exact timeline for win a second or third cluster was introduced but by the time I joined linked in two thousand thirteen it was well established that we had one large development just go to one cluster and know that all of your data is there as opposed to having to think okay this data set may only exist on the small cluster that also causes other problems title Hoodoo operations team was built out and we then inherited the model that Yahoo had been following for running these clusters big clusters as opposed this would be around the two thousand eight two thousand nine timeframe at a time when the Renault vendors out there that you can consult I think Tom White published I had duke book is discovery ability it also would make it very hard to join data sets together which are located on different clusters you wouldn't be able to leverage data locality anymore get slowed down by the MR and one large production cluster and there was a process then forgetting your job promoted from development over to production and since you at the time mm it was hard to isolate one job from another job it was important to make sure that the jobs that were running on the production cluster well behaved and had been vetted later on through a Riley may be around two thousand ten soup the only place that you go for help about how to deploy large do cluster how to manage it would be the Apache mailing lists or meet ups related to this so I think the easiest thing then for people that Lincoln was to recruit some of the people from Yahoo had been working on dupe and that's how the original Lincoln was not alone in this problem even before the open source who do pico system this problem manifested Google there's data bread lines basically the cue that you're referring to where you have to get into some kind of q either to get your job to run or in some cases they were just told okay well you have to fix this and then get back in the queue and wait another month so one of the things that I did after joining Lincoln was to try to figure out if we automate that process and that are we couldn't promote immediately we were able to offer actionable advice to the owner of the workflow you know these are the things that you need to change or to get a green signal from Dr Elephant interview I did a while ago with a guy named Tomasz Tonga's who wrote a book called a winning with data alluded to this book a couple times because he talks about what he turned Chris Dickson diagnose performance pathologies things like skew in terms of tasks too much memory too little memory and stuff like that and very quickly we got to a point we are we could take the human reviewer completely out of the loop and we were able to might say promote to production probably eighty percent of things as well as for the cases resulted in a doctor elephant which is a service that we run which looks at the exhaust basically from the duke jobs it applies ristic's to those logs and use those infrastructure efficiency and we don't want to require that everyone who uses her dupe or spark needs to be an expert in these systems it was a perennial problem for people building data infrastructure people who are working as data analysts are data scientists in those nascent days of a dupe infrastructure with that process of vetting individual flows actually became a major problem people would sometimes wait for up to a month for someone to sit down and review their job and better Richard Gabriel wrote this really interesting essay in the late eighties and I think it became well known in the early nineties describing this concept of worse is better and he was yes they have better things to worry about right they are machine learning experts or their analytics experts and to require that they spend a month or maybe even a year coming up yeah that sounds very familiar and I think it also points to a larger issue which we create a Doctor Elvin to help address which is this inherent tension between developer productivity and despite they could figure out a way to do so we actually read shouldn't say we but the people at Lincoln at that time recruited folks from Yahoo A to have the data scientists go and write a custom Hudood job just for you to get data to get the nightly report back to you the to explain why lisp had sort of failed in the market whereas CNC plus plus we're doing really well and he sort of identified what he described as the Mitt beat with all of the intricacies of these systems would severely impact their productivity I think this is kind of another interesting example of this concept of worsens responsibilities over to the user for the sake of keeping the implementation very simple and I think a dupe is a really good example of the worsest better philosophy album or was it more of a progression of additional solutions that alleviated this issue so I think the doctor elephant help us solve this tension right between I actually just get their jobs right their jobs in an autonomous fashion without being blocked by was Dr Elephant the thing that just solved this of language but that's not true at all really is a declarative language with some nice escape hatches for you to insert imperative logic through UDF's and things like that so while he was still incubating at Yahoo the introduced pig which was sort of a new take on sequel tickets often described as an imperative you should be simple and it's acceptable to push complexity over the implementation in order to keep the API simple versus the other approach of saying we'll actually were willing to put imperative scripting language than it does declared if language like sequel if we take the lineage of query interfaces forward a little bit further cool and the New Jersey school where New Jersey was like a stand in for bell labs and he compared the mit school where everything has to be in a sense perfect right the people who wrote pig were coming at it from more of a scripting language approach and that's why superficially pig looks a lot more like so that helped to provide people with higher level abstractions that they could reason about similarly at facebook the introduced Hoodoo hdfs map an action and benefit of versus better is that you're able to get this thing out very quickly you're able to innovate on it and make it better over time to appoint personal productivity and infrastructure efficiency but there were other things that it also helped people become more productive I mean when I was first introduced your programming. API was Ma- produce which is I think that lineage goes from pig to hive and the next thing along that lineage might arguably be presto would you say or is there I mean when you Russian rules on top of that one implication of this though is that if you need to write your own user defined function you can't just correct yeah and I think a big distinction to make between Presto and things like spark hive and pig is with those ladder systems the job it's a prime example of that winded the Hadou P- infrastructure problems at Lincoln get alleviated to the point where it was much easier for why did facebook produce high and Yahoo produced pig I knew for fact that a lot of the people who worked on high at facebook were ex Oracle engineers so the the assembly language for data processing and W Handwriting Assembly essentially right or you could also like to I don't have a database so instead I have to write the query background or people who wanted to do ad hoc queries right if you are writing could produce it means you have to compile your code you have to deploy the jar files all of that stuff inherits their permissions you can allow them to run whatever code they want inject whatever could they want so in terms of sort of velocity iteration speed it's a lot faster and that HDFS usage is still pretty prevalent today people are still largely running HDFS clusters although the large distributed bucket storage systems have has changed over time so whether we're talking about a duke map reduce being abstracted into pig or hive I think definitely looked at Duke and thought well what we really want here is the interface that a database provides but the scale ability that who provides and I think the those other systems if however your coming at this from a database background and you're comfortable using sequel you can solve your problem using sequel in the standard built in UDF's this placed some of that HDFS usage if somebody is on aws three or Azure blob storage but the query layer they using their account have access to that data so the code basically runs with their privileges whereas in a system like Presto the whole service runs as the press or being displaced by something like Apache spark and people pulling you know if I understand that the usage of patchy spark people will pull distributed plan by hand and if I want to optimize it I have to go and retrieve the statistics myself so it wasn't very productive for people who were coming from more of a database due user it's understood that that press do user has access to all of the underlying data and then Presto is able to superimpose its own authoring thing about these higher level interfaces for querying large data sets in had duke is Presto the next thing in that series that Linnea jer working sets into a district memory system and then they'll query those working sets that are in memory now you can have this ad hoc data science process aw big data processing can you give me your perspective for how spark changed the data landscape yes so for starters sparked we are you know it's ninety five percent of what the right thing would have been had you gone for that initially and in the meantime it spreads like a virus and I think you're looking at there's no reason not to use presto I think I probably jumped the gun here in the evolution of a dupe usage so early days of on it knew about all of these tricks they knew that forced materialization between map reduce stages and different jobs was away the lineage of Presto is a little bit different so Presto is very much an MP database running on top of HDFS or that's able to access data abusive you've got an HDFS cluster the hoop distributed file system that's the place where you're storing all your data the quote Unquote Data Lake and very quickly realized that they needed a higher level programming interface for the majority of people at facebook and that's hive came from is interesting to consider like NHTSA's so comparing it to something like verdict or Green Plum. I think it's ancestors I in a way whereas something like aps massively parallel I just to clarify what you're talking about here this is the fact that in a map reduce you often doing these three operations map shuffle reduce and your check pointing ABC user triggers runs using their ide- so whether they're able to read data on S. or from a blob store depends on whether could immediately that forcing people rather only allowing one reduce stage per map job was an issue if you wanted to group by multiple keys for added that to Presto because that's a code injection vector I could very easily sneak some code and use it to access data that I'm not supposed to be able to have access to so when people became a much bigger problem so while everyone was united and understanding like these are the next steps for the project exactly how to accomplish that became a problem model where you could have multiple reduce stages per map stage I think paradoxically though since dupin map reduce were such a success it is more efficient lie caching data and memory as opposed to materializing it to disk between every stage like map reduce does but right off the bat it meant that the project instantaneously had a very large community both of users as well as developers and finding consensus within that community similarly users definitely validate project but they also make it very hard for you to evolve API's and I think with API's you're always going to make a handful of mistakes you want to eliminate later on but The more users you have the harder it is to do that so I think that the people who that's right you don't want one if those transfers to fail and then you have to go and repeat that work but if you're talking about let's say a fast query right which looks at a little bit of data a little bit more of an interactive workflow than the batch Hadou workflow and my sense is that the spark usage that really changed how people saw start the spark project had the advantage of looking at hoop recognizing the problems that had duke had and then starting from apple you'd have to chain together multiple map reduce jobs even though you really didn't need those map stages in between so what people really wanted was an m. r. star data at each of these operations which makes it a costly series of operations right so if you have a really long running map task big sense checkpoint before you hit reduced need to find that that forces materialization actually accounts for most of the time or it's something that slows you down definitely at every stage another issue was that I think everyone under old application verticals on top of spark in a manner that it isn't really possible with just map reduce another thing that I think has really benefited spark at least in terms of that project do ab UDF's depressed oh they need to be vetted to make sure that they're OK whereas with the dupin spark bottles since everything runs as the user who's actually triggering that job so with big data I have the option of using more than just sequel to analyze data I have the ability to leverage different file formats depending on which format suspects of big data ecosystem so we want to leverage abstractions from the database world things like tables of views the d a performance problem but the aired on the side of making sure you know that they could recover if a job's failed right because they were optimizing for these very long running jobs albums just by keeping everything in a single repository let's fast forward to today you work on managing Paducah Infrastructure Lincoln doing was creating a problem that then vendors had to solve all of these projects had their own release cycle they were not necessarily tested or integration tested against the version that I'm a clean slate without the baggage of a large user base in a large community they were able to integrate very quickly and produce something so when did he just I'd like to just say that as soon as map reduce was released as an open source project and my guess would be even before it was released as an open source project the people working ability to integrate quickly as the fact that it still located in a single code repository and it's a single project that governs it whereas with the Duke model so in effect trying to introduce a level of indirection at each one of these layers but also provide the abstractions necessary to really decouple implementation it out does this all fit together because you could count on the fact that no one from these individual projects had done any integration testing for you so that really was the the value that vendors you would want so if you weren't using vendor your distribution your first task was to figure out well for this version of hive what version of Duke do I need what version of pig do I need prototypical example in your head tell me the problems that Dolly would solve for a prototypical application so I think one thing agents from API's with the goal both of making things simpler for users so that they don't have to worry about physical details that they shouldn't we just looking at paths in HDFS mid also allows you to search by call him names and annotations and things like that so just the process of discovering data sets this is the best support for the engine that I'm using or the best performance for the query that I'm running a similarly I'm able to take advantage of different storage layers and swap those out what's happening above when you say user you're talking about an internal application developer linked rucked give an example of an apple really rich we've been working on for several years now our goal with Dolly is really to combine the best aspects of a relational database with the best really need to know about in the first place but also with the goal of making it easier for the people who provide the infrastructure to make changes underneath without disrupting as well as understanding who produces the data set and what the contract is between you the consumer and the producer as another thing that is aided there occasion that let's say I'm maybe I'm building a dashboard or I'm building a some kind of reporting system within linked in I'm sure you have but the sad thing is I can't think of a better a better term to use what is your day to day consists of these days I'm focusing most of my time on a project called actual implementation of the data set so overtime right I can evolve Schema without requiring that consumers migrate in Lockstep with me because it solves the discover ability problem we have a data set catalogue that searchable and that's typically I think easier to discover things in that catalog than it would the thing is that we are able to over time improve the performance of data set by looking at the queries that people are running against it and using those to inform it started as a single project but then you had projects spinning out of it so hive and pig both started as a duke projects but then they spun out into their own projects in would that really ended up fling between your logical view of data and the underlying physical details a combined knees with the freedom and flexibility that people are used to in the big data ecosystems are we partition the data set of the format that we encode the data set in and things like that so I think efficiency is one thing another thing that Dolly provides what was the motivation for starting this project well I think that one big motivation was simply to make things easier for people developing application this basically I think probably the best way of explaining this is to say that if you think of data sets as services which have an API this allows us that I'm actually materializing on HDFS otherwise right when I make a change to the Schema it becomes instantaneously visible to everyone who's consuming it using ally as a suffix on names of projects that linked in icy originally started this data access Edlington now all catches doesn't stand for it at least in the early days there were just cleaning up this mess that had been created I don't know really what the motivations for doing that but I think that spark has avoided those he's like I've you spark as like a version two and you another nice thing about spark is that the API very elegant they're very clean it's very easy to be map reduce style jobs Lincoln has been developing its data infrastructure since the early days of the duke ecosystem linked in started using make things easier for themselves but in this situation it was bad for both sides it makes it very hard for infrastructure providers to make any change without talking the of anything because we've gone beyond just data access to other things like data management data catalog discovery issues like that okay I don't think any other system at this point as yet provided the ability for a data set owner to decouple the API of their data set in other words the Schema of the data set from the they have this level of interaction between the Schema that they're consuming which is provided by the view on top of the data set and the scheme of the data it needs to be or where developers have to manage more details than they should it's because the people who are providing that APR slacking off right they're trying silence and by the way is it Dali or Dolly Doll Do L. L. Y. D. A. L. I. D. A. L. The artist right so there's a history of set is using or which clustered data said is stored on her how that data set is partitioned but also to give the people who are running that cluster and into every user who's going to be impacted in helping them to migrate of so we really want to improve the velocity for both sides and then I think support multiple EP is on top of the same data set in a manner similar to our service can have a v One v Two v three of the API that it presents to is that your day to day job. Yeah I would say even though I don't really like this term big data infrastructure in general yeah why don't you like that term because you know it's a buzzword people I think use it too frequently managing that data the ability to make changes behind the scenes without impacting them one interesting thing is that usually when you find a situation where the API is more complicated on top of the dupin spark by hiding details from them that they shouldn't need to worry about anyway things like what file format adidas us and the underlying details if I'm using my sequel or post Chris it's irrelevant to me how my sequel and post-chris in code a table or how they write it up as a female inspirations what we're really looking out was conventional database where you have this nice separation between the view that someone writing a query Azure over the next couple of years were enforcing a policy where all public data must only be accessible using Dolly. API's just to make sure that we don't find back in your client because then it hard dependency on where that data set is physically in your data center so by adding the additional level of indirection where we have instead a data set view of

Coming up next