Lincoln, Carl Steinbach, Engineer discussed on Software Engineering Daily
Linked in is a social network with Peta bytes of data in order to store that data linked in distributes replicates that data across a large cluster of me Sheen's running the dupe distributed file system in order to run calculations across its large data set Lincoln needs to split the computation up using in two thousand eight and in the last eleven years the company has adopted streaming frameworks distributed databases and newer execution run times like patchy spark with the popularization of machine learning there are more applications for data engineering than ever before but the tooling around engineering means that it is still hard for developers to find data sets and clean their data and build reliable models. Carl Steinbach is an engineer at least it in working on tools for data engineering in today's episode Carl Discusses The data platform inside linked in and the strategies that the company has done developed around storing and computing large amounts of data full disclosure linked is a sponsor of soften engineering daily so call Steinbach welcome to suffering daily the very much it's great to be here so linked in started using Hud oop back in two thousand eight this was near the beginning of the age of modern data infrastructure give me brief history of how data infrastructure at Lincoln has evolved yeah well let's start with the introduction of back in two thousand eight so linked in had a big problem with that point in time the site had future called people you may know which suggested to members you know here's someone that you should consider acting with and it was very clear based on metrics that this was the engine of growth for Lincoln at that point in time when they first introduced this feature they were using Oracle to actually build the index for people you may know using a process called triangle closing so if I know you and you know Stephen Sitting over here chances are I also may know Stephen so that would be a good suggestion can you know in my feet so when the future was first introduced to take a date to build this index on Oracle but it was such a successful feature at drove so much breath towards the site that pretty soon it was taking upwards of a week to complete a single run and at that point the results were becoming increasingly stale and scaling it wasn't really an option so people started looking around for a better approach something that could be horizontally scaled so when I joined Lincoln in two thousand thirteen one of the first things I did was to go Jura and search for the first mention of that I could find and I found a couple of tickets where people are discussing how to set up the first cluster discussing the first use case which was py incidentally like G. Krebs was heavily involved in this you know and he's the guy who later went on to start the Kafka project so the application of Paducah to the pin k. problems there's a huge success and they were able to bring the time required degenerate this index down to just a couple of hours and they found that by adding more machines they could bring it down even lower so that first big success than inspired other people to start leveraging Duke for other problems and pretty soon it got to a point where within the company it was just known that was we're all of the data was stored all of the tracking data all of the derived data sets and things like that so we started using it to build search indexes we started using it for future generation Shen and model training for machine learning and four analytics as well so by the time I arrived who really cemented its position within Lincoln's broader data infrastructure. You know one interesting thing about link Dennis there's this division between online serving systems near lines dreaming systems and off line batch analytics systems so duke was basically the offline Bash analytics solution at Lincoln when you joined was there standardized process by which somebody stood up a hoodoo cluster or was given access to hoop resources or people just spinning up clusters despite they could figure out a way to do so we actually read shouldn't say we but the people at Lincoln at that time recruited folks from Yahoo this would be around the two thousand eight two thousand nine timeframe at a time when the Renault vendors out there that you can consult I think Tom White published I had duke book through a Riley may be around two thousand ten soup the only place that you go for help about how to deploy large do cluster how to manage it would be the Apache mailing lists or meet ups related to this so I think the easiest thing then for people that Lincoln was to recruit some of the people from Yahoo had been working on dupe and that's how the original title Hoodoo operations team was built out and we then inherited the model that Yahoo had been following for running these clusters big clusters as opposed kamini smaller clusters that simplifies management because you have fewer things to manage fewer services to manage at least an it's also good from a data standpoint because you can just go to one cluster and know that all of your data is there as opposed to having to think okay this data set may only exist on the small cluster that also causes other problems is discovery ability it also would make it very hard to join data sets together which are located on different clusters you wouldn't be able to leverage data locality anymore get slowed down by the rick and things like that so the model at Lincoln has always been fewer larger clusters over many smaller clusters was there a distinct point at which it went from being a single large cluster to that single arts cluster got so big that you had to instant she eight additional clusters I don't know the exact timeline for win a second or third cluster was introduced but by the time I joined linked in two thousand thirteen it was well established that we had one large development MR and one large production cluster and there was a process then forgetting your job promoted from development over to production and since you at the time mm it was hard to isolate one job from another job it was important to make sure that the jobs that were running on the production cluster well behaved and had been vetted later on with that process of vetting individual flows actually became a major problem people would sometimes wait for up to a month for someone to sit down and review their job and in some cases they were just told okay well you have to fix this and then get back in the queue and wait another month so one of the things that I did after joining Lincoln was to try to figure out if we automate that process and that resulted in a doctor elephant which is a service that we run which looks at the exhaust basically from the duke jobs it applies ristic's to those logs and use those Chris Dickson diagnose performance pathologies things like skew in terms of tasks too much memory too little memory and stuff like that and very quickly we got to a point we are we could take the human reviewer completely out of the loop and we were able to might say promote to production probably eighty percent of things as well as for the cases are we couldn't promote immediately we were able to offer actionable advice to the owner of the workflow you know these are the things that you need to change or to get a green signal from Dr Elephant Lincoln was not alone in this problem even before the open source who do pico system this problem manifested Google there's interview I did a while ago with a guy named Tomasz Tonga's who wrote a book called a winning with data alluded to this book a couple times because he talks about what he turned data bread lines basically the cue that you're referring to where you have to get into some kind of q either to get your job to run or A to have the data scientists go and write a custom Hudood job just for you to get data to get the nightly report back to you the it was a perennial problem for people building data infrastructure people who are working as data analysts are data scientists in those nascent days of a dupe infrastructure yeah that sounds very familiar and I think it also points to a larger issue which we create a Doctor Elvin to help address which is this inherent tension between developer productivity and infrastructure efficiency and we don't want to require that everyone who uses her dupe or spark needs to be an expert in these systems yes they have better things to worry about right they are machine learning experts or their analytics experts and to require that they spend a month or maybe even a year coming up beat with all of the intricacies of these systems would severely impact their productivity I think this is kind of another interesting example of this concept of worsens better Richard Gabriel wrote this really interesting essay in the late eighties and I think it became well known in the early nineties describing this concept of worse is better and he was to explain why lisp had sort of failed in the market whereas CNC plus plus we're doing really well and he sort of identified what he described as the Mitt cool and the New Jersey school where New Jersey was like a stand in for bell labs and he compared the mit school where everything has to be in a sense perfect right the you should be simple and it's acceptable to push complexity over the implementation in order to keep the API simple versus the other approach of saying we'll actually were willing to put responsibilities over to the user for the sake of keeping the implementation very simple and I think a dupe is a really good example of the worsest better philosophy an action and benefit of versus better is that you're able to get this thing out very quickly you're able to innovate on it and make it better over time to appoint it's a prime example of that winded the Hadou P- infrastructure problems at Lincoln get alleviated to the point where it was much easier for I actually just get their jobs right their jobs in an autonomous fashion without being blocked by was Dr Elephant the thing that just solved this album or was it more of a progression of additional solutions that alleviated this issue so I think the doctor elephant help us solve this tension right between personal productivity and infrastructure efficiency but there were other things that it also helped people become more productive I mean when I was first introduced your programming. API was Ma- produce which is the assembly language for data processing and W Handwriting Assembly essentially right or you could also like to I don't have a database so instead I have to write the query plan by hand and if I want to optimize it I have to go and retrieve the statistics myself so it wasn't very productive for people who were coming from more of a database background or people who wanted to do ad hoc queries right if you are writing could produce it means you have to compile your code you have to deploy the jar files all of that stuff so while he was still incubating at Yahoo the.