3 Burst results for "Carl Steinbach"
"carl steinbach" Discussed on Software Engineering Daily
"To make changes behind the scenes without impacting them one interesting thing is that usually when you find a situation where the API is more complicated it needs to be or where developers have to manage more details than they should it's because the people who are providing that APR slacking off right they're trying make things easier for themselves but in this situation it was bad for both sides it makes it very hard for infrastructure providers to make any change without talking into every user who's going to be impacted in helping them to migrate of so we really want to improve the velocity for both sides and then I think as a female inspirations what we're really looking out was conventional database where you have this nice separation between the view that someone writing a query us and the underlying details if I'm using my sequel or post Chris it's irrelevant to me how my sequel and post-chris in code a table or how they write it up so by adding the additional level of indirection where we have instead a data set view of back in your client because then it hard dependency on where that data set is physically in your data center Azure over the next couple of years were enforcing a policy where all public data must only be accessible using Dolly. API's just to make sure that we don't find.
"carl steinbach" Discussed on Software Engineering Daily
"Map reduce style jobs Lincoln has been developing its data infrastructure since the early days of the duke ecosystem linked in started using we are you know it's ninety five percent of what the right thing would have been had you gone for that initially and in the meantime it spreads like a virus and I think you're looking at introduced pig which was sort of a new take on sequel tickets often described as an imperative of language but that's not true at all really is a declarative language with some nice escape hatches for you to insert imperative logic through UDF's and things like that so that helped to provide people with higher level abstractions that they could reason about similarly at facebook the introduced Hoodoo hdfs map and very quickly realized that they needed a higher level programming interface for the majority of people at facebook and that's hive came from is interesting to consider like why did facebook produce high and Yahoo produced pig I knew for fact that a lot of the people who worked on high at facebook were ex Oracle engineers so the I think definitely looked at Duke and thought well what we really want here is the interface that a database provides but the scale ability that who provides and I think the people who wrote pig were coming at it from more of a scripting language approach and that's why superficially pig looks a lot more like imperative scripting language than it does declared if language like sequel if we take the lineage of query interfaces forward a little bit further I think that lineage goes from pig to hive and the next thing along that lineage might arguably be presto would you say or is there I mean when you thing about these higher level interfaces for querying large data sets in had duke is Presto the next thing in that series that Linnea jer away the lineage of Presto is a little bit different so Presto is very much an MP database running on top of HDFS or that's able to access data NHTSA's so comparing it to something like verdict or Green Plum. I think it's ancestors I in a way whereas something like aps massively parallel correct yeah and I think a big distinction to make between Presto and things like spark hive and pig is with those ladder systems the job ABC user triggers runs using their ide- so whether they're able to read data on S. or from a blob store depends on whether they using their account have access to that data so the code basically runs with their privileges whereas in a system like Presto the whole service runs as the press due user it's understood that that press do user has access to all of the underlying data and then Presto is able to superimpose its own authoring Russian rules on top of that one implication of this though is that if you need to write your own user defined function you can't just added that to Presto because that's a code injection vector I could very easily sneak some code and use it to access data that I'm not supposed to be able to have access to so when people do ab UDF's depressed oh they need to be vetted to make sure that they're OK whereas with the dupin spark bottles since everything runs as the user who's actually triggering that job inherits their permissions you can allow them to run whatever code they want inject whatever could they want so in terms of sort of velocity iteration speed it's a lot faster those other systems if however your coming at this from a database background and you're comfortable using sequel you can solve your problem using sequel in the standard built in UDF's there's no reason not to use presto I think I probably jumped the gun here in the evolution of a dupe usage so early days of abusive you've got an HDFS cluster the hoop distributed file system that's the place where you're storing all your data the quote Unquote Data Lake and that HDFS usage is still pretty prevalent today people are still largely running HDFS clusters although the large distributed bucket storage systems have this placed some of that HDFS usage if somebody is on aws three or Azure blob storage but the query layer has changed over time so whether we're talking about a duke map reduce being abstracted into pig or hive or being displaced by something like Apache spark and people pulling you know if I understand that the usage of patchy spark people will pull distributed working sets into a district memory system and then they'll query those working sets that are in memory now you can have this ad hoc data science process a little bit more of an interactive workflow than the batch Hadou workflow and my sense is that the spark usage that really changed how people saw aw big data processing can you give me your perspective for how spark changed the data landscape yes so for starters sparked it is more efficient lie caching data and memory as opposed to materializing it to disk between every stage like map reduce does but when did he just I'd like to just say that as soon as map reduce was released as an open source project and my guess would be even before it was released as an open source project the people working on it knew about all of these tricks they knew that forced materialization between map reduce stages and different jobs was a performance problem but the aired on the side of making sure you know that they could recover if a job's failed right because they were optimizing for these very long running jobs I just to clarify what you're talking about here this is the fact that in a map reduce you often doing these three operations map shuffle reduce and your check pointing data at each of these operations which makes it a costly series of operations right so if you have a really long running map task big sense checkpoint before you hit reduced that's right you don't want one if those transfers to fail and then you have to go and repeat that work but if you're talking about let's say a fast query right which looks at a little bit of data need to find that that forces materialization actually accounts for most of the time or it's something that slows you down definitely at every stage another issue was that I think everyone under could immediately that forcing people rather only allowing one reduce stage per map job was an issue if you wanted to group by multiple keys for apple you'd have to chain together multiple map reduce jobs even though you really didn't need those map stages in between so what people really wanted was an m. r. star model where you could have multiple reduce stages per map stage I think paradoxically though since dupin map reduce were such a success right off the bat it meant that the project instantaneously had a very large community both of users as well as developers and finding consensus within that community became a much bigger problem so while everyone was united and understanding like these are the next steps for the project exactly how to accomplish that became a problem similarly users definitely validate project but they also make it very hard for you to evolve API's and I think with API's you're always going to make a handful of mistakes you want to eliminate later on but The more users you have the harder it is to do that so I think that the people who start the spark project had the advantage of looking at hoop recognizing the problems that had duke had and then starting from I'm a clean slate without the baggage of a large user base in a large community they were able to integrate very quickly and produce something so he's like I've you spark as like a version two and you another nice thing about spark is that the API very elegant they're very clean it's very easy to be old application verticals on top of spark in a manner that it isn't really possible with just map reduce another thing that I think has really benefited spark at least in terms of that project ability to integrate quickly as the fact that it still located in a single code repository and it's a single project that governs it whereas with the Duke model it started as a single project but then you had projects spinning out of it so hive and pig both started as a duke projects but then they spun out into their own projects in would that really ended up doing was creating a problem that then vendors had to solve all of these projects had their own release cycle they were not necessarily tested or integration tested against the version that you would want so if you weren't using vendor your distribution your first task was to figure out well for this version of hive what version of Duke do I need what version of pig do I need it out does this all fit together because you could count on the fact that no one from these individual projects had done any integration testing for you so that really was the the value that vendors it at least in the early days there were just cleaning up this mess that had been created I don't know really what the motivations for doing that but I think that spark has avoided those albums just by keeping everything in a single repository let's fast forward to today you work on managing Paducah Infrastructure Lincoln is that your day to day job. Yeah I would say even though I don't really like this term big data infrastructure in general yeah why don't you like that term because you know it's a buzzword people I think use it too frequently but the sad thing is I can't think of a better a better term to use what is your day to day consists of these days I'm focusing most of my time on a project called really rich we've been working on for several years now our goal with Dolly is really to combine the best aspects of a relational database with the best suspects of big data ecosystem so we want to leverage abstractions from the database world things like tables of views the d fling between your logical view of data and the underlying physical details a combined knees with the freedom and flexibility that people are used to in the big data ecosystems so with big data I have the option of using more than just sequel to analyze data I have the ability to leverage different file formats depending on which format this is the best support for the engine that I'm using or the best performance for the query that I'm running a similarly I'm able to take advantage of different storage layers and swap those out so in effect trying to introduce a level of indirection at each one of these layers but also provide the abstractions necessary to really decouple implementation agents from API's with the goal both of making things simpler for users so that they don't have to worry about physical details that they shouldn't really need to know about in the first place but also with the goal of making it easier for the people who provide the infrastructure to make changes underneath without disrupting what's happening above when you say user you're talking about an internal application developer linked rucked give an example of an apple occasion that let's say I'm maybe I'm building a dashboard or I'm building a some kind of reporting system within linked in I'm sure you have prototypical example in your head tell me the problems that Dolly would solve for a prototypical application so I think one thing it solves the discover ability problem we have a data set catalogue that searchable and that's typically I think easier to discover things in that catalog than it would we just looking at paths in HDFS mid also allows you to search by call him names and annotations and things like that so just the process of discovering data sets as well as understanding who produces the data set and what the contract is between you the consumer and the producer as another thing that is aided there the thing is that we are able to over time improve the performance of data set by looking at the queries that people are running against it and using those to inform are we partition the data set of the format that we encode the data set in and things like that so I think efficiency is one thing another thing that Dolly provides I don't think any other system at this point as yet provided the ability for a data set owner to decouple the API of their data set in other words the Schema of the data set from the actual implementation of the data set so overtime right I can evolve Schema without requiring that consumers migrate in Lockstep with me because they have this level of interaction between the Schema that they're consuming which is provided by the view on top of the data set and the scheme of the data that I'm actually materializing on HDFS otherwise right when I make a change to the Schema it becomes instantaneously visible to everyone who's consuming it this basically I think probably the best way of explaining this is to say that if you think of data sets as services which have an API this allows us support multiple EP is on top of the same data set in a manner similar to our service can have a v One v Two v three of the API that it presents to silence and by the way is it Dali or Dolly Doll Do L. L. Y. D. A. L. I. D. A. L. The artist right so there's a history of using ally as a suffix on names of projects that linked in icy originally started this data access Edlington now all catches doesn't stand for the of anything because we've gone beyond just data access to other things like data management data catalog discovery issues like that okay what was the motivation for starting this project well I think that one big motivation was simply to make things easier for people developing application on top of the dupin spark by hiding details from them that they shouldn't need to worry about anyway things like what file format adidas set is using or which clustered data said is stored on her how that data set is partitioned but also to give the people who are running that cluster and managing that data the ability.
"carl steinbach" Discussed on Software Engineering Daily
"Linked in is a social network with Peta bytes of data in order to store that data linked in distributes replicates that data across a large cluster of me Sheen's running the dupe distributed file system in order to run calculations across its large data set Lincoln needs to split the computation up using in two thousand eight and in the last eleven years the company has adopted streaming frameworks distributed databases and newer execution run times like patchy spark with the popularization of machine learning there are more applications for data engineering than ever before but the tooling around engineering means that it is still hard for developers to find data sets and clean their data and build reliable models. Carl Steinbach is an engineer at least it in working on tools for data engineering in today's episode Carl Discusses The data platform inside linked in and the strategies that the company has done developed around storing and computing large amounts of data full disclosure linked is a sponsor of soften engineering daily so call Steinbach welcome to suffering daily the very much it's great to be here so linked in started using Hud oop back in two thousand eight this was near the beginning of the age of modern data infrastructure give me brief history of how data infrastructure at Lincoln has evolved yeah well let's start with the introduction of back in two thousand eight so linked in had a big problem with that point in time the site had future called people you may know which suggested to members you know here's someone that you should consider acting with and it was very clear based on metrics that this was the engine of growth for Lincoln at that point in time when they first introduced this feature they were using Oracle to actually build the index for people you may know using a process called triangle closing so if I know you and you know Stephen Sitting over here chances are I also may know Stephen so that would be a good suggestion can you know in my feet so when the future was first introduced to take a date to build this index on Oracle but it was such a successful feature at drove so much breath towards the site that pretty soon it was taking upwards of a week to complete a single run and at that point the results were becoming increasingly stale and scaling it wasn't really an option so people started looking around for a better approach something that could be horizontally scaled so when I joined Lincoln in two thousand thirteen one of the first things I did was to go Jura and search for the first mention of that I could find and I found a couple of tickets where people are discussing how to set up the first cluster discussing the first use case which was py incidentally like G. Krebs was heavily involved in this you know and he's the guy who later went on to start the Kafka project so the application of Paducah to the pin k. problems there's a huge success and they were able to bring the time required degenerate this index down to just a couple of hours and they found that by adding more machines they could bring it down even lower so that first big success than inspired other people to start leveraging Duke for other problems and pretty soon it got to a point where within the company it was just known that was we're all of the data was stored all of the tracking data all of the derived data sets and things like that so we started using it to build search indexes we started using it for future generation Shen and model training for machine learning and four analytics as well so by the time I arrived who really cemented its position within Lincoln's broader data infrastructure. You know one interesting thing about link Dennis there's this division between online serving systems near lines dreaming systems and off line batch analytics systems so duke was basically the offline Bash analytics solution at Lincoln when you joined was there standardized process by which somebody stood up a hoodoo cluster or was given access to hoop resources or people just spinning up clusters despite they could figure out a way to do so we actually read shouldn't say we but the people at Lincoln at that time recruited folks from Yahoo this would be around the two thousand eight two thousand nine timeframe at a time when the Renault vendors out there that you can consult I think Tom White published I had duke book through a Riley may be around two thousand ten soup the only place that you go for help about how to deploy large do cluster how to manage it would be the Apache mailing lists or meet ups related to this so I think the easiest thing then for people that Lincoln was to recruit some of the people from Yahoo had been working on dupe and that's how the original title Hoodoo operations team was built out and we then inherited the model that Yahoo had been following for running these clusters big clusters as opposed kamini smaller clusters that simplifies management because you have fewer things to manage fewer services to manage at least an it's also good from a data standpoint because you can just go to one cluster and know that all of your data is there as opposed to having to think okay this data set may only exist on the small cluster that also causes other problems is discovery ability it also would make it very hard to join data sets together which are located on different clusters you wouldn't be able to leverage data locality anymore get slowed down by the rick and things like that so the model at Lincoln has always been fewer larger clusters over many smaller clusters was there a distinct point at which it went from being a single large cluster to that single arts cluster got so big that you had to instant she eight additional clusters I don't know the exact timeline for win a second or third cluster was introduced but by the time I joined linked in two thousand thirteen it was well established that we had one large development MR and one large production cluster and there was a process then forgetting your job promoted from development over to production and since you at the time mm it was hard to isolate one job from another job it was important to make sure that the jobs that were running on the production cluster well behaved and had been vetted later on with that process of vetting individual flows actually became a major problem people would sometimes wait for up to a month for someone to sit down and review their job and in some cases they were just told okay well you have to fix this and then get back in the queue and wait another month so one of the things that I did after joining Lincoln was to try to figure out if we automate that process and that resulted in a doctor elephant which is a service that we run which looks at the exhaust basically from the duke jobs it applies ristic's to those logs and use those Chris Dickson diagnose performance pathologies things like skew in terms of tasks too much memory too little memory and stuff like that and very quickly we got to a point we are we could take the human reviewer completely out of the loop and we were able to might say promote to production probably eighty percent of things as well as for the cases are we couldn't promote immediately we were able to offer actionable advice to the owner of the workflow you know these are the things that you need to change or to get a green signal from Dr Elephant Lincoln was not alone in this problem even before the open source who do pico system this problem manifested Google there's interview I did a while ago with a guy named Tomasz Tonga's who wrote a book called a winning with data alluded to this book a couple times because he talks about what he turned data bread lines basically the cue that you're referring to where you have to get into some kind of q either to get your job to run or A to have the data scientists go and write a custom Hudood job just for you to get data to get the nightly report back to you the it was a perennial problem for people building data infrastructure people who are working as data analysts are data scientists in those nascent days of a dupe infrastructure yeah that sounds very familiar and I think it also points to a larger issue which we create a Doctor Elvin to help address which is this inherent tension between developer productivity and infrastructure efficiency and we don't want to require that everyone who uses her dupe or spark needs to be an expert in these systems yes they have better things to worry about right they are machine learning experts or their analytics experts and to require that they spend a month or maybe even a year coming up beat with all of the intricacies of these systems would severely impact their productivity I think this is kind of another interesting example of this concept of worsens better Richard Gabriel wrote this really interesting essay in the late eighties and I think it became well known in the early nineties describing this concept of worse is better and he was to explain why lisp had sort of failed in the market whereas CNC plus plus we're doing really well and he sort of identified what he described as the Mitt cool and the New Jersey school where New Jersey was like a stand in for bell labs and he compared the mit school where everything has to be in a sense perfect right the you should be simple and it's acceptable to push complexity over the implementation in order to keep the API simple versus the other approach of saying we'll actually were willing to put responsibilities over to the user for the sake of keeping the implementation very simple and I think a dupe is a really good example of the worsest better philosophy an action and benefit of versus better is that you're able to get this thing out very quickly you're able to innovate on it and make it better over time to appoint it's a prime example of that winded the Hadou P- infrastructure problems at Lincoln get alleviated to the point where it was much easier for I actually just get their jobs right their jobs in an autonomous fashion without being blocked by was Dr Elephant the thing that just solved this album or was it more of a progression of additional solutions that alleviated this issue so I think the doctor elephant help us solve this tension right between personal productivity and infrastructure efficiency but there were other things that it also helped people become more productive I mean when I was first introduced your programming. API was Ma- produce which is the assembly language for data processing and W Handwriting Assembly essentially right or you could also like to I don't have a database so instead I have to write the query plan by hand and if I want to optimize it I have to go and retrieve the statistics myself so it wasn't very productive for people who were coming from more of a database background or people who wanted to do ad hoc queries right if you are writing could produce it means you have to compile your code you have to deploy the jar files all of that stuff so while he was still incubating at Yahoo the.