3 Episode results for "Cudi F."
#269 HoloViz - a suite of tools for Python visualization
cuDF, cuML & RAPIDS: GPU Accelerated Data Science with Paul Mahler - TWiML Talk #254
"Hello. And welcome to another episode of twin? We'll talk the podcast y into interesting people doing interesting things in machine learning and artificial intelligence. I'm your host Sam Sherrington. This week shows drawn from some of the great conversations I had. Media GP technology conference. And no brought to you by Dell. If you call my tweets from GTC, you may already know that one of the announcement this year was a new reference architecture for data science workstations, powered by high end GPU's and accelerated software such as in videos Rapids, Dell was among the key partner. Showcase during the launch and offers a line of workstations designed for modern machine learning, and I work loads to learn more about Dell precision workstations and some of the ways they're being used by customers in industries like media, and entertainment engineering, manufacturing healthcare and life sciences, oil and gas and financial services. Visit Dell EMC dot com slash precision. All right, everyone. I am on the line with Paul Moller, Paul as a senior data scientist and technical product manager for machine learning at Invidia. Paul welcome to this week in machine learning a out. Thanks for having me. Absolutely. I'm looking forward to jumping into our conversation, which will be focused on what envy is doing around rapid and kumail all of interesting stuff in that area. But before we do that you were a philosophy major. How did you make your way to working in machine learning? I was a philosophy major and two years before I was set to graduate. I added economics is a major because I read the economist magazine and thought that it was a fascinating collection of bunch of different articles about different aspects of the world. So I figured if that's what economists read I wanted to be an economist. I went on to do a master's degree in economics where I mostly focused on quantitative methods and in an earlier life. I went out to Washington DC where I worked as an economist. I began my career at the World Bank, serving on health and human welfare issues in sub Saharan Africa. And then I worked for the office of the chief economist at Fannie Mae. Now one day. Well, waiting for the bus to get home from Fannie Mae. There was a article from New York Times about gentlemen, from SUNY buffalo that had written an algorithm, offering notes on screenplays, and I always had like a hobby interest in the different creative arts aspects of of our culture. And when I saw that somebody had written essentially, a big block of math in code was telling writers how to write their screenplays better. I mainly decided that I needed to switch into data Cy answer big data, which was the more popular term at the time. Because. That was where some of the most interesting things in the world. We're going on. That's a great story. You remember the moment literally the bus ride that triggered the that set you off down this path. Yeah. It was raining at the time. Awesome. And so what do you do? Now what I've been doing for the last year I had been at a couple of startups in joined Invidia to work on. What we've been calling the the Rapids project or the Rapids ecosystem. Now what that? Began as is that our director of engineering who had worked with previously Accenture for years had been saying that, you know, we see this great exceleron in, you know, neural network methods as result of getting them on GP us, but you are not seeing any of those bandages for more of the the bread and butter data science that happens a lot of places, you know, like Fannie Mae or like the World Bank where they may have large data sets and questions they want to address through machine learning, but we aren't talking about, you know, convoluted neural networks to understand images. So a lot of the time when I was working at a data scientist at a couple of startups. I like to joke that I was really a bar trivia champion because while I was waiting for my code finished running and spit out my result. I had plenty of time to read all the news of the world on the internet. And I guess, you know, unfortunately, for for being my compatriots in in data science, what we've done with the Coolum, Ellen Cudi f- in particular is, you know, if somebody knows pandas or they know the pie data ecosystem they can immediately jump right in and start seeing just crazy speedups like fifty acts like sometimes more than that on doing their end to end workflows. And that includes you know, reading from disk to GP memory doing all your data munching in merging and variable creation through actually executing your algorithm, and you know. Making inferences. So the idea was that since all of this will I mean, some of it is not obviously tractable to GP us. We are able to process strings in the the latest iteration of Cudi app, which to me seems like a miracle. But it really I like to joke. It's kinda like before the team that I work with had delivered these big pieces of duty app. It's like I could drive a car. And now, suddenly I can fly a plane, and I don't need to be an expert in Khuda or parallel algorithms or anything except the tools that I've worked with most of my career now the sickest back maybe when I introduced you mentioned to a mile. You've mentioned Cudi f- mentioned Rapids. Can you kind of paint a picture of the broader ecosystem of offer in libraries and tools that? Comprise our makeup Rapids, and how they all fit together. Yeah. So Rapids is the overall name of the project, and that's made up a smaller subline breweries that all start with coup because that's inherited from Khuda. Rapids is built on Invidia. Kuda Kuda is the for anyone who's. No. Yeah. The underlying vibrate or API for doing things on video views. Right. It's the it's the general purpose computing library for Invidia us. Okay. And if you think about the the more broad pie data ecosystem, I think a lot of people do a lot of their initial data cleaning exploration in pandas in. So that swat kunia is meant to replace for people that are moving their workloads onto GP us. And so the API is very very close. And you're able to in some cases just change the import statements at the top of your program, and it will just work. So pandas has this kind of core abstraction of data frame and so- Cudi. F is just a kind of you can think of it as a Khuda power data frame. Yeah. I think that's the best way to think about it. Okay. Cool L is our machine learning tool kit, and we aspire to one day have almost all the functionality that exists in psych. It. Learn cycler is a imminent package built by some of the world's greatest developers. So we've got a ways to go there. But we've been rapidly adding algorithms in the last release, for example, we have a stock asset gradient descent regression. Ordinary linear, regression ridge regression, a principal components analysis and. Some other things like Calvin filtering what we're trying to do is start with the things that are the real workhorses of of day-to-day machine learning in business and other parts of industry. And it's been exciting to watch the package grow. In fact, when we launched back in October. We had I think four algorithms in whom L, and we've doubled that over the last couple of months, it was very exciting to present at the GP developer's conference in San Jose, California a couple of weeks ago to the wider community. All the things that we have been able to deliver in such a short amount of time. You mentioned that you kind of sparred to psych it learned as is the does that mean, the COO over places I learned it sounds like it does for folks that are trying to take advantage of the GPO. And was there an opportunity to rather than replacing psych it learn kind of fit in underneath it. So folks can that use that or that have existing work that uses psychic learned could take advantage of the GPS exceleron without having to rewrite their their apps. I mean, at least with the algorithms we've delivered we've tried to keep the API one to one. Okay. For any of your listeners. I would encourage them to take a look at the API and just see how close it is to psych. It learn I'd also like to add that we've partnered with in RIA, the French research institution that does a lot of work on psych it learned and over the next over the next few months and few years. We're going to be building that collaboration with them. I don't think we'll ever replace Ike. Learn because there are still problems where I don't think it's big enough or the use cases right to necessarily go to full GP us. So I think of certain analyses. I did as a economist which would look like machine learning, but we're maybe a few thousand rose, and this was much more traditional frequent statistics. I think that there's always going to be a lot of that work being done. And I think that you know. With any data science work. It's about finding the right tool for the job. But I will tell you when I was. Testing out our code earlier in the summer are our demo workflow involved reading in think around like a gigabyte Obse v data to pandas in on my MAC book pro to like five and a half minutes and on a single GP you at took like fifteen or twenty seconds. Yeah. When you talk about, you know, I like data analysis is a iterative interactive process, and the faster, you can move the more fluid your conversation with the data will feel to you as a user there won't be, you know, the long wait times to see results or see if you made a coding error in my case or any opportunity to become a bar trivia master yet. So before we dig into into that. Because that's an interesting point there kind of talking about the landscape, you mentioned Cudi f- KOMAL are there. Other major pieces that we should be keeping track of in this conversation. Yeah, we are working on a graph analytics package called a coup graph and could've guessed that. Yeah. Our minds are so fixated on accelerating. The algorithms were totally out of bandwidth for fancy names. But everyone knows that Jensen does all the naming it invidious. Why would anyone else been anytime thinking about that? But Graf Graf is embarrassing in the sense that we compare it to a a graphic loose package in. Python? And it's one of those things where you see the numbers. You just really wanna double check them like ten thousand times speedups over network X pursuit out rhythms. That was my reaction to the the loading the data frame, and I still I wanna get through the kind of broad landscape before we dig deep into that. With us the first place I'm gonna come back to once we do. So you've got a graph analytics piece in coup, graph, any other major components here. Some of the things that began as major components are now under the hood. So we put a bunch of effort into building a string reader so you could directly. Har status with strings. It's very common thing that happens in data science GPA us do not like strings, but now you can do things like just easily create your dummy variables from strings on the GP you chance kinda home. But it's actually a pretty major even just part of the whole speeding things up. I don't think our case would be is compelling if we said that could only be numerical data in the Cudi F data frame that simply will not work for many use cases. So. We also have a package called coup cross filter. It's written as COO X filter, and we're going to be building out some accelerated GP, accelerated visualizations. So if you think of the workflow from munching to analysis to incite through visuals ations, we wanna be able to offer every piece of that puzzle. The other thing we're heavily using a. Software called Daska and Daska is a package that handles distributed computing, it his been used to scale out pandas, for example, multiple cores, and the we were lucky enough that the creator Daska has joined our team, and is helping us use that as a way to distribute workloads when we're talking about moving beyond a single note of GP us. This may go back to the initial example you gave of loading the pan is data frame or loading the data frame. You said it was a terabyte. I it was it was gigabytes. It's pretty easy to choke pandas. And I'm sure a lot of your listeners have experienced this before the workflow was a Fannie Mae makes loan delinquency data going back. I think sixteen years available for free. And this is, you know, the whole payment history for a subset of all the loans that Fannie Mae has acquired, and as a demo workflow what we wanted to do is read in, you know, however, many quarters of data, we could fit or relevance and then apply extra boost to predict default. Which reminds me of another sort of under the hood improvement that we've made it's not really under the hood. But we have made contributions to the D M L C extra boost library and will continue to do. So. That you know, has has appeared in a lot of our, you know, early presentations, webinars I think extra boost is almost like magic. And it's a good broad workhorse for the first thing that we were going to introduce but. We are working with the community to make certain changes to extra boost that make it more amenable for the Rapids ecosystem. But then giving those back to the community. So pardon at sidebar. But with the the before any real data processing had had happened. Just bring in this dance large data set. A rule of thumb. We couldn't do more than a couple of quarters of data. And then you would really see the time to like load and go through the data preparation and execute the algorithm increase stood stanchly. So I think that the largest we could do on pandas quarter wise was like two three quarters of data. They were small. So I think the biggest we tried was about one and a half gigabytes. And that's where you saw those you really kind of frustrating load times of of more than five minutes. And so what's happening here in these two different scenarios is on the panda side. He's got a gig and a half on both sides. You've gotta gig in a half of information on a disc, probably a CSV file perhaps. Right. And you're loading it into a data frame on the panda side. It's data. Frame that's located in ram and on the rapid side. It's a Cutie f frame. That's located on the GP itself. Is that right? Yep. NJIT your memory. Okay. And that's where the kind of the five minutes versus fifteen seconds differentiation. Come from. Yeah. And this is where this really is more for like are are hardcore Sieg is. But there's some problems with pandas that have been known for a while. You know, it's it's great like I can't sank west McKinney and the community enough for open sourcing because it's, you know, put food on my plate for five or six years, but it's also single threaded on the CPU. So so even in the world of CPU's, you started to see people look for things like Daska to help better leverage. Even the multiple cores of CPU's. You might have on your MAC book pro. I think it's you know, I think it's kinda funny every everyday to science gig. I've had they've given me like a shiny MAC book pro and I mostly work in Jiu per notebooks. And most of that stuff is only taking advantage of a single core of the processor. Right, right. I mean, I I don't want to exaggerate too much. But it's almost like you still want the next Mike book pro. I want to see what they do with that touch bar. Get some thoughts on the touch more. But. Let's not even go there. But you know, and so the other thing is that when we when we moved past, the, you know, the the massive parallelism that you can get from using GP us when you get to the the side, I know better, this is all for the most part matrix, algebra and GP's love matrix algebra. They are designed to do it. And you know in our algorithms like doing ridge regression, which can take some time to run on in the conventional pie dated ecosystem when I gave a tutorial at the GP, developers GP technology convention. I just like we're going to just do hyper Twitter search and run through like a thousand ridge regressions on this black Friday data set because it's just fast enough that we can. Brute force, hyper research on certain data set sizes, and the do you have comparative results for that particular scenario? We aren't baking off every algorithm. These days ridge regression is fast. I don't know like doing a ridge regression, I think like eight hundred thousand rose took me like less than second couple of seconds. I remember being fast enough where I could do a live demo and run through one hundred rations of it normally doing two or three would have taken quite some time. Let's jump in to the to this this part of the Kuma library. Can you? Maybe talk us through the technical underpinnings of this. I mean is it as simple as hey, these things love matrix multiplication, and we're just doing matrix multiplication using Khuda, and it's just faster or they're kind of interesting nuances to the way some of these algorithms work that might be worth chatting about. We'll start with a little bit about the architecture of CU l. So criminal is built on top of what we're calling 'em L Prims. These are primitive functions that are composed of even lower level, math, libraries, or or various things that have been developed at NVIDIA for certain linear, algebra purposes. And so we take these primitives, and they are delivered in C. Plus, plus, so then when we need something new. Like, I have a colleague working on doing massively parallel arena regressions. And so when he began working on that we already had Calman filter primitive and an OLS primitive in. So the amount of new work that he needed to begin composing a prototype was dramatically reduced Sunday in the future. I actually wanna see these M L Prims wrapped in python. So different, you know, machine learning, researchers graduate students that aren't experts in parallel programming would be able to mock up the new algorithms, they're inventing and be able to leverage the advantages the GP. You sounds like a sounds like a no brainer. Are the ALP rims open source or they locked up in a binary or something? I mean binary. Thing. Yeah. It's a mix. I mean, some of the stuff is Invidia proprietary stuff, and we tried to wrap it in a binary. But other ones are more open source, and it's a it's a discussion that we're going to continue to have. But where oh sorry. Go ahead. Is it a well documented? You know, set of primitives or is a kind of internal nobody really knows about them outside of the company. It's something that you know, I've tried to mention publicly whenever we speak about it. It's not Super Bowl documented right now. So you need to you need to be able to go into our get older and look at the primitives folder and be able to read a little bit of CC. Plus, plus got it. Okay. But yeah, I hopefully want to wrap these python and introduce them to the greater development community to see what else they can do with them. Now, you mentioned are there are there. Some algorithms that are more or less tractable the bat. I mean, right now, we're working on building a lot of different solvers for more exotic kinds of regressions and one of the challenges in developing those has been and this really is where you know. I'm off some of the thinking of my colleagues there essentially sequence Royal algorithms the way they're originally designed right? If you look at the most basic version of like, a gradient descent you start someplace and you keep taking little steps until you're satisfied. And then you stop now, that's a sequential operation when we've been doing some early inspections. On different solvers this morning. In fact, a colleague told me that he was disappointed that we were only getting three x speed up because he was still trying to think around how to make the algorithm less sequential. So there will be things just by the way that the underlying mathworks that aren't going to necessarily be another ten thousand x speed up, but a two or three x speed up. I think is still pretty great. And the other really heavy intellectual work that is going on right now, we hope to wrap up by the end of the summer or the fall is going to be on multi node multi GP, you algorithms using Daska in some cases, it'll use Daska, and we're currently working our way through what other kind of communications layers could be helpful in trying to you. Block up. This data in distributed across a cluster of GP us in a way that creates a, you know, allow moment for the user. I know it's a little marketing. But that's what I'm. I'm getting I'm looking at what we're doing with like that's what I'd really like. You know, we've we've been lucky so far that we have gotten some house with what we deliver. But the the underlying. You know, algebra and algorithms of breaking some of these things into parallel jobs is very far from trivial. You mentioned that some of the things that you're doing are allowing you to larger strings onto the GP. You are you able to utilize the GP for kind of. Heavy heavy kind of NLP types of algorithms. I guess for a lot of those you're kind of numerical ising, the textual data anyway, but other any limitations there one way or the other there are some limitations, and we do want to do an LP. We're not quite there yet, we're working on a implementation of word toback in terms of preparing or understanding your data. We have a lot of ordinary string functions. Like tokens IRS. We have a regular expressions engine. So you can. Search for regular expressions on strings and sub springs and use that to create variables I think probably closer towards the end of the year would be the soonest. I expect us that we're going to deliver that, but it certainly something that we're interested in and are currently working on and just in terms of like, the the less the preprocessing to NLP is is where we started. And so now that we have released and continue to Irit on our our string manipulations package that's gonna lay the foundation for in L P practitioners to be able to work in a way that they're used to an have algorithms like were. Back or L DA or other things like that opening available to them one of the things I'm curious about is in the clue mala library are all of these algorithms and everything you're doing with that library only dependent on the GPS. In other words, you doing all of the compute in the GP you or are you using the CPU as needed or where appropriate and? And I'm wondering anymore. You know, more broadly as kind of an all or nothing kind of thing. Or is is the focus really on doing a a given operation in the best place for that operation. I think it's doing I give an operation where it's best suited. But we are really just looking at things that are are suited to the GP. You. That's actually. So doing everything where where it's best seeded, but everything is best suited in GP. Know. Our work has been focused on, you know, once the especially on the side once the data is GP memory. We don't wanna move it around. That's kind of one of the big advantages that we have by doing into nj data science. We're dramatically cutting down on these read rights. So like the data will come in. And it'll sit as a Cudi F data frame. I can immediately pass it into my algorithm, and it all stays in one place. So. Rebel to cut down some of the overhead by thinking real hard about. Reducing the copies that happen in the course of doing this anals. I mentioned that there's a bit of dichotomy or decision point around, you know, do kind of optimize around these. And and workflows, and then everyone who wants to do anything that utilizes the GP and takes advantage of what you're doing needs to wait for you to build their outer them, you know, or, you know, is there some way, you know, if I want to do something that you know, that Kumo offers an optimized version of, but I also need to do stuff that for which there's not a coup Mao optimize version. Do I need to load do the five minute pen to load in Iran and the load into the GP you and kind of go back and forth each time, or you know, is there some kind of scenario where I can look. Load into the GP's and keep the data there. But also do CPU based operations against I don't even know technically that makes any sense or is feasible. We have a bunch of different for maps you can export data from Cudi app. So if you had an algorithm that we haven't built at and I say anybody is willing to join in. This is an open source project. We'd love to have any help. We can get building these algorithms out. But if you have something that you need to do for work today, and you're like Cudi app. Sounds like a great way to speed up. The pan is part of my workflow. But I am going to do a bunch of algorithms, for example. I want to do a bunch of Beijing and stuff we can export data from the GP data frame into a pan is data frame athlete like to work with it into the array we support the aero data format. And we also just introduced support for deal pack, which plays nicely with the handful at the deep learning packages like pie torch. So. Well, we're working as hard as we can't add more algorithms to it for the user community, you can pick and choose what's what's most useful to you. At this time. I'd also like to add that who American taking known berets. So these packages were designed to closely used together, but we know that that's not gonna cover all users, and that's not necessarily requirement. The other piece of this is cool graph. And I gather that's a newer kind of more merging part of the Rapids ecosystem. I don't have the deepest knowledge of coup graph on the team, I'm not really a graph guy. But I know that they're benchmarks have been fabulous. And we're hoping to make more graph algorithms available to people that heavily rely on graph theory in their day-to-day work, and maybe switching gears a little bit the on the work that's happening in Rapids. One of the things that was mentioned that this recent GTC was some announcements and partnerships around creating a reference architecture for data science workstations. What can you tell us about that initiative? Yeah. The reference architecture for data science workstations is very new. But what I think is exciting about it is that for for people that are. Able to get an Invidia data science workstation. We are going to have the the software. It's going to be heavily based on Rapids laid out. Ideally. So once you get the data science workstation, it's loaded with the software that you need to have that that reference architecture will also refer to what we think the best hardware layout is and we're just trying to in another way make GP data. Science more accessible to people. Sometimes it's a common project in data science circles to try to build your own deep learning rig think that's a great exercise is, but it's not for everybody. And I've been in some very serious corporate environments where IT is not gonna let you bring in the computer that you built to start working on their proprietary data. Right. Right. And the Dana science workstation initiative is really about making it as easy as possible for an organization that wants to dive into GP data science to get started cool any parting thoughts from you on Rapids were KOMAL or advice for folks who haven't really been exposed to what in Vida's doing on the software side that want to explore more. I would just really encourage everybody to take a look if you go to Rapids dot AI. That's a. Portal landing pay age that will be somebody to everything that I haven't ever here as well. The things that I have links to documentation links to get hub. We've done a Google group. We encourage everybody that touches Rapids and find something that they don't like that doesn't work too violent ticket. You can see our roadmaps in our current work on get hub. And we really want the community involved. Like, you know, as I think about the machine learning algorithms that I'm going to roadmap next for the team to develop on a lot of that has been formed by customer and community feedback, and it's gonna continue to be informed by customer and community feedback. And so. How just? Ask anybody that is interested in taking a look, you know. Please try to get involved with us because that's that's really what's going to measure the success of our project. There's a real open source project, and we had a great job of building community. So far, we've got lots of stars and forks, but we wanna see more of those. And we're always happy to see issues opened on get hub. Awesome. Well, Paul thanks for taking the time to share with us. What you're up to? Well, thanks for having me. All right, everyone. That's our show for today. For more information on any of the shows in our GTC twenty nineteen series. Visit twilly I dot com slash GTC nineteen thanks again to Dell for sponsoring this series. Be sure to check them out at Dell EMC dot com slash precision as always thanks so much for listening and catching next time.
#275 Beautiful Pythonic Refactorings
"Do you obsess about writing your code just the right way before you get started maybe some ugly code on your hands and you need to make it better. Either way re factoring could be your ticket to happier days. On this episode, we'll walk through a powerful example of iterative Lee re factoring some code until we eventually turn our ugly duckling into a python ick beauty on our hope is our guest in this episode to talk us through re factoring some web scraping Python Code. This is Talk Python Emmy Episode Two, hundred, seventy, five, recorded July, ninth, two, thousand, twenty. Nine. Welcome to talk by a weekly podcast on python the language, the libraries, the ecosystem in the personalities. This is your host Michael Kennedy. Follow me on twitter where I'm at in Kennedy, keep up with the show and listen to past episodes at Talk Python FM and all the show on twitter via at talked by on this episode is brought to you by us over at Talk Python. Training. Pythons acing imperilled program and support is highly underrated. Have you shied away from the amazing new eysenck keywords because you've heard, it's way too complicated or that it's just not worth the effort. But. The right workloads one hundred times speedup is totally possible with minor changes here code. But you do need to understand the internals and that's why our course acing techniques in examples and python show you how to write acing code successfully as well as how it works. Get started the AC can wait today with our course at talk by dot com slash ASE INC honor welcomed. Oh, talk by Sunday me. Thanks for having me on. Excited to be here. I'm excited to. It's going to be beautiful man. Hopefully. Hopefully. Yeah it's beautiful re factories. So I am a huge fan of re factoring I've seen so many people bri to overthink the code that they're writing like well, I gotta get it right and I gotta think about the Algorithms in the way and Ryan and all this stuff and what I found is you don't really end up with what you want in the end a lot of times anyway. If you just go in with an attitude of this code is plastic, it is malleable and I can just keep changing it and you always are on the lookout for making a better you end up in a good place. Yeah. I completely agree for is not a one time thing or something that happens only two years from when you initially write the code it's I heard once actually that it. Goes a lot in hand with legacy code and There's a number of different definitions for legacy co by one definition is like is he code is code that isn't actively being written. So if you write something and then you consider it done and then the next week like no one's working on it that technically according to that person's definition is legs Z code. So that can be re factored. You know you can refer something you wrote earlier in the day it doesn't have to be a year later or ten. Yeah absolutely. I mean you just you get it working you know A. Little bit more. You apply that learning back to it and with the tooling these days is really good. It's not just a matter of you know if you go back to nineteen, ninety nine, you read Martin Fowler's re in book he talks about these are the steps that you take by hand. Make sure you don't make a mistake and now the steps are highlight right click apply re factoring I mean that's not one hundred percent true. In the example we're GONNA talk through is not like that exactly but there are steps along the way where it is potentially definitely winters in static analyzers. Heavily. Underutilized I feel and so many of them will just automatically apply the changes that you WANNA do and it's fantastic for huge code bases. It would be almost impossible to do it by hand. Yeah. Absolutely. It would definitely be risky. So maybe that's why people sometimes avoid it. Now before we get into that though let's start with your story. How'd you get him into python? I know you're into languages weren't talk about that but by onto then also yes. So the the shorter it's a long story but the shorter. Version of it is I degree in university, which wasn't computer science required at least two introductory courses. So the first intercourse was in python the second one was in Java, and then I ended up really enjoying the classes I ended up taking a couple more but ultimately stuck with the career that I had entered into, which was actuarial science. That's like insurance statistics. So you were in some form of math program I'm guessing yeah. Yeah. It's very, very boring to explain but if you like math, it's a great career. Yeah Awesome and SO I ended up for my first job at a university. I ended up working at a software company basically that very simply explained created the insurance calculator that many insurance companies use, and after working there for about four or five years, I had just fallen in love with the Software Engineering side of my job and decided that I wanted to transition fulltime to like a purely technical company. So Mitzvah several years or a couple years later and now I work for video as A. Senior. Library software engineer and That's how I got into programming and our code base that we work on is completely open source and primarily uses was fourteen and python three that's python enter down like a dream job sounds awesome. Yeah. I absolutely love it. Yeah. So you working on the rapids team, right which works on doing a lot of the computation that might be in pandas but over on jeep use that roughly right? Yeah that that's a great description. So so yeah, within INVIDIA. I work for an organization called. Rapids we have a number of different projects. So specifically I work on Cudi. F that is see you D apps. So the coup is to letters see you from Khuda which is the parallel programming language that invidia has made and the F. stands for data frame, and so this is basically a very similar library two pandas the difference being that it runs on the GPS. So the one liner for rap is is it's a completely open source end to end data science pipeline that runs on the GP you so if You're using pandas and it works great for you like there's no reason to switch. But if you run into a situation where you have a performance bottleneck, Cudi can be like a great drop in replacement. We don't have a one hundred percent parody with like the pandas library, but we have enough that a lot of fortune five hundred companies that pick up and use us are able to very easily transition their existing code in pandas to right James and import line go much faster something incredible like that. That's the goal. At the dream. Yeah. I just recently got a new alienware high end alienware desktop and it's the first G force I've had in a long time. That's not like some amd Radi on a book or something like that. So I'm pretty excited to have a machine that I can now test some of these things out at some point. Yep Acceleration on different devices it is very exciting also. All right. Well, let's start by introducing a bill briefly a little bit about refractory and we talked a tiny bit about it in general, and then we're going to dive. Into a cool example that you put together that really brings a lot together and what I love about your examples. It's something you've just gone grabbed off the Internet. It's not by contrived like well, let's do this and then unwind the re factories until it does is like just found it Michael let's see what this thing does. That's GonNa be fun. Let's just start with a quick definition of refractory factoring maybe how howdy know when you need it how do you know when you need refracting for me? I have a sort of number of anti patterns in my. Head that when I, recognize them in the code, some people might refer to them as sort of technical debt. This idea that the first time you write things or maybe initially when you write things, you don't have the full picture in mind and then as time goes on, you start to build up technical debt in your code and re factoring can be reorganizing, restructuring your code or rewriting little bits of it to basically reduce tech to make it more readable, maintainable, scalable, and just in better in general better code. That's sort of the way I think of it. Yeah. It is pure sense right? It should not change the behavior at least in terms of like inputs outputs exactly as the easiest code factor is code with tests that's unit tests or regression tests or any the other number of tests that there are. If you have a code base that has zero tests re factoring is very, very dangerous because you can fact or something and completely changed the behavior and not know about it, which is not ideal at all somewhat suboptimal and indeed Mark Valor when he came up with idea of refractory or at least he publicized. Sure. The ideas were basically there before one of the things that struck me most was not the refractories but was this idea of code smells and it's like this aesthetic I look at the code and it works. But like your nose Kinda turns out, you're kind of like you know you but it still works right it's like not broken, but it's it's not nice and You know there's all sorts because Kuzma's like too many parameters long method things like that but they rarely have beer cutoffs right? It's over twelve lines, the functions to large but under that is totally fine right? Like that's not it's never really super clear cut so I, think this whole idea. Of Re factoring much like factor in itself requires like going over and over it as sort of through your career to refine like what the right aesthetic to achieve is it probably varies by language as well little. Yeah. If you start to do it consciously when you're looking at code and asking yourself like when you have that code smell feeling like something's not right here if you are conscientiously like paying attention to what it is like slowly over time, you will start to pick up on exactly what it is about it like a very, very small one for me and I think this is mentioned in maybe clean code or it might have been Martin Fowler's book. It's like a declaring a variable earlier than it needs to be declared. So you might declare like all your variables at the top of a function, but then like two of them you use immediately but the other three, you don't use until the last four lines of the function small things like that. It seems simple but I've made the change where I've put that a declaration closer to where it's get used, and then you realize, oh, wait a second. This isn't actually reference it set to something, but then it's not actually use later on. So I can just delete this is because it was at the top of the function you can't see where it's being declared. Or. If it's US somewhere else that like you actually have a phantom unused variable that can be deleted. It's simple things that lead to better changes later on well, and just mental overhead like you said, the technical debt side of things. For example, there's the variable that was at the top surely win the code has written it was being used, but it's been modified over the years and now no longer is it being used but it because it's separated from whereas declared towards us you don't. WanNa. Mess with that like if you start messing with that, you're earning more work are you're asking more is I'm just going to make the minor change I don't WanNa break anything. And then the next person that comes to try to understand that they gotta figure out will, why is there that like set count variable like I. Don't feel like it's being used, but it's there. You know you've got another thing to think about that's in the way for sure. Yeah. So certainly, I think it's viable. There are fantastic tools that will like highlight. This variable is unused or this assignment is it meaningless or or something like that? So there are options, but still it's better to not let that stuff live in the Code Yep. Agree. Let's talk about this example that you've got here and maybe should give a little background on your language enthusiasm and programming competition interests, and so on. Your interest in hosting competitions I think is probably worth touching on already. But then this example is from you trying to reach out and understand it and do some analysis of those environments are those ecosystems right. The background with this these different languages and coding competition. Yeah. So I initially got into competitive programming quote unquote. So just though the one sentence description is there's a number of websites online hacker rank league code code forces that they host these one, two, three hour contests where they have three to four or five problems that start out easy than get harder as you progress through them and you can choose any language you want to solve them in and The goal is just to get a solution that passes as quickly as possible. So it's not, necessarily about how efficient your coat is, it has to run within a certain time limit but if you can get it to run or pass python versus plus versus Java, any code solution works I started doing these to prepare for technical interviews. So if you're interviewing for companies like, Google facebook, etc. A lot of their interview questions are very similar to the questions on these websites, and so I at one point was looking for a resource online like for youtube videos that just explained the stuff. But at the time I couldn't really find any. So I started youtube channel covering the solutions to these problems and I thought it would be better to solve it in a number of languages than than opposed to just see plus plus. So I started solving them people's plus Python and Java, and that's Sorta what led into led to my interest and competitive programming even though I'm I'm not interviewing actively anymore I just find these Super Fun it keeps you on your toes in terms of your data structure in algorithms knowledge and you can treat them as like code cod. I'm not sure if you're. Familiar with the concept of just sort of writing one little small program and trying a couple of times in different languages and you learn different ways of solving the problem you might not would have initially solve the problem that way this example I decided to just figure out what the top languages that people use to solve these competitive programming problems on a given website. So the site that I chose was cut forces. Yeah, and you're like Hey I'm working on this new data frame library that's like pandas. Let me see how I can use pandas. To solve this problem and get some practice or something, right? Yeah. Yeah. So when I just started Invidia I knew that the panda's library existed but I had zero experience with it and I knew that it had the sort of group by reduction functionality that if you had a big table of elements, you could get these sort of statistics on you know what's the top language or what's the average time? It takes people to submit a very easily with this kind of library. So I thought what better way to learn pandas than. By, trying to build a simple example that uses this library for something that I'm interested in and so the first thing that I did was I google you know how to scrape each tables using pandas and brought me to this blog that at the end of the day has about sixty lines of code and it's a tutorial walk. So it walks you through how to get this code off of HD month table and basically the Pike on talk that I gave it came out of doing this I had no plans of giving A. Talk. Talk on this I just after having gone through it and sort of re factoring one by one. I realized that like I could give a pretty simple talk to like connor like five years ago that didn't know about any of the I didn't know about prevention I didn't know about numerate I didn't know about all the different techniques I was using and I figured it would be at least for some individuals out there it would be a useful talk highlighting the things that I didn't know when I first started coding in. Python. But now we're like second nature for me and that's where that came from. Yeah and it's really interesting example is cool. I do think that a lot of the factories were let's try to make a more python ick brandon of this more idiomatic version of this like miss understanding the four in loop for example and treat them. All right. So in a lot of ways, it's a corey factory, but also kind of leveraging more of the native bits of the language if you will absolutely now. Yeah. So you went and grabbed. His code and it does two basic things goes download some html and then pulls it apart using I. Think L. Xml html parter, and then it's going to loop over the results that it gets from the H.. melpar Sir and turn this into basically a list or a dictionary. Then you're gonNA feed that over pandas as pandas, pretty interesting questions and most of the challenge or most of the the MESSI. Code lived in this html by the things, right? Yeah. That's a pretty good description of what's happening. So let's go and just Talk through some of the issues you identified, and then the fix exactly knowing how did you identify that as a problem and then what fix did you apply to it? Now there's a lot of code and it's hard to talk about coding audio. So we'll maybe try to edge like as high levels possible talk about like the general patterns and what we fix. The first part of the code would go through and it would create an empty list recruit like index to keep track of where it was and the did a loop. Elements increment the index out of thing to the list pronounce information as it went right Yep and I think the first thing that you talked about was a code comments. Actually you're like, what is this code comment here? It just says we're looping over these things but what do you think a loop is and why don't we have comment? Yeah. Even worse was like arguably the second comment some might argue is add some value but the first. Comment. Above the line that creates an empty list, it says create empty list and it's only a what is that six characters if you don't include the spaces and I, think, that's definitely one of the things that's called out in a number of re factoring books as comments should add value that is not explicitly clear from the code. I think even beginners are able to tell that you're creating an empty lists there. There's no reason to basically state. What the code is doing typically, comments should say why if it's not clear why something is being done a certain way or something that's implicit and not explicitly clear from what the code is doing yeah. In terms of refractory and I love this idea of these comments are sort of almost warning signs because if I find myself writing one of these comments to make stuff more clear unlike wait a minute wait a minute if this is just describing. What's here something about what I'm doing is wrong. Maybe the variable name is not at all clear what the heck it is or maybe it could use a type annotation to say what types come in and said of here's a list of strings like how about list bracket string goes there to just type it is it's five three after all and you know from the the code smells book dollar had this great description of calling these types of Comments Deodorant for code smells something wrong. It smells a little as bad if we'd like, lay it out, set the stage but every time I see one of those like I just need to rename this function like a short version. This comment would say a rename this variable or like restructure embrace these things apart because if it needs a comment is probably just to there's an individual in these plus plus community his name's Tony Van and he. has a a rule or not a rule, but a a recommendation that you up code base for step one, step two, step three and and guaranteed. You're going to get like one or two matches and a lot of times. It's these steps of comments on top of pieces of code and like a larger function and odds are you could make that coat better by re factoring each of those steps into its own small function just whatever the step. Like if you put step wanting a description, you've already given that he's a code name. You just need to take the next step but in a functioning give that function that. Yes. Exactly. Exactly what you said exactly I think there was even some tool way way back in early days a C. Sharp that if you would highlight Kodori factor in you highlighted a comment would function name a fi who would try to guess the function by using the. Comment turning into function like a that would work as an identifier in the language. Anyway it was totally a good idea to there's a couple of things going on here. One is like wiser prince statement nobody needs us. Once you take that out though you were able to identify this. Well, let's take a step back I if you have an integer and your increment every time through the loop so that it stays in sync with the index of. The elementary looping over. That's probably not the best way to do it right like python has built in enumerate. Yeah. This is probably one of the most common things I see in Python. Sadly in in certain languages, they don't have this function but in Python, it's right there built into the language. As you mentioned, it's called the numerate. So you can pass whatever thing you're looping over to a numerate and that's going to bundle it with an. Index, that you can then in line d structure into an index and the element that you were getting a from your range for loop before. So anytime you see an index idea or I or something that's keeping track of the index and that's getting into the J.. Sometimes it's J. sometimes it's J. sometimes it's K. ex wife you're being really creative and yeah there is a Bilton pattern for basically avoiding that and it makes me extremely happy like. It happens actually not just once in this piece of go but twice where you can make out of a numerate and once you see it, it's very hard to unseat it but like I said this was something that I learned numerate from Python and this was not something that I knew of an I didn't learn school. So there's a lot of python developers developers in many languages out there that I think they're just not aware and as soon. As you tell them. I. Think they'll agree Oh yeah. This is way better than what I was doing before. Yeah. You just need to be aware of it. You always trying to these issues you got to create the variable than like wise variable there. Then you gotTa make sure you increment you incriminate before you work on that with the value increment after is zero basis at one based. All of these things are just like complexities that are like. What is happening here? Like what if you have a have a past continue skip the loop, but you forget the increment like there's all these little education and you can just with the numerous you can say you know it's always gonNa work. You can even set the start position to be one if you wanted to go one two, three, beautiful. Yeah. That's a great point. Yeah. There are use cases where you're gonNA run into bucks whereas with. The numerate you know at least you're not going to have a bug with that index, right? It's always gonNA be tied to the position with the starting place the way you want it. So yeah, that's really nice but it's not super discoverable, right? There's nothing in the language that screams and waves it's hands and says, yeah, you're in a for loop. We don't have this concept of a numerical for loop. This is actually better than this is what? You wanted you didn't even know you wanted it. Yeah. It has to be something that you stumble across. Interestingly, some language is go is the one that comes to mind they actually build in the enumerate into their range based for Liu. So in go, they have built in basically the destructuring, and if you don't want the index, if you just want range based for loop and you want to ignore the index, then you're just supposed to use the. Under bar to say I, don't need the index for this loop, but it's interesting that like go is a more or more recently created language than python at least when they decided like they thought it was such a common use case that they would think that most people need it more often than they would. So they built it into their for loop. So with that language, you can't avoid learning about it 'cause it's it's in there for about. Syntax air tonight at least they explicitly ignore this yup interesting I. Didn't know that about go to now. You've got this little cleaner you look at it again and you say, well, now what we're doing is we're creating a list an empty list which we commented great emptiness that was cool book that coming out, but it was very helpful in the beginning to help you understand now, and then you say we're GonNa loop over these items and then a pen something to that. List well as possible. But this is one of your anti patterns that you like to like the finding get rid of right. This is an anti pattern that I call initialized then modify and actually enumerate example previously also a falls into this anti anti pattern. So anytime you have a variable that it doesn't need to be a for loop many many times it is that inside each iteration of that four loop, you're then modifying what you just initialized outside that is initialising and. Then modifying and my ass rotations is that you should try to avoid this as much as possible when it comes to the pattern of initialising an empty list, and then in each iteration of your loop, you're calling append that is built in to the Python language as something that can be used as a list comprehension, which is so much more beautiful in my opinion compared to just a a roth four loop and then appending for each iteration. Yeah. Every now, and then there's Like a complicated enough set of tests or conditionals or something going on in there that maybe not. But I, agree with you most the time at just means what I really wanted to write was a list comprehension. It is though meal bracket item four item in such and such if such as I tried that, that's what you gotta do. Yeah. Let's comprehension. Once you start to use it moving to a language that doesn't have it makes you very sad because it's Joe's needs. He's totally makes Zad and. I really really wish list comprehension had some form of sorting clause because at that point, you're almost into like in memory data base type of behaviors, right? Like I would love to say projection thing transformed thing for thing in collection where the test is order by whatever I mean, you can always put assorted around it, but it's it'd be lovely if they're like it's already got those nice steps I like to ride it on three lions, right? The projection, the set and the conditional like just one more line but the order by in there but maybe wonder maybe I should put a pep in there. Who knows I was gonna say that sounds like a future pep. Definitely I mean it would be easy to implement. Transform went to SORTA impasse at his key or something like that. But anyway, it would be really cool. But they're they're very, very nice even without that and once you have it as a list comprehension then unlocks the ability to do some other interesting stuff which you didn't cover in yours because it didn't really matter. But if you have square brackets there and those brackets are turning a large data collection into a list if you put rounded brackets all the sudden, you have a much more efficient generator. Yup that is something I don't call out at that point but at. The end of the talk I- Lewd to article that was mentioned on the other podcast that you coz Python Bites. Yeah. Thanks for the shout out by the way. Yeah. No, it was a great article. But if I mentioned generator expressions right after it mentions list comprehension and I mentioned that these things go hand in hand that you should familiarize yourself because if at any point you're passing a list comprehension to like an algorithm like any or or something, you can drop the square brackets and then just pass it the generator and it'll it'll become much more efficient. So. With them and there's no way to go from four loop really quickly and easily to generator yield style of programming. Right there's not like for yield I in whatever right like there. But with comprehension, it's where brackets versus rounded BRI apprentices, right? It's. So it's so close that if that makes sense, it's like basically no effort to make it happen and okay. So we've got into a list comprehension which is beautiful, and then you say, all right, it's time to turn our attention to this doubly nested for loop and it's GonNa go over a bunch of the items and pull out an index, and then you know go and work with index. So it's another numerate and then I think another Thing, that's pretty interesting that you talk about I don't remember exactly where it came in the talk but you're like look what you're doing in this loop is actually looping from like a second onward for all the items and that really is just a slice. Yeah. Yeah. So in this nested for loop, the outer for loop is basically reads for J. in range of one to the length of your list. So you're basically creating a range of numbers from one to the length of your list, and then right inside that four loop you're creating a basically a variable. That's the J, th element of your list. So all you're doing is skipping the very first element of your list, but the way. You're doing this is generating explicit indexes indices based on the range function and the length function, and I thought at first that they must be doing this because we need access to the index later or we need access tr elements later but that wasn't the case it just seemed like the only reason they were doing all of this was to skip over the first element and so very nicely. Once again, Python has very, very many nice features. They have something called slicing where you can basically pass it the syntax, which is square bracket, and then something in the Middle End Square brackets and in order to skip the first one, you just go one colon one to the and that's beautiful because. He don't even have to check the length of the items. You just go to the end, which is avoids areas of like do I have to plus one here do I not is at minus one budget the ending piece, by John, Dory badges from skip, Vermont and the rest Yep, it's so convenient. You're you avoid making a call to lend you avoid making a call the range, and you avoid your local assignment on the first line of your. You can basically remove all of that and just use slicing and you're good to go and the slicing is slicing as a a really really awesome future actually comes from a super old language that was created in the sixties called AP L. and Python. Languages that has something called negative negative index slicing where you can pass out a negative one so that it wraps around to the last element, which is a super super. It's SORTA looks weird but once you use it, it's so much more convenient than doing like a len minus one or something like that. It's it's it is a little bit unreal but. Once, you know what it does. It's great. It's great. It's like I want the last three. I don't WanNa care how long it is I just want the last three and that's yeah it's fantastic. Slicing I think is fairly underused for people who come from other languages but yeah it and it fits the bill because there's so many of these little edge case. You talk about arison programming like off by one errors are a significant part of problems with programming right and they just skip that altogether it's beautiful up to the next thing to do is so you're parsing the stuff out of the Internet, which means you're working with one hundred percents strings, but some of the time you need numerical data. So you can ask questions like the sixth or seventh or whatever, and so they have. This is GonNa be fun to talk about they have try value equals in of data. So past the integer asked the potentially like data over to the. Initial eiser either that's going to work or it's going to throw an exception in which case you will say except pass will not you the original article had that rape? Do It's this try pars except? Otherwise it's going to be non or it's going to be set to the string value or something to that effect. So what do you think about this? How'd you feel when you saw that? Yes. So my initial reaction was that this is four lines of code that can potentially be done in a single line using something called a conditional expression. So in many other languages, they have something called a ternary operator, which is typically a question mark where you can do an assignment to a variable based on a conditional predicates is something that's just asking to reform. And if it's true, you assign at one value and if it's false, sign it another value. So in Python, they have something called a conditional expression, which has the syntax assigned to value using the equal sign ask your questions. On this case we just ask, is it an int- or story? So the first thing that returns it's actually backwards from Turner operator. So this is the reality. The line reads data equals into data if and then check your predicate and in Python, we can just call is numeric. Value which will return as true or false based on whether it's a number. So if that returns true than it'll end up assigning data into data otherwise, you can just sign it itself data and then it's not going to do any transformation on that variable because it's not numeric it's one line of code. It's more expressive in my opinion and avoid using trying except and it's preferable from from my point of view. I would say it's probably preferable from my point of view as well. I have mixed feelings about this but I, do think it's nice under certain circumstances one. For example, if you say tried to thing except pass a lot of lenders and high charm and whatnot will go this is too broad of a clause you're catching too much and you're like, okay. Well now to make the little squiggly in the scroll bar, Go hey. I have to put a Hashtag disable check whatever right now it's five lines one with a weird exception to say, no, no this time it's fine. So that's not ideal. I. Definitely. Think that that's more explicit more expressive tease this additional if one liner, the one situation where I might step back and go you know, let's just do the try is if there's more variability in the data, so this assumes that the data. Is Not none and that it's string like right. But if you got potentially objects back or you got none some the time, then you need a little bit more of a test mean you could always do if if data and data is numeric, that's okay. But then it's like if data and is instance of string data and like there's some level where there's enough task that it becomes you kind of find a crash, right? We'll just catch it and go but we were talking before we record also like there's a performance consideration potentially definitely and it's interesting. I'll let you speak to what what you found but on the Youtube comments of the Pike talk that was one of the probably the most discuss things was whether or not the conditional expression was less performance than the. Original trying except because a couple individuals commented that it was, it was more python to use the except in and therefore it might be more performance. You can share with what you found sure. Well I think in terms of the python inside like certainly from other languages like say c plus there's more of this it's easier to ask for forgiveness than. Style programming rather than alternative look before we leap right visiting like see, it could be a page fault in the programs because poof and goes away and something wrong. Where's this? It's just gonNA throw an exception you're gonNA catch it or something like that A. so there's like this tendency to do the style but in terms of performance I wrote a little program because I wanted to. I've Mike maybe this is faster may be slower. Let's think about that right by wrote little program which I linked to. There's a simple just linked to it in the show notes breeze one, million list with one million items at us the random seed that has always the same so there's no zero the area. Ability even though it's random, it's like predictable random and it builds of this list of either strings or numbers randomly a million of them about third strings, one third number, and then it goes through and it just tries both of Mrs like let's just convert this as many of them as we can over to editors and and do it either with the trikes pass or just do this is numeric test. It is six times I got. Yeah. About six point five times faster to do the test, the one line test than it is to let it brash and realized that it didn't work. Yeah. So there you go. You heard it here on hug my to me driving conditional expressions faster than trikes. Talk Python to me is partially supported by our training courses. How does your team keep their python skills sharp? How do you make sure new hire get started fast and learn the python ICK way. If the answer is a series of boring videos that don't inspire or a subscription service. You pay way too much for and use way too little listen up. At talk by on training, we have enterprise tears for all of courses get just the one course you need for your team with four reporting and monitoring or ditch that anew subscription for our correspondence, which include all the courses and you pay about the same price as subscription once. For details visit training talk by DOT, FM slash business or just email sales at Tom Dot FM. There's a lot of overhead to throw an exception and catching it and dealing with all that. Now, right this is a particular use case that varies and like all these benchmarks like might vary like if you've got ninety five percent numbers and five percent strings at might behave differently. So there's a lot of variations example you can play with in what seems like a reasonable example to me it's faster to do. The is numeric tests. So fast, right? Not like five percent faster six, hundred, fifty percent faster is worth thinking about out for sure. Let's see. So come through and in the end you had. A ton of stuff was here. It was like when he lines a code just for these two loops, and now you've got it down to four lines of code by basically an outer loop and inner loop grab the data independent with this little test that you've got much nicer. I Agree Yep. So you went I, think if you look at the overall program at this point, you were doing some analysis or like some reporting said, it started sixty lines of code and now it's down to twenty. Roughly. depending on if you count, you know lines and whatnot but it was about sixty down to about ten or twenty lines and At this point I had sort of pointed out that I had made a mistake. So this was fantastic at least I had thought that you know I I take a code I'd take a code snippet from a blog reduced it by. Roughly seventy five percent or sixty seven percent depending on how you measure it. But I had made an even bigger mistake than I had realized and it was that when I hit originally I'd shown googling for you know how to scrape html using pandas that I read the second results and the third result was actually what I should have chosen and it was that I had a pandas actually has a read html method in the library and so the point that I go on to make if you use that you go from you know ten or twenty lines down to like four lines of code and You're just invoking this one Pan Api read html and it's so much better. So you re factoring is fantastic but there's some quote about like the best code is no code. If you don't have to write anything to do what you WanNa do and you can just use an existing library. That's the best thing that you can do because that's going to be way more tested than the custom code that you've written. It's GONNA save you a ton of time and you're going to end up with ultimately less code to maintain yourself and better than having someone else maintained the Co that you're using for you. Exactly right. It gets better for for no effort on your part. Yeah. It might get faster where my handle more cases of like broken html or who knows but you don't have to keep maintaining that it's just read UNDIS- grades Tim Allen painters just probably getting maintained. Yeah. One of the things that I've echoed in some of the other talks that I've given is knowing your algorithms in C. Plus plus definitely there's a whole. Standard. Library. There's a lot of built in functions. I guess they're not so much called Algorithms. They call them built in functions and python. But like there's a whole page where I was just looking at it the other day and there's a ton of them that I'm just not aware of everyone else about map filter. Any all like I. Just saw I. Think it was called MoD which was a built in function for giving you. Both like the quotient and the remainder, which is like there's definitely been a couple of times where I've needed both of those in you do those operating separately and it's like off I just knew about it. You can in a single line you can D- structure it using the Uraba unpacking knowing your algorithms is great but also knowing your libraries knowing your collections like the more you get familiar with what exists out there the less. You have to right and the more readable your code is because if everybody knows about it, we have a common knowledge base that it's transferable from every project you work on your final version basically had to really meaningful lines. One was request dot get the other was hand as dot read html. You don't have to explain to anyone who has done almost anything with python what request get means like Oh yeah. Okay. So got it. Right. We all know that works. We know it's good work and so on and it's really nice. I. Think Though what you've touched on here really it's really important but it also shows why it's kind of hard to get really good at a language and the reason is there are so many packages right you go to. Let me. Let me try Pipe York. Now every time I go there is always Hundred and forty, five, thousand packages. If you WANNA, learn to be a good pilot program or you need to at least have awareness at a lot of those and probably some skill set in some of them because like pandas, one of those requests and other one writes the four lines solution that you came up with was building on those two really cool libraries and so to be a good programmer effective. Like keeping your eye on all those things and I think that's it's both amazing. But it's also kind of tricky because I'm really good with four loops functions great. You've got two hundred thousand packages of study go. There some quote that I've heard before where being a language expert is ten percent language, ninety percent ecosystem and it's you can't be a guru in insert any language if you don't know the tools if you don't know the libraries, it's so much more than just learning the syntax and learning the built in functions that come with your language it takes years and It definitely doesn't happen overnight. It's a challenge for all of us aperture. You know maybe it's worth a shoutout to awesome bash python dot com right now as well, which like as different categories you may be cared about and then we'll like highlight some of the more popular libraries in that area that. Sounds that's a good one. Yeah. For sure do you did nine different steps? You actually have those called out very clearly in your slides, you get the slides from the get hub repo associate with your talk which linked to the show nights of course. But all of this re factoring talk was really art of the journey to come up with a totally different answer, which was one of the most popular languages for these coding competitions. Yeah. Ultimate goal was to scrape the data and then to use pandas in order to do that analysis and at the end of the day I, believe the number one I definitely know the number one language was C. Plus Plus said about. Eighty nine percent, and that typically is the case because certain websites they give the same time limit per language so Like hacker rank, they vary by language. So Python your execution time that your allotted faster is ten times more for python. So even though python slower, they give you our amount of time, but most websites don't do that. So the code forces website, it gives like you I think two seconds execution time regardless of the language they use and so due to that most people choose the most performance language, which is C. Plus plus. But in second place was python and I know a lot of competitive programmers that for the problems where performance isn't an issue that you're trying to sell four, they always use python because it's about a fraction of the number of lines of code to. Solve it in python than it is in any language sometimes, you can solve a problem in one line in on and the next closest languages like five lines, which is a big deal when time mattress guy yeah. Are you optimizing execution time or developer time in this competition, right? Yeah. It definitely matters what you're trying to solve for C. C plus plus was first python was second Java was third and then there was a bunch of fringe languages. The top three were C. Sharp Paschal and Katelyn and yeah you can see a full list if you go watch the Pike on talk but it was add to find out what was what was used and what wasn't sure was and Cool to see the evolution of what you created the answer that question or just pretty neat. All right. Well, let's just talk a little bit about rapids because I know that people out. There are visual data scientists and they're probably interested in that project. Do we did mention a tiny bit that it's basically hake has as data frames apply something like that that API pretty close not one hundred percent identical and everything, but pretty close and it runs on. The wire GP's better like I've a really fast computer. I have a core I nine with six cores. I. Got a couple of years ago. That's a lot of course, right? So yeah it first thing I should highlight to is that rapids is more than just Cudi APPs. So could have the library I work on. We also have a Kuu, I o coup graph consignia who spatial cwm l., and each those sort of map to a different thing and like the data science ecosystem. So Cudi F-. Definitely is the analog pandas Kumhal I think this sort of analog you can think of is like psychic learn but also to like none of this is meant as as replacements, they're just meant as. Alternatives like if performance is not an issue for you like stick with what you have, there's no reason to switch day. I don't don't do it because for example, I couldn't run it on my macbook right? Because I've Radian right? Right. If you do want to try it out I think they're on the record. So if you go rapids day I, we have a link to a couple of examples using like, Google. Co Lab that are hooked up to like free GP that you can just take it for a spin and. You need the hardware, but you can't go try it out. But like our pitches sort of like this is useful for people that have issues with compute right and for different pieces. You'RE GONNA want different projects. So if you're doing pandas likes or data manipulation Cudi, F is what you want. But yeah, why are GP's faster is just a completely different device in a completely different model so a GP us. Typically. It's in the G. of the GPO were known for being great for graphics processing, which is why it's called the GPO. But at some points, someone coined the term he actually works on the the rapids team. Mark Harris, he coined the term gpgpu which stands for general processing GP compute. It's now typically referred to as just jeep you computing, but it's this idea that even though the GP model is Great. For Graphics Processing. There are other applications that a jeep to us are also amazing for the next best one is matrix multiplication, which is why they of became huge in neural nets and deep learning. But since then we've basically discovered that there's not really any domain that we can't find a use for GPS force. So there is a a standard library in the coup model called thrust. So if you're familiar with C.. Plus plus the standard libraries called S. t. l., and it has a suite of algorithms and data structures that you can use thrust the analog of that for Khuda, and it has reductions. It has scans and it basically has all the algorithms that you might find in your CPAP plus stl, and if you can build a program that uses those algorithms you've just GPA accelerated your code however using thrust isn't as easy. As some might like and a lot of data scientists they're currently operating in Python and R., and they don't want to go and learn C. Plus plus and then Khuda, and then master the thrust library just in order to accelerate their data science code the rapids goal is to basically bring this GP computing model for a sort of general her bis acceleration of data science compute or whatever compute you want to the Data scientists, and so if they're familiar with the pandas API, let's just do all that work for them. Put the. So so rapids is built heavily on top of thrust and Khuda, and so we're basically just doing all this work for the data scientists so that they can take their panda's code like you said, hopefully, just replace the import and you're off to the races and the performance wins are pretty impressive like. I'm not on the marketing side of things. But in the talk I mentioned, I just happened to be listening to a podcast held the invidia AI podcast and they had i. believe name was Nicholson and by swapping out cudi effort pandas for their model. They were able to get a hundred x performance win and a thirty x reduction in cost. That's thirty times. Not Thirty percent. Yes. So thirty is at right. Multiplication, valley, which is massive. That's the difference between something running. So if it's a hundred X in terms of performance at the difference between something running in sixty seconds or an hour and forty minutes, and if you can also save thirty x if that cost you a hundred bucks and now you only have to pay three dollars. It seems like a a no brainer for those individuals that are impacted by performance bottleneck. If you're hitting pandas and runs in a super short number seconds. It's probably not worth it to switch over. Yeah. Well, and you probably you tell me how realistic you think says, but you could probably do some kind of import conditional import like in the import, you could try to get the record stuff working. If that fails, you could just import pandas as the same thing. One is PD the speedy. Maybe it just falls back to just working on regular hardware but faster when it works pretty think that is definitely possible. There's going to be limitations to it though obviously if you have. A a sort of Cudi F. Data frame I don't think you do it piecemeal. But if you have a large product what I'm thinking if you wrote it for the rapids version but then let it fall back to pan does not the other way round you take arbitrary pass code you try to ratify it that might not work, but it seems like the other one may well work and that way if somebody tries to run it, they don't have the right setup is just slower also would there's definitely a way to do that to make that work it might require a little bit of. Some sort of boilerplate framework code that is doing some sort of checking you know is this compatible else? But like that definitely sounds automated like, yeah. Yeah. That sounds cool because that would be great to have it just fall back to like not not working just not so fast. Right The future of computing is headed to a place where we can dispatch compute to like different devices without having to like manually specify that like I need this code to run on the CPU versus the view versus the TPS versus in the future I'm sure there's GonNa be a cue for quantum processing unit like. Exactly. We all think Sierra Leone most of us that don't work it in video. We are serially and in terms of like the way that fuse do compute but I think in ten or twenty years were all going to be learning about different devices and it's going to be too much work to in our head always have to be keeping track of which devices is going to. At some point there's going to be a programming model that comes out just automatically handles like when it can go to the fast vice when we can send it to the seaview. Yeah, absolutely. So just a while you're talking I pulled it up on that alienware gaming machine got as a g force are. To seventy, which has two, thousand, three, hundred, and four cores. That's a lot that's a lot of course, and if you look somewhere Google claims that it achieves seven point five teraflops in the super increases that ten nine teraflops, which is just insane like a core I seven doing like going to three five. Point two, eight or something like that. So anyway, the numbers readers are like they boggle the mind you think of how much computation graphics cards do these days I think top of the line I might get this wrong but like the modern GP user are capable of fifty teraflops. It's an immense amount of compute. That's hard to fathom especially when coming from the CPU sort of way of thinking. Yeah, absolutely. Yeah. The only reason I didn't get a higher graphics card is every other version required water cooling. I'm like that sounds like more effort than I want for a computer could just go with us when. All right, well rapid sounds like super cool project, and maybe we should do another show on the the rapids team across these things to talk a little bit more deeply. But it sounds like a great project by working I, work on the C. Plus plus lower engine of it but I'd be happy to connect you with them. Some of the Python folks that that work on that side of things and I'm sure they'd love to come on. Yeah. That'd be fun. All right now before you get out of here, got to ask you the questions if you're gonNA write some Python Code, what editor do us so I am a vs Code convert. That's what I typically use day to day. Nice. Yeah. That's quite a popular one these days and then notable pipe package. Then you ran across Chicago people should know about this. Yeah. So I like to recommend there's a built in Standard Library which I'm pretty sure most hyphen developers are familiar with it or tools which has a ton of great functions but less well-known is a IP package called more hyphen editor tools and I'm not sure if this one's been recommended on the show before. But if you like what's in it or tools, you'll love what's in more tools. It has a a ton of my favorite algorithms chunked being. One of them you basically pass it a list and a number, and it gives you a list of lists consisting of that many things. It's like paging for yeah. Yeah and there's tons of neat functions. Another great one that's so simple but doesn't exist built in all underscore equal. It just checks given a list are all the elements, the same, and it's a simple thing to do. You can do it with all but you have check is every element equal to the first one of the last one. So there's just a ton of really convenient functions and algorithms in more O'Toole someone Iraq. Yeah that's cool and you can combine these with. Like generator expressions and stuff they are all these you know pull some element out of each object that's in their generate collection. Ask of all those are equal at they go to all these ideas go together well there. Yeah. They compose super nicely. Yeah for sure. All right. Final Call to action people are interested in doing and refractory and making their code better. Maybe even check out rapids but he say, I'd say if you're interested in what you heard on the podcast, check out the pike on talk, it's it's on Youtube. If you search for bike on twenty twenty, you'll find the youtube channel and if you're definitely am interested in the Day I check us out there I assume all this stuff will be in the show notes as well. So maybe that's Urging. Yeah it well, then also you talked about Youtube channel a little bit baby just tell people how to find that will put like in the show notes as well. So they can if they to watch you. Talk about some of these solutions in these competitions. Yes. So my online aliases code underscore report. If you search for that on twitter youtube or Google I'm sure all the links will come up and yet you can find me that way. Awesome. Are Willing to that as well. I Will Boehner. Thank you so much be in the show a lot of fun to talk about these things with you. Thanks for having me on. This was awesome. You Bet bye bye. This has been another episode of Talk Python to me or guest on this episode was Connor who? Has Been brought to you by US over at talk by on training. Wanting to level up your python. If you're just getting started, try my python jump start by building ten apps course or if you're looking for something more advanced check out our new eysenck course, the digs into all the different types of acing programming you can do in python and of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription. That never expires be sure to subscribe the show open your favorite podcast and search for Python. We right at the top you can also find the itunes feed it slash itunes the Google play fetus slash play in the direct RSS feed at slash ourselves on talk by phone Donovan. This is your host, Michael Kennedy thanks so much for listening I really appreciate it now get out there and write some Python Code.