3 Episode results for "Cudi F."

#269 HoloViz - a suite of tools for Python visualization

Talk Python To Me

55:57 min | 1 year ago

#269 HoloViz - a suite of tools for Python visualization

"The tool chain for modern data. Science can be intimidating. How do you choose between all the data visualization libraries out there? How about creating interactive Web APPs from those analysis on this episode we dive into a project that attempts to bring that whole story together. All of his of his coordinated effort to make browser-based data visualization in python easier to use easier to learn and more powerful. We have Philip Ruediger from hall of his here to guide us through it. This is talk, python absurd. Two hundred, sixty nine recorded June fifteenth two thousand twenty. Welcome to talk by a weekly podcast on Python, the language the libraries, the ecosystem in the personalities. This is your host Michael Kennedy. Follow me on twitter. I'm at in Kennedy. Keep up with the show and listen to past episodes at talk python. Dot Com, and follow the show on twitter at talked python. This episode is sponsored by Brilliant Dot Org and data dog. Please check out what they're offering during their segments. It really helps support the show. Bill book under attack by enemy. I'm excited to have you here now. We're GONNA talk about a bunch of libraries that are all brought together around this hall of his overall Meta Project. If you will to make yeah, exactly this umbrella to make working with data and exploring and visualizing python a little bit easier tonight, I think that's a great project in. It looks like it's getting a lot of traction and I'm happy to be here to chat about it. About the. Absolutely before we get to that, though let's start with your story. Had you get into programming in Python? So I got started with programming? Creating relates apart from like the. City's website and you hack some objects us. Really didn't get started with actual programming. Joined a autonomy engineering course monographs outside moved from. Germany to the to study electronic engineering music. Thinking is suppose better. Music was, but I took a liking to kind of programming. And See in bear log, pretty low level stuff, and then towards the end of that project kind of Undergrad degree. Simulator of BIPEDAL locomotion into us, which was far more complex than that envisioned it, but it was really exciting to me to actually get into a big projects of my own, and from there are then joined masters, course and stir going pricing in. His and we had the simulator called topographic. Ethical you talked about working on his bipedal locomotion stimulating. See plans like forget the language. That kind of stuff is trickier than it seems like it should be right. Oh, absolutely, because back naughty about more networks. Jumped in heard about no networks, genetic programming built achieved network with wage prompters and assumed like genetic. Make it work it didn't it didn't really work related works? Off around like my goal, bipedal humanoids around the but it never actually managed jared. Real emotion. Yeah, I, guess, there's something to train these models correctly and getting them set up right in the first place. It's not just magic that you can throw at a problem, right. Arkansas This thing wasn't complicated enough. solvable brains that's. Joined the lashes. A graduate. Program Moral Matic's. Actually Boring! Party It was at like trying to model the way brains and synapses work using neural networks. Yes, so it was pretty close to what? Controversial owner network nowadays or around back then they aren't as popular then, but we also have like recurring corrections. The idea was to model human visual system for your. Mailing system and so starting very what Kinda problems where you're trying to do what we tried to ask it to do and see if it was successful, you really. Organizing right so that you didn't have to reprogram a bunch of light. Into into the network or just organized like like many of the Nowadays do, but we were trying to have closer to the actual biology's. We had the all types that were interacting. Ran. Those models were tremendously conflicts, and it was just super hard zoo. Started having huge model with like I have sixty eight hundred page upwards working model run. With this? And that's actually where started talking these missiles votes. Colleague of mine where like? He's. In all these things properly, just looking looking through. When right for released deductions? And then I'd sturdily Serb writing this. Yeah, it's like trying to watch the green in the Matrix. Right like just like no. I can't see it this way. We gotta look at it better. Yes. So the idea was that we start building something like you have these huge browner spaces right like excited strings, inhibitory strings model roll over time, so we had. Very Complex Congress basically explore built this tool hall of US start digging into. Losing like what effect does prominent actually have on the illusion of the model, so you can drag a slider and see the strength to the Mojo. And so this is how it also times. That was really breakthrough actually starting. But, also meant that eventually starts being more time building this station to was wondering on my actual. S One of the challenges I found out in the NFL more rewarding. Yeah. That's the real danger. Right is like Ah, me I started getting programming, doing research on complex systems and and math, and and whatnot and I after a while I realized you know the part of this project that really makes me happy is when I'm not doing the math. That was a sign that I should probably be doing something else, but it's. It is really fun to to build these things. I also do think it's near. It's a challenge of research projects in this academic stuff in general. It's hard to get credit. That like you're not, they're not GONNA go man. That's a killer contribution to data science that you made here's. Your PSG wears the network whereas the paper right or a politician. They're exactly. Exactly yeah. Actually opposed to paper on all of these. Turned, out to be the only paper published. Rest. My models didn't work until very. Two weeks before. These models are working. Actually. Results were got around six. Man Down to the wire. The wire. Yeah well, it's. It's better late than never in that case, so that's how you got programming in February two python. Obviously, it's a a natural place to go. Python is if you're doing neural networks. What time frame was this? What year? So joined the. It was a really good program, so it was the doctoral training centre in Edinburgh had actually doesn't existing was part of the. Informatics departments, but they had close call fluctuation. Apartments joined in Tanny first year Master's program Way longer than should finish my. But Thinking Twenty fifteen on business one end season season. Yeah Yeah Cool I'm just wondering how you started in two thousand ten with some of the stuff. If you started now, how much easier you think it would be, or would it be basically the same to work on? Is The visualization problems the neural network problem? It just seems like that has come so far in the last five years. Oh, absolutely so I. Remember. We had obviously insurance with C.. Code Law had. Something held site by weeds which are maybe online about anymore now, but it was really awkward interface or seeks tensions and nowadays up the. Number. Colonels were running more words in. Victim pilot to something very often run right absolutely same with the visitation tools like so many interactive origin bills. So that gun for. Is. Using, he thought remember that's. The actual rendering. Is will on. Okay property! Sites. Such as usually. Nice all right well. What do you have to these days? Day today, so it's really nice. I have the freedom to switch between. So freedom, but we kind of joined continuing, which is now on the condor in Washington actually beforehand thesis I was running out of funding joined Congress as a job between writing my thesis. Right inside joins to do consulting. All Barney problems or various? Clients corporations is on, but from the very beginning. We had this idea of we build tools of its solve people's problems. Newswoman are. Kind of models work really well for us is the entire home? Sweet builds as as. Spending quite a bit of open source signs. Or billable time for side, but also kind of for example panel hilts with funding from US army origins so they. Built this new framework. Freedom for over basically six runs. Through two panels really cool, we'll talk about that for sure. I go between most times consulting work, but as much as possible attributes stuff work on during a time to go. Yeah, until you mostly mostly to remote work I would guess being yes, so actually. I was in Edinburgh. In Kanda in Austin. Number for years and last year I moved here back to Berlin. Grow and actually had an office any just open. You're not this year and it was. I thought it would be nice to actually spent three days a week. To the office seen people have more regular routine. Burke until three am. Before. But. Happens provocative. Yeah Yeah, well going to be around people, not everyone. Yeah for sure it's a bit of a bummer. A mean provokes like can work remotely and just carry on mostly what we're doing a bit of a bummer for a lot of people. It's a tragedy, right? It's a huge huge problem, but especially I. Don't know how people do. I know it seems really really scary. Hopefully we get through that soon and we can go back to an office a WHO knows what the desire to get back to work together. We'll be like some of these remote ideas I think we're going to stick and some kind of like Whoa. So glad that's over. Yeah, I think so particularly I mean it's really not a good test. People talking. Assuring Revolution of. Force, SCENARIO Don't have child care Stockholm yeah. Off of our Harley. I, think you touched on the real challenge. It's one thing to say well. Let's all try to be remote for. It's another to say. Let's work with your small children around you all the time. That is the real struggle I think as a apparent to find the time in the focus right so I think it's an unfair task, but if it's working under these scenario, this is like the worst case scenario so obviously. For a exactly exactly, it's interesting, so it's kind of Nice I actually like. Client, meeting. Yeah A it does humanize people a little bit I think You know it's doing too far. Innovate like you watch the news or you watch like comedy shows in that are still going in. It's just like Yup. Everyone's at their couch or they're getting table or just a little Home Office and It's funny, so let's talk about the history of this project a little bit, so you started with Hollow Viz Unites. Hollow. Birds. More confusing history. Of Talk by enemy is brought to you by brilliant dot org brilliant has digestible courses in topics from the basics of scientific, thinking all the way up to high in science like quantum computing, and while quantum computing may sound complicated, brilliant makes complex learning uncomplicated fun. It's super easy to get started, and they've got so many science and math courses to choose from I recently used brilliant to get into rocket science for an upcoming episode, and it was a blast the. The interactive courses are presented in a clean accessible way, and you could go from knowing nothing about a topic to having a deep understanding. Put your spare time to good use in hugely improve your critical thinking skills go to talk by Tondo FEM brilliant and sign up for free. The I two hundred people that use that link get twenty percents off the premium subscription that's talked by phone dot FM slash brilliant or just click the link in the show notes. How did you go from like trying to create better visualizations to this larger project I? Guess you know as as a wage reduction, maybe like Topol people how you got there and then like give us a high level view of what it is. So, we started with all of us that which was actually built on this article around as Brown is kind of like. Now have data classing. And prominent on causes other type. Products like Patriots. Bramlett is kind of the foundation everything. I've Dacian on just general semantic. With things. Not as a whole range of Actually. About kinds of the initial thing which had been around for hanging into. And then rebuilt. It's all of us on top of that and then one of our first projects at continuum bartenders. Boise you cannot office to build extending hoping to. Reports about. Attention for Hogan's and then obviously focused on you know geographical data in and whatnot, yes. Protections for you. Then I mean what we saw over and over. Are Consulting projects was whole were happy analysis and it's. Notebooks and then people would share these someone who doesn't know about code scared. Off Spy. Inland Snowbo-. Share it and that's kind of how we started building our full, so no books are pretty nice to show people, but at least in Jupiter as far as I know, there's not a great way to say. Please load this with every bit of code collapsed right? I mean there's there's templates movie kind of Obscure Law. Everyone was familiar with going on just generally. If you just want to have anything nice attended as advice. Lay Out at together. There wasn't really though. The Changed. And then we just needed name for all of stuff which decided on pirates. Angles taken and then. We had a little bit of pushback from. Sounds like is by association rights. Presumptuous to thinker. like you can't claim it all right. Boca we've got yeah, that was a totally fair criticism. We can talk to various community members, and we're like okay. His becomes general thing and we're GONNA finding. This been confusing obviously. I think this. Year, and a half ago, we kinda run with the name for year and a half as So often times. Out There, silver's office is us not general resource that it's meant to be the as videos on Youtube about some presentations you gave and stuff. And I saw sometimes it was called Hollywood subtype pivate Micah. Wonder what the relationship is here. The historically where that comes from you, so think overall. Arthur come this general resource and absolutely. Happy to have listing of all the different virtualization desperately. Labor's on there and we'd like to have more Detroit material two point two's. Justin comes General Resource and is now our efforts kind of have a coordinate need? It's a search tools at work together to have a browser-based association dash warming. Easier. Yeah very cool. You're basically making a bunch of choices for people like here's a way that you can plot. Steph here's a way you can stream data or process large amounts of data that doesn't fit in Ram or whatever. Yeah, but you're not forcing them down that path right? Roy of the things that was cool was like okay. If you need more controller, you WanNa, do something different. You can just do this yourself. If you want to do less work and accept the defaults or the the default way working right, you can use update built in not selective the idea that there is the way we fight communicates is about shortcuts up or make it easy. We should be good. We should just be able to get something onscreen quickly just. hogging particular. You just brocker data on. But then you shouldn't be Stockholm. Be. Of Her just put get your quick offense really hard to estimate from there, and that's something we had some gallons. Of hollies pretty opinionated and actually. It doesn't fit the regular model oldest boyfriend. This imperative model where you say you get your figure. Get your axes. Modifies little bit of the axes. It's about just wrapping her data on habits have visualize itself then you can tweak the options on, and that does not. have. We kind of decided now we'd rather meet people where they are people already. There's already such such big existing. Tools. Points evening, big deafening tools, and rather than tell people like this is the way you're just be able to again what you have. That's eight philosophy behind depot Sabir grounds. would. Use pounders. You'll held. A. Which? Takes your parents data Intel This next access is goes on the why axis. Colored by despair. And that gives you. The pop and we wanted to take kind of say well. WHAT'S FOR CONDOS WE WANT TO? With entire petted rights it'd be is meant to work with pandas, but with Desk X-ray now reacts Japan does the most recent Julia. GPS Yeah I haven't heard Cudi F. that sounds really cool, but like task is very powerful now. Had Matthew Rockland on the show to talk about that and it's like. Data's pandas yeah, it's like pandas, but across multiple machines, if necessary and sort of thing that has the same as necessary or Locally. and. Yeah, it's super run is particularly because we have data partner. Takes your data and it's described as like a fancy judy, instagram or get. Out, but it doesn't really fast. It's built on lumber and asked. Generates images from data points really quickly doesn't just support point data imports all Yvonne Lions Regretting rosters and called Meshes and try mashes, and basically just take your data renders with tasked. Just put something on screen really quickly. Just too fast and accurate. Of Large data sets lower soccer large sutton's money's. Yeah, that's cool I. Think it was on your project where you've got a picture of. Was it the US wanting every single person and where they were? quickly. Yeah three hundred million data points a second or less interactivity Zuma amounts I think particularly now that we have the. Video has a new initiative rapids. Rebuilding today system on top of Jesus as crazy. Busy, but you could always use like yeah, exactly yeah, using this out to write these kernels yourself obscure, but now which works so they had initial prototype for data center, and our our arbiter on these marlins defined Bali took the in boats intended. To support US needed so now you. Can Take Navy twenty milliseconds performance. Incredible civil. Scales across multiple done, we will. Never cease to amazes me. How powerful GP are. I! Think! Wow, it can do that. That's amazing like no big deal more. Likely. For us. All Circle Edgy uses JJ views graphical. Actually wanted to use it for graphical Whereas, basically, you're hot. You're right you're. Working full will now let's awesome St. Rose Brussels. That's Super Bowl. Recoil all right, so we've got all of us and then related to that the library Geo views. You talked about h plot. Data share! This is what we're talking about quickly rendering like three hundred million points on the map. About Haram as the basis for like the data class like functionality, we also have color set. Yes, is just we all know that color maps can be crucial. I don't know. I think the most famous domino's the jets. Rainbow! Soldier hide it or a reason radically database terribly galkin false movies just. Is it's. Distort things. Soap, there's actually scary about up initially like Doctors Gone Voss delusions respect looking at. Jets in misinterpreting debts zoo. Zoe critic package halls. Which? has set of social uniform. It actually takes works. I should run up on instagram. Basically, we took a set of our mouths that impose a paper about a rock. By Missing So these websites looking to look up the names I feel bad for. Crediting I it's really handy to have that. Put together and well thought through and. Choosing colors one that look good into that are meaningful, not so easy. Yes, it's. But I mean thankfully the. Aware his pocket. Actually was. The final star of the show here is panel what you talked about a being that sixty nine month project that you've gotta work on New Dash. Boarding tools yes, so the system for a long time, our hat, Johnny and Shawnee was great. It makes it easy to. Share your analysis in our. Habits. was early on. There's like this Jupiter Dashboard project you can take. Your note both rains in sells a little bit and. And he's that for a little while, but then it was abandoned, but this was a problem we can. To share their analysis without as an actual dashboard words. Apnea and so we decided to build. Just before then actually lottery in mouth with called. Dash also really nice. Nice Library. Backwards in. Requires a little bit more knowledge of accident. In cases and we wanted something where people could just off in their analysis. They're ounces it. Drop it in. You could wrap it in a function in annotate that like function depends on these things, and so on those things change updates. Model. We wanted to me where Jaap in your new house. You've got some notebook. Share it and he's affection about arts. Seen over and over again in. Fact for they had a bunch of data time. His relation team on the data scientists. Do someone else's around. Over the precipitation team. Working from where it could be. Some cussing Dallas good. or Fiction making position from sleeping analysis to terrible. What. You're right. That's how. To panels away were yeah, you can basically layout different parts of. And like organizing a notebook with not necessarily showing the code, and you get some little sliders and other widgets that you can interact with the private and to me, it feels like just a a really nice way to quickly take what you're already doing. An exploratory mode in put it in up in a user interactive way without learning flask API's Java script. This portion of Python me is brought to you by data dog. Are you having trouble visualizing bottlenecks and latency in your? APPS and you're not sure where the issue is coming from or how to solve it with data, dogs end to end monitoring platform. You can use their customizable built in dashboard to collect metrics and visualize APP performance in real time data, dog automatically correlates logs and. And traces at the level of individual requests, allowing you to quickly troubleshoot your python application plus their service map automatically plots the requests across your application architecture, so you understand dependencies income proactively monitor the performance of your APPs. Be The hero that got your APP back on track at your. Company get started today with a free trial at talk by on dot com slash data dog. Sisal. Is, just have your. He dropped them into this thing. Put it in a bunch of rows and columns laid out on your screen, and then you put one little commandment. The end of this thing built the layout built called suitable, and then you can pennells serve the notebook. Ops Up your dashboard is. How do you host it so actually? It's just built on. Bo cases just server, so you can post it on any concert providers. We are trying to control out. Process or even thinking what happened. This Yes or Cloud or whatever? Were working on, but in the end. Like a container absolutely, Oh yes, that's. For our examples, we have we build on his. All kind project wraps on the environment with some. than the boys it and what we're hoping for? Is that I think there's a PR maybes IRV emerged. Basically just give this product post. The NFL will be environments and announce it runs, and then it builds a docker container for you so I think that's super to go. Yeah everything you're environments and you're. Drunk. Yeah, maybe even go to like some of hosted Brunetti cluster service. Miss, go pick this runner there. Make sure keep running out great for me if I needed. Yes, but we're certainly over. Interested in. Friends were also. otherwise. Yeah, we're also working on. Yeah, it's a bit of an orthogonal problem and skillset solving visualization. Right like it's when they do the Josh. Scripted cool cool visualizations, another like what now I do devops you. Arden's I love it. Absolutely so on your website, it gets under the getting started section so at all of his H. L. O. B. I. Z. Dot Org. You've got a document that SORTA talks about given these different scenarios more of a picture I. Guess The answer, a couple of questions about it, and we'll help you choose the subset of tools kind of flow together. Maybe talk us through some of these scenario, so it says are you working with tabular data, workmanlike, other types of arrays, network and dimensional or streaming data, and then there's sort of this flow of a guy. Here's how you. He's together these tools to come up with something excellent. Mermaid Diagram. She looked at home the organs on their. Wives. Go. The general idea is that takes you from type data says so. Let's say you've got some tabular data. Decide like what library us to low the stand. Alone specials hand as a totally fine, and if you WANNA go to geographic data, US Japan does. Geometry's. Geometry's used you. Episode People Doubt your cut off here, you say. Do you have more than fifty? Thousand Rose I mean obviously. It varies a little bit on the computer you have. Already, arbitrary, exactly yeah, but it's it gives you a kind of it's not millions of rows, or or billions or something like that right it's it's not that high of number to say okay. Well, maybe want to consider something other than pandas or working on some of this, but yeah okay. Is it a huge amount of data by some definition huge, and if not, is it geospatial wasn't. Might, use Jira pounders. You may use Dacix for example, said warms. Dusk is a great tool. Love. got. Millions or billions of rows you. Can't loaded onto memory just the space. And then the whole point behind our existent particularly, GPO is. You should have to change any coach. Whether you choose pandas or or now the Cudi F- library shouldn't have to changing colors. You should just be able to Duncan into this framework and uncalled edgy GPO on it, and then you get your out. That's part of the philosophy here. The same applies to like. You've got some eventual rains. We generally recommend for example that you go with x Ray. Sex rate is really underrated. Delivery should be more popular. I don't know maybe not that. Many people N. dimensional raise, but it's kind of Hannah's for or in nationals for beyond tabular. So you've got. Thoughts satellite imagery over time by order alerted right or microscope data also time z stack overtime according vitamin whatever. It. Is So, You might use that instead, or you might keep simple. Just use one on. A doctorate. Lower level. But, in the audience, at Yahoo, and then he can't see just drop it into it. She will call, and then you get this holiday subject out of the holidays. It's already displays that you already have on, but then you might have the issue. Yeah. This was a lot of data. East, for reasons desperation, so you may not want to dump it straight into your browser jumping gigabytes here, browsers. Trafalgar Way to crash it. Even with the speed to download a quickly that much javascript. Is GonNa make it hurt. Hurt. Okay we'll follow. Follow so that's. Just the auction to achieve plot to say. This and so that that means that you've got service agregation. aggregating statehood as You Get out is a nice attractive bocquet fall. Or if you are ones that were has issued. OPS actually doesn't directly support. or about both. Once, you're there you can save. You can shared notebooks or you conducted. Right right turn into a dashboard. Okay, very cool. All linked to this flow chart for people over, so they can think about it when they're checking us out. And I think it also gives you a sense of even outside of his. There's useful stuff like have streaming data, maybe check out stream, Z., or in dimensional checkout x Ray, and Daska just this idea of like. How do I think about the right underlying library rather trying to jam everything into Panzer, number or something? A lot of people associated judy. Does and Try to switch rays. And actually pumped used to how this. Structure called McConnell and they record Africa. Yes disease. Who's just use X ray. Mitchell I see okay interesting. I didn't realize that when you talked about visualizing millions and millions of data points quickly in the browser. He said Okay when he's data share I. Don't know that we necessarily dove into it enough to say exactly what it does so basically move I. Get it right how it works. It can look at millions. Hundreds of millions of data, points and say well is of the graph is really this and if you look at these ten thousand at this scale Kinda going to be the same, is that how it works or does it? Does it down sample somehow, or how does it does it actually? Make meaningful pictures or process of that. Let's think about it, so it actually is that faster actually always looks at all. Your data. Okay filmed into something. Obviously, he wants out your report, even Jones like it. Has the clip an outside the rectangle? It actually does over your entire building. Anger Is Happened on the server right. Exactly. And then all it has to send the image of the aggregated data. or Hemming points or billion points when? Thousands. Images much smaller. Obits. Yeah, absolutely now the the that's basically the same size. No matter how much you have exactly the compression, little bit, but right not so much. Your present. On that works was most visual elements. We have been dominated elements for visuals first-grader thinking. Aggregated Point Beta line, Dayton. POLYGONS trimesters meshes. Just downsizing images as Got Huge. Bigger Pixel image she can dump today to. Apple it's? Great and looking at it says that it has scale ability with the desk or Khuda. Jeff. How do you configure it to make it? Choose one or the other on the? Data frame it will just take take things. The way. It works should think of it as a bunch of chunks of the underlying. These might be dragged me. Woods cost like it might be on your machine, or it might be a cost lake. clustered machines, and so it just keeps the cockatoos local. It means that the allegation for each of those chunks happens on the particular, not your cluster, and then once done. The aggregation just has to send the excise image. Main road on to aggregate and so. You can distribute the competition, but so have the world war. Okay so what you provide to the data Seder. Tells it how to process that you give it a pandas data framed. You give it a state data frame, give pdf frame, and it just go knows how to work with all of them, but that that implies how is computed is those three Anthony Blogger got it. Okay? Now see so. We actually done the interstate dot org. If you look at one of the user guards. Guides. There's. Table or this data type four point data. State. For examples might be on the and. Ask for. Image might be x ray on. Your. Back. While it seems like a really nice way to fill these things together and disintegrate API. Maybe we could talk about some of the projects or communities that are using all of us. One That's really taken consoles vicarage report. Indicator is the GIO. Which is basically? It's an initiatives by birds times. To build like this big bane of formed. So you used to have all these different data silos where you had to stay in the cloud, you'd have to download it onto your local machine export and they've been building this. Pot forums you can easily. Deployed you're on Jupiter Hug and then keep your data in the cloud, but analyze it also. In so. These people might be. On apologists or Ocean She. Option Hawkers. And have these huge data sets huge meshes Beta. But there will in Berkeley tools. To always open for tools fast. You can move the data from the clouds. And zoomed interactively around without having download. Muscle. Memory Super Happy August. Yes solving some real problems with awesome also intake Asai intake. All you really string products so again. Actually, this is also urged. Project if you've ever to. You have a bunch dating sources. Logs and if you keep track of all your data. Not have cuts and scrapes to this kind of data and that kind of data. Let's you bites one. File specify I've got BCC falls here. There's thousands of them and all of those are runoff thoughts. Their. Felt here in capsulated all this nice little, big logged loaded on. By courts. In just to your catalog low this. Thanks to specification that you're. Just love. That has integration with tolls and. Ways, so for first of all you can, you can put your into your data catalogues, and since I never had this default. That I always look at four. This data right so you can sign in Massena. Source pots the points at the long itunes in his lungs. His attitudes, none of will automatically Jarrett off on. Your thoughts in this. Okay now. That's typical thing that has it has a little league on top of. Already dominated and now you can just buy a graphical interface up around. This all this all. Specify how to import it and whatnot in visualize. Ask Great in thin. Cross filter so. So invidia initiative on the they find around with virtualization varies forms, and interestingly they built this Crossville library builds on top of annal and Boker to build crossroad obligations. Crossover, hearing will. Also referred to move brushing select something on. Bought and thank you see. That reflected OSS and. Physical Vegetables DASHBOARDS. Goal and then some space stuff as well. These are actually projects that. We work with. Is the. HOUSE COPE Recently, actually got a real languages room. largest of the hotels world when they visit the coach me, saying the ones who? Wanted some. Data that comes with this. Territory today it can be challenging because they have fifty head of bites of image has data like they're gonna be collecting and stuff in. Child take. Have to do the same else's every day. One sole that can handle this stuff. And so I handed opens project to concerts. Are. Phones? Are, CEO. Started other. And added odors, projects is I'm listening. Help for me. Is Built as. On. Will handle five hundred votes. But still that's pretty impressive. Awesome, and it's cool to see all these projects using the libraries. You guys put together. They're probably given it. Some pretty serious testing sold off since testing. In Science Rob. Hall performance issues. The job I dragged. Just building is awesome. Early feared complete divorce from any actual users of the libraries. It's really hard to tell like, are you? Just wasting your time, so it's super nice to go back and forth consulting. People's problems. Going back to tool. Absolutely, yeah, you need this blend, keep it real If you're too focused on just solving problems for consulting, you don't get that spare time to develop new stuff as much. As I want to close conversations with two quick things. One I really like to explore things like this visual lay analysis type of libraries and whatnot by just looking at examples, because usually can look at some nice pictures and get a quick sense, so you guys have a bunch of different. Utah, Orioles or simple, little examples expositions showing them off baby is there like a couple of favors WANNA? Point you bought. Kalamata does oh absolutely. Is So what we can do is so most of our websites views g views. On all we have gallery. then. We also have this website. Examples which really want. abundancy again is most general zone, so if you worst in contracting they examples yet just grab a couple that are like really cool off. The people would appreciate. Yes, let's go with the one that you talked about. Senses example Goto example or to search for census. It'll be on on the patriot. Gallery and You Click on. Of the top risers, sensitive sample much explores. How do I use I've got this status I do actually displayed certain boring. It's so it starts off by molding data, said loaded in this case with alive recall. In fact, it's just loaded with Dansk, but we've got slow wrapper around dusk, which does spatial indexing so special indexing means. It has built an index of the space. Determines archery. It's super fast to say show me. The things that are near here are right here. Yes so if In and no longer as I talked about earlier, but he faults dedicated scheduling the entire data set each time with zinc into just most. Special indexation to say okay. This stuff is definitely not for I. Don't I don't even consider it. Becomes faster right if you look at that. We can start by letting the data set data. It and we start with simple linear aggregation. Is that. It's just walked with you. Think about the population in the US. There's a few hotspots. You're super guns all the cities. Yet. And so all you see with linear. Army is York few flurries a little bit of l. a little bit Chicago, that's. The Nice thing about data shooter is that. Takes away that you can do linear log, or you can kind of adjust the color amount. All kinds of difficult, but a default actually definitely hauled histogram equalization which means that. It adjusts the history of the. Colorado in such a way that the lowest number such explained, but it kind of it causes. The Colorado that picture over audio only yes. But basically it who deals here? The your data is not the exact values, so you shouldn't use it for like reading out the exact values. To get an overall idea of something really nice. Kind of is part of what makes like. He looked at data, Shater image, or you'd like sometimes on twitter L. C. Image and OPS. The Mixture that you see the overall shape values, and this kind of goes through that example do that. Explains. What does this actually do so the census data you can see? Shape you've city now. You can see but. Not, in this area in the west of the US. MD and yet really reveals. The population distribution us and then also condensed rates how to manipulate your your. Show the hot spots especially. because. Arguments fixed image. You can say values above this This density caller bed, and so you really suggested off. Yeah, very cool and. Go into detail on some anymore to check it out share share, but it really built up. People can take all linked to each one of these steps like builds up with a liner to of code. It's not super complicated exactly right? Won't we kind of explore depressing? The Vincent has busy racist all the. People, anything really see these aggravation different cities. It's Tom Arnold. Think about yeah, that's not good. Yeah, reveals the package you look in there yeah i. see and then finally rounds out. You. Because our children work well together. You final example demonstrates how to take this. And use all of these to generate jackpot where if it was running on the server on the on the website? Zoom very much. Gets very pixelated, but if you're running on your own notebook, yourself or you devoted to serving rebuttals ample then he consumed around in pans. Zoo individual people. That's wild, so maybe there's a whole bunch of examples over examples up a dot org. Each one of the examples is tagged with the various libraries that people might want to explore, so there's a bunch of people can go. Go there and and dig into them and check it out. This is just one of. Those upstate. Builder Kolzig Police Asam I guess. Maybe you could just touch really quick on awesome dash panel dot Org, and then tell us what's next with the whole project and rb. At a time. Then I've been super happy to season, so community building is hard. I, think she. Was this and route along, but there's been a lot of interest in and marks Mattson. Works. At Mr. Built this what I call or or Julie? Show off what you can go without so are examples can try to focus on the simply south. Awesome. Connell really shows you what you can do. It's really impressive from Douglas or On. Also Honda Org has not resources for. Best Leverage things. Ideally with migrate back to our website, but yeah, he's built this complex multi page sites, which takes like love, difference Lazy Pollens. It's Kinda, matter right like panels involved in it as well. It's built on our. Website has built entirely in panel about panel also about panel and the. Trying to take that further as well so recently I did a talk for our. Our. President built. The entire nation built a presentation tool Powell demo pop. Out, some very cool all right, so what's next with the whole project where you going so one thing that we kind of working on recently a lot is all of us. Is We on late selections, so you can now in the spirit of. Guidance we can generate your holiday spots, and if they're all using the same data, you can just say link selections of pilots to these moments, and then automatically hookup early link between. On one, all the other ones update. That's really tight about them. Super. Dive into the data yeah. Particularly with. Support now let's. VP's and data explore tens of millions or billions of data points using selections releasing. One thing of doubts, another thing in terms in. This. Is the next release is GONNA have or That means is out of always had the ability to. solve. This thing does in a row with this thing, a bunch of kids going row or a column. taught here, but she wants to have more control had to write your own. Geology as s two out, you have an ability. Weights, but next release is GONNA. You might want to exercise it, yes. That's what we're trying to keep you from right. We don't want people to Right. So what was done now is at its default templates where he said I want this to go on the sidebar. Hetero is men area, and it looks like a Polish folks. It's not just a bunch of things what Fitch it's Yakuza. Also. That because. A lot of. Demos for Split to go the whole way supposed. To be a little more control. And, then last thing I'm really excited about in the sexually panel is integrating ecosystems, so you're familiar. Jupiter. Know about her widgets and our rigid has been. Lots of. Fields on top of Jeju aged student hours things. Volumes export three users just a whole bunch of Barbour's right, and it's been kind of shame that's. Roads, we don't want to have division. And this next release we're GONNA be on just put. It into your panel. And your vocab. Been discomfiting it in there and just on the. NATO servers server just work load digital instead. Now, we don't have these tunes originally. His knee more you can just use originates in panel, or you can go the other way and kind of say well. We've got this point system. Out which serves uber. While and so you can now put your power into. Just, make sure the existence verge you can use tools and miatas. Nice, because you don't have to have a separate set of widgets for your visualizations, and then people also building them for Jupiter, of course they would build them for their right, so might as well just bring together. Along effort so. Regards. All right well. That's great examples and looks like you got a lot of momentum going forward as well so thanks for bringing all your experience and talking about what you guys have built their. Great. Yeah, you bet before you get out of here though you've got to answer the final questions. If you're gonNA, write some Python, code what editor DIS own stores. Now in Boston. We so. Double in. Africa and probably Jupiter's well at some point, yes. Like four. Building things that he makes on some Dan. And then notable pipe package. You got a bunch here already I. It's worth saying you can just pin saw hall of his Oh. There's a bunch but yeah. I think I've already yeah. Harvest which is with Charriol, but I really want things over. Some of the underlying robberies. xrays, awesome harding, arguments and daffy awesome sorry. I agree all right recall, so people are excited about his tens like it. Really myself the problem, our help them build some DASHBOARDS Monaco action. What can they do? Come visit us at all of his. DETTORI oho take. Initial steps of using our projects. On the legal dashboard, so go there and check that out. Also check out examples of how to see can do. She master the stuff to build more complex examples, and then message on twitter or message are usual projects like I'll use our panel on twitter, and if you've got any longer form, questions, join us our discourse, which is at this force all right well. It looks a great project and I think people build cool things with it, so thanks for sharing with us. Thank you at by. This has been another episode of Talk Python to me. Our guest in this episode has been Philip Ruger and has been brought to you by brilliant dot, Org and data dog. Brilliant Dot Org encourages you to level up your analytical skills and knowledge visit python dot com slash brilliant and get brilliant premium to learn something new every day. Did it all gives you visibility into the whole system running your code visit talk by Dot FM, Slash Day dog, and see what you've been missing, but there in a free t shirt with your free trial. One to level up your python. If you're just getting started, try my python jumpstart by building ten apps course, or if you're looking for something more advanced. Check out our new ASE INC course, the digs into all the different types of acing programming. You can do in Python and of course if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription that never. Never expires be sure to subscribe to the show. Open your favorite podcasts and search for Python. We should be right at the top. You can also find the itunes feed slash itunes the Google play fee that slash play in the direct RSS feed at slash ourselves on top by on him. This is your host Michael. Kennedy, thanks so much for listening I really appreciate it. Get out there and write some Python Code. China!

US twitter Michael Kennedy Africa NFL Congress Stockholm Germany Cudi F. Japan Philip Ruediger Bill jared Edinburgh Arkansas Mojo
cuDF, cuML & RAPIDS: GPU Accelerated Data Science with Paul Mahler - TWiML Talk #254

This Week in Machine Learning & AI

38:12 min | 2 years ago

cuDF, cuML & RAPIDS: GPU Accelerated Data Science with Paul Mahler - TWiML Talk #254

"Hello. And welcome to another episode of twin? We'll talk the podcast y into interesting people doing interesting things in machine learning and artificial intelligence. I'm your host Sam Sherrington. This week shows drawn from some of the great conversations I had. Media GP technology conference. And no brought to you by Dell. If you call my tweets from GTC, you may already know that one of the announcement this year was a new reference architecture for data science workstations, powered by high end GPU's and accelerated software such as in videos Rapids, Dell was among the key partner. Showcase during the launch and offers a line of workstations designed for modern machine learning, and I work loads to learn more about Dell precision workstations and some of the ways they're being used by customers in industries like media, and entertainment engineering, manufacturing healthcare and life sciences, oil and gas and financial services. Visit Dell EMC dot com slash precision. All right, everyone. I am on the line with Paul Moller, Paul as a senior data scientist and technical product manager for machine learning at Invidia. Paul welcome to this week in machine learning a out. Thanks for having me. Absolutely. I'm looking forward to jumping into our conversation, which will be focused on what envy is doing around rapid and kumail all of interesting stuff in that area. But before we do that you were a philosophy major. How did you make your way to working in machine learning? I was a philosophy major and two years before I was set to graduate. I added economics is a major because I read the economist magazine and thought that it was a fascinating collection of bunch of different articles about different aspects of the world. So I figured if that's what economists read I wanted to be an economist. I went on to do a master's degree in economics where I mostly focused on quantitative methods and in an earlier life. I went out to Washington DC where I worked as an economist. I began my career at the World Bank, serving on health and human welfare issues in sub Saharan Africa. And then I worked for the office of the chief economist at Fannie Mae. Now one day. Well, waiting for the bus to get home from Fannie Mae. There was a article from New York Times about gentlemen, from SUNY buffalo that had written an algorithm, offering notes on screenplays, and I always had like a hobby interest in the different creative arts aspects of of our culture. And when I saw that somebody had written essentially, a big block of math in code was telling writers how to write their screenplays better. I mainly decided that I needed to switch into data Cy answer big data, which was the more popular term at the time. Because. That was where some of the most interesting things in the world. We're going on. That's a great story. You remember the moment literally the bus ride that triggered the that set you off down this path. Yeah. It was raining at the time. Awesome. And so what do you do? Now what I've been doing for the last year I had been at a couple of startups in joined Invidia to work on. What we've been calling the the Rapids project or the Rapids ecosystem. Now what that? Began as is that our director of engineering who had worked with previously Accenture for years had been saying that, you know, we see this great exceleron in, you know, neural network methods as result of getting them on GP us, but you are not seeing any of those bandages for more of the the bread and butter data science that happens a lot of places, you know, like Fannie Mae or like the World Bank where they may have large data sets and questions they want to address through machine learning, but we aren't talking about, you know, convoluted neural networks to understand images. So a lot of the time when I was working at a data scientist at a couple of startups. I like to joke that I was really a bar trivia champion because while I was waiting for my code finished running and spit out my result. I had plenty of time to read all the news of the world on the internet. And I guess, you know, unfortunately, for for being my compatriots in in data science, what we've done with the Coolum, Ellen Cudi f- in particular is, you know, if somebody knows pandas or they know the pie data ecosystem they can immediately jump right in and start seeing just crazy speedups like fifty acts like sometimes more than that on doing their end to end workflows. And that includes you know, reading from disk to GP memory doing all your data munching in merging and variable creation through actually executing your algorithm, and you know. Making inferences. So the idea was that since all of this will I mean, some of it is not obviously tractable to GP us. We are able to process strings in the the latest iteration of Cudi app, which to me seems like a miracle. But it really I like to joke. It's kinda like before the team that I work with had delivered these big pieces of duty app. It's like I could drive a car. And now, suddenly I can fly a plane, and I don't need to be an expert in Khuda or parallel algorithms or anything except the tools that I've worked with most of my career now the sickest back maybe when I introduced you mentioned to a mile. You've mentioned Cudi f- mentioned Rapids. Can you kind of paint a picture of the broader ecosystem of offer in libraries and tools that? Comprise our makeup Rapids, and how they all fit together. Yeah. So Rapids is the overall name of the project, and that's made up a smaller subline breweries that all start with coup because that's inherited from Khuda. Rapids is built on Invidia. Kuda Kuda is the for anyone who's. No. Yeah. The underlying vibrate or API for doing things on video views. Right. It's the it's the general purpose computing library for Invidia us. Okay. And if you think about the the more broad pie data ecosystem, I think a lot of people do a lot of their initial data cleaning exploration in pandas in. So that swat kunia is meant to replace for people that are moving their workloads onto GP us. And so the API is very very close. And you're able to in some cases just change the import statements at the top of your program, and it will just work. So pandas has this kind of core abstraction of data frame and so- Cudi. F is just a kind of you can think of it as a Khuda power data frame. Yeah. I think that's the best way to think about it. Okay. Cool L is our machine learning tool kit, and we aspire to one day have almost all the functionality that exists in psych. It. Learn cycler is a imminent package built by some of the world's greatest developers. So we've got a ways to go there. But we've been rapidly adding algorithms in the last release, for example, we have a stock asset gradient descent regression. Ordinary linear, regression ridge regression, a principal components analysis and. Some other things like Calvin filtering what we're trying to do is start with the things that are the real workhorses of of day-to-day machine learning in business and other parts of industry. And it's been exciting to watch the package grow. In fact, when we launched back in October. We had I think four algorithms in whom L, and we've doubled that over the last couple of months, it was very exciting to present at the GP developer's conference in San Jose, California a couple of weeks ago to the wider community. All the things that we have been able to deliver in such a short amount of time. You mentioned that you kind of sparred to psych it learned as is the does that mean, the COO over places I learned it sounds like it does for folks that are trying to take advantage of the GPO. And was there an opportunity to rather than replacing psych it learn kind of fit in underneath it. So folks can that use that or that have existing work that uses psychic learned could take advantage of the GPS exceleron without having to rewrite their their apps. I mean, at least with the algorithms we've delivered we've tried to keep the API one to one. Okay. For any of your listeners. I would encourage them to take a look at the API and just see how close it is to psych. It learn I'd also like to add that we've partnered with in RIA, the French research institution that does a lot of work on psych it learned and over the next over the next few months and few years. We're going to be building that collaboration with them. I don't think we'll ever replace Ike. Learn because there are still problems where I don't think it's big enough or the use cases right to necessarily go to full GP us. So I think of certain analyses. I did as a economist which would look like machine learning, but we're maybe a few thousand rose, and this was much more traditional frequent statistics. I think that there's always going to be a lot of that work being done. And I think that you know. With any data science work. It's about finding the right tool for the job. But I will tell you when I was. Testing out our code earlier in the summer are our demo workflow involved reading in think around like a gigabyte Obse v data to pandas in on my MAC book pro to like five and a half minutes and on a single GP you at took like fifteen or twenty seconds. Yeah. When you talk about, you know, I like data analysis is a iterative interactive process, and the faster, you can move the more fluid your conversation with the data will feel to you as a user there won't be, you know, the long wait times to see results or see if you made a coding error in my case or any opportunity to become a bar trivia master yet. So before we dig into into that. Because that's an interesting point there kind of talking about the landscape, you mentioned Cudi f- KOMAL are there. Other major pieces that we should be keeping track of in this conversation. Yeah, we are working on a graph analytics package called a coup graph and could've guessed that. Yeah. Our minds are so fixated on accelerating. The algorithms were totally out of bandwidth for fancy names. But everyone knows that Jensen does all the naming it invidious. Why would anyone else been anytime thinking about that? But Graf Graf is embarrassing in the sense that we compare it to a a graphic loose package in. Python? And it's one of those things where you see the numbers. You just really wanna double check them like ten thousand times speedups over network X pursuit out rhythms. That was my reaction to the the loading the data frame, and I still I wanna get through the kind of broad landscape before we dig deep into that. With us the first place I'm gonna come back to once we do. So you've got a graph analytics piece in coup, graph, any other major components here. Some of the things that began as major components are now under the hood. So we put a bunch of effort into building a string reader so you could directly. Har status with strings. It's very common thing that happens in data science GPA us do not like strings, but now you can do things like just easily create your dummy variables from strings on the GP you chance kinda home. But it's actually a pretty major even just part of the whole speeding things up. I don't think our case would be is compelling if we said that could only be numerical data in the Cudi F data frame that simply will not work for many use cases. So. We also have a package called coup cross filter. It's written as COO X filter, and we're going to be building out some accelerated GP, accelerated visualizations. So if you think of the workflow from munching to analysis to incite through visuals ations, we wanna be able to offer every piece of that puzzle. The other thing we're heavily using a. Software called Daska and Daska is a package that handles distributed computing, it his been used to scale out pandas, for example, multiple cores, and the we were lucky enough that the creator Daska has joined our team, and is helping us use that as a way to distribute workloads when we're talking about moving beyond a single note of GP us. This may go back to the initial example you gave of loading the pan is data frame or loading the data frame. You said it was a terabyte. I it was it was gigabytes. It's pretty easy to choke pandas. And I'm sure a lot of your listeners have experienced this before the workflow was a Fannie Mae makes loan delinquency data going back. I think sixteen years available for free. And this is, you know, the whole payment history for a subset of all the loans that Fannie Mae has acquired, and as a demo workflow what we wanted to do is read in, you know, however, many quarters of data, we could fit or relevance and then apply extra boost to predict default. Which reminds me of another sort of under the hood improvement that we've made it's not really under the hood. But we have made contributions to the D M L C extra boost library and will continue to do. So. That you know, has has appeared in a lot of our, you know, early presentations, webinars I think extra boost is almost like magic. And it's a good broad workhorse for the first thing that we were going to introduce but. We are working with the community to make certain changes to extra boost that make it more amenable for the Rapids ecosystem. But then giving those back to the community. So pardon at sidebar. But with the the before any real data processing had had happened. Just bring in this dance large data set. A rule of thumb. We couldn't do more than a couple of quarters of data. And then you would really see the time to like load and go through the data preparation and execute the algorithm increase stood stanchly. So I think that the largest we could do on pandas quarter wise was like two three quarters of data. They were small. So I think the biggest we tried was about one and a half gigabytes. And that's where you saw those you really kind of frustrating load times of of more than five minutes. And so what's happening here in these two different scenarios is on the panda side. He's got a gig and a half on both sides. You've gotta gig in a half of information on a disc, probably a CSV file perhaps. Right. And you're loading it into a data frame on the panda side. It's data. Frame that's located in ram and on the rapid side. It's a Cutie f frame. That's located on the GP itself. Is that right? Yep. NJIT your memory. Okay. And that's where the kind of the five minutes versus fifteen seconds differentiation. Come from. Yeah. And this is where this really is more for like are are hardcore Sieg is. But there's some problems with pandas that have been known for a while. You know, it's it's great like I can't sank west McKinney and the community enough for open sourcing because it's, you know, put food on my plate for five or six years, but it's also single threaded on the CPU. So so even in the world of CPU's, you started to see people look for things like Daska to help better leverage. Even the multiple cores of CPU's. You might have on your MAC book pro. I think it's you know, I think it's kinda funny every everyday to science gig. I've had they've given me like a shiny MAC book pro and I mostly work in Jiu per notebooks. And most of that stuff is only taking advantage of a single core of the processor. Right, right. I mean, I I don't want to exaggerate too much. But it's almost like you still want the next Mike book pro. I want to see what they do with that touch bar. Get some thoughts on the touch more. But. Let's not even go there. But you know, and so the other thing is that when we when we moved past, the, you know, the the massive parallelism that you can get from using GP us when you get to the the side, I know better, this is all for the most part matrix, algebra and GP's love matrix algebra. They are designed to do it. And you know in our algorithms like doing ridge regression, which can take some time to run on in the conventional pie dated ecosystem when I gave a tutorial at the GP, developers GP technology convention. I just like we're going to just do hyper Twitter search and run through like a thousand ridge regressions on this black Friday data set because it's just fast enough that we can. Brute force, hyper research on certain data set sizes, and the do you have comparative results for that particular scenario? We aren't baking off every algorithm. These days ridge regression is fast. I don't know like doing a ridge regression, I think like eight hundred thousand rose took me like less than second couple of seconds. I remember being fast enough where I could do a live demo and run through one hundred rations of it normally doing two or three would have taken quite some time. Let's jump in to the to this this part of the Kuma library. Can you? Maybe talk us through the technical underpinnings of this. I mean is it as simple as hey, these things love matrix multiplication, and we're just doing matrix multiplication using Khuda, and it's just faster or they're kind of interesting nuances to the way some of these algorithms work that might be worth chatting about. We'll start with a little bit about the architecture of CU l. So criminal is built on top of what we're calling 'em L Prims. These are primitive functions that are composed of even lower level, math, libraries, or or various things that have been developed at NVIDIA for certain linear, algebra purposes. And so we take these primitives, and they are delivered in C. Plus, plus, so then when we need something new. Like, I have a colleague working on doing massively parallel arena regressions. And so when he began working on that we already had Calman filter primitive and an OLS primitive in. So the amount of new work that he needed to begin composing a prototype was dramatically reduced Sunday in the future. I actually wanna see these M L Prims wrapped in python. So different, you know, machine learning, researchers graduate students that aren't experts in parallel programming would be able to mock up the new algorithms, they're inventing and be able to leverage the advantages the GP. You sounds like a sounds like a no brainer. Are the ALP rims open source or they locked up in a binary or something? I mean binary. Thing. Yeah. It's a mix. I mean, some of the stuff is Invidia proprietary stuff, and we tried to wrap it in a binary. But other ones are more open source, and it's a it's a discussion that we're going to continue to have. But where oh sorry. Go ahead. Is it a well documented? You know, set of primitives or is a kind of internal nobody really knows about them outside of the company. It's something that you know, I've tried to mention publicly whenever we speak about it. It's not Super Bowl documented right now. So you need to you need to be able to go into our get older and look at the primitives folder and be able to read a little bit of CC. Plus, plus got it. Okay. But yeah, I hopefully want to wrap these python and introduce them to the greater development community to see what else they can do with them. Now, you mentioned are there are there. Some algorithms that are more or less tractable the bat. I mean, right now, we're working on building a lot of different solvers for more exotic kinds of regressions and one of the challenges in developing those has been and this really is where you know. I'm off some of the thinking of my colleagues there essentially sequence Royal algorithms the way they're originally designed right? If you look at the most basic version of like, a gradient descent you start someplace and you keep taking little steps until you're satisfied. And then you stop now, that's a sequential operation when we've been doing some early inspections. On different solvers this morning. In fact, a colleague told me that he was disappointed that we were only getting three x speed up because he was still trying to think around how to make the algorithm less sequential. So there will be things just by the way that the underlying mathworks that aren't going to necessarily be another ten thousand x speed up, but a two or three x speed up. I think is still pretty great. And the other really heavy intellectual work that is going on right now, we hope to wrap up by the end of the summer or the fall is going to be on multi node multi GP, you algorithms using Daska in some cases, it'll use Daska, and we're currently working our way through what other kind of communications layers could be helpful in trying to you. Block up. This data in distributed across a cluster of GP us in a way that creates a, you know, allow moment for the user. I know it's a little marketing. But that's what I'm. I'm getting I'm looking at what we're doing with like that's what I'd really like. You know, we've we've been lucky so far that we have gotten some house with what we deliver. But the the underlying. You know, algebra and algorithms of breaking some of these things into parallel jobs is very far from trivial. You mentioned that some of the things that you're doing are allowing you to larger strings onto the GP. You are you able to utilize the GP for kind of. Heavy heavy kind of NLP types of algorithms. I guess for a lot of those you're kind of numerical ising, the textual data anyway, but other any limitations there one way or the other there are some limitations, and we do want to do an LP. We're not quite there yet, we're working on a implementation of word toback in terms of preparing or understanding your data. We have a lot of ordinary string functions. Like tokens IRS. We have a regular expressions engine. So you can. Search for regular expressions on strings and sub springs and use that to create variables I think probably closer towards the end of the year would be the soonest. I expect us that we're going to deliver that, but it certainly something that we're interested in and are currently working on and just in terms of like, the the less the preprocessing to NLP is is where we started. And so now that we have released and continue to Irit on our our string manipulations package that's gonna lay the foundation for in L P practitioners to be able to work in a way that they're used to an have algorithms like were. Back or L DA or other things like that opening available to them one of the things I'm curious about is in the clue mala library are all of these algorithms and everything you're doing with that library only dependent on the GPS. In other words, you doing all of the compute in the GP you or are you using the CPU as needed or where appropriate and? And I'm wondering anymore. You know, more broadly as kind of an all or nothing kind of thing. Or is is the focus really on doing a a given operation in the best place for that operation. I think it's doing I give an operation where it's best suited. But we are really just looking at things that are are suited to the GP. You. That's actually. So doing everything where where it's best seeded, but everything is best suited in GP. Know. Our work has been focused on, you know, once the especially on the side once the data is GP memory. We don't wanna move it around. That's kind of one of the big advantages that we have by doing into nj data science. We're dramatically cutting down on these read rights. So like the data will come in. And it'll sit as a Cudi F data frame. I can immediately pass it into my algorithm, and it all stays in one place. So. Rebel to cut down some of the overhead by thinking real hard about. Reducing the copies that happen in the course of doing this anals. I mentioned that there's a bit of dichotomy or decision point around, you know, do kind of optimize around these. And and workflows, and then everyone who wants to do anything that utilizes the GP and takes advantage of what you're doing needs to wait for you to build their outer them, you know, or, you know, is there some way, you know, if I want to do something that you know, that Kumo offers an optimized version of, but I also need to do stuff that for which there's not a coup Mao optimize version. Do I need to load do the five minute pen to load in Iran and the load into the GP you and kind of go back and forth each time, or you know, is there some kind of scenario where I can look. Load into the GP's and keep the data there. But also do CPU based operations against I don't even know technically that makes any sense or is feasible. We have a bunch of different for maps you can export data from Cudi app. So if you had an algorithm that we haven't built at and I say anybody is willing to join in. This is an open source project. We'd love to have any help. We can get building these algorithms out. But if you have something that you need to do for work today, and you're like Cudi app. Sounds like a great way to speed up. The pan is part of my workflow. But I am going to do a bunch of algorithms, for example. I want to do a bunch of Beijing and stuff we can export data from the GP data frame into a pan is data frame athlete like to work with it into the array we support the aero data format. And we also just introduced support for deal pack, which plays nicely with the handful at the deep learning packages like pie torch. So. Well, we're working as hard as we can't add more algorithms to it for the user community, you can pick and choose what's what's most useful to you. At this time. I'd also like to add that who American taking known berets. So these packages were designed to closely used together, but we know that that's not gonna cover all users, and that's not necessarily requirement. The other piece of this is cool graph. And I gather that's a newer kind of more merging part of the Rapids ecosystem. I don't have the deepest knowledge of coup graph on the team, I'm not really a graph guy. But I know that they're benchmarks have been fabulous. And we're hoping to make more graph algorithms available to people that heavily rely on graph theory in their day-to-day work, and maybe switching gears a little bit the on the work that's happening in Rapids. One of the things that was mentioned that this recent GTC was some announcements and partnerships around creating a reference architecture for data science workstations. What can you tell us about that initiative? Yeah. The reference architecture for data science workstations is very new. But what I think is exciting about it is that for for people that are. Able to get an Invidia data science workstation. We are going to have the the software. It's going to be heavily based on Rapids laid out. Ideally. So once you get the data science workstation, it's loaded with the software that you need to have that that reference architecture will also refer to what we think the best hardware layout is and we're just trying to in another way make GP data. Science more accessible to people. Sometimes it's a common project in data science circles to try to build your own deep learning rig think that's a great exercise is, but it's not for everybody. And I've been in some very serious corporate environments where IT is not gonna let you bring in the computer that you built to start working on their proprietary data. Right. Right. And the Dana science workstation initiative is really about making it as easy as possible for an organization that wants to dive into GP data science to get started cool any parting thoughts from you on Rapids were KOMAL or advice for folks who haven't really been exposed to what in Vida's doing on the software side that want to explore more. I would just really encourage everybody to take a look if you go to Rapids dot AI. That's a. Portal landing pay age that will be somebody to everything that I haven't ever here as well. The things that I have links to documentation links to get hub. We've done a Google group. We encourage everybody that touches Rapids and find something that they don't like that doesn't work too violent ticket. You can see our roadmaps in our current work on get hub. And we really want the community involved. Like, you know, as I think about the machine learning algorithms that I'm going to roadmap next for the team to develop on a lot of that has been formed by customer and community feedback, and it's gonna continue to be informed by customer and community feedback. And so. How just? Ask anybody that is interested in taking a look, you know. Please try to get involved with us because that's that's really what's going to measure the success of our project. There's a real open source project, and we had a great job of building community. So far, we've got lots of stars and forks, but we wanna see more of those. And we're always happy to see issues opened on get hub. Awesome. Well, Paul thanks for taking the time to share with us. What you're up to? Well, thanks for having me. All right, everyone. That's our show for today. For more information on any of the shows in our GTC twenty nineteen series. Visit twilly I dot com slash GTC nineteen thanks again to Dell for sponsoring this series. Be sure to check them out at Dell EMC dot com slash precision as always thanks so much for listening and catching next time.

Rapids Fannie Mae Dell Paul Moller Invidia Daska World Bank GTC Invidia scientist economist magazine COO Washington New York Times Daska SUNY
#275 Beautiful Pythonic Refactorings

Talk Python To Me

55:04 min | 1 year ago

#275 Beautiful Pythonic Refactorings

"Do you obsess about writing your code just the right way before you get started maybe some ugly code on your hands and you need to make it better. Either way re factoring could be your ticket to happier days. On this episode, we'll walk through a powerful example of iterative Lee re factoring some code until we eventually turn our ugly duckling into a python ick beauty on our hope is our guest in this episode to talk us through re factoring some web scraping Python Code. This is Talk Python Emmy Episode Two, hundred, seventy, five, recorded July, ninth, two, thousand, twenty. Nine. Welcome to talk by a weekly podcast on python the language, the libraries, the ecosystem in the personalities. This is your host Michael Kennedy. Follow me on twitter where I'm at in Kennedy, keep up with the show and listen to past episodes at Talk Python FM and all the show on twitter via at talked by on this episode is brought to you by us over at Talk Python. Training. Pythons acing imperilled program and support is highly underrated. Have you shied away from the amazing new eysenck keywords because you've heard, it's way too complicated or that it's just not worth the effort. But. The right workloads one hundred times speedup is totally possible with minor changes here code. But you do need to understand the internals and that's why our course acing techniques in examples and python show you how to write acing code successfully as well as how it works. Get started the AC can wait today with our course at talk by dot com slash ASE INC honor welcomed. Oh, talk by Sunday me. Thanks for having me on. Excited to be here. I'm excited to. It's going to be beautiful man. Hopefully. Hopefully. Yeah it's beautiful re factories. So I am a huge fan of re factoring I've seen so many people bri to overthink the code that they're writing like well, I gotta get it right and I gotta think about the Algorithms in the way and Ryan and all this stuff and what I found is you don't really end up with what you want in the end a lot of times anyway. If you just go in with an attitude of this code is plastic, it is malleable and I can just keep changing it and you always are on the lookout for making a better you end up in a good place. Yeah. I completely agree for is not a one time thing or something that happens only two years from when you initially write the code it's I heard once actually that it. Goes a lot in hand with legacy code and There's a number of different definitions for legacy co by one definition is like is he code is code that isn't actively being written. So if you write something and then you consider it done and then the next week like no one's working on it that technically according to that person's definition is legs Z code. So that can be re factored. You know you can refer something you wrote earlier in the day it doesn't have to be a year later or ten. Yeah absolutely. I mean you just you get it working you know A. Little bit more. You apply that learning back to it and with the tooling these days is really good. It's not just a matter of you know if you go back to nineteen, ninety nine, you read Martin Fowler's re in book he talks about these are the steps that you take by hand. Make sure you don't make a mistake and now the steps are highlight right click apply re factoring I mean that's not one hundred percent true. In the example we're GONNA talk through is not like that exactly but there are steps along the way where it is potentially definitely winters in static analyzers. Heavily. Underutilized I feel and so many of them will just automatically apply the changes that you WANNA do and it's fantastic for huge code bases. It would be almost impossible to do it by hand. Yeah. Absolutely. It would definitely be risky. So maybe that's why people sometimes avoid it. Now before we get into that though let's start with your story. How'd you get him into python? I know you're into languages weren't talk about that but by onto then also yes. So the the shorter it's a long story but the shorter. Version of it is I degree in university, which wasn't computer science required at least two introductory courses. So the first intercourse was in python the second one was in Java, and then I ended up really enjoying the classes I ended up taking a couple more but ultimately stuck with the career that I had entered into, which was actuarial science. That's like insurance statistics. So you were in some form of math program I'm guessing yeah. Yeah. It's very, very boring to explain but if you like math, it's a great career. Yeah Awesome and SO I ended up for my first job at a university. I ended up working at a software company basically that very simply explained created the insurance calculator that many insurance companies use, and after working there for about four or five years, I had just fallen in love with the Software Engineering side of my job and decided that I wanted to transition fulltime to like a purely technical company. So Mitzvah several years or a couple years later and now I work for video as A. Senior. Library software engineer and That's how I got into programming and our code base that we work on is completely open source and primarily uses was fourteen and python three that's python enter down like a dream job sounds awesome. Yeah. I absolutely love it. Yeah. So you working on the rapids team, right which works on doing a lot of the computation that might be in pandas but over on jeep use that roughly right? Yeah that that's a great description. So so yeah, within INVIDIA. I work for an organization called. Rapids we have a number of different projects. So specifically I work on Cudi. F that is see you D apps. So the coup is to letters see you from Khuda which is the parallel programming language that invidia has made and the F. stands for data frame, and so this is basically a very similar library two pandas the difference being that it runs on the GPS. So the one liner for rap is is it's a completely open source end to end data science pipeline that runs on the GP you so if You're using pandas and it works great for you like there's no reason to switch. But if you run into a situation where you have a performance bottleneck, Cudi can be like a great drop in replacement. We don't have a one hundred percent parody with like the pandas library, but we have enough that a lot of fortune five hundred companies that pick up and use us are able to very easily transition their existing code in pandas to right James and import line go much faster something incredible like that. That's the goal. At the dream. Yeah. I just recently got a new alienware high end alienware desktop and it's the first G force I've had in a long time. That's not like some amd Radi on a book or something like that. So I'm pretty excited to have a machine that I can now test some of these things out at some point. Yep Acceleration on different devices it is very exciting also. All right. Well, let's start by introducing a bill briefly a little bit about refractory and we talked a tiny bit about it in general, and then we're going to dive. Into a cool example that you put together that really brings a lot together and what I love about your examples. It's something you've just gone grabbed off the Internet. It's not by contrived like well, let's do this and then unwind the re factories until it does is like just found it Michael let's see what this thing does. That's GonNa be fun. Let's just start with a quick definition of refractory factoring maybe how howdy know when you need it how do you know when you need refracting for me? I have a sort of number of anti patterns in my. Head that when I, recognize them in the code, some people might refer to them as sort of technical debt. This idea that the first time you write things or maybe initially when you write things, you don't have the full picture in mind and then as time goes on, you start to build up technical debt in your code and re factoring can be reorganizing, restructuring your code or rewriting little bits of it to basically reduce tech to make it more readable, maintainable, scalable, and just in better in general better code. That's sort of the way I think of it. Yeah. It is pure sense right? It should not change the behavior at least in terms of like inputs outputs exactly as the easiest code factor is code with tests that's unit tests or regression tests or any the other number of tests that there are. If you have a code base that has zero tests re factoring is very, very dangerous because you can fact or something and completely changed the behavior and not know about it, which is not ideal at all somewhat suboptimal and indeed Mark Valor when he came up with idea of refractory or at least he publicized. Sure. The ideas were basically there before one of the things that struck me most was not the refractories but was this idea of code smells and it's like this aesthetic I look at the code and it works. But like your nose Kinda turns out, you're kind of like you know you but it still works right it's like not broken, but it's it's not nice and You know there's all sorts because Kuzma's like too many parameters long method things like that but they rarely have beer cutoffs right? It's over twelve lines, the functions to large but under that is totally fine right? Like that's not it's never really super clear cut so I, think this whole idea. Of Re factoring much like factor in itself requires like going over and over it as sort of through your career to refine like what the right aesthetic to achieve is it probably varies by language as well little. Yeah. If you start to do it consciously when you're looking at code and asking yourself like when you have that code smell feeling like something's not right here if you are conscientiously like paying attention to what it is like slowly over time, you will start to pick up on exactly what it is about it like a very, very small one for me and I think this is mentioned in maybe clean code or it might have been Martin Fowler's book. It's like a declaring a variable earlier than it needs to be declared. So you might declare like all your variables at the top of a function, but then like two of them you use immediately but the other three, you don't use until the last four lines of the function small things like that. It seems simple but I've made the change where I've put that a declaration closer to where it's get used, and then you realize, oh, wait a second. This isn't actually reference it set to something, but then it's not actually use later on. So I can just delete this is because it was at the top of the function you can't see where it's being declared. Or. If it's US somewhere else that like you actually have a phantom unused variable that can be deleted. It's simple things that lead to better changes later on well, and just mental overhead like you said, the technical debt side of things. For example, there's the variable that was at the top surely win the code has written it was being used, but it's been modified over the years and now no longer is it being used but it because it's separated from whereas declared towards us you don't. WanNa. Mess with that like if you start messing with that, you're earning more work are you're asking more is I'm just going to make the minor change I don't WanNa break anything. And then the next person that comes to try to understand that they gotta figure out will, why is there that like set count variable like I. Don't feel like it's being used, but it's there. You know you've got another thing to think about that's in the way for sure. Yeah. So certainly, I think it's viable. There are fantastic tools that will like highlight. This variable is unused or this assignment is it meaningless or or something like that? So there are options, but still it's better to not let that stuff live in the Code Yep. Agree. Let's talk about this example that you've got here and maybe should give a little background on your language enthusiasm and programming competition interests, and so on. Your interest in hosting competitions I think is probably worth touching on already. But then this example is from you trying to reach out and understand it and do some analysis of those environments are those ecosystems right. The background with this these different languages and coding competition. Yeah. So I initially got into competitive programming quote unquote. So just though the one sentence description is there's a number of websites online hacker rank league code code forces that they host these one, two, three hour contests where they have three to four or five problems that start out easy than get harder as you progress through them and you can choose any language you want to solve them in and The goal is just to get a solution that passes as quickly as possible. So it's not, necessarily about how efficient your coat is, it has to run within a certain time limit but if you can get it to run or pass python versus plus versus Java, any code solution works I started doing these to prepare for technical interviews. So if you're interviewing for companies like, Google facebook, etc. A lot of their interview questions are very similar to the questions on these websites, and so I at one point was looking for a resource online like for youtube videos that just explained the stuff. But at the time I couldn't really find any. So I started youtube channel covering the solutions to these problems and I thought it would be better to solve it in a number of languages than than opposed to just see plus plus. So I started solving them people's plus Python and Java, and that's Sorta what led into led to my interest and competitive programming even though I'm I'm not interviewing actively anymore I just find these Super Fun it keeps you on your toes in terms of your data structure in algorithms knowledge and you can treat them as like code cod. I'm not sure if you're. Familiar with the concept of just sort of writing one little small program and trying a couple of times in different languages and you learn different ways of solving the problem you might not would have initially solve the problem that way this example I decided to just figure out what the top languages that people use to solve these competitive programming problems on a given website. So the site that I chose was cut forces. Yeah, and you're like Hey I'm working on this new data frame library that's like pandas. Let me see how I can use pandas. To solve this problem and get some practice or something, right? Yeah. Yeah. So when I just started Invidia I knew that the panda's library existed but I had zero experience with it and I knew that it had the sort of group by reduction functionality that if you had a big table of elements, you could get these sort of statistics on you know what's the top language or what's the average time? It takes people to submit a very easily with this kind of library. So I thought what better way to learn pandas than. By, trying to build a simple example that uses this library for something that I'm interested in and so the first thing that I did was I google you know how to scrape each tables using pandas and brought me to this blog that at the end of the day has about sixty lines of code and it's a tutorial walk. So it walks you through how to get this code off of HD month table and basically the Pike on talk that I gave it came out of doing this I had no plans of giving A. Talk. Talk on this I just after having gone through it and sort of re factoring one by one. I realized that like I could give a pretty simple talk to like connor like five years ago that didn't know about any of the I didn't know about prevention I didn't know about numerate I didn't know about all the different techniques I was using and I figured it would be at least for some individuals out there it would be a useful talk highlighting the things that I didn't know when I first started coding in. Python. But now we're like second nature for me and that's where that came from. Yeah and it's really interesting example is cool. I do think that a lot of the factories were let's try to make a more python ick brandon of this more idiomatic version of this like miss understanding the four in loop for example and treat them. All right. So in a lot of ways, it's a corey factory, but also kind of leveraging more of the native bits of the language if you will absolutely now. Yeah. So you went and grabbed. His code and it does two basic things goes download some html and then pulls it apart using I. Think L. Xml html parter, and then it's going to loop over the results that it gets from the H.. melpar Sir and turn this into basically a list or a dictionary. Then you're gonNA feed that over pandas as pandas, pretty interesting questions and most of the challenge or most of the the MESSI. Code lived in this html by the things, right? Yeah. That's a pretty good description of what's happening. So let's go and just Talk through some of the issues you identified, and then the fix exactly knowing how did you identify that as a problem and then what fix did you apply to it? Now there's a lot of code and it's hard to talk about coding audio. So we'll maybe try to edge like as high levels possible talk about like the general patterns and what we fix. The first part of the code would go through and it would create an empty list recruit like index to keep track of where it was and the did a loop. Elements increment the index out of thing to the list pronounce information as it went right Yep and I think the first thing that you talked about was a code comments. Actually you're like, what is this code comment here? It just says we're looping over these things but what do you think a loop is and why don't we have comment? Yeah. Even worse was like arguably the second comment some might argue is add some value but the first. Comment. Above the line that creates an empty list, it says create empty list and it's only a what is that six characters if you don't include the spaces and I, think, that's definitely one of the things that's called out in a number of re factoring books as comments should add value that is not explicitly clear from the code. I think even beginners are able to tell that you're creating an empty lists there. There's no reason to basically state. What the code is doing typically, comments should say why if it's not clear why something is being done a certain way or something that's implicit and not explicitly clear from what the code is doing yeah. In terms of refractory and I love this idea of these comments are sort of almost warning signs because if I find myself writing one of these comments to make stuff more clear unlike wait a minute wait a minute if this is just describing. What's here something about what I'm doing is wrong. Maybe the variable name is not at all clear what the heck it is or maybe it could use a type annotation to say what types come in and said of here's a list of strings like how about list bracket string goes there to just type it is it's five three after all and you know from the the code smells book dollar had this great description of calling these types of Comments Deodorant for code smells something wrong. It smells a little as bad if we'd like, lay it out, set the stage but every time I see one of those like I just need to rename this function like a short version. This comment would say a rename this variable or like restructure embrace these things apart because if it needs a comment is probably just to there's an individual in these plus plus community his name's Tony Van and he. has a a rule or not a rule, but a a recommendation that you up code base for step one, step two, step three and and guaranteed. You're going to get like one or two matches and a lot of times. It's these steps of comments on top of pieces of code and like a larger function and odds are you could make that coat better by re factoring each of those steps into its own small function just whatever the step. Like if you put step wanting a description, you've already given that he's a code name. You just need to take the next step but in a functioning give that function that. Yes. Exactly. Exactly what you said exactly I think there was even some tool way way back in early days a C. Sharp that if you would highlight Kodori factor in you highlighted a comment would function name a fi who would try to guess the function by using the. Comment turning into function like a that would work as an identifier in the language. Anyway it was totally a good idea to there's a couple of things going on here. One is like wiser prince statement nobody needs us. Once you take that out though you were able to identify this. Well, let's take a step back I if you have an integer and your increment every time through the loop so that it stays in sync with the index of. The elementary looping over. That's probably not the best way to do it right like python has built in enumerate. Yeah. This is probably one of the most common things I see in Python. Sadly in in certain languages, they don't have this function but in Python, it's right there built into the language. As you mentioned, it's called the numerate. So you can pass whatever thing you're looping over to a numerate and that's going to bundle it with an. Index, that you can then in line d structure into an index and the element that you were getting a from your range for loop before. So anytime you see an index idea or I or something that's keeping track of the index and that's getting into the J.. Sometimes it's J. sometimes it's J. sometimes it's K. ex wife you're being really creative and yeah there is a Bilton pattern for basically avoiding that and it makes me extremely happy like. It happens actually not just once in this piece of go but twice where you can make out of a numerate and once you see it, it's very hard to unseat it but like I said this was something that I learned numerate from Python and this was not something that I knew of an I didn't learn school. So there's a lot of python developers developers in many languages out there that I think they're just not aware and as soon. As you tell them. I. Think they'll agree Oh yeah. This is way better than what I was doing before. Yeah. You just need to be aware of it. You always trying to these issues you got to create the variable than like wise variable there. Then you gotTa make sure you increment you incriminate before you work on that with the value increment after is zero basis at one based. All of these things are just like complexities that are like. What is happening here? Like what if you have a have a past continue skip the loop, but you forget the increment like there's all these little education and you can just with the numerous you can say you know it's always gonNa work. You can even set the start position to be one if you wanted to go one two, three, beautiful. Yeah. That's a great point. Yeah. There are use cases where you're gonNA run into bucks whereas with. The numerate you know at least you're not going to have a bug with that index, right? It's always gonNA be tied to the position with the starting place the way you want it. So yeah, that's really nice but it's not super discoverable, right? There's nothing in the language that screams and waves it's hands and says, yeah, you're in a for loop. We don't have this concept of a numerical for loop. This is actually better than this is what? You wanted you didn't even know you wanted it. Yeah. It has to be something that you stumble across. Interestingly, some language is go is the one that comes to mind they actually build in the enumerate into their range based for Liu. So in go, they have built in basically the destructuring, and if you don't want the index, if you just want range based for loop and you want to ignore the index, then you're just supposed to use the. Under bar to say I, don't need the index for this loop, but it's interesting that like go is a more or more recently created language than python at least when they decided like they thought it was such a common use case that they would think that most people need it more often than they would. So they built it into their for loop. So with that language, you can't avoid learning about it 'cause it's it's in there for about. Syntax air tonight at least they explicitly ignore this yup interesting I. Didn't know that about go to now. You've got this little cleaner you look at it again and you say, well, now what we're doing is we're creating a list an empty list which we commented great emptiness that was cool book that coming out, but it was very helpful in the beginning to help you understand now, and then you say we're GonNa loop over these items and then a pen something to that. List well as possible. But this is one of your anti patterns that you like to like the finding get rid of right. This is an anti pattern that I call initialized then modify and actually enumerate example previously also a falls into this anti anti pattern. So anytime you have a variable that it doesn't need to be a for loop many many times it is that inside each iteration of that four loop, you're then modifying what you just initialized outside that is initialising and. Then modifying and my ass rotations is that you should try to avoid this as much as possible when it comes to the pattern of initialising an empty list, and then in each iteration of your loop, you're calling append that is built in to the Python language as something that can be used as a list comprehension, which is so much more beautiful in my opinion compared to just a a roth four loop and then appending for each iteration. Yeah. Every now, and then there's Like a complicated enough set of tests or conditionals or something going on in there that maybe not. But I, agree with you most the time at just means what I really wanted to write was a list comprehension. It is though meal bracket item four item in such and such if such as I tried that, that's what you gotta do. Yeah. Let's comprehension. Once you start to use it moving to a language that doesn't have it makes you very sad because it's Joe's needs. He's totally makes Zad and. I really really wish list comprehension had some form of sorting clause because at that point, you're almost into like in memory data base type of behaviors, right? Like I would love to say projection thing transformed thing for thing in collection where the test is order by whatever I mean, you can always put assorted around it, but it's it'd be lovely if they're like it's already got those nice steps I like to ride it on three lions, right? The projection, the set and the conditional like just one more line but the order by in there but maybe wonder maybe I should put a pep in there. Who knows I was gonna say that sounds like a future pep. Definitely I mean it would be easy to implement. Transform went to SORTA impasse at his key or something like that. But anyway, it would be really cool. But they're they're very, very nice even without that and once you have it as a list comprehension then unlocks the ability to do some other interesting stuff which you didn't cover in yours because it didn't really matter. But if you have square brackets there and those brackets are turning a large data collection into a list if you put rounded brackets all the sudden, you have a much more efficient generator. Yup that is something I don't call out at that point but at. The end of the talk I- Lewd to article that was mentioned on the other podcast that you coz Python Bites. Yeah. Thanks for the shout out by the way. Yeah. No, it was a great article. But if I mentioned generator expressions right after it mentions list comprehension and I mentioned that these things go hand in hand that you should familiarize yourself because if at any point you're passing a list comprehension to like an algorithm like any or or something, you can drop the square brackets and then just pass it the generator and it'll it'll become much more efficient. So. With them and there's no way to go from four loop really quickly and easily to generator yield style of programming. Right there's not like for yield I in whatever right like there. But with comprehension, it's where brackets versus rounded BRI apprentices, right? It's. So it's so close that if that makes sense, it's like basically no effort to make it happen and okay. So we've got into a list comprehension which is beautiful, and then you say, all right, it's time to turn our attention to this doubly nested for loop and it's GonNa go over a bunch of the items and pull out an index, and then you know go and work with index. So it's another numerate and then I think another Thing, that's pretty interesting that you talk about I don't remember exactly where it came in the talk but you're like look what you're doing in this loop is actually looping from like a second onward for all the items and that really is just a slice. Yeah. Yeah. So in this nested for loop, the outer for loop is basically reads for J. in range of one to the length of your list. So you're basically creating a range of numbers from one to the length of your list, and then right inside that four loop you're creating a basically a variable. That's the J, th element of your list. So all you're doing is skipping the very first element of your list, but the way. You're doing this is generating explicit indexes indices based on the range function and the length function, and I thought at first that they must be doing this because we need access to the index later or we need access tr elements later but that wasn't the case it just seemed like the only reason they were doing all of this was to skip over the first element and so very nicely. Once again, Python has very, very many nice features. They have something called slicing where you can basically pass it the syntax, which is square bracket, and then something in the Middle End Square brackets and in order to skip the first one, you just go one colon one to the and that's beautiful because. He don't even have to check the length of the items. You just go to the end, which is avoids areas of like do I have to plus one here do I not is at minus one budget the ending piece, by John, Dory badges from skip, Vermont and the rest Yep, it's so convenient. You're you avoid making a call to lend you avoid making a call the range, and you avoid your local assignment on the first line of your. You can basically remove all of that and just use slicing and you're good to go and the slicing is slicing as a a really really awesome future actually comes from a super old language that was created in the sixties called AP L. and Python. Languages that has something called negative negative index slicing where you can pass out a negative one so that it wraps around to the last element, which is a super super. It's SORTA looks weird but once you use it, it's so much more convenient than doing like a len minus one or something like that. It's it's it is a little bit unreal but. Once, you know what it does. It's great. It's great. It's like I want the last three. I don't WanNa care how long it is I just want the last three and that's yeah it's fantastic. Slicing I think is fairly underused for people who come from other languages but yeah it and it fits the bill because there's so many of these little edge case. You talk about arison programming like off by one errors are a significant part of problems with programming right and they just skip that altogether it's beautiful up to the next thing to do is so you're parsing the stuff out of the Internet, which means you're working with one hundred percents strings, but some of the time you need numerical data. So you can ask questions like the sixth or seventh or whatever, and so they have. This is GonNa be fun to talk about they have try value equals in of data. So past the integer asked the potentially like data over to the. Initial eiser either that's going to work or it's going to throw an exception in which case you will say except pass will not you the original article had that rape? Do It's this try pars except? Otherwise it's going to be non or it's going to be set to the string value or something to that effect. So what do you think about this? How'd you feel when you saw that? Yes. So my initial reaction was that this is four lines of code that can potentially be done in a single line using something called a conditional expression. So in many other languages, they have something called a ternary operator, which is typically a question mark where you can do an assignment to a variable based on a conditional predicates is something that's just asking to reform. And if it's true, you assign at one value and if it's false, sign it another value. So in Python, they have something called a conditional expression, which has the syntax assigned to value using the equal sign ask your questions. On this case we just ask, is it an int- or story? So the first thing that returns it's actually backwards from Turner operator. So this is the reality. The line reads data equals into data if and then check your predicate and in Python, we can just call is numeric. Value which will return as true or false based on whether it's a number. So if that returns true than it'll end up assigning data into data otherwise, you can just sign it itself data and then it's not going to do any transformation on that variable because it's not numeric it's one line of code. It's more expressive in my opinion and avoid using trying except and it's preferable from from my point of view. I would say it's probably preferable from my point of view as well. I have mixed feelings about this but I, do think it's nice under certain circumstances one. For example, if you say tried to thing except pass a lot of lenders and high charm and whatnot will go this is too broad of a clause you're catching too much and you're like, okay. Well now to make the little squiggly in the scroll bar, Go hey. I have to put a Hashtag disable check whatever right now it's five lines one with a weird exception to say, no, no this time it's fine. So that's not ideal. I. Definitely. Think that that's more explicit more expressive tease this additional if one liner, the one situation where I might step back and go you know, let's just do the try is if there's more variability in the data, so this assumes that the data. Is Not none and that it's string like right. But if you got potentially objects back or you got none some the time, then you need a little bit more of a test mean you could always do if if data and data is numeric, that's okay. But then it's like if data and is instance of string data and like there's some level where there's enough task that it becomes you kind of find a crash, right? We'll just catch it and go but we were talking before we record also like there's a performance consideration potentially definitely and it's interesting. I'll let you speak to what what you found but on the Youtube comments of the Pike talk that was one of the probably the most discuss things was whether or not the conditional expression was less performance than the. Original trying except because a couple individuals commented that it was, it was more python to use the except in and therefore it might be more performance. You can share with what you found sure. Well I think in terms of the python inside like certainly from other languages like say c plus there's more of this it's easier to ask for forgiveness than. Style programming rather than alternative look before we leap right visiting like see, it could be a page fault in the programs because poof and goes away and something wrong. Where's this? It's just gonNA throw an exception you're gonNA catch it or something like that A. so there's like this tendency to do the style but in terms of performance I wrote a little program because I wanted to. I've Mike maybe this is faster may be slower. Let's think about that right by wrote little program which I linked to. There's a simple just linked to it in the show notes breeze one, million list with one million items at us the random seed that has always the same so there's no zero the area. Ability even though it's random, it's like predictable random and it builds of this list of either strings or numbers randomly a million of them about third strings, one third number, and then it goes through and it just tries both of Mrs like let's just convert this as many of them as we can over to editors and and do it either with the trikes pass or just do this is numeric test. It is six times I got. Yeah. About six point five times faster to do the test, the one line test than it is to let it brash and realized that it didn't work. Yeah. So there you go. You heard it here on hug my to me driving conditional expressions faster than trikes. Talk Python to me is partially supported by our training courses. How does your team keep their python skills sharp? How do you make sure new hire get started fast and learn the python ICK way. If the answer is a series of boring videos that don't inspire or a subscription service. You pay way too much for and use way too little listen up. At talk by on training, we have enterprise tears for all of courses get just the one course you need for your team with four reporting and monitoring or ditch that anew subscription for our correspondence, which include all the courses and you pay about the same price as subscription once. For details visit training talk by DOT, FM slash business or just email sales at Tom Dot FM. There's a lot of overhead to throw an exception and catching it and dealing with all that. Now, right this is a particular use case that varies and like all these benchmarks like might vary like if you've got ninety five percent numbers and five percent strings at might behave differently. So there's a lot of variations example you can play with in what seems like a reasonable example to me it's faster to do. The is numeric tests. So fast, right? Not like five percent faster six, hundred, fifty percent faster is worth thinking about out for sure. Let's see. So come through and in the end you had. A ton of stuff was here. It was like when he lines a code just for these two loops, and now you've got it down to four lines of code by basically an outer loop and inner loop grab the data independent with this little test that you've got much nicer. I Agree Yep. So you went I, think if you look at the overall program at this point, you were doing some analysis or like some reporting said, it started sixty lines of code and now it's down to twenty. Roughly. depending on if you count, you know lines and whatnot but it was about sixty down to about ten or twenty lines and At this point I had sort of pointed out that I had made a mistake. So this was fantastic at least I had thought that you know I I take a code I'd take a code snippet from a blog reduced it by. Roughly seventy five percent or sixty seven percent depending on how you measure it. But I had made an even bigger mistake than I had realized and it was that when I hit originally I'd shown googling for you know how to scrape html using pandas that I read the second results and the third result was actually what I should have chosen and it was that I had a pandas actually has a read html method in the library and so the point that I go on to make if you use that you go from you know ten or twenty lines down to like four lines of code and You're just invoking this one Pan Api read html and it's so much better. So you re factoring is fantastic but there's some quote about like the best code is no code. If you don't have to write anything to do what you WanNa do and you can just use an existing library. That's the best thing that you can do because that's going to be way more tested than the custom code that you've written. It's GONNA save you a ton of time and you're going to end up with ultimately less code to maintain yourself and better than having someone else maintained the Co that you're using for you. Exactly right. It gets better for for no effort on your part. Yeah. It might get faster where my handle more cases of like broken html or who knows but you don't have to keep maintaining that it's just read UNDIS- grades Tim Allen painters just probably getting maintained. Yeah. One of the things that I've echoed in some of the other talks that I've given is knowing your algorithms in C. Plus plus definitely there's a whole. Standard. Library. There's a lot of built in functions. I guess they're not so much called Algorithms. They call them built in functions and python. But like there's a whole page where I was just looking at it the other day and there's a ton of them that I'm just not aware of everyone else about map filter. Any all like I. Just saw I. Think it was called MoD which was a built in function for giving you. Both like the quotient and the remainder, which is like there's definitely been a couple of times where I've needed both of those in you do those operating separately and it's like off I just knew about it. You can in a single line you can D- structure it using the Uraba unpacking knowing your algorithms is great but also knowing your libraries knowing your collections like the more you get familiar with what exists out there the less. You have to right and the more readable your code is because if everybody knows about it, we have a common knowledge base that it's transferable from every project you work on your final version basically had to really meaningful lines. One was request dot get the other was hand as dot read html. You don't have to explain to anyone who has done almost anything with python what request get means like Oh yeah. Okay. So got it. Right. We all know that works. We know it's good work and so on and it's really nice. I. Think Though what you've touched on here really it's really important but it also shows why it's kind of hard to get really good at a language and the reason is there are so many packages right you go to. Let me. Let me try Pipe York. Now every time I go there is always Hundred and forty, five, thousand packages. If you WANNA, learn to be a good pilot program or you need to at least have awareness at a lot of those and probably some skill set in some of them because like pandas, one of those requests and other one writes the four lines solution that you came up with was building on those two really cool libraries and so to be a good programmer effective. Like keeping your eye on all those things and I think that's it's both amazing. But it's also kind of tricky because I'm really good with four loops functions great. You've got two hundred thousand packages of study go. There some quote that I've heard before where being a language expert is ten percent language, ninety percent ecosystem and it's you can't be a guru in insert any language if you don't know the tools if you don't know the libraries, it's so much more than just learning the syntax and learning the built in functions that come with your language it takes years and It definitely doesn't happen overnight. It's a challenge for all of us aperture. You know maybe it's worth a shoutout to awesome bash python dot com right now as well, which like as different categories you may be cared about and then we'll like highlight some of the more popular libraries in that area that. Sounds that's a good one. Yeah. For sure do you did nine different steps? You actually have those called out very clearly in your slides, you get the slides from the get hub repo associate with your talk which linked to the show nights of course. But all of this re factoring talk was really art of the journey to come up with a totally different answer, which was one of the most popular languages for these coding competitions. Yeah. Ultimate goal was to scrape the data and then to use pandas in order to do that analysis and at the end of the day I, believe the number one I definitely know the number one language was C. Plus Plus said about. Eighty nine percent, and that typically is the case because certain websites they give the same time limit per language so Like hacker rank, they vary by language. So Python your execution time that your allotted faster is ten times more for python. So even though python slower, they give you our amount of time, but most websites don't do that. So the code forces website, it gives like you I think two seconds execution time regardless of the language they use and so due to that most people choose the most performance language, which is C. Plus plus. But in second place was python and I know a lot of competitive programmers that for the problems where performance isn't an issue that you're trying to sell four, they always use python because it's about a fraction of the number of lines of code to. Solve it in python than it is in any language sometimes, you can solve a problem in one line in on and the next closest languages like five lines, which is a big deal when time mattress guy yeah. Are you optimizing execution time or developer time in this competition, right? Yeah. It definitely matters what you're trying to solve for C. C plus plus was first python was second Java was third and then there was a bunch of fringe languages. The top three were C. Sharp Paschal and Katelyn and yeah you can see a full list if you go watch the Pike on talk but it was add to find out what was what was used and what wasn't sure was and Cool to see the evolution of what you created the answer that question or just pretty neat. All right. Well, let's just talk a little bit about rapids because I know that people out. There are visual data scientists and they're probably interested in that project. Do we did mention a tiny bit that it's basically hake has as data frames apply something like that that API pretty close not one hundred percent identical and everything, but pretty close and it runs on. The wire GP's better like I've a really fast computer. I have a core I nine with six cores. I. Got a couple of years ago. That's a lot of course, right? So yeah it first thing I should highlight to is that rapids is more than just Cudi APPs. So could have the library I work on. We also have a Kuu, I o coup graph consignia who spatial cwm l., and each those sort of map to a different thing and like the data science ecosystem. So Cudi F-. Definitely is the analog pandas Kumhal I think this sort of analog you can think of is like psychic learn but also to like none of this is meant as as replacements, they're just meant as. Alternatives like if performance is not an issue for you like stick with what you have, there's no reason to switch day. I don't don't do it because for example, I couldn't run it on my macbook right? Because I've Radian right? Right. If you do want to try it out I think they're on the record. So if you go rapids day I, we have a link to a couple of examples using like, Google. Co Lab that are hooked up to like free GP that you can just take it for a spin and. You need the hardware, but you can't go try it out. But like our pitches sort of like this is useful for people that have issues with compute right and for different pieces. You'RE GONNA want different projects. So if you're doing pandas likes or data manipulation Cudi, F is what you want. But yeah, why are GP's faster is just a completely different device in a completely different model so a GP us. Typically. It's in the G. of the GPO were known for being great for graphics processing, which is why it's called the GPO. But at some points, someone coined the term he actually works on the the rapids team. Mark Harris, he coined the term gpgpu which stands for general processing GP compute. It's now typically referred to as just jeep you computing, but it's this idea that even though the GP model is Great. For Graphics Processing. There are other applications that a jeep to us are also amazing for the next best one is matrix multiplication, which is why they of became huge in neural nets and deep learning. But since then we've basically discovered that there's not really any domain that we can't find a use for GPS force. So there is a a standard library in the coup model called thrust. So if you're familiar with C.. Plus plus the standard libraries called S. t. l., and it has a suite of algorithms and data structures that you can use thrust the analog of that for Khuda, and it has reductions. It has scans and it basically has all the algorithms that you might find in your CPAP plus stl, and if you can build a program that uses those algorithms you've just GPA accelerated your code however using thrust isn't as easy. As some might like and a lot of data scientists they're currently operating in Python and R., and they don't want to go and learn C. Plus plus and then Khuda, and then master the thrust library just in order to accelerate their data science code the rapids goal is to basically bring this GP computing model for a sort of general her bis acceleration of data science compute or whatever compute you want to the Data scientists, and so if they're familiar with the pandas API, let's just do all that work for them. Put the. So so rapids is built heavily on top of thrust and Khuda, and so we're basically just doing all this work for the data scientists so that they can take their panda's code like you said, hopefully, just replace the import and you're off to the races and the performance wins are pretty impressive like. I'm not on the marketing side of things. But in the talk I mentioned, I just happened to be listening to a podcast held the invidia AI podcast and they had i. believe name was Nicholson and by swapping out cudi effort pandas for their model. They were able to get a hundred x performance win and a thirty x reduction in cost. That's thirty times. Not Thirty percent. Yes. So thirty is at right. Multiplication, valley, which is massive. That's the difference between something running. So if it's a hundred X in terms of performance at the difference between something running in sixty seconds or an hour and forty minutes, and if you can also save thirty x if that cost you a hundred bucks and now you only have to pay three dollars. It seems like a a no brainer for those individuals that are impacted by performance bottleneck. If you're hitting pandas and runs in a super short number seconds. It's probably not worth it to switch over. Yeah. Well, and you probably you tell me how realistic you think says, but you could probably do some kind of import conditional import like in the import, you could try to get the record stuff working. If that fails, you could just import pandas as the same thing. One is PD the speedy. Maybe it just falls back to just working on regular hardware but faster when it works pretty think that is definitely possible. There's going to be limitations to it though obviously if you have. A a sort of Cudi F. Data frame I don't think you do it piecemeal. But if you have a large product what I'm thinking if you wrote it for the rapids version but then let it fall back to pan does not the other way round you take arbitrary pass code you try to ratify it that might not work, but it seems like the other one may well work and that way if somebody tries to run it, they don't have the right setup is just slower also would there's definitely a way to do that to make that work it might require a little bit of. Some sort of boilerplate framework code that is doing some sort of checking you know is this compatible else? But like that definitely sounds automated like, yeah. Yeah. That sounds cool because that would be great to have it just fall back to like not not working just not so fast. Right The future of computing is headed to a place where we can dispatch compute to like different devices without having to like manually specify that like I need this code to run on the CPU versus the view versus the TPS versus in the future I'm sure there's GonNa be a cue for quantum processing unit like. Exactly. We all think Sierra Leone most of us that don't work it in video. We are serially and in terms of like the way that fuse do compute but I think in ten or twenty years were all going to be learning about different devices and it's going to be too much work to in our head always have to be keeping track of which devices is going to. At some point there's going to be a programming model that comes out just automatically handles like when it can go to the fast vice when we can send it to the seaview. Yeah, absolutely. So just a while you're talking I pulled it up on that alienware gaming machine got as a g force are. To seventy, which has two, thousand, three, hundred, and four cores. That's a lot that's a lot of course, and if you look somewhere Google claims that it achieves seven point five teraflops in the super increases that ten nine teraflops, which is just insane like a core I seven doing like going to three five. Point two, eight or something like that. So anyway, the numbers readers are like they boggle the mind you think of how much computation graphics cards do these days I think top of the line I might get this wrong but like the modern GP user are capable of fifty teraflops. It's an immense amount of compute. That's hard to fathom especially when coming from the CPU sort of way of thinking. Yeah, absolutely. Yeah. The only reason I didn't get a higher graphics card is every other version required water cooling. I'm like that sounds like more effort than I want for a computer could just go with us when. All right, well rapid sounds like super cool project, and maybe we should do another show on the the rapids team across these things to talk a little bit more deeply. But it sounds like a great project by working I, work on the C. Plus plus lower engine of it but I'd be happy to connect you with them. Some of the Python folks that that work on that side of things and I'm sure they'd love to come on. Yeah. That'd be fun. All right now before you get out of here, got to ask you the questions if you're gonNA write some Python Code, what editor do us so I am a vs Code convert. That's what I typically use day to day. Nice. Yeah. That's quite a popular one these days and then notable pipe package. Then you ran across Chicago people should know about this. Yeah. So I like to recommend there's a built in Standard Library which I'm pretty sure most hyphen developers are familiar with it or tools which has a ton of great functions but less well-known is a IP package called more hyphen editor tools and I'm not sure if this one's been recommended on the show before. But if you like what's in it or tools, you'll love what's in more tools. It has a a ton of my favorite algorithms chunked being. One of them you basically pass it a list and a number, and it gives you a list of lists consisting of that many things. It's like paging for yeah. Yeah and there's tons of neat functions. Another great one that's so simple but doesn't exist built in all underscore equal. It just checks given a list are all the elements, the same, and it's a simple thing to do. You can do it with all but you have check is every element equal to the first one of the last one. So there's just a ton of really convenient functions and algorithms in more O'Toole someone Iraq. Yeah that's cool and you can combine these with. Like generator expressions and stuff they are all these you know pull some element out of each object that's in their generate collection. Ask of all those are equal at they go to all these ideas go together well there. Yeah. They compose super nicely. Yeah for sure. All right. Final Call to action people are interested in doing and refractory and making their code better. Maybe even check out rapids but he say, I'd say if you're interested in what you heard on the podcast, check out the pike on talk, it's it's on Youtube. If you search for bike on twenty twenty, you'll find the youtube channel and if you're definitely am interested in the Day I check us out there I assume all this stuff will be in the show notes as well. So maybe that's Urging. Yeah it well, then also you talked about Youtube channel a little bit baby just tell people how to find that will put like in the show notes as well. So they can if they to watch you. Talk about some of these solutions in these competitions. Yes. So my online aliases code underscore report. If you search for that on twitter youtube or Google I'm sure all the links will come up and yet you can find me that way. Awesome. Are Willing to that as well. I Will Boehner. Thank you so much be in the show a lot of fun to talk about these things with you. Thanks for having me on. This was awesome. You Bet bye bye. This has been another episode of Talk Python to me or guest on this episode was Connor who? Has Been brought to you by US over at talk by on training. Wanting to level up your python. If you're just getting started, try my python jump start by building ten apps course or if you're looking for something more advanced check out our new eysenck course, the digs into all the different types of acing programming you can do in python and of course, if you're interested in more than one of these, be sure to check out our everything bundle. It's like a subscription. That never expires be sure to subscribe the show open your favorite podcast and search for Python. We right at the top you can also find the itunes feed it slash itunes the Google play fetus slash play in the direct RSS feed at slash ourselves on talk by phone Donovan. This is your host, Michael Kennedy thanks so much for listening I really appreciate it now get out there and write some Python Code.

Michael Kennedy youtube Google Martin Fowler Emmy GP twitter developer Khuda ASE INC Ryan Lee jeep Mark Harris US alienware