Procella: YouTube's super-system for analytics data storage

Linear Digressions


Hickey. Pay Ben. which been up to lately? I've been spending a lotta time on Youtube actually. There are a lot of good youtube videos on different topics. Music also data science Yeah have you been spending anytime on Youtube recently? I wouldn't say up been watching more youtube than than usual or anything, but I have been reading about it a little bit and that's kind of what we'll talk about today. All right. You're listening to linear digressions. So is reading youtube a thing then? I think reading papers about Youtube. It's fair to that. That's the thing not aware of anyone who I know. People who listen to music on Youtube and people who learn how to do stuff on Youtube I'm not sure I. Know Very many people who just read on Youtube, though yeah, so, what? What is it that you were reading? Right so I was reading about a system that they have built over there called Pro Chela I believe I am pronouncing that correctly. Coach Yeah it now. It's about slightly differently P. R O. C E L L. Yeah so this is the data system that they have set up to power. Four different types of analytics use cases which I think is really interesting, because usually what you would do if you have four different types of analytics use cases, you would have different systems, and you'd be shuttling data back and forth, but instead what they've done at Youtube is they've built one unified system, and so prochell is the name of that system. I was reading sort of a white paper that describes how it works, and why they made some of the design choices did. One system to rule them all, that is the goal. The goal so Let's start with what are the different use cases that you want able, so this is all under the bucket of analytics data serving, so if you've been listening to some of our other more recent episodes, you know that there's also another high profile use for databases, which is sort of transaction handling for running applications. We're not talking about dot stuff by and large instead. This is just data that's being used for various analytics needs. And those four needs are number one. This is the data that powers reporting and Dash Boarding. So imagine that you are for example, a a creator or sort of like a video author or something on Youtube you have you know some reports and some dashboards and stuff like that you can do to see how many views your videos getting and that sort of thing. Right so I guess that data has to come from somewhere and and. Normally, you would build a system just to handle that data. Right? Yeah, so in that case, let's talk about each of these. What are the needs of each of these systems as we go through them so for reporting and Dash Boarding? Because we are doing that on roughly the scale of you know maybe the number of. Creators or the number of videos, others are very high volume of data that and of queries that you have to be able to serve because there's potentially out that the entire user base of Youtube. Might be interacting with that data. these people have probably expectation that the data's pretty fresh, so if someone watched a video of mine ten minutes ago, I want to see that reflected in my report were my dashboard so being able to see things, I don't know maybe not necessarily in real time, but pretty fast after they happen. Yeah, yeah, I mean like you launch a video. You want to get a pulse on how well it's doing! Let's say if you get ten million views roughly on a video and you launch video. You'RE GONNA get a lot of us really quickly and your dashboard and save zero view is for five hours. Right right, so you want to be able to access fresh data, and you want to do it with a fast response time so I. Don't want to log into my dashboard. Run a query, and then it has to wait thirty seconds or a minute, or who knows multiple minutes for it to return something back to be I definitely have had that experience on some websites wherever most of the pages on the website or web APP or service or fast, and then you go to the charts fewer, the analytics view, and you know you see a lot of spinners. Is these things happen in the background? Yeah, yeah, so that's not. That's not good user experience yeah. So the second use case is embedded statistics So this is when you're actually in the application There's you know those little ticker. The how many times of it has been viewed how many thumbs up it's gotten. HOW MANY THUMBS DOWN! So the ticks that are actually embedded in the user experience, you're not going to establish view, but you're. Clicking around Youtube and you see these statistics that are maybe a little bit more general or or rougher. In the actual user interface. Yeah, so those are you know. I guess that's a little bit starting to flirt with something that looks more application based, but it's intellects because it's looking at potentially all of the. All of the data that's in the database to try to be able to compute those compute those numbers, so this is also a pretty large volume of queries that you have to be able to handle. They say on the order of millions of queries per second because I guess that's probably the roughly the order of. like the frequency with which videos are viewed, which is probably the what's driving most of that.

Coming up next