Audioburst Search

Power Up Your PostgreSQL Analytics With Swarm64 - Episode 133

Automatic TRANSCRIPT

hello and welcome to the data engineering. Podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project hear about on the show you'll need some more to deploy such our friends over at Leonard. With two hundred gigabit private networking scalable shared block storage. A forty gigabit public network fast object storage and a brand new managed Kuban as platform. You get everything you need to run a fast reliable and bulletproof data platform and for your machine learning workloads. They've got dedicated. Cpu ended GPS. Instances could date engineering PODCAST DOT COM SLASH. Lynnette today to get a twenty dollar credit and launch a new server. And under a minute and don't forget to thank them for their continued support of this show. You Monitor your website to make sure that you're the first to know when something goes wrong. But what about your data tidy? Data's the data ops monitoring platform that you've been missing with real time alerts for problems in your databases. Etl pipelines or data warehouse and integrations with Slack Pager. Duty and custom web hooks. You can fix the air as before they become a problem. Go data engineering PODCAST DOT com slash tidy data today and get started for free with no credit card required. Your host is Tobias macy into today interviewing Thomas Richter about swarm sixty four a post grads Q. L. Extension to improve parallelism menaj support for. Fbga's so thomas can you start by introducing yourself. Yeah Hammer name is Thomas. I'm CEO and CO founder of Swansea. Four and I'm a strange beast because I live at the intersection of business and data and data management programming. So that's what I do and enjoy very much. And do you remember how you first got involved in the area of data management so probably the the first real exposure to enterprise grade data management and data. Wrangling was as an intern almost twenty years ago when I was working at Necessa German national airline and in the department. They did something that today. You would probably call a data science back then. They called it sales steering and I basically pulled out data out of the large. Ibm COGNIS based data warehouse. And all the beauty of overlap cubes and the like so. That was my first exposure to that space and being always at this kind of intersection point as I mentioned so I very much enjoy basing business decisions on a vision of the truth and I think the most objective vision can obtain Israeli looking at the data the underlying based on the ninety effects. And then you can make much smarter decisions because you basically looking at to proof an hypothesis as opposed to just argue opinions. So I've been through my career always been at that kind of cross section and when I had the opportunity to found something in the data space I was very excited about it and so we basically built phone sixty four. So can you describe a bit more about what swarm sixty four is and some of the work that you're doing there so swamped? Sixty four is an extension for the usually popular post database. And I think to business posts. Chris will be a new concept right. It's very widely very widely adopted and NUCCI popular. And we do is we tend interests and we are basic sal rating it for reporting analytics time-series through spatial workloads and also hybrid oracles that include. Transactional Leonidas legal components. So that's we. Can you give me some of the back story of how the business got started and what it is about it that keeps you motivated and keeps you continuing to invest your time and energy with it and that's that's a very good question and it's also quite an interesting journey that so when we started this and this and that sounds horribly stereotypical but this was actually started in the Berlin Cobra space? And so we really Michael and I met at a co working space and we started to go at it initially very much from the habit angle. So my co-founder had developed some of the earliest mobile edgy pews. And we were basically looking at data processing from how Bangel and as we evolved we learned from interacting with our customers that everybody wants a full solution. You don't want to have some kind of peace that you have to puzzle together you really enjoy. Having a full solution for us. Really Pulse then came naturally as a system we could accelerate and not only with hot man. That was our original take But also we build a stronger stronger. So Fra- component to it. So as I will be expanding later you now have the choice between software and hardware components as you want to add them options. And so yeah. That's how we started. And I think the part that I particularly enjoy about where we've come since we started. This is that we're now in a situation where we can really challenge some market players and on talking about the big proprietary databases that are really good products but they also very expensive especially in the area of data warehousing we can now live post-chris which has already fantastic to a level of performance that it can suddenly compete and this act of moving open source into spaces where previously only proprietary solutions could address the business problems. That's something I find very rewarding because it's a little bit like playing rocky bubble. Y You know you're like the small guy and you're going in and you're playing there and you kind of fighting to win the title against some of those really heavyweight champions and I find that quite rewarding it's a big challenge but that's kind of where the funds and in terms of the bottlenecks that exist in the open source post grads. What are some of the common ones that users run into that prevent them from being able to use it for the full range of use cases that they might be intending to and what would lead them to maybe move their analytical workloads into a data warehouse or a data lake vs continuing to run them in process with the post grads engine and the postgraduate database? That they've been building up for their applications and their business. Yes I think this is actually already a very well framed because postcards in itself I mean as we all know. It's been around for thirty plus years and it's a really mature and powerful product however. I would say it has a blind spot in the area of parallelism in some of the things that that's a hang together with it and when I talk about parallelism here I talk about the ability to deploy those modern multicolored systems and deploy many many chorus like tens or hundreds to single problem the kind of MPP style processing that those proprietary products already master post-chris kind of got as it doesn't afterthought so if you look at this feature called career parallelism that that was added impulse nine point six. That's already approximately twenty years down post-chris history lane right so it's something that has been added very late in the development cycle off the database and whereas it's a it's a great extension. We LOVE BEING THERE. It is really not going as far as we personally believe it should do. And that's why we extending it. So Kirk houses them is one of the bottlenecks usually when you're finding a difficult to deploy a lot of your course in you multi-core system to your post Chris Queries then swansea before can probably help you. Civilly Scanning Watch amounts of data that ah not lending themselves to index quite simply because index is great. If you try to find the needle in the Haystack. But what if you're not trying to find the needle in the haystack? What if you're trying to scan a range that is effectively a thought of your table again? Posters isn't very faucet scanning so it really hurt when you tried to run these kinds of crews then another area is of course the currency of contexts curry's so have curse at fall into the first of the second category just been describing and then you try to run multiple of them in parallel and you will see how your individual pollsters workers kind of scrambling for I o and kind of competing with each other. This is something that's also addresses so complex scoring currency something. That's Is also a challenge. We see in the field and finally and this is true for any database. We're just trying to you know help and contributed Certain patterns difficult process and there is always the question about occasional user rewrite it or should you provide some additional intelligence for example to execute certain anti joins smarter and things like that. This is really a kind of a never ending debate now. The default choice would usually be to rewrite the queries. But there is often a scenario where this is not desirable or whether this is just not an option. Because the curry's could for example come from an application that the user touch so kurt patterns that a difficult process are kind of the fourth and Amon so in summary parallelism scanning lash amounts of data many concurrent complex queries and crew patterns. That are difficult to process those four areas where we see a lot of challenges in post Chris when you try to scale it to a large degree and when I say a large degree I mean we're talking about at least sixteen. Twenty four threads like were Elf. Eight to twelve course right mental talking here about your little system database running on ten percent of the Sava having is maybe one or two device we're looking at latter problems hundred gigabytes terabyte range. Something like that and in terms of post grads. It's a common database for application workloads and for being able to do some lightweight analytics but what are some of the common bottlenecks that users run into that prevent them from being able to use it for all the different use cases and that might lead them to use a dedicated data warehouse or a data lake engine for being able to do more intensive analytic workloads? Yeah thanks. It's a very good question as had an essay already framed in your question. It's like post-chris itself is extremely versatile. But it usually struggles as the data. It's quantity of data grows and generally tends to resonate around four different areas so the first one is related to curriculum so parallelism has been kind of added to post grass when it was already something like twenty odd years old years old. So you're looking at The so-called Corre- parallelism feature being added in version. Nine point six two. We're now at version twelve so this is only actually a few versions ago. And that means that when posters execuse it doesn't utilize modern multi core systems quite to the degree as you would in a data warehousing context where you usually working with this. Mpp MASSIVELY PARALLEL PROCESSING PARADIGM. So post-crisis kind of holding itself back a bit and it's also missing some of the features to move data during the career process to actually keep the curry parallel for a very long time. The second part is that scanning large amounts of data that are not lending itself to an index. Is A challenge for Post Chris? When you're implementing an index usually kind of solving the kind of problems where you're finding in needle in the haystack few needles in the haystack. Whereas when you have curry that's requests to scan effectively. A third of your table index won't help you much and this is really a poster will then struggle the other part is what if you have many complex curry's and they could be off the first or the second kind of just describing concurrently like many concurrent complex grease they will really push the limits on your kind of storage bottlenecks and the way you retrieve data. The way data propagates through postcards. So that's another area where we are seeing bottlenecks and finally my fourth point is that no databases perfect as being able to execute every career in a perfect way however that certain cre- patterns from the domain of data warehousing. That much. Well they really require a little bit off special processing a little bit of optimization for example handling anti joins differently and things like that and this is really rather as the career patterns that can really turn postcode queries into so-called. Never come back varies. So those are four areas parallelism the scanning of large amounts of data many concurrent complex series. And then certain career patterns that had just turning according to never come back those kinds of things we are seeing and general. This is especially relevant when you're moving into hundreds of GIGABYTES or terabytes beyond it tends to be a lot less relevant when you like ten twenty gigabytes of your total database size and in terms of the use cases that benefit from this parallelism. The obvious one is the data warehousing use case where you're being asked to to perform large aggregates on data said said maybe just for specific columns within a set of rose but what are some of the other use cases that can benefit from the increased parallelism and some of the hardware optimizations. That you're building into swarm sixty four yet you've mentioned data warehousing that is of course the obvious one and in all honesty also data warehousing. I think is a very very catch all expression because it really moves like it. It's really there's a very wide variety of how you can have your underlying Schema design. Or what have curious you asking? So that's already a very broad field. However there's also other areas that are quite relevant so for example anything that is allowing kind of user based Dash Boarding reporting element so in other words. You may have bi tools. You may have custom dashboards. You may have a service software as a service solution that includes some customer interaction. That's take your salesforce dot com as an example right these kinds of applications. They allow us to drill down to aggregate to find out. What is my current status so in other words says a lot of concurrent reporting dash boarding happening and these kinds of problems. We are able to address very effectively. So it's kind of coming back to the point of mentioned earlier. Many many concurrent complex varies. So that's another use case where we see ourselves being a very popular and a very good solution and then another area. It's actually kind of more will be called new developments in the area of geospatial data for example or machine learning so just as an example. We did a project Toyota Japan and there was around the subject of connected cars analyzing geospatial data and also looking for a certain kind of response. Time a certain predictable response time and we were able to keep that response time window for much much longer. Time than standard posters without the For celebration so if you then kind of translated that back into cost we actually found that we could get away with much. Less hard ran an as a return. Or as a result of that you would basically lower your costs by as much as seventy percent. So that's one area like geospatial data time series data processing but again time-series probably in context. Yeah we're not trying to to be the next timescale DB but We are allowing people to process time series if they have the need of it. In addition to for example the geospatial data how the reporting day Tad's Cetera and then as I mentioned the other area machine learning It's very interesting when you need to have a certain kind of response speed so that kind of snapping. The posters generally has and combine that with actually feeding in a lot of machine learning data and at the same time pulling out of data to feed your models. And this is something that we doing with a company called TURBOT. They are the renewables energy space. And they're optimizing They optimizing wind turbines for energy generation. And how they're actually positioned and their We're in the area of optimizing wind turbines and also looking at predictive maintenance cases so these are just some areas on one side the big data warehousing space many many different use cases in that field but also things related to dash boarding reporting anything in that field and then of course any new developments geospatial data with immensely powerful. Postage is extension and machine learning space. Those are some of the areas that people find very interesting and then because of the fact that they're able to get this improved performance out of their existing post graph database it removes the necessity to do a lot of the data copying and can simplify the overall system design. I'm wondering what are some of the other benefits that can be realized by keeping all of your data in the database engine but what are some of the challenges that it poses as well particularly in the areas of doing things like data modeling within the database or for the data warehousing use case being able to generate some of the History tables so that you can capture changes over time and things like that. Yes so that's a very good questions lead me. I kind of frame the environmental bit. So this is very much thinking in the posters world right and what I mean to say there is post. Chris is a very schema based database. Right so we are not looking. Posters has pretty good documents store capabilities but again everything. You encounter the way you work with. The database is really like the scheme is at the heart of it and things that are maybe scheme. Ls are scheme for certain time. In you then you special operators to work with it and that's kind of a very conscious choice but in general if you're comfortable with a world of sequel with world where this defined scheme us then This will be versatile and you will be able to process certain elements like for example events in raise or or certain ski less elements documents. That has all possible but your your base assumption should be around a Schema based world and so. That's that's something that's quite important so if you're in an environment where you're willing comfortable working with explicitly defined Schema. You will find this extremely versatile. Nsf already mentioned you'll be able to find solutions for your different Problems like for being able to time travel in your data being able to audit. What has what has happened in terms of changes in song yet. So say if you're thinking posters if that's the mindset you like You will get very far into all sorts of spaces. Data warehousing logging geospatial data processing mich- at time series data processing machine. Learning those kinds of things. You'll be able to expand into staying within your comfortable kind of Post post-chris Working Paradigm. I think that is the key thing. That is the key qualifier. So if you're happy with sequel if you're happy with post Chris this is Then extending very naturally and increasing the processing through pit of database can be beneficial for things that are being that are intensive like being able to paralyze the queries. But how does that shift? The overall bottlenecks impact the guy. Oh in terms of the throughput of the database. Yeah I mean the the obvious thing. That always happens when you're moving to. Mpp is you. Run into this. I O Bottleneck. You mentioned right. It's suddenly in many many curry's it becomes the question. How fast can fetch the data and one of the things that we did and will probably touch on some of the other things when we talk a little more about architecture that some of the things we did was we created our own storage format and that is a hybrid row column format. It has some columnists indexing and the choice while we went hydro column is because post-chris itself as rose store and as I always say it's very difficult to teach a row store complete column store tricks right just like you'll you'll end up possibly with the worst of both worlds so we kind of embraced part of the Rosta concept and build that kind of row column hybrid format that allows you to still process varies in kind of tearing to the postcards logic. We are compressing those. We're making our own kind of posters. Has Its little data pages. And we're keeping a bit bigger data pages that also compressed so this generally tends to work very well and then there's some column Nar indexing as mentioned to also allow to be a bit smart and not retrieve everything and kind of indiscriminate story of the jury so you can be bit selective I would guess maybe some kind of skip reading or a certain Range next step would probably be the closest thing there. Yeah and all that is kept in a in a format and this format can be processed by the CPU but this format could equally be processed chemically processed by hardware extensions like for example. Fbga's so we're looking here. At two things EPA Jason Smart. Ssd's that are capable of reading these comments and then doing a lot of processing of those along the way and that usually helps you. Basically resolves the I o bottleneck with compression special layout selective fetching and then processing additional hardware so digging further into the actual implementation of swarm sixty four. You mentioned that it's a plug in for Post grads but can you talk through so more of the technical details of how you approach that and some of the evolution that has gone through since he first began working on the problem. So what's pretty good is that we came into it having built already other database extensions so we were really looking into okay. What were the things the lessons we've learned and we made the conscious choice to stay on an extension level with stress. In other words we would not go in and build out one post Chris and as many of your listeners probably know a lot of the popular projects and product in the market are actually post. Chris Derivatives you have the examples of Arizona redshift. You have the examples of IBM and tease out or for example pivotal green plan that all swans all once upon a time version of posters that then kind of take private and foreign to new product and we decided to not do that so we started with looking at okay where extension hooks that we can use where certain. Api is that we can use and we started kind of expanding from there and post. Chris is very very versatile that space. I mean it's probably among the most extensible. Databases are including close source databases so both open source. Databases probably posters among the one. The one that is most extensible and what you can do is you can define certain ways. And how data is accessed example custom scan provider you can define ways in how your data stored we started with the foreign data tables foreign data storage engines because there was no native storage engine. Yet at the point we started that now isn't version twelve. We are very eager to see how this kind of table storage. Api will evolve over the future. We may actually go much more in that direction but for now it's really a combination of defining certain table sources in our case that foreign table. Api combined with certain access paths that we can define certain career planner hoops. We can provide to post Chris. Certain cost functions. So it's really been designed very very well in terms of extensive and you can just provide an kind of offer yourselves to all these different extension hooks and then your respect to function will be called and you have the ability to tell balk standard. Pulse Chris about all the great things you can do in addition and this is how we worked and we realized that a lot. It's not the easiest way of working. But it's in a way the most rewarding because on the one side you're really benefiting from the lower effort and overhead to move between post-chris versions and secondly it was actually very easy for us to support other solutions for example our product also works for enterprise DB enterprise DB's post-chris advance server is actually not open source and still. We were able to compile for post. Chris Advanced Server by Enterprise D and were able to run on that. So now you can also use a product in solution like enterprise and that would have not been the case if we hadn't gone for such a cut of modular plug. Ville an architecture post Christmas offering us now that is on how we kind of work into the system. Let me just cover a few parts of actually doing so. The one side if we kind of take the anatomy of a curry it goes into the system and we are basically offering pollsters. In addition to all the different data handling mechanisms itself will actually offering it additional ways to process the query so for example we offer it to move data around during the Corre- so-called Shuffling Can stay peril longer. That's one of the things we do. We offer post-chris our own join imitation specifically optimized for joining very large amounts of data. So if you WANNA join tables that have a few billion. Rows of tables have a few million or even a few billion rows themselves that is something that can very quickly bring posters to its limits. And we did this. We have a special joint implementation for that. So that's something that is offered to post. It can pick it if it wants to. We offer certain Providing patterns so if we can basically notice that something is going to be executed very badly because for example it is not going to be. Maybe some very linear execution mechanism as opposed to do it in parallel than we will. We will offer that to post and the Planner will then pick and choose. Once the curious planned and gets executed we have the matching execute or notes to all these things and also we have this accelerated. I was mentioning before and when it comes to processing we can actually offload sometimes the entire query to the HVAC celebrators so that's optional accelerators. You can use. Fpj's you can use Samsung smart-t SSD's and those bj's from intellect filings or smartest decent sound. They will then receive instructions and process data according to the curry and only return the results and so all in all there is a host of different functions but offering to post goes to career planner will kind of choose like from a buffet. And if you have the optional additional Haba Delaration will also offload and push down a lot of the processing directly to additional hardware and making your system thereby even more efficient and then another element of this equation Beyond Post. Grads is the available hardware so you mentioned. Fbga's and smart SSD's and I know from looking through the documentation that you also have support for the Intel obtain persistent memory. And I'm curious how the overall landscape of hardware availability has evolved since he first began working on this and some of the challenges that things like the cloud pose for people who are interested in being able to leverage the specialized hardware that you're able to take advantage of. Yeah that's a that's a very good question and that's also something when happy that the market has really moved in From our opinion the right direction because when we started with this I mentioned earlier that we came from a very very hot driven in world and we were very early on using these. Fpj's as a prototyping platform I for processing Dayton using database processing and then many changes happened in the market on the one side. You suddenly had an until has really been on the forefront of that moving FBGA devices into the data center and then Silence also followed Than Amazon. Suddenly already years ago introduced and B. Based instance into that cloud and then from there onwards it's really been step by step by step and more and more clouds are enabling data centers gets or datacenter great if PJ accelerator cards in terms of cloud support in the context of swamp. Sixty four because actually it now becomes too many mention. Everyone in terms of supporting. Fpj But let me just mention the ones that we directly supports the of course. You've got Amazon. You got vh So large French data center you get Nimh big. Us based high performance data center and it's public knowledge that Azure is coming out with an FPJ instance so those edges some in the market and and the players that we are focusing on the moment And there you can really get access to FPJ's In the cloud quite easily in the instance types of makes it ever more easy to deploy on premise. Those are those can be obtained through. Oem's they basically extension cards they look like GPO's more or less just a very very different profile of what's inside but if you're just looking at the card is looks more or less like a GPO so nothing nothing new and exciting outside of the box but of course it gets quite exciting when you look inside and then another area of complexity is because of the fact that you are acting as an extension to post grads you need to be able to support whatever different versions people are running in their installations so while there might be a new feature that simplifies your work in version. Twelve as you mentioned that table storage. Api you still need to be able to be backwards compatible to whatever postgraduate supporting in order to be able to take advantage of a wide range of adoption and so I'm wondering how the overall evolution of post says has impacted the product direction and the engineering work necessary on your end to be able to build in all the features. They are trying to support as well as the challenges that you're facing in terms of being able to support the range of versions that are available for people who are running post says so in general we are like sometimes people are directly plugging into the existing database. But in general we are proposing a one time backup in one time restore quite simply because when we are deploying to our clients we usually give them a container based deployment that is I know there may be some people that are religious about the teeny tiny bit of performance containerized approach might cost but just in terms of ease of deployment. It is it makes it so incredibly easy that we're really that we're really in the predominant amount of cases actually managed to convince the client to do it this way and to be honest also eighty ninety percent of the. Klein's already very very happy with just going with the container. So when you actually getting swarm you will be getting kind of match set. It will be a box senate post but of the right version That we recommend at that moment combined with our extension and combined with all the relevant settings you need in a container and then if you're using FPJ or obtain a persistent memory on your system it will also have all the right configuration parameters to make it really easy to deploy to this hardware so getting almost kind of cloud like comfort there. And we're basically by way doing the same with all machine images for the different cloud. Instances EPI mentioning. So we really think it's much more convenient to a one time backup and restore and then not fight with any configuration parameters are any details than actually trying to retrofit into kind of every single post Chris. Version days out there however having said that we will also make the deployment into more broadly available postcode versions easier so that maybe half a year down the line away on how you can extremely quickly. Just install the extension into something probably from post-paris ten or eleven onwards to twelve or thirteen. So family brought window of versions that we will just support out of the box and just pick up the detail on your question. They are with the Post restorage. Engine we at the moment not Utilizing the storage engine because we are actually waiting for how this evolve. But you're right once we've actually made that pivot from the foreign data to the storage engine. That will actually be forcing us to basically keep two versions maintain so depending on which post You would basically support us in one way or the other so that communist is true but in general we've been so far knock on wood been quite successful in keeping pace with posters and then the other element of compatibility is in the extensions that people want to be able to use alongside your work at swarm sixty four so. I know you already mentioned post I S. Which is one of the better known extensions in the ecosystem. But what are some of the other ones that people will commonly look to use alongside swarm sixty four? And what are some that you know to be conflicting? That won't work. If they're using your extension the enemy. They may try to answer that question. A little more on that on high level so general people love using extensions and also extremely popular not only post. Yes but it's also any kind of extended data types so like custom data types on which is really one of the strongholds of post. Chris of course are important to support. And that's something we do and That is very useful is so I would say custom. Data types and that kind of custom functionality. Around post-chris extends ability as really what we see most now what does not work with sworn in the card version. And there's a change coming which I will tease a little bit. But in the current version as I mentioned. We're keeping our own storage of the data to anything that relies on how post-war status stored will not conflict but require a rock around in other words so we'll be generally propagate is people to use a mix off what we call native tables those other ones that do not have this. Kolomna stores Standard postcode tables and Also some of those accelerated tables but generally mix match them dislike horses for courses. Now when you then use a solution like for example background backup tool that kind of invisibly just copies pages that is usually relying on some knowledge. About how poster? Statea looks and hence it will run into trouble when trying to work with swarmed similarly replication schemes. That are based on for example. How the data's stored on disk again simulate issue however we have recognized that for customers it is sometimes actually quite useful to be able to just retain the data exactly as they store it and so in a upcoming product version. We will be looking more into what we call the a complete drop in where people have more of a choice. They still have the ability to get the extreme acceleration for certain amounts of data. Maybe these other kind of append only data. We were talking earlier about history tables and things like that. They would perfectly fit into a format. However you may have data formats that are probably replicated between will to post rece- databases etc. Where you would choose a different Storage Foam it and this is really where the upcoming product versions will go and they will allow you to keep your source format for the cases where it makes complete sense and still give you a higher amount of acceleration as well as used bespoke bespoke analytical storage format for the cases where you want extreme performance and digging more into the F. PGA capabilities. I know that most people are going to be familiar with the concepts of the CPU and the GP as a co-processor and some of the relevant statistics of those different pieces of hardware for being able to select one that will fit well for people who aren't familiar with FBGA's or who haven't worked with them closely. What are some of the benefits that enough? Pga can provide particularly in the database context. And what are some of the specifications that they should be looking at when they're selecting one for installing into their hardware or for deploying into a cloud environment? So let me take a look at what an F. J actually is so it's actually a configuration fabric elephants as like a blank sheet of paper that when he wakes up it is told what it should be and it could for example be a piece of sheet music playing for like a symphony or something. And it's quite similar. It's like a sheet of paper being configured to be the processing logic you need and how this translates to the area of databases is that we turn it into a piece of processing logic processing the individual it appoints in your data storage as they move through. The chip so storage is moved through the F. We've turned the entire fabric the entire. Pj processing into a custom logic for database processing. Now some people ask us. Do you kind of compile query into a specific configuration that. Fpj So it does only that no we don't we actually instead use the FBI with a very kind of sequel specific but still quite versatile processing unit. That does the processing as in the compression. If you're looking at storing data you're on finalize the J. Or if you're looking at reading data you decompress and you then execute the sequel career and all that happens while data's flowing through the FBGA. You have fantastically fine grain control over however data moves and I would say this is probably the single biggest difference in CPU's and GPS on the one side and FPJ's on the other that because you have that ability to reconfigure you can make something very custom and because you can make it custom you can make sure daytime moves efficiently. I enjoy a little bit of. Gpo programming as a hobby and the challenges. You have in making sure that you're processing happens effectively. You know the kind of knowledge you need to know about all your kind of cache hierarchies and how data moves that something you do not need to consider the PJ context because it's all determined by yourself you actually defining how data moves and hence you can make it extremely effective and extremely efficient so I would say that's one of the core elements and then finally another interesting element. Is this reconfiguration. Within split seconds so one of the processing units for example have is a streaming text processor that is capable to find a wildcard based strings so like strings with fuzzy matching inside your data as it moves through very effective but as you can imagine that takes a little bit of space so the F. B. J. being reconfigured you could effectively depending on your workload have those units included or executed vice versa. If you have for example a nightly or weekly load window you could reconfigure your FPJ and that's basically something that our database doesn't the background reconfigure the FPJ to be all writers than it does the nightly and then it turns back into already. I do the daily caller reprocessing so those are examples off. What the F. B. J. is kind of unique versus views and GPO's it's the ability to really define how does my data flow and on the other hand the ability to reconfigure so you can actually shape shifter device to match your need and then for people who are adopting swarm sixty four and particularly the hybrid row column store. How does that impact the way that they should be thinking about data modeling and the table layout and also for people who are working with very large data sets any partitioning or charting considerations that they might have pretty close? To Post Gra south. In general you can petition without data but quite often it's not needed because often petitioning as a m requirements to overcome certain a performance bottlenecks. And we don't necessarily require that however it is entirely possible to do it so in terms of paradigm that new new to learn. It's really quite standard. Post is just another storage format you can choose and you can get additional benefit having said all that. This storage is really kind of an expert option. It is to get the best possible performance. When people use a product they will already get benefit from all the other features and the kind of additional storage is kind of the icing on the cake so we generally recommend our customers is start slowly and work your way into it. We don't propagate any big migrations any big changes in particular. When you're coming from post Chris that's usually a few small tweaks you can do and you will see dramatic differences in terms of your experience of building this product and growing the business around it. What are some of the most interesting or unexpected challenging lessons that you've learned one of the things I found a really really interesting is to see how customers and also our excellent solution engineering team has actually solve some of the things and it shows you kind of the boundaries of your product? And it being used in ways. It wasn't really intended to be used which was really fun to see. So let me just give you one specific sound There was an issue around a career processing speed and that was actually already year and a half ago so it's actually quite much earlier version of the product but Essentially the way how the customer and one of our solution engineers got around the problem was they actually turned everything into a swarm. Sixty four based table format like I've been describing earlier and this was really a transactional table is started from. But it was so cheap to make that secondary copy because everything was Very fast ingested an accelerated by the F. PJ and then it was also very very cheap to process once. It wasn't that format. Actually the entire round trip was still significantly. Accelerating the curry. So that was kind of really unexpected. If you think about it okay. I'm having a table and I'm doing a very heavy operation on it. And then no way a second you actually. Don't you take a copy of the table? And then you take an basically take the operation on the copy off the table and you still faster than processing the original table in the first place that that was quite fascinating actually to see that happen so that was really a kind of learning point. Now in a way that was also a little quirky so within your versions We would now recommend different design. But you know in the end of the day. It's really really fun. Seeing your product being used and seeing some of the performance benefits being put to quite unexpected uses and when swarm sixty four the wrong choice and somebody might be better suited either just using vanilla post grass or some of the other players in the ecosystem or migrating to a different set of database technologies. So generally if your problem is small or your system is small. I think we're probably not the right choice. Example some people say oh. I'm running a database server and really mean is they have four eight physical cores than eight or sixteen physical threats and this is really the kind of level where at a parallelism becomes a little pointless. Because you know there isn't that many courses go round in the first place so what you want to paralyze similarly post-chris itself a performance. Pretty well even with these kind of challenging industry benchmarks if you're moving into areas like ten gigabytes so thirty gigabytes of data so I wouldn't say for anything in that range swollen would really be relevant but it can somebo- sometimes already be relevant for one hundred gigabytes of data and then as you move up from there like into terabytes into tens of terabytes hundreds of terabytes. That's definitely a range where we feel very very comfortable so too small. Assist them or to smaller problem or often the combination of the two. That's something that's definitely melt suited for us. And then the other part as I've mentioned is We're not trying to introduce a new tool and kind of invent any new paradigm so you should be looking at post grass as in. May Be using posters. Already may be considering post. I think this is also a kind of qualifying criterion. So to say if you really WanNa work on something that is for example. No sequel style. You know you should be looking at a post extension right. So that is I. Think another point however it. Doesn't you have to be in posters already? We find people who are looking at post-chris coming from those proprietary data warehouses. We've been talking about in the beginning and for those we can actually be an excellent choice and then as you look to the near and medium term of the business and the technologies and the evolution of the post as ecosystem system. What are some of the things? That you have planned. So in general this notion of becoming more and more invisible I would say. That's kind of the overarching concept. So if a kind of imagined and I'm looking at where we are now. Is you start with a server. And you add some obtain. Dc and you press installing tension and. They're swamped sixty-four using the octane. Persistent Ram from Intel. And it will just be doing everything invisibly you'll get the acceleration the same with an F. Pj Card may be on a cloud instance. You choose now. Okay I want to use a certain F Djabel Cloud instance or I'm installing an FDA card into my server. I'm buying a new server that already has an F. J. Cart say until stylings or a smart sst or an array of smartest these drives from Samsung. And then all you do. Is You? Install the extension. We detect the hardware. We just pattern. That's really that's really wham seeing all selves going in future. So we've been able to show very very good performance with the product we have now. It can be dramatically different. Like the fifty x on some curry's usually you see ten to twenty x depending on your your workload so big. Big Acceleration That's great but we now want to make it easy and easy to use and that's really helps going so you're adding hot while you're ordering a server that has new hardware and then using our extension you'll be able to use it very effectively and it will all fall into place behind the scenes and we got some pretty promising Prototypes of that running in our lab and some very confident. We'll go that way and will become more and more invisible apart from of course the massive performance differences that we wanna make for our users. Are there any other aspects of the work that you're doing at swarm sixty four or the post grads ecosystem or some of the analytical use cases that we've highlighted that we didn't discuss the you'd like to cover before we close out the show one thing I want to to mention? His is a big shout to the community. we've managed to get a first patch through so this this has now been been pushed. Which was great. This is about making it easier to backup also foreign tables in the Post Chris Environment so that will go into one of the new upstream versions of post. Chris should be there in version thirteen. So big big shout out to the community for that and in general we seeing ourselves as a member of that community so we are looking all the time at. Okay what can be contributed? We're also looking very much into the initiatives off the community around this persistent Ram obtain DC. And of course. Fpj's accelerators so big shout out to all the companies. They're in the post recycle system. That makes it a lot of fun to be there. Because you've got so much support for this database so for anybody who wants to get in touch with you or follow along with the work that you're doing all definitely have you add to your contact information to the show notes and as a final question. I would just like to get your perspective on what you see is being the biggest gap and the tooling or technology. That's available for data management today. Okay I would say actually that really really powerful open source visual. Bi Tool the kind of interact with these different databases. I think that is something that could be quite transforming Tori so think about an open source tableau and with that kind of power and and capabilities I don't want to discount any any of the of the projects that are that are out there. But I think there's definitely room for one of those existing projects to grow into really feature rich and easy to use. Visualize her snack. Just connect two different database back ends and then just just run so and maybe overlooking something obvious but From all the tools we've been using and swarm all the ones source. We didn't find something that is quite as powerful as some of the proprietary offerings out. There so that is maybe something that could be quite. Transforming Tori getting people into thinking more about data management utilizing database in the context of the tools to the maximum. They are using it today. Yeah I can definitely agree with that that there are a bunch of great point solutions or great systems that have a lot of features but aren't necessarily very accessible to people who don't WanNa dig into a lot of custom development for it so I'll I'll second your point on that. So thank you very much for taking the time today to join me and discuss the workday. You've been doing with swarm sixty four and trying to optimize the capabilities of post grads and simplify people's use cases there. It's definitely very interesting project so I think you for the work you're doing and I hope you enjoy the rest of your day. Thank you Ramesh. Great to talk to you listening. Don't forget the checkout or other show podcast dot net at Python podcast dot com to learn about the python language. It's community in the innovative ways that is being used in visit the site at data engineering. Podcast DOT COM to subscribe to the show sign up for the mailing list and read the show. He learned something or try to project from the show. Then tell us about it. Email hosts at St Engineering podcast? Dot Com with your story and to help other people find the show. Please leave her view on. I tunes until your friends and coworkers.

Coming up next