A new story from Software Engineering Daily

Automatic TRANSCRIPT

And automated analytics on top, AI ops as we call it, as well as incident response, essentially alerting and notifying the users. Or otherwise in terms of the areas that we're solving this problem is log analytics, infrastructure monitoring, APM, real user monitoring, et cetera. So this is a cohesive solution. It looks like a single application, unlike, let's say, how we dealt with all these parts of monitoring in the past. Single user interface, the data is fully connected, it all relies on OpenTelemetry. So we believe that has improved quite a bit how effective and powerful. Let's say monitoring and it can be because we bring everything in one place. It works at any scale, we have some of the largest customers in the world, some of the largest retail ecommerce name it. Using it at a very, very, very large scale. And that's our focus as a company in general. And also the other big focus we have is that analytics and AI ops, right? Now that we have all these data fully connected, not only we can, let's say, manually do more powerful troubleshooting. But it allows us to build analytics on top. Let's start connecting the dots for the user. And that's kind of the other foundation for what have built, right? So OpenTelemetry, data fully connected, and a single application, all in all place. And analytics on top enterprise scale really. And when you talk about developing a product that's comprehensive in that sense, can you explain how it differs from what was available maybe two or three years ago and what are the kind of engineering problems that you've been addressing that were not available or were not addressed in previous iterations of observability technology? Sure. First of all, I think we built the observability cloud as we call it in the way we built it, because we have been responding to the customer to users and customers, right? What we see is because of the increased complexity of these systems and how connected now they are, we don't monitor and troubleshoot infrastructure separately from applications anymore, right? And when let's say an end user faces a problem, it's oftentimes connected back to the backend application. And that might be connected to the underlying infrastructure, which is usually in the cloud. So that was the problem, the users were facing, and that's what we are responding to, right? The difference from the past is typically we used to have some system that would monitor my infrastructure, maybe how the different systems that monitor, let's say, my virtual infrastructure on top, I usually had a different system that monitored maybe my network devices. I definitely had a different APM system that only monitored my application. Oftentimes, the API provider might have given me a real user monitoring application as well, but that was also not really connected. To the way I was troubleshooting backend applications. So all of these were separate and not connected, right? So whenever a problem accurate, it was up to the user. Oftentimes, in a war room, multiple, let's say, admins of all these tools, getting together, trying to understand where the problem might be, right? And all the troubleshooting and monitoring. All the troubleshooting, all the advanced level shooting was happening in people's heads, right? Because I had to maybe some data in one system. But that had to connect the data in another system somewhere else completely manually. So that was and still is the life of most, let's say, network operating centers in the world, right? And observability is trying to change that. And I think the whole industry is moving towards the direction of described in my opinion. So when a user hooks into Splunk and they start collecting logs, metrics and traces, can you give a sense of the backend infrastructure that's storing that information, I just love to get a sense of the databases and the just the infrastructure that you use to serve that data, caching infrastructure, et cetera. So first of all, in trying to build a system ourselves, but as I said, is enterprise scale and really handled data when volume at real time is a very hard engineering problem by itself. As I mentioned, all our let's say IP when it comes to data collection is part of open telemetry and in the open-source. So a lot of the additional value will provide in how we deal with this data. So we're dealing usually with structured and unstructured data. So metrics and traces tend to be very, very structured and logs them to be more structured. So there is, in any case, there is usually a processing layer for all this data as soon as it arrives. So we're trying to with a very, very low latency to ingest the data and route it to the appropriate place, and then when it comes to monitoring, let's say, alerting, for example, it has to happen in real time, right? Our goal is to be able to alert the user within, let's say, ten seconds from the moment a data point is generated, right? So they know immediately in real time. If you have a problem. So we have built, let's say, our own streaming metrics and monitoring infrastructure that was a lot of the technologies that signal effect had built action before the acquisition that allows us to monitor all this in real time as it comes to the system. So traditional time series databases tend to store all this data, and then let it rest and coordinate the data for, let's say, alerting or dashboards. But out oftentimes means that you have to wait for.

Coming up next