Canada discussed on Software Engineering Daily

Automatic TRANSCRIPT

How do tags play a role in designing and observability workflow? Dogs, you mean tags like essentially they are talking about might come from the data that flows into the system. Yes. So again, bags to structure, right? Ideally you have essentially key values that come along with the data that describe what the data might have come from. Let's say you might have the Kubernetes pod, where these logs or traces might have come from, right? Or have the service name. And a lot of other details. So these are very, very important information as I'm trying to slice and dice and troubleshoot an application. And that's where the analytical approach I described earlier comes into play as well, right? So you have to be able to essentially to store all that in a way that allows you later to aggregate data, dynamically, in any possible way. Meaning, let's say I have an application that is facing some issues. Oftentimes, I might want to ask, is this for all the users or just, let's say, users that come from iOS. Or users that come from let's say Canada. Or users that run on a very specific version of our client. And let's say I isolated down to a very specific subset of users. Then I might want to ask, okay, does this happen for all the types of requests or the requests that hit a database? And maybe I isolated further. And maybe I can further ask iteratively. Okay, now that I know the types of users and the types of requests that happen on all my infrastructure or my infrastructure, the transgender particular pod let's say, show me all the pods and how they behave. So obviously you understand all these kind of questions on top of each other, create an explosion of dimensionality and of course that's where data comes in, but that's how you also need to have a way to be able to handle all of that at any scale. Makes sense. It does, yes. So do you work at all on frontend monitoring? Yes. So is there a way to connect frontend traces to back end infrastructure? Correct. So the way we have built, I guess, what's called real user monitoring. And terror reporting on top of that now, for us that occur on the client side on JavaScript is that we use the same, let's say, principles, very visual way, full fidelity. So a product every transaction, and as a result, that allows us to fully connect the data to back end. Because traditional, let's say, user end user monitoring, collect the very small sample of the user interactions, had another small sample of the backend interactions, the two randomly connected the collected. So it was almost impossible to connect the two. Every time I had an end user trace, it was very unlikely I would have the backend trace as well. So anyway, in our case, because we collect all the data, both on the client and the backend, the traces are fully connected. So this allows me to say, if my user is facing a problem, it allows me very, very quickly to know if a problem comes from the front end or it's a backend issue that has propagated to the front end. And then further iterate and solvation. And it's worth noting that data collection technology have built for a JavaScript, but also now contributing to open dilemma. So it will be available for an analysis once a year. How does Splunk fit into a proactive monitoring workflow? Or how do you build a proactive monitoring workflow around Splunk rather than just having to be reactive to failures? First of all, even when it comes to let's say reactive monitoring, one of the principles we have is that we want to end our editing in real time, right? So if you have a problem, you will know as quickly as it happens. So you can react to it much more quickly. Let's say it won't take minutes from the moment problem happens until you know about it. Assuming there is an alert in place for it. Now, beyond that, we go back to what I was describing before. Because I think we have now a lot of data, a lot more than with traditionally had and generally the data is more structured and has a lot more dimensional in them. Instead of being more proactive and that's kind of where I think observability and maybe this concept of AI ops come together. We talk about AI ops for a while, but I don't think we were able to be very effective because the data was on structured enough, like the signal to noise ratio was never good enough with the data we had until maybe more recently with all of the standardization I was describing. So what we try to do is, because we have all the signal, we try to essentially as quickly as an issue happens. We try to visually and proactively tell the user what the problem might be. Of course, if you have implemented this allows or anything else, which is general purpose, then we can also help us be a lot more proactive. But the point generally is that we try to understand how the system behaves normally and identify abnormal behavior patterns that might indicate the problem, right? And kind of surface those and have the user take a look at where the problem might be right now, right? As opposed to all the data that is available.

Coming up next