Commander, Executive, CCC discussed on CCC Talks
Having watched that I watched it too. We thought it'd be very interesting to get you on to talk to our listeners. More about about this topic in particular. I mean I was taken back by your tongue. Instant management concerned problem management root causes which is fantastic But for me you introduce us for trauma and your session look at trauma. I'm on the purpose of the talk. Describe that traumatic events as being something happens in an organization so we know incidence or an outage preventing businesses from doing the thing that the business exists to do you know affecting their customers but like you to maybe opened up and tell us a little bit More about your perspective of the trauma element of incident in this incident and problem resolution as scenario absolutely so the the idea behind this came from my own experiences with being treated for post traumatic stress disorder and individual trauma treatment and when I started to learn about how we as individuals respond to trauma. And how our bodies physiologically respond and what are the steps take in trauma treatment for individual treatment a lot of it. I was seeing a parallel to how organizations respond. So if we take the idea that an the incident or an outage or a disruption service like you said is basically a traumatic event so trauma is when it occurs to a person is when our nervous system is not as out of answers right. We cannot respond We're we're out of solutions to a problem and that introduces trauma and and so the same thing kind of happens and what happens with people when we have a traumatic event. That triggers trauma is that our window of tolerance. This is is broken right. We normally we can respond to get you know we can activated by something activates a response. We respond we come back we settle. We continue to move when there's something traumatic we might get stuck on or get stuck off either stuck in flight or stuck in freeze and organizations tend to to do this as well and what happens that I start as I started a little more research and investigation in thinking this kind of hypothesis out is that we So that the thing that's great about people we have this thing called the prefrontal cortex and it's it's usually a big advantage right. It's it's where executive executive function happens is how we tell the difference between right and wrong good and bad but the problem is our prefrontal cortex responds the same way to perceive trauma as it does to real trauma so when there are things lack of a better term when there are things that physiologically remind us of something traumatic it causes our prefrontal Cortex to respond on the same way. It does real trauma and organizations do that the same way when there's something that looks similar. We respond the exact same way and the reality is the systems are very complex right technical in sociological systems within organizations. So we try to do this pattern recognition like we love to do that as humans as we want to say. This looks like something I've seen before so I'm going to respond similarly but the reality is every incident is unique because of the complex like systems at play so what we tend to do as organizations as we get stuck on where we're always in kind of flight or we get stuck offer stuck in freezing all of these things prevent us from being able to innovate and move forward and very counterproductive to the idea of being able to learn from incidents incidents and be able to grow as an organization and enhance how we work third very interesting to bring that level in into it certainly seen organizations organizations over the years have a major incident. Customer impacting eventually was you know. Revenue impacting but also from A. I guess nations perspective You know their perspective. But I'm trying to fight the incident and it just made things worse us in them. MM freeze where they can't make a decision to do anything with your prolongs the pain and then I've seen flight where they say. There is no problem when it's a into everybody that there is you know or at least the boardroom. They're trying to hide behind it. So it's very interesting for me to see those human traits actually fall into those all those patterns of incident management on have experienced it myself over over a number of years. I'm also wanted to get your view. your your big into devops up so big into all this digital disturbs economy. That that's growing You know I take it. That there's a lot of people have managed instance since last twenty thirty years. It's nothing new today but is there a case that if they don't change how they manage these incidents in the past that about these newer technologies newer cloud the outdoor mall built around. That did they need to think differently. for some of these new digital services that are coming out outwear provider at least mileages underlying incidents that they never see. I think that's part of it. There's a lot of reasons that we have to continually evolve our incident response and one of the things I like to think about and I've been doing. It ops for two decades over two decades. So I've things have changed. Things are different There's there's definite data and science behind the idea of you right at you. Run it right so this idea of sort of having the small group that response install incidents and is able to fix. Everything really doesn't work for several reasons. One is to be honest. It never really worked that well in the first place but we got away with it and there was a time in my career when I could hold everything about the systems I supported in my head I could under like. You're supporting a lamp stack. Okay I kind of know all the components rodents I know what they do. I can help figure it out. Today's distributed systems. It's impossible to be able to understand that because everything. Everything's in a constant state of change and dependency. So you WANNA be able to in an incident get the people who have the domain knowledge about that particular component involved this quickly as possible because they can help restore service as quickly as possible that being said in order to do that we have to change the way we think about incident response in order to do that because again these are such complex systems and even mentioned like some of it is outside of our organization. Some of it is stuff. That's abstracted away. How do we understand how to to to see what's what's happening over in aws when this is happening plus we had a release on this micro service an hour ago? And I'm just the person carrying the pager I can't know all this stuff and it's crazy easy to expect me to write. Michael served meeting this. Yeah Yeah we have to distribute this effort and is the way that we think about incidents silence and and investigating and learning from them and sharing information that we get a lot of first thing at the CCC when we look at the people aspect of change is people having to adopt what they do and smaller increments but more often. How do we change? Change how we do instant management. It's not a big bang thing where you go away for nine months and try and figure it out adopt small so if you're now doing micro services in a devops very fast environment as you said dance manager isn't going to know very much about what that is. I think they're all is to coordinate. And bring people together and bring them true that resolution process using the people involved and say a micro services to figure out what went wrong or if they can't see it because arena a fight situation maybe to step back a little bit onto helps them. You know help help some truth so I was just. It's GONNA say a big part of that is is. We're big believers page duty in using the incident command system which was actually developed by first responders outside of tackles. It's how first responders in fire resolution and things like that work together and the role of the commander which can be someone who's their day job title is incident management doesn't really matter They're not a revolver right. There coordinator their communication their decision maker. But they're getting they know how to get the right people and it's so much more about psychology communication and leadership and delegation and decision making than any kind of technical acumen in being a really good incident. Commander that's it and I think you meows really strong role as you said it's leadership roles. It's a communication roll but you do have to have technical insights or knowledge enough to be able to you understand some of the domains that are being affected not necessarily the indepth knowledge. But you do certain these knows something about it. I always you know is incident management. Him Him a little bit debased over the years where it's Osha. We could outsource after party. And maybe that's an answer. Maybe it's not but certainly on some of the core critical applications locations that organizations have. I'd like to have control of how they get managed from an incident perspective especially customer facing revenue generating and things think could lead to reputational damage. So I think it's a stronger role. Possibly the an organization's have I think given the merit I think over the years. Yes I think. A lot of a lot of companies are having great success with having it be a role and not a job so I give the example holy pages but there's this oughta organizations the same way so we don't have people whose job you know in the Org Chart. Is there an incident manager. Our incident commanders. Our on call incident command. They have other other jobs and yeah so what. We found some of our best incident commanders. Our product owners. And it's because they understand the product right. They don't necessarily understand CASSANDRA. Cooper Netease are deep technical but they know how our product works. They know how the components work and they have really good product. Owners and product managers tend to have really a good delegation in communication and decision making skills and though the reason we made the mistake in the past of saying well we want our incident commanders to be like engineering managers. Managers are injuring. The problem is one of the rules of incident. Command is you're not a resolve her right so If you're an engineer in your day to day job you're not necessarily a great incident commander Hendrick because you might be the person who is best equipped to help solve the problem. But you can't solve the problem because you're trying to run the incident. You're trying to move the team together. So we've it had a bunch of really interesting experiences. I've seen a lot of other companies have adopted this as they adopt the ICS and it can be that you're not a kind of a problem with saying having a formal incident management role within your organization. But look at what those skills are and where those things he's come in and how those folks can have that understanding and follow those processes still follow the same process. It works really well. Yeah I think that's a good approach as you said you. You've you've taken it from first responders in incidence. I'm brought back into the world of technology and services at that. We live in one one thing. We're hearing a lot We're interested in the space. Wanting we're hearing a lot from people we talked about. Incident management is with all this. That's a movement in the cloud. So you'll have cloud money. Cloud providers on software as a service providers providing applications join the infrastructure. So if there's any real L. incidents in those environments they don't come to the organization still handled by the cloud provider. So typically what we're hearing is. There's a reduction in instant value per se which is would wisconsin was positive however when an incident strikes or when it happens it's generally of a bigger issue and a bigger challenge to try and resolve because he has taken care of behind the scenes by the cloud. The Saas provider. I you saying this in your experience. I think it's what we're seeing is more of a distribution of incidents because again as that responsibility the is is is being shared among the organization. So what happens is you have folks that are responding to incidents more frequently frequently than they used to write so. That's a challenge again. When I think back to being admin and carrying a pager and I'd be on a call wrote up for a week I probably like got paid like six six times during that week so I knew how to send response because they did it all the time these days as we're doing the great thing of putting more people on call distributing the load being more focused focused on what we're in you can go a lot longer between being involved in incidents and the problem is practice makes permanent right so you don't want to have it be you page than on an incident? You don't even remember how to log into Patriot because you haven't done it for six months right. You don't WanNa be trying to remember how to do post-mortems because you never do them. Yes there's reasons we want to practice. This one is because again practice makes permanent. Just so that it's that but it also goes back to that trauma thing right what we WanNa do do everything about an incident stressful. It just happens right so what we WANNA do is we want to create a physiological response in ourselves to incident response so that when we are paid Dan we are doing everything we can to minimize stress. Because the number one I'll give you. Here's here's the most important thing to do in an incident. This is your Golden Golden Golden Ticket silver bullet. It's words don't panic. That is the most important anything else and to not right right so what we wanted to this thing so when we do things like plant failure injection game days..