In a 2-part series I wrote on learning to fail (part 1 and part 2) I made some references to being able to fall down and get back up again, in the IT world, but this is an issue that is far from trivial.
I pointed out that we need to learn from our failures, understand how we failed in the past and shorten our recovery times in the future.
Someone I recently started talking to a lot, Matt Groeninger, wrote a great follow-up piece to all this based on a conversation we had around a Skype failure and recovery... and his piece was good - ultimately leaving us with the following issue to be resolved: "How can we both restore service quickly and solve a long-term systemic problem when we can't always tell that two issues are even related?"
Matt's issue echoed through my head for a few days, but then in a eureka! moment I think I may have his answer. I recall a really cool conversation I had with some product management people from the BSM (business service management) line of enterprise software here at HP and it just dawned on me - we actually have a product that is built for this specific purpose!
I won't harp too much about the component of the BSM suite, called "Service Intelligence", except to steal this directly from the marketing page lined by clicking the component name:
Anticipate. Optimize. Report.
"HP Service Intelligence is an analytics layer within the HP Business Service Management portfolio that gathers and analyzes information from a dynamic, real-time service model to give you visibility and insight into the performance of your applications, your infrastructure and the connections between them. In today's dynamic, complex IT environments, this enables you to correlate and map physical, virtual and cloud-based elements to fully understand your infrastructure and manage it more effectively."
Putting this back into the security context, because I'm always good for that, it would sound something like this:
- [call to help desk] "Hello, your application is loading very slowly and giving me time-outs when I try and open my accounts page"
- [service desk] "We're sorry for the inconvenience, can you provide us with some information such as the application you're accessing?"
- user provides basic information
- service desk enters information into service desk software
- back-end processes pull data, linking assets across the uCMDB (universal change-management database)
- system links application being identified as responding slowly to a cluster of systems currently under active investigation by the security and network teams
- service desk immediately gets an alert pop-up that there is an investigation into a possible DDoS attack currently being carried out against an application on an adjacent system sharing that connectivity, and that service degradation has been reported and is being worked on
Before you start typing your reply, I realize this is a rather simple example and that we're not talking about distributed end-user component failure here, but why couldn't we extend the uCMDB out to other critical systems with a little work?
In the Skype case we all talked about the system should have linked a change which was recently pushed out to the users and combined it with all the other recent changes to investigate a general slowness... rather than investigating everything in pieces, in silos.
I'm not saying Service Intelligence is the key to completely cutting down on your incident double-work, problem and issue resolution times, and world peace - but what I am saying is that Service Intelligence is built to optimize the analysis between changes, connected systems and components to help you figure out dependencies in cases such as linked and distributed failures.
I know we said DevOps and all the wonderful things that come with it aren't a product you can buy and magically make enterprise or organization faster and more agile in delivery - but it sure does help to have the right products and automation in at the right times to ease some of the ultra-complex work that needs to be done such as linking problems and issues for faster and smoother resolution with minimum work overlap and maximum resource optimization to actually fixing the problem not just temporarily restoring service.
Cross-posted from Following the White Rabbit