The Resilient Enterprise: Learning to Fail Part 2

Monday, June 25, 2012

Rafal Los


In the first part of this 2-part series on enterprise adaptation to failure, I gave you my overview on why the tribe mentality in the DevOps movement could and will lead to a more successful adaptation of applications to today's real-world failure scenarios.  

First came the call to admit to ourselves that failure was imminent and unavoidable - and that this wasn't a reason to despair.  

In fact, as I noted from a previous conversation - IT tends to attempt to hide its failures, which is endemic to the human mindset, when we should really learn from them.  Moving on, let's discuss a few ideas behind how failing with the support of a DevOps tribe can lead to a more resilient enterprise and ultimately better enterprise security.

In the following few sections we're going to take a look at combining tools, processes and the tribe mentality to solve some otherwise ugly problems - and come out the other side of the tunnel a little better off.  Let's go...

Fail fast, fail cheap

The first type of failure that comes to mind when I think of IT failing isn't necessarily a security breach, or even a catastrophic system event.  I'm thinking of the failure to deliver to the business.  As my new running buddy Ben Kepes told me the other day - "fail fast, fail cheap - live to fail again".

What we of course mean by this is those NPI projects the business always has kicking around.  NPI stands for New Product Introduction - or in this case applications or systems that will need to be stood up to support a program or project that may never go past the prototype stage.  The DevOps (and consequently cloud) mentality is phenomenal for this type of ability.  

A small agile team able to spool up necessary resources without having to procure systems, install, patch and wait can deliver a prototype in a staggeringly quick amount of time.  Here's the beauty of that... it's much cheaper because if the prototype never makes it to a production state, we can simply disband the resources back into the pool and we've only been billed for what we've consumed.  

Should the prototype go live we can now scale out resources and build up as-needed because the team that built the prototype has a great inside track on how the thing was built in the first place.

In this example, we can fail fast, fail cheap (if the NPI fails) or we can scale out and serve business need with lightning quick precision when things succeed beyond our wildest imaginations (hey, it happens all the time...).

Improve MTtR (Mean Time to Repair)

Things break. Systems fail. Software hits unexpected, unpredictable and unrecoverable errors.  Stuff gets hacked.  This is just the way the world works, and if this frightens you, well you probably haven't been in IT long enough to understand it's truth.  When I was employed at my previous organization there were two metrics that applications were judged by.  

The first was the judgment of the development organization, and the second was the judgment of the operations organization.  What people in management failed to realize was that these were actually driven from the same tribe, or should have been, to optimize these outcomes.  Allow me to explain further, and focus on the operations team for a minute.

At my previous role the operations team was a pan-organizational, outsourced team that was responsible for the uptime of applications once they had been launched go-live. They maintained .Net, Java, and a few PHP applications on top of countless vendor-specific applications such as SalesForce, and SAP.  I don't know about you - but I wouldn't ever want to be part of a team like that.  Doomed for failure every time.  

Over time we realized the turn-over rate was so high because the stress was astronomical, and no one wanted to learn all those platforms only to be blamed for the system being down when it was rarely their fault.  Worse yet, when the issue was an application fault it literally took a conference call of 20 people on an international bridge to troubleshoot over many hours.

Developers used a countless number of defect tracking systems and build-management processes, and the operations team used something guaranteed to be different.  There was very little information-sharing except in the form of emails which everyone horded in their over-stuffed mailbox, and knowledge was never built.

As I was exiting, stage left, some of the applications teams started putting up make-shift wiki pages with information about the application and gave the operations teams access to that information as read-only, then eventually allowed them to contribute.  In the few short months that I witnessed this, the MTtR fell like a stone... so my only hope is that they continued this practice, although I can't verify this.

A common platform for defect tracking which can automatically build a database of searchable defects, errors, and responses can create more resilient applications and improve MTtR by an incredible rate as a first step up that hockey-stick curve.

Looking back, it seems so insane to have the operations team be so detached from the developers who actually wrote the application.  As Gene Kim says, "if you wake the developers up at 2am to fix a problem, issues get fixed remarkably quicker" - for reasons that start to become obvious once you try it.  Information sharing is key.  Knowledge of the application is key.


You know what else a split operations - development organization can rarely do right? Fix issues where code is required without an act of Congress.  Diving into this issue is also like the one outlined above, it's all about knowing and articulating where the failure is, and figuring out how to get at it faster and then test, deploy faster.

There are 7 observable stages of a failure: (1) realization of failure, (2) validation of failure, (3) troubleshooting for root cause(es), (4) identification of root cause(es), (5) fix development, (6) fix application, (7) fix validation.

 Many of these have short-circuit capabilities which are left unexploited in large organizations as well as small ones.  The short version is this - you can't just skip steps, but you can make them go faster and waste less time from "Oh no!" to "all better"... and isn't that what we want in the end?

Let's focus on the continuum between steps 3 (troubleshooting) thru step 5 (fix development).  In these steps we've often, as described above, got multiple organizations involved which rarely talk to each other and worse yet don't understand the application from the same perspective.

The development team focuses on the application, and often has little insight into how the environment it's being deployed to is built and maintained, while the operations team is all about the environment but knows little about the application.  This type of learning-on-the-fly is dangerous and prolongs that cycle from steps 3 thru 5.  

If we had a tribe in place, utilizing a singular system of record for diagnostics, system and application configuration information, build information we can at least speak the same language.  A system of collaboration is critical, so critical that in fact I could see measurable gains (maybe as high as 30%) in time spent from troubleshooting to fix development.

If the developers had direct access to the system which held the logs for the failing application they wouldn't rely on someone else to analyze or even ship logs over for analysis, and this would likely result in identification of an issue faster and with less resources.

 If these same developers had further access to deploy directly into the broken environment a fix... we may be talking about an even faster loop.  The big question is - are we willing to co-mingle the developers and the operations people?  This is a big cultural shift.

These are just some of the direct ways that a tribe, or unified DevOps team, can learn to fail better.  Once we're over the mental hurdle that failure is a reality of IT and business, the next mental step becomes a necessity... and that's working together.

Being empowered to work together with technology, access, and thinking is something we have to work on in order to make IT agile enough to serve today's dynamic business.

Cross-posted from Following the White Rabbit

Possibly Related Articles:
Enterprise Security
Information Security
Patching Enterprise Security Application Security Incident Response Secure Coding Resilience Enterprise Risk Management IT Security DevOps
Post Rating I Like this!
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.