Resilience engineering, John Allspaw and DOES 2020 conference

28. 07. 2020

Overview

Welcome to 0800-DEVOPS, a newsletter digest of interesting ideas from the world of DevOps, technical practices and increased productivity! Today we're talking to John Allspaw, we're discussing resilience engineering and remembering highlights from DOES 2020 conference.

Check out our newsletter archive. If you like 0800-DEVOPS, please share the good vibe and forward this article to your friends. Thanks, you rock!
Or you can just sign-up here.

Resilience Engineering

Would you consider yourself a resilient organization?
The single most important advantage of leaders in the Age of Software is their ability to learn and adjust to new conditions, i.e. how resilient they are.

Resilience is not about reducing negatives (incidents, errors, violations). It’s about identifying and then enhancing the positive capabilities of people and organizations that allow them to adapt effectively and safely under pressure.

–Sidney Dekker

Resilient organizations don’t take past success as a guarantee of future safety. They keep their eyes and ears open, discover space for improvement, and execute on that relentlessly. Sometimes this space for improvement is clearly visible. Other times we must closely look at the system (maybe even poke it a bit!) and look for anomalies (incidents) that come out. That poking is called Chaos Engineering and we’ll talk about that some other time.

John Allspaw says that incidents shouldn’t be looked at as an abnormal state of the system that is otherwise stable but as signals showing us where the system should be improved. Organizations, especially technical ones, are getting better and better in developing tools and devising ways to push the system over the edge. That is expected since such technical solutions are in their comfort zone.

Where organizations are still falling short is what happens after the system fails. How do we address incidents and what do we get out of them?

While many of the organizations meticulously track hard data about incidents and draw conclusions (which is good!) few of them use these opportunities for improving the organization as a sociotechnical organism (which is less good!).

Incident analysis is not about incidents anymore. It’s about how your organization works.

–Nora Jones

If your incident management ends as soon as the service is restored pinning the incident to human error or “root cause”, be aware that you’re missing out on a great opportunity to learn about your organization and improve the way it works. Nowadays, organizations are complex systems, and as we have learned, complex systems never fail because of one single “root cause”.

Find all this and more in this excellent podcast with Nora Jones, especially why you should focus on people and their collaboration in a sociotechnical organization.

My company learns from

— CROZ (@croz_hr) August 14, 2020

Hand picked

+ In the first post of this three-part series, our Ivo Štampalija is reflecting on a role mainframe plays in enterprise organizations and introduces a new direction IBM takes in supporting DevOps practices on the mainframe (such as CI/CD) by developing new tools on z/OS. What does this mean for large organizations and mainframe lovers? I can’t wait to see it.

+Last month I had an opportunity to be part of DevOps Enterprise Summit 2020 conference. Although traveling to London would have been much better, being part of this conference even virtually was a great experience. DOES talks are usually publicly available after some time. Till then, let me share some highlights.

+ Many companies are manically trying to hire superhumans expected to deal with all aspects of the software delivery process. They are looking for a DevOps engineer. What they’re actually looking for are system engineer, network engineer, security expert, automation expert, systems analyst, and full-stack developer with a pinch of a software architect, all in one. Alessandro Diaferia explains what is wrong with this approach and how should companies go about staffing to “become DevOps”.

+ Based on the famous concept by ThoughtWorks, CNCF Foundation is issuing its own CNCF Technology Radar! Due to the number of different projects under CNCF sponsorship, the radar will each time focus only on one specific area. The first issue is covering Continuous Delivery area.

+ In his book “Outliers”, Malcolm Gladwell states that it takes seven consecutive human errors to take the airplane down. An airplane is an excellent example of a complex system. When a complex system fails it is almost never because of a single “root cause”. Richard Cook wrote an insightful piece on how complex systems fail. It nicely summarizes everything we already intuitively knew about complex systems, like “Catastrophe requires multiple failures – single point failures are not enough” and “Catastrophe is always just around the corner.”

+ As of today, there are 1.418 cloud-native projects under CNCF sponsorship. And that’s only a fraction of tools and skills used in DevOps space. Imagine how intimidating it must be to enter this escape room. But it doesn’t have to be like that. Having a high-level visual roadmap with possible options and forks in the road can help you navigate wide space and find your ideal path. Just roll the dice and make your move.

Read with us

Actionable Agile Metrics for Predictability: An Introduction

Many problems in traditional project management stem from the fact that there is no predictability in the software delivery process. If only we had predictability, we could tell when something would be done. And we wouldn’t have to deal with guesstimates.

Unpredictability comes from a lack of flow. Lack of flow manifests itself as queuing work items. We need to manage/optimize the flow. How can we tell if we’re successfully managing flow? By tracking three metrics: Work In Progress, Cycle Time, and Throughput. And Cumulative Flow Diagram is the best tool to track them.

Actionable Agile Metrics describes nicely what a CFD is and ties it back to three agile metrics. Implemented properly, CFD can help you manage flow, reach predictability, and stop guesstimating once and for all.

Quote of the Day

“If your postmortem doc includes the verbiage “this is one of the root causes,” YOU DON’T HAVE A ROOT CAUSE.”
—J. Paul Reed, Senior Applied Resilience Engineer @ Netflix

Office hours with 0800-DEVOPS

Folks, we’re adding new timeslots for anyone to book casual private time with our 0800-DEVOPS team and discuss DevOps and other related topics -> NO strings attached!

You just click’n’pick a timeslot that best suits you and we’ll do the rest….let’s just have a casual chat on interesting topics!

Check out our newsletter archive. If you like 0800-DEVOPS, please share the good vibe and forward this article to your friends. Thanks, you rock!
Or you can just sign-up here.

Interview of the Month

John Allspaw

John Allspaw is the founder of Adaptive Capacity Labs and one of the people that drove DevOps movement from its beginnings. His (and Paul Hammond’s) famous “10+ deploys per day at Flickr” talk at Velocity conf in 2009 showed us that there is a better way of collaborating and delivering software. Today, John focuses on helping organizations do better in incident management and learn from their own mistakes. I talked with John about incident management, learning organizations, and how the world has changed since 2009.