Process Control Anti-Patterns that Impact DevOps

Part Two in a Series on DevOps Anti-Patterns

Introduction

Sometimes it’s hard to be sure that the culture is fully DevOps. After all, this is a path of continuous improvement, and always striving to become better is part of the DNA. There’s also a lot of nuance and opinion on what DevOps means, and a lot of ways to get there.

So, I was thinking: what about an alternative symptom-based checklist to measure how far you are on the journey? I came up with some ideas to use as a gauge. Essentially… How DevOpsey are you?

By DevOpsey, of course, I mean all the things in the DevOps Loop (plan, create, verify, package, release, configure, monitor). But it also applies to things like the DevOps Super-pattern, and probably a bunch of other things too.

I kept adding stuff to the article until I realized that it was getting too big and needed to split it out. This second part addresses some of the anti-pattern symptoms you might come across in overly zealous process when systems thinking isn’t applied. Often these things are implemented with the best intentions, just like the road to hell.

Let me know what you think.

It takes more than a few hours to go from initial commit to production

As far as benchmarks go, I feel like this is probably the most fundamental indicator.

This one thing says so much about how many silos, control gates, and manual steps there are in the organization, and tools like Value Stream Mapping are there to help work through with stakeholders if you want to understand where all of that sunk time is going.

It takes more than an hour for subsequent commits to make their way to production

As a developer, you should be able to deploy at any time.

You shouldn’t need to wait until the end of the sprint or program increment to get your code out. You shouldn’t need to wait for the next meeting of the change control board. You shouldn’t need to wait for QA to bless the change.

I mean, I think this is problem #1 with SAFe. Our research shows that if you want to do things at scale, you do it by building high performing teams (1/n) https://t.co/4WN6LigYcE
— Jez Humble (@jezhumble) June 1, 2019

Even if it takes ages to get service infrastructure stood up (see above), low latency to ripple changes to production is essential for fast feedback. In conjunction with adequate testing, short cycle times will reduce the risk of defects significantly.

Once consequence of longer commit-to-deploy cycles is that developers will bundle many commits into a single change, almost certainly increasing risk and complicating diagnosis.

Even if your organization is staunchly control-based and insists on a separate team signing off ahead of feature release, it’s possible to separate deployment (installing new code on production systems) from release (the act of making new code generally available). In this way, you can still test in production and improve confidence that new code doesn’t break existing functionality early.

You’re frequently blocked waiting on someone to perform some task or approval

You’re not sure who to turn to? You are sure who to turn to, but it’s taking ages to get a response or otherwise unblock your work? You have silos, and it’s killing your velocity.

Whether it’s Operations, Change Control, Database Administration, InfoSec, Compliance, QA, or IT, relying on teams who don’t have you as first priority is tough.

Empowering the organization to simplify process, create thoughtful automation, and otherwise eliminate silos will help dramatically.

Well, the NOC tested, and they said everything was fine…

At one company, there was a Jenkins slave servicing pipelines that started crashing frequently. The release engineering team responsible for building and running these slaves had been dis-empowered such that all diagnostics and monitoring were being performed by the NOC. If something went wrong, then all these guys could do was check with the NOC.

In this particular case, the NOC ran some tests and said everything was fine. Unfortunately, there were no details given on what those tests were.

Because of this dis-empowerment, all release engineering could do is get the slave re-imaged and hope for the best. Of course, this fixed nothing. The net is that release engineering spent time dealing with complaints from developers and talking to the NOC, the NOC spent time being busy not tracking down the problem, and everyone lost.

This is another great example of how silos will hurt your organization.

Major changes require extensive cross-team coordination

Dr. Strangelove, or: How I Learned to Stop Worrying and Love the Bomb

Putting a team of cross-functional senior staff in a room to roll out a major change, perhaps by way of a “war room” or “mission control” indicates tight coupling and brittleness.

This is distinct from waiting on someone to perform work on your behalf and is more often a symptom of the complexity of application architecture than organizational complexity.

This is not the 1960s, and you are not sending rockets to the moon (although if you are, please say hello). In order to maintain feature velocity and stability, it’s important to develop components and systems that are loosely coupled and can be upgraded independently.

You observe change windows or change quiet periods

Change windows are pre-defined intervals where production changes are permitted. Change quiet periods are pre-defined intervals where production changes are not permitted. Some of these are legitimate for regulatory or control reasons, but often they are driven through fear.

When the motivation is fear, controls are typically instituted because confidence has been lost as severe production incidents cause trauma to the business.

Stalling delivery means that unpublished changes will accumulate. This is risky, since it becomes increasingly hard to remember implementation details.

Isn’t it better to improve confidence in delivery so that small, incremental changes can be made at any time with trivial or no risk?

You have run-books

Run-books contain a set of procedures that typically state when* X happens to Y, do Z* and are fairly common with isolated operations teams and network operations centers. They might also be known as Standard Operating Procedures or SOPs.

Run-books are often boasted about by those who insist upon them as being critical to operational success.

In practice, it’s often incredibly hard to keep run-books up to date. After all, developers like writing code and invariably detest spending time writing docs, let alone a set of operational procedures based on what might happen one day.

Even worse: for a sophisticated organization that is well on its way to maturity, the frequency of incidents should fall dramatically, and there becomes little benefit to creating and maintaining volumes of documentation “just in case”. Wouldn’t it be better to spend time developing more features and tests?

If your organization relies on run-books, then you’re not empowering developers to maintain their own stuff and are insulating them from critical information about how their code is performing.

Next Up

https://cloud.google.com/blog/products/devops-sre/the-2019-accelerate-state-of-devops-elite-performance-productivity-and-scaling

Whether small company or large, one key observation from a summary of the 2019 State of DevOps Report is that companies which have a higher degree of control and process tend to be less performant. By the way, it also highlights how this journey is not a purely technical one.

Everyone in an organization has their role to play, but it’s incredibly disheartening to be told to work harder if you’re being hamstrung by the system. Fixing process is a foundational and critical investment.

Eighty-five percent of the reasons for failure are deficiencies in the systems and process rather than the employee. The role of management is to change the process rather than badgering individuals to do better. — W. Edwards Deming

The third part of this series will talk about software related concerns, which are mainly about how to think about writing and maintaining code. You also haven’t seen the last of W. Edwards Deming.

p.s. In case you missed it, the first part in the series covered Operations.

This article originally appeared on Medium.

Orc Works