A few days ago my friend Will Gallego wrote a fantastic piece on how the industry is less safe and less stable. From the cultures we’ve developed to the systems we maintain, everything has been shaken by multiple years and multiple rounds of layoffs.

It’s been 7 years since I wrote a half-way useful blog post. Will’s commentary certainly raised a lot of feelings here and I want to add some book-ends to his piece: How we got here and where we need to go next.

So what happened?

It’s become fashionable to blame all of our current ills on the Zero Interest Rate Phenomenon, but there are real concrete things that happened as a result since 2008 which we need to keep in mind when free money meant free growth.

Companies hired more people because they could afford to bring people in earlier in their careers and spend longer getting them to where they needed them to be.
Companies were free to solve problems with humans rather than building better systems.

In the preceding decade the industry had made significant strides in the areas of safety. We standardised the practice of doing post-incident reviews, developing just and no-blame cultures, and had finally (IMHO) started to grasp the differences between complex and complicated infrastructures.

Complex systems are OK up to a point. They appear everywhere and they can be mostly reasoned about. We understand how they work. Given the starting conditions we can reasonably predict the outcomes.
Complicated systems are not OK: The also appear everywhere but they are much harder to reason about. We may think we understand how they work but the interactions within the system make predicting outcomes difficult. And indeed the same inputs can result in different unexpected outputs.

This was the foundation on which we had started to develop systems safety at large. Will himself was at the forefront of this effort.

And then the hiring sprees began.

More people means we could solve things faster, right? Right?

Well.. not quite.
Ostensibly, yes it could but hindsight has shown us that we lacked some key factors:

As existing teams grew and new teams were created, organisations didn’t have enough technical leaders in place to promote the same culture of safety.
As new people entered the industry they weren’t being exposed to the same risks and negative impacts to systems because we mitigated and avoided issues with humans.
Whereas in smaller organisations you have more forcing functions pulling teams together, in larger organisations teams drift apart which leads to less context sharing and more isolation.

As teams grew larger with less shared context, they started to solve the same problems around safety in different and sometimes incompatible ways.
The issues didn’t stop there. Because we failed to keep safety near the front of our design decisions many teams stopped thinking about it completely. One company I worked for during this time suffered multiple hours-long complete outages over the course of a year before they accepted that even the simple act of configuration changes in one supposedly safe system could have a site-wide blast radius.

We stopped building sufficient levels of safety into our systems
– Me, just now

My friends in the security space are no doubt standing unimpressed at this point, gesturing to the same landscape they’ve been dealing with for decades. I apologise for not listening to you sooner.

The final result of all of this exponential and organic smearing was competition between teams.
I believe we started to reach a tipping point around the time the COVID pandemic broke out. It’s ironic that the pandemic, as destructive as it was, may have helped the industry by slowing down changes for a period of time. Do I have evidence for this? No. Don’t ask for it. Call it the intuition of 25 years in this industry.

The tipping point came in two forms:

The limits of hiring and the capacities on existing teams to continue absorbing failures.
The end of the pandemic, ZIRP, and the reduction in corporate growth and less new work to do.

Less work means teams now need to work harder to justify their existence. As the first rounds of layoffs began many of us saw the net effect of this on the cultures of our organisation. You didn’t have to be the best at your job, you just had to be better than most of the rest. Time to compete!

Unfortunately when everyone is sitting on a mountain of manure they’ve been cultivating for some time, nobody was left stench-free.

That sounds pretty terrible. Where are we now?

We’re not really in a great place.

10+ years rapid grow of both teams and systems was bad. When our teams shrank, our systems didn’t become any less complicated. The people who are left at organisations are now holding the Bag of Holding of Turds. No matter how much we dig in, it’s never enough.
Teams barely have the resources to keep their heads above water. Many don’t even manage that.

This is especially true where centralised teams have developed and own platforms that provide services to other teams.

The advice I have is not going to sit well with anyone. You’ve been warned…

Oh dear. What do we do?

We need to start by change our working models.
Instead of central teams providing every functionality requested to other teams, they instead need to provide core functionality and contracts.

Example:
If you run a logging platform, you’re probably responsible for:

The ingestion of logs into a pipeline
The manipulation and redaction rules of those logs – or at least the application of rules written by others
The emission of those logs to one or more endpoints
The management of analytics and visualisation tooling
The compliance ownership and responsibility for that data

A new model would break apart the responsibility between the maintainers of a core platform and the teams that need to use it. The central team would be responsible for:

The uptime of the pipeline
The ingress contract (what format input data should take)
The ability for teams to define, test and deploy their own rules
The egress contract (how teams can define where data goes and what it looks like)
Base capabilities to use the data for incident debugging, medium term analytics, and long term audits

Teams who wish to use the platform would be responsible for:

Writing and managing their own processing rules
Ensuring the data they emit to the pipeline adheres to the contract
Understanding which analytics capabilities they need for the data
Owning and building the specific tooling outside the base capabilities
Ownership of the data and adherence to regulations

Likewise SRE teams are not solely and entirely responsible for the production readiness of all systems.
They should provide clear guidance on what “production ready” means.
They should develop automated tests to ensure teams are meeting those expectations.
Teams should be able to run those tests as part of their build processes and fix issues themselves.

Is that all? That doesn’t sound so bad..

Don’t be silly. The worst is yet to come: we need to rip off some bandages.
All of the tech debt we’ve built up has to aggressively be paid off.

Some organisations have realised this and they’re taking actions. The ones who haven’t are going to continue to struggle to advance safety, free up engineering time and will lag relative to their competitors. They will continue to suffer more outages over time while the rest of the industry slowly recovers.

This means some things will deliberately need to be left broken.
Others will get minimum Keep-The-Lights-On effort.
And others will simply need to be turned off even if people are using them, because there isn’t enough human capacity to support them adequately.
If TeamA is providing a service used by TeamB, but TeamA doesn’t have time to manage it, then TeamB should own it or accept that it won’t be available any more.

For many organisations this may be a fairly radical shift in ownership and responsibilities but the current state of affairs across the industry where people are being ask to do more and more with less and less is ultimately untenable.

We will never be able to return to improving the safety and reducing the complexity and complicatedness of our systems until we make this shift and shed the baggage of the past.

Geeky

Layoffs, industry safety, and where we go from here

So what happened?

More people means we could solve things faster, right? Right?

That sounds pretty terrible. Where are we now?

Oh dear. What do we do?

Is that all? That doesn’t sound so bad..

Comments

Leave a Reply Cancel reply

Search

Categories

Recent Posts

Tags