CfgMgmtCamp 2025 Ghent

How we troubleshoot difficult problems: cognition and understanding causality in distributed software systems
2025-02-04, 09:20–10:10, D.Aud (Main)

Troubleshooting can be one of the most difficult aspects of software operations.

There are several reasons for this. One is that our views of the systems we run are often mediated through limited forms of observability. At their best, such tools tend to show us only state, not how systems got into a state.

Another problem is that issues can be intermittent, and difficult to reproduce. Many of the most challenging issues involve systems that are not broken, merely slow or consuming excessive resources.

There is uncertainty, and, often, there is a lot of pressure to get things resolved quickly. Much of the time, we don't really understand the system end-to-end when we begin an investigation.

So how do we do this work? This talk will dive into what we know about how the best troubleshooters succeed at their work, combining what we know from cognitive science research on 'cognition in the wild' in a variety of domains with my own research on troubleshooting activities in software.

Laura has been a software engineer and Site Reliability Engineer for over two decades. She has worked at both small startups and large organisations such as Slack and Google. Laura has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know, and is currently is completing her MSc in Human Factors and Systems Safety at Lund University. She lives in rural Ireland in a small village full of medieval ruins.