Why do things go right?

Photo by Josue Isai Ramos Figueroa on Unsplash

In his 2014 Safety I and Safety II: The past and future of safety management, Erik Hollnagel makes the argument that we should not (just) try to stop things from going wrong. Instead, we need to understand why most things go right, and then ensure that as much as possible indeed goes right. It seems so obvious. Yet it is light years away from how most organizations ‘do’ safety today, with their focus on low numbers of lagging indicators, incidents and injuries.

That said, many organizations have now begun to recognize the severe organizational deficiencies, cultural problems and ethical headaches that lag indicators create for them. Most will be familiar with the following sorts of things:

  • Numbers games and the hiding or renaming of injuries and incidents;
  • Counterproductive and credibility-straining sloganeering (‘Zero Harm!’);
  • Short-termism (driven by quarterly figures);
  • Creative case management and a lack of compassion for those who do get hurt in the course of work (think of the cynical use of ‘suitable duties’ to keep someone off the injury stats or lost time books);
  • The misdirection of accountability through sanctioning, dismissal or exclusion of those who have been hurt in the past (just cancel the contractor’s access card, for instance);
  • The statistical insignificance of any change in typical lagging indicators (‘We went from 3 lost time injuries last year to 1 this year! And we worked a total of 5 million person-hours!’ …Uhmm, right);
  • Organizational learning disabilities and cultures of risk secrecy;
  • Worker cynicism, mistrust and disenchantment;
  • Cases of outright management fraud that have got managers dismissed or even in jail.

Erik’s insight is the hinge on which the transition to Safety Differently turns. Let’s not stare ourselves silly at the lowest possible injury and incident numbers, with the ridiculous and counterproductive Siren Song of ‘Zero Harm’ (See Sheratt & Dainty, 2017, for how ‘Zero’ is linked to more fatalities and serious injuries)as the ultimate paradise we want to reach. Deming said it long before anybody in the Resilience community said it: get rid of targets, slogans and exhortations. They get in the way of allowing your people to produce quality work (Deming, 1982). So instead, let’s learn why things go right and find out what we can do to make it even more so. Safety is not about the absence of negatives; it is about the presence of capacities. The field of Resilience Engineering, formally founded at a meeting in Söderköping in Sweden over a decade ago (Dekker, 2006)was of course driven by this logic. I witnessed Erik and others forcefully making this very point, from many different angles, for a week in a room full of peers and stakeholders.

One way to illustrate this point, as Erik indeed does (and others now do as well) is by way of a Gaussian, or normal curve. The curve shows that the number of the things that go wrong (the left side of the curve) is tiny. On the right side of the curve are the heroic, unexpected surprises (a Hudson River landing, for instance) that fall far outside what people would normally experience or have to deal with. In between, the huge bulbous middle of the figure, sits the quotidian or daily creation of success. This is where good outcomes are made, despite the organizational, operational and financial obstacles, despite the rules, the bureaucracy, the common frustrations and obstacles. This is where work can be hard, but is still successful. The way to make even further progress on safety, suggests this figure, is not by trying to make the red part of things that go wrong even smaller, but by understanding what accounts for the big middle part where things go right. And then enhancing the capacities that make it so. That way, we don’t make the red part smaller by making the red part smaller. We make the red part smaller by making the white part bigger. Research by René Amalberti tells us that it is indeed likely that this is the way to make progress on safety in already safe systems (Amalberti, 2001, 2006, 2013). In those systems, we have milked the recipes for how to prevent things from going wrong to the maximum already. We have many layers of protection in place. We have rules to the point of overregulation. We monitor, record, investigate and standardize the designs people work with. Ever more things targeted at the red part are not going to make it any smaller. The complexity of the system won’t let us. And in fact, the more we do to make that part smaller with what we already know (more rules, more limits on people, more technology and barriers) may in fact contribute to novel pathways to breakdown, accidents and failures (Dekker, 2011).

The way to make the red part (unwanted outcomes) on the left smaller is not by making it impossible for things to go wrong (as we’ve done almost everything in that regard already). We make the red part smaller by making the white part bigger: focusing on why things go right and enhancing the capacities that make it so. Figure by Kelvin Genn.

 

Shifting the paradigm

The question that most organizations yearn to have answered, though, is this: what is going to take the place of their long-held and easily communicated total recordable injury frequency rate? As Thomas Kuhn (1970)pointed out, people are unwilling to relinquish a paradigm—despite all its faults—if there is no plausible, viable alternative to take its place.

A few years back, I was working, together with some students, with a large health authority which employed some 25,000 people. The patient safety statistics were dire, if typical: one in thirteen of the patients who walked (or were carried) through the doors to receive care were hurt in the process of receiving that care. 1 in 13, or 7%. These numbers weren’t unique, of course. They were also problematic. Because what exactly is ‘nosocomial harm,’ harm that originates in a hospital? What is ‘medical error’ and when is it putatively responsible for the adverse event that happened to the patient? Indeed, when exactly does the patient become that ‘one’ out of thirteen? These are important (and huge) epistemological and ontological questions. I have vociferously commented on them before (and I didn’t exactly make friends in the field, for example by claiming how safe gun owners are in comparison to doctors) (Dekker, 2007).

But it’s not the point here. When we asked the health authority what they typically found in the one case that went wrong—the one that turned into an ‘adverse event,’ the one that inflicted harm on the patient—here is what they came up with. After all, they had plenty of data to go on: one out of thirteen in a large healthcare system can add up to a sizable number of patients per day. So in the patterns that all this data yielded, they consistently found:

  • Workarounds
  • Shortcuts
  • Violations
  • Guidelines not followed
  • Errors and miscalculations
  • Unfindable people or medical instruments
  • Unreliable measurements
  • User-unfriendly technologies
  • Organizational frustrations
  • Supervisory shortcomings

It seemed a pretty intuitive and straightforward list. It was also a list that firmly belonged to a particular era in our evolving understanding of safety: that of the person as the weakest link, of the ‘human factor’ as a set of mental and moral deficiencies that only great systems and stringent supervision can meaningfully guard against. In that sort of logic, we’ve got great systems and solid procedures—it’s just those people who are unreliable or non-compliant:

  • People are the problem to control
  • We need to find out what people did wrong
  • We write or enforce more rules
  • We tell everyone to try harder
  • We get rid of bad apples

Many organizational strategies, to the extent that you can call them that, were indeed organized around these very premises. Poster campaigns that reminded people of particular risks they needed to be aware of, for instance. Or strict surveillance and compliance monitoring with respect to certain ‘zero-tolerance’ or ‘red-rule’ activities (e.g. hand hygiene, drug administration protocols). Or a ‘just culture’ process that got those lower on the medical competence hierarchy more frequently ‘just-cultured’ (code for suspended, demoted, dismissed, fired) than those with more power in the system. Or some miserably measly attention to supervisor leadership training.

We were of course interested to know the extent to which these investments in reducing the ‘one in thirteen’ had paid off. They hadn’t. The health authority was still stuck at one in thirteen.

What would Erik do?

This is when we asked the Erik Hollnagel question. We asked: “What about the other twelve? Do you even know why they go right? Have you ever asked yourself that question?” The answer we got was “no.” All the resources that the health authority had were directed toward investigating and understanding the ones that went wrong. There was organizational, reputational and political pressure to do so, for sure. And the resources to investigate the instances of harm were too meager to begin with. So this is all they could do. We then offered to do it for them. And so, in an acutely unscientific but highly opportunistic way, we spent time in the hospitals of the authority to find out what happened when things went well, when there was no evidence of adverse events or patient harm.

When we got back together after a period of weeks, we compared notes. At first we couldn’t believe it, thinking that what we had found was just a fluke, an irregular and rare irritant in data that should otherwise have been telling us something quite different. But it turned out that everybody had found that in the twelve cases that go right, that don’t result in an adverse event or patient harm, there were:

  • Workarounds
  • Shortcuts
  • Violations
  • Guidelines not followed
  • Errors and miscalculations
  • Unfindable people or medical instruments
  • Unreliable measurements
  • User-unfriendly technologies
  • Organizational frustrations
  • Supervisory shortcomings

It didn’t seem to make a difference! These things showed up all the time, whether the outcome was good or bad. It should not come as a surprise. Vaughan reminds us of this when she alludes to ‘the banality of accidents:’ the interior life of organizations is always messy, only partially well-coordinated and full of adaptations, nuances, sacrifices and work that is done in ways that is quite different from any idealized image of it. When you lift the lid on that grubby organizational life, there is often no discernable difference between the organization that is about to have an accident or adverse event, and the one that won’t, or the one that just had one (Vaughan, 1999).

This means that focusing on people as a problem to control—increasing surveillance, compliance and sanctioning—does little to reduce the number of negatives. As I relate in The Safety Anarchist (2018), we analyzed thirty adverse events in 380 consecutive cardiac surgery procedures with colleagues in Boston and Chicago (Raman et al., 2016). Despite 100% compliance with the preoperative surgical checklist, thirty adverse events occurred that were specific to the nuances of cardiac surgery and the complexities associated with the procedure, patient physiology and anatomy. Perhaps other adversities were prevented by completely compliant checklist behavior, even in these thirty cases. But we will never know.

You can also see this in measures of safety culture, which typically include rule monitoring and compliance. They actually don’t predict safety outcomes. One study by Norwegian colleagues, conducted in oil production, traced a safety culture survey which had inquired whether operations involving risk were carried out in compliance with rules and regulations (Antonsen, 2009). The survey had also asked whether deliberate breaches of rules and regulations were consistently met with sanctions. The answer to both questions had been a resounding ‘yes.’ Safety on the installation equaled compliance. Ironically, that was a year before that same rig suffered a significant, high-potential incident. Perceptions of compliance may have been great, but a subsequent investigation showed Vaughan’s ‘messy interior;’ the rig’s technical, operational and organizational planning were in disarray, the governing documentation was out of control, and rules were breached in opening a sub-sea well. Not that these negatives were necessarily predictive of the incident (indeed, we need to be wary of hindsight-driven reverse causality): the messy interior would have been present without an incident happening too.

More research in healthcare shows a disconnect between rule compliance as evidenced in surveys, and how well a hospital is actually doing in keeping its patients safe (Meddings et al., 2016). Hospitals that had signed on to a national patient safety project were given technical help—tools, training, new procedures and other support—to reduce two kinds of infections that patients can get during their hospital stay:

  • Central line associated blood stream infection (CLABSI) from devices used to deliver medicine into the bloodstream;
  • Catheter-associated urinary tract infection (CAUTI) from devices used to collect urine.

Using data from hundreds of hospitals, researchers showed that hospital units’ compliance scores did not correlate with how well the units did on preventing these two infections. As with that oil rig, the expectation had been that units with higher scores would do better on infection prevention. They didn’t. In fact, some hospitals where scores worsened showed improvements on infection rates. There appeared to be no association between compliance measurements and infection rates either way.

Identify and enhance the capacities that make things go right

But if these things don’t make a difference between what goes right and what goes wrong, then what does? We were still left with a relatively stable piece of data: one in thirteen went wrong, and kept going wrong. What explained the difference if it wasn’t the absence of negative things (violations, shortcuts, workarounds, and so forth)? This is not just an academic question. If you are a manager (or clinician, or especially patient) in this sort of system, you’d like to know. You would love to get your hands on the levers and push or nudge the system toward more good outcomes and further away from those few bad ones.

So we looked at our notes again. Because there was more in there. And we started holding it up against the literature that we knew, and some that we didn’t yet know. In the twelve cases that went well, we found more of the following that in the one that didn’t go so well:

  • Diversity of opinion and the possibility to voice dissent. Diversity comes in a variety of ways, but professional diversity (e.g. compared to gender and racial diversity) is the most important one in this context. Yet whether the team is professionally diverse or not, voicing dissent can be difficult. It is much easier to shut up than to speak up (Weber, MacGregor, Provan, & Rae, 2018). I was reminded of Ray Dalio, CEO of a large investment fund, who has actually fired people for not disagreeing with him. He said to his employees: “You are not entitled to hold a dissenting opinion…which you don’t voice”(Grant, 2016, p. 190).
  • Keeping a discussion on risk alive and not taking past success as a guarantee for safety. In complex systems, past results are no assurance for the same outcome today, because things may have subtly shifted and changed. Even in repetitive work (landing a big jet, conducting the fourth bypass surgery of the day), repetition doesn’t mean replicability or reliability: the need to be poised to adapt is ever-present (Woods, 2006). Making this explicit in briefings, toolboxes or other pre-job conversations that address the subtleties and choreographies of the present task, will help things go right.
  • Deference to expertise. Deference to expertise is generally deemed critical for maintaining safety. Signals of potential danger, after all, and of a gradual drift into failure, can be missed by those who are not familiar with the messy details of practice. Asking the one who does the job at the sharp end, rather than the one who sits at the blunt end somewhere, is a recommendation that comes from High Reliability Theory as well (Weick & Sutcliffe, 2007). Expertise doesn’t mean only front-line people. The size and complexity of some operations can require a collation of engineering, operational and organizational expertise, but high-reliability organizations push decision making down and around, creating a recognizable pattern of decisions ‘migrating’ to expertise.
  • Ability to say stop. As Barton and Sutcliffe found in an analysis of wildland firefighting, “a key difference between incidents that ended badly and those that did not was the extent to which individuals voiced their concerns about the early warning signs” (2009, p. 1339). Amy Edmondson at Harvard calls for the presence of ‘psychological safety’ as a crucial capacity in teams that allow members to safely speak up and voice concerns. In her work on medical teams, too, the presence of such capacities were much more predictive of good outcomes than the absence of non-compliance or other negative indicators (Edmondson, 1999).
  • Broken down barriers between hierarchies and departments. A point frequently made in the organizational literature, and also in the sociological postmortems of big accidents—from Barry Turner in the 1970’s to Diane Vaughan recently—is also one of Deming’s reminders, as well as one from the literature on fundamental surprises (Lanir, 1986): the totality of intelligence required to foresee bad things is often present in an organization, but scattered across various units or silos (Woods, Dekker, Cook, Johannesen, & Sarter, 2010). Get people to talk to each other: research, operations, production, safety, personnel—break down the barriers between them.
  • Don’t wait for audits or inspections to improve. If the team or organization waited for an audit or an inspection to discover failed parts or processes, they were way behind the curve. After all, you cannot inspectsafety or quality intoa process: the people who do the process createsafety—every day (Deming, 1982). Subtle, uncelebrated expressions of expertise are rife (the paper cup on the flap handle of a big jet; the wire tie around the fence so the train driver knows where to stop to tip the mine tailings; draft beer handles on identical controls in a nuclear power plant control room, so as to know which is which; the home-tinkered redesigned crash cart in a hospital ward). These are among the kinds of improvements and ways in which workers ‘finish the design’ of their systems so that error traps are eliminated and things go well rather than badly.
  • Pride of workmanship, another of Deming’s points, is linked to the willingness and ability to improve without being prodded by audits or inspections. Teams that take evident pride in the products of their work (and the workmanship that makes it so) tended to end up with more good results. What can an organization do to support this? They can start by enabling their workers to do what they want to do and need to do, by removing unnecessary constraints and decluttering the bureaucracy surrounding their daily life on the job.

How much ‘more’ of this did we find in the twelve cases (out of thirteen) that went well? That is impossible to answer. As said, the ‘study’—such as it was—was scrambled and unscientific, an opportunistic deep-dive into a complex organization with all the unprepared person-power we could throw at it during a few hectic weeks. So you should see the list above not as conclusions, but as a set of hypotheses. Are these starting points for you and your organization to identify some of the capacities that make things go right? And if so, how would you enhance those capacities? What can you do to make them even better, more omnipresent, and more resilient? It is also an incomplete list. Perhaps you have found other capacities in your teams, in your people, and in your systems and processes that seem to account for good outcomes. What are they? What can you add? It is time to compare notes on a much wider scale—indeed to speed up and scale up our embrace of Safety II and identify and enhance the capacities that make things go right.

 

References

Amalberti, R. (2001). The paradoxes of almost totally safe transportation systems. Safety Science, 37(2-3), 109-126.

Amalberti, R. (2006). Optimum system safety and optimum system resilience: Agonistic or antagonistic concepts. In E. Hollnagel, D. D. Woods, & N. G. Leveson (Eds.), Resilience Engineering: Concepts and Precepts(pp. 253-274). Aldershot: Ashgate Publishing Co.

Amalberti, R. (2013). Navigating safety: Necessary compromises and trade-offs — theory and practice. Heidelberg: Springer.

Antonsen, S. (2009). Safety culture assessment: A mission impossible? Journal of Contingencies and Crisis Management, 17(4), 242-254.

Barton, M. A., & Sutcliffe, K. M. (2009). Overcoming dysfunctional momentum: Organizational safety as a social achievement. Human Relations, 62(9), 1327-1356.

Dekker, S. W. A. (2006). Resilience engineering: Chronicling the emergence of confused consensus. In E. Hollnagel, D. D. Woods, & N. G. Leveson (Eds.), Resilience Engineering: Concepts and precepts(pp. 77-92). Aldershot, UK: Ashgate Publishing Co.

Dekker, S. W. A. (2007). Doctors are more dangerous than gun owners: A rejoinder to error counting. Human Factors, 49(2), 177-184.

Dekker, S. W. A. (2011). Drift into failure: From hunting broken components to understanding complex systems. Farnham, UK: Ashgate Publishing Co.

Dekker, S. W. A. (2018). The Safety Anarchist: Relying on human expertise and innovation, reducing bureaucracy and compliance. London: Routledge.

Deming, W. E. (1982). Out of the crisis. Cambridge, MA: MIT Press.

Edmondson, A. (1999). Psychological safety and learning behavior in work teams. Administrative Science Quarterly, 44(2), 350-383.

Grant, A. (2016). Originals: How non-conformists change the world. London: W. H. Allen.

Kuhn, T. S. (1970). The structure of scientific revolutions([2d ed.). Chicago, IL: University of Chicago Press.

Lanir, Z. (1986). Fundamental surprise. Eugene, OR: Decision Research.

Meddings, J., Reichert, H., Greene, M. T., Safdar, N., Krein, S. L., Olmsted, R. N., . . . Saint, S. (2016). Evaluation of the association between Hospital Survey on Patient Safety Culture (HSOPS) measures and catheter-associated infections: Results of two national collaboratives. BMJ Quality and Safety, doi:10.1136/bmjqs-2015-005012.

Raman, J., Leveson, N. G., Samost, A. L., Dobrilovic, N., Oldham, M., Dekker, S. W. A., & Finkelstein, S. (2016). When a checklist is not enough: How to improve them and what else is needed. The Journal of Thoracic and Cardiovascular Surgery, 152(2), 585-592.

Sheratt, F., & Dainty, A. R. J. (2017). UK construction safety: A zero paradox. Policy and practice in health and safety, 15(2), 1-9.

Vaughan, D. (1999). The dark side of organizations: Mistake, misconduct, and disaster. Annual Review of Sociology, 25(1), 271-305.

Weber, D. E., MacGregor, S. C., Provan, D. J., & Rae, A. R. (2018). “We can stop work, but then nothing gets done.” Factors that support and hinder a workforce to discontinue work for safety. Safety Science, 108, 149-160.

Weick, K. E., & Sutcliffe, K. M. (2007). Managing the unexpected: Resilient performance in an age of uncertainty(2nd ed.). San Francisco: Jossey-Bass.

Woods, D. D. (2006). Essential characteristics of resilience. In E. Hollnagel, D. D. Woods, & N. G. Leveson (Eds.), Resilience Engineering: Concepts and Precepts(pp. 21-34). Aldershot: Ashgate Publishing Co.

Woods, D. D., Dekker, S. W. A., Cook, R. I., Johannesen, L. J., & Sarter, N. B. (2010). Behind human error. Aldershot, UK: Ashgate Publishing Co.

5 thoughts on “Why do things go right?”

  1. Wonderful summary. Thank you for taking the time to write this and share with the wider community. Every little bit helps in our journey!

  2. Thanks for another important discussion. As I shared with my contacts, there’s an unfortunate element in that these discussions tend to be limited to those who already “get it.” Akin to “preaching to the choir.”

  3. It is of interest that you cite Dr Deming as it had already occurred to me that safer work would naturally come from better work and yet the greatest barrier to achieving this is not the un-willing worker nor even the willing worker but as he demonstrated consistently the manager of the system of work and the style of management that prevails. This follows from the work of Shewhart, focused on quality, that defined the means for managers to predict the performance of a process. Safety as currently practised looks at correcting the faults of a system in a retarded fashion and usually looks no further than the control of human error as manifested in outcomes. A system of management, following a Deming Approach to Safety would be constantly improving the system and managing the variation within in it, which is a basis for addressing the issues Erik Hollnagel has identified in his work on FRAM. What is the greatest obstacle to this is the prevailing style of management based on the short termism you refer to. Deming and others such as Myron Tribus were there already but the problems identified by Tribus writing about the Germ Theory of Management and what Deming in his later book “The New Economics” called a tyranny and prison are what get in the way. If your company is structured on principles derived from McCullum, Thomson, Weber and Taylor and striving for efficiency over effectiveness then no amount of 4 hr workshopping on Just Culture and Employee Engagement is going to address the Train Wreck style and its underlying ethos of holding others to account for blame as the responsible delinquent. Only by applying the lens of SoPK and then transforming from a Me to We based organisation can silos be broken, fear driven out and the real breakthrough in safety achieved.

  4. Thank you very much indeed for that summary and the possibility to share it. It’s a long journey to address that way of thinking about safety to leaders and other decision makers (like politicians).

  5. I really liked this article, especially the wider point that the majority of work is done in the “middle”. Our availability bias/heuristic means we tend to focus on the extremes, rather than the “the huge bulbous middle”. The software industry seems to have the reverse problem, in that we focus on the heroic rock star software developers. So suggestions about how to improve performance are often little more the exhortations to know everything and be better. The reality is that most software is created by average software developers (because that’s just statistics), who don’t quite know what they are doing (because they are inexperienced and the technology is new) working for organisations that don’t know what they want (because the world and markets are uncertain). The final seven points at the end of the article are a good place to start thinking about this middle work. The final point “Pride of workmanship” already neatly aligns with the “software craftsmanship” approach.

Leave a Reply