Pembroke Refinery is an oil processing facility on the Milford Haven Waterway, in Wales. It was the site of a multiple-fatality explosion in 2011, and the grounding of the Sea Empress in 1996, releasing a major oil spill into Pembrokeshire Coast National Park. Prior to both of these events, Pembroke Refinery made national headlines with another explosion and fire, fortunately non-fatal.
Early on the morning of 24th July, 1994, there was a dry electrical storm raging above the refinery. Just before 9am lightning struck one of the processing units starting a fire. The lightning and subsequent fire did not directly cause any permanent harm, but they were the trigger for the events which followed. They created a disturbance in the normal operation of the refinery, and signalled that this was not a normal day at the office.
As part of the emergency shutdown, hydrocarbon flow was halted through part of the plant, the cracking unit. None of the vessels in this plant are supposed to be completely empty, even in shutdown, so a series of valves began to close. These closed valves prevented any of the vessels from completely draining. Once the plant was restarted and flow resumed, the valves opened again. One valve in particular, which we’ll call VALVE B, stayed closed. VALVE B enabled flow from a tank called the debutaniser into another tank called the naptha splitter. With VALVE B closed, and flow restarted, the debutaniser began to fill up, whilst the naptha splitter was being starved.
To make matters worse, the control system was getting incorrect signals from VALVE B. According to the control system, VALVE B was open, and everything was working normally. Inside the control room there were status indications for all of the equipment, and a separate monitor with a list of all the alarms, but there was no overview display of the plant. What I imagine when I think of a control room is a screen with a picture of all the tanks joined together, showing how full each one is, and how much is flowing between them. If the operators had such a picture they could have quickly worked out that even though VALVE B was supposedly open, nothing was flowing from the debuatniser to the naptha splitter. The technology allowed such graphics to be displayed, but had been configured onto to show raw data and bar charts. The operators could also only call up part of the overall process at a time. On the debutaniser information screen they could see that pressure was increasing. There was nothing calling their attention to the naptha splitter, so no reason to bring up that particular screen, or to compare it to the debutaniser.
As the debutaniser filled up with liquid, pressure increased and had to be relieved. The operators vented the debutaniser into the blow-down and flare system three times. Unlike at Texas City, where similar overspill caused a much worse explosion, all of the equipment was connected to flare towers. Rather than have a separate tower for each piece of equipment, there were towers for different types of product. The “sweet” tower dealt with light hydrocarbons. The “sour” tower dealt with gasses that had significant amounts of hydrogen sulfide, and the “acid” tower dealt with mixed material that needed processing before it could be safely burned off.
Just like at Texas City, though, the blow-down system was designed primarily to handle overflow gasses, not large amounts of liquid. Within each unit there was a drum for capturing liquid and allowing gas to proceed into the arterial pipework and eventually to the flare. Once this drum was full, liquid would enter the pipework. Because there were only three towers for lots of bits of equipment, this piping was fairly complicated. Even in an overflow situation pressure needed to be carefully managed so that there was a smooth flow to the tower. On 24th July, as the liquid from the third venting traversed through one of four 90 degree elbow bends, the pipework failed. It didn’t help that changes a few years earlier to reduce the environmental impact of the plant prevented automatic removal of liquid from the blowdown system. It didn’t help that the pipes, not meant to handle liquid in the first place, were known to be corroded.
Twenty tons of hydrocarbon were released, a vapour cloud formed, and an explosion quickly followed. There were no fatalities. Partly this was due to luck, but to be fair to the plant management there was also solid contingency planning in place, and facilities in range of the explosion were designed to cope with blast damage.
It took two and a half days to put out the fire, but only a few hours for Health and Safety Executive (the United Kingdom body responsible for investigating accidents of this type) representatives to arrive on the scene.
The site was quite dangerous to investigate due to the extensive damage, but the control rooms were mostly intact. In fact they wouldn’t have been badly damaged at all if the earlier lightning strikes hadn’t disabled the air-conditioning system, requiring the protective door to be left open.
The proximate causes of the accident was identified as two process imbalances. There was more liquid going into the debutaniser than coming out of it, and more heat going in than coming out. Both of these speak to a loss of control. Some variation of chemical processes is normal, even desirable. Too much variation, in particular when two parameters drift badly in the wrong direction at the same time, led to an unsafe situation.
The report also found that the actions of the operators contributed to these imbalances. They didn’t quite understand what was going on, and so they didn’t take appropriate action to correct the drift. It would be easy (and very wrong) to stop there.
None of the operators went to work that day intending to do a bad job. In fact, they were displaying considerable expertise in coping with a highly unusual situation. They had contained a dangerous fire without anyone being hurt or damage to the plant, and they were in the process of restoring production. However, they were trying to manage within a plant and organisation with very low resilience.
“Resilience” is the capacity of something cope with disturbance. When measuring the physical resilience of a component we consider the amount that it distorts in response to stress, and how quickly and perfectly it returns to its original shape. Resilience is a positive measure of safety, in the sense that it considers the presence of good things rather than the absence of bad things.
The physical plant at the refinery had very limited flexibility. In a simple flow cycle, it is very important that the system always has more capacity to remove pressure, mass and heat than it does to introduce pressure, mass and heat. This can be achieved by having a second control loop which shuts off inputs if the outputs are not flowing.
The design of the control room dis-empowered the operators. They had enough information to follow carefully written procedures, but they were not expected to adapt or improvise, and so the system didn’t provide them with the situational awareness needed to show initiative. It wasn’t just the lack of an overview display, there was also an alarm system that just dumped a long list of unprioritised warnings to the operators, and unreliable instruments which meant that the operators had to form complicated mental models including disturbed processes and incorrect reporting of those processes.
Their training and their equipment didn’t let them step back and form a clear picture of what was going on in time to react appropriately
Finally, and quite literally, if the elbow bend hadn’t been corroded it might have flexed and returned to shape instead of shattering.
The lightning strike provided an initial disturbance to the system. The equipment and operators were put under unusual stress. Good business and good safety required a return to normal operating conditions as quickly and smoothly as possible. It didn’t have to be a lightning strike which caused the disturbance, and it was probably impossible to enumerate and protect against every single thing that could go wrong. Instead, the overall system needed the positive features of resilience so that it could respond to any disturbance.
Some of these features could have been pure hardware – a blowdown system with more capacity, and less crucial timing and balance. Modifications to the plant to reduce its environmental footprint had actually made this part of the system far less resilient. Other missing resilient features related to the support provided to the operators. Displays that provide increased situational awareness, and alarm systems which take care of prioritisation and interpretation so that operators can focus on the big picture increase resilience.
Whilst the report does not discuss training in depth, it does mention that the team inside the control room were flexible and multi-skilled. This is normally a marker of resilience. Individuals could switch roles to cover for missing staff, and managers were in the practice of “helping out” during upset situations. Unfortunately this individual flexibility didn’t translate into team resilience. Decisions were made on an individual, reactive basis without co-ordination.
Reading between the lines, the gap in trust and understanding was not between line management and control room operators, but between the designers and the whole operational team. All of the operators, including the line managers, were provided with the information that the designers thought they needed to have to operate the plant in its intended fashion.
In 1984, Charles Perrow would have called this a “Normal Accident”. The tight coupling and interactive complexity of the system meant that the operators were not able to comprehend the problems adequately, so they made things worse instead of better. Resilience, one of the themes of Safety Differently and Safety II thinking, provides us with a way to manage safety in a way that can prevent such accidents. It is good to anticipate hazards and design systems and people that can cope with them, but it isn’t enough. We also need to look at safety as a positive attribute of our systems and people.
This post is a modified version of a segment from DisasterCast Episode 42.
Image: Colin Bell/geograph.org.uk