What cause accidents in a slowly drifting system — human errors or technical issues?

Y. Man
UX Collective
Published in
8 min readMar 17, 2022

--

I recently watched a movie called Boîte Noire which describes a black box analyst who is on a mission to solve the reason behind the deadly crash of a brand new aircraft. The story immediately reminded me of Boeing 737 max accidents which had cost hundreds of peoples’ lives, because they share a lot of similarities. This movie actually raised some good questions such as how we can understand a complex system’s failure. Is it because someone just made a fatal human mistake, or there are some pieces of unreliable technologies that are activated and start to move the needle? Historically we have been often claiming that these are dominant factors that contribute to accidents, particularly human errors. For example, 75–96% of marine casualties are caused by some form of human error (Rothblum, 2000), 94% of deadly car crashes are caused by some type of human error (National Public Radio, 2017), etc. But is it really so or can we understand it in different ways? In this article, I would like to walk you through one of the 737 max accidents and explain that why human errors or technical issues are really not the cause but just some symptom of deeper problems in the whole socio-technical system (Dekker, 2017). We would also talk about how we can perhaps view the chain of incidents from a system perspective (Dekker, 2011) because I believe that for human factors researchers it is critical to take a system approach to understand those directly observed “causal relationships” in real life.

Lion Air flight 610, which was one of the world’s newest and best-selling aircrafts — the Boeing 737 max 8, crashed into the Java Sea on October 29, 2018. In hindsight, many reports have drawn the light to the MCAS system, because its activation in certain conditions can make manual control of the airplane extremely difficult (basically every time the pilots tried to trim the nose up, the auto system would push it back down which caused a series of uncommanded dives), or human errors of how the pilots failed to observe “salient” movement of trim wheel and its featured sound (if stabilizer was automatically controlled, a large wheel located between the pilots would spin and make a sound). Some even said that the accident was caused due to technical issues and human errors so that if we can have more reliable systems and more competent human operators with more training, the accident could have been avoided. But how do you think?

In the cockpit of Boeing 737 max 8 airplane, a large wheel was located between the pilots to tell the stabilizer is automatically controlled.
Photo by JC Gellidon on Unsplash

What we know was that the flaps were raised and autopilot was off. The pilots couldn’t identify or solve the root problem when they noticed there was unreliable airspeed and altitude information soon after they took off. The first officer asked if they should request a return to the airport but the captain decided to continue. Their communication to air traffic controller was a bit delayed and had not so much effective information except saying that they had flight control problem with unreliable readings from their instruments. Based on data from cockpit voice recorder, the pilots were locating checklist for mismatched airspeed, which did not really exist.

There was also traces of technical failure in this accident like many others. The left side of angle of attack sensor, which helps to determine if there is sufficient lift and calculate airspeed and altitude, was actually replaced recently but it was either incorrectly installed or faulty. Something deeper than the technical aspect is that no one really ensured that it was successfully calibrated though there was a procedure and signatures in place.

The crew who flew the same plan hours before the accident experienced the same issue partially due to the sensor failure, but they realized that the stabilizer was not working correctly so they cutout to disengage the auto-trim system, which helped them avoid a fatal accident.

However, they decided to continue their flight instead of returning the airport which was 10 mins away. Though the issue was reported once it was landed, the pilots only mentioned symptoms of having unreliable readings. Nothing about stabilizer or how they solved it.

In addition to all those human errors or technical issue, MCAS was a main target to be blamed in many coverage of the accident. How does it come into being in the first place? Well, Boeing modified the existing 737 in a hasty manner to compete with their competitor’s newly launched energy-efficient product, but the larger engines that help with energy efficiency had to be slided forward on the wings for a safer position, but then it caused a problem of pitching up under some circumstances. Therefore MCAS was born to automatically force the nose down by adjusting the trim.

Boeing persuaded the authority that the max model is handled the same way as all previous 737 to reduce the need for costly pilot training. They believed that MCAS was just a silently-running background program that was not worth mentioning in any flight control manuals. So there is really no checklist for the pilots to follow during those tense moments.

Boeing believed that the activation of MCAS is an event with extremely low probability because it has to meet all of these 3 conditions: 1. Excessive angle of attack provided by sensor 2. Autopilot off. 3. Flaps up. However such a design of MCAS has been found some fatal problems. One of them is that there was no fail-safe because the system relied on only one angle of attack sensor even though there are two on the airplane. So in this accident, MCAS relied on that left faulty sensor and that just made Flight 610 meet all those three conditions. Then MCAS kicked in and the plane started to pitch down automatically.

Didn’t Boeing have any awareness of this issue? Well, they did but they assumed that the pilots would instantly recognize that the nose-down trim situation had something to do with MCAS and respond quickly. Boeing hopes to use the sort of noticeable design of spinning wheel and accompanying sound to sufficiently warn the pilots that the stabilizer system is automatically adjusting the trim. However, the pilots on the flight 610 were giving their full attention to faulty readings and checklist. It is reasonable to believe that their cognitive resources were exhausted to some extent so that they did not really act in a way the Boeing designers assume they would. By following this lead, the investigation found out that the first officer had a poor training performance even the captain’s training result was normal. Unfortunately during their trouble shooting process, the captain wanted to locate the checklist himself and had to assign the driving task to the first officer who clearly needs additional training.

The narrative may go on but now after hearing all these pieces of factors that directly or indirectly resulted in the accident, do you think it could still be simply explained by human errors or technical failures as most reports would do?

The conventional way to understand accidents is an essentially a person approach to safety (instead of a system approach) in which attention has been largely given to the problems in humans (Le Coze, 2013). This may easily induce hindsight in accident analysis, e.g., “the operator should have… but he/she did not see / smell / hear / understand it”. With this approach in mind we often hear that people got blamed because they are not like machines and they have variability. However variability is something that makes our system functions well most of the time but when it is not working well, human error is the price to pay (Hollnagel, 2002). One good example in this Boeing case is that the previous pilot crews noticed the problems of auto-trim system and they successfully adapted to that. It is their variability that actually saved people’s life. But imagine if the move of disengaging the auto-trim system would bring more complexity to the picture, would they be blamed for the same actions?

Another insight we can gain from this accident is that there’re deep organizational issues that concern safety culture, training, and management and they usually lie under the observed deviated human performance and design issues (Reason 2000, Hopkins 2014). However, to capture and solve these much less salient issues is not an easy task. Here is why: A system failed because it has been on its way to fail, most likely in a slow pace — It caught no attention with small incremental changes because the way of how the system was ill-structured had been working, until at some point it goes beyond the safety boundary and causes a system-wide collapse, then the accident happens. That’s what Dekker and Pruchnicki (2014) described as “systems slowly drifting into failure” and they believed that the system failed because it was successful. That is actually what we see in the flight 610 accident that things are slowly drifting into failures on all dimensions but since it was working, nobody really questioned it until the moment it went over the safety boundary. For instance, introducing larger engines (to increase efficiency), putting them in a new positions (to increase their distance to the ground), creating a new software program MCAS (to solve the pitching-up issues that were introduced by having engines at new positions)…these linked incremental steps or changes all aimed to solve a certain local issue. On the superficial level they look reasonable and rational, and the system might work safely as usual. Nevertheless, the problem is that local adaptiveness can lead to the illusion of assistance or miscalibration of the dynamic situation, as it might lead the decision-makers to ignore the mal-adaptiveness on a global scale (Dekker and Pruchnicki 2014) — the system is slowly drifting towards the boundary. Therefore, one important design insight for any complex high-risk human-technology system should be enhancing the system’s capability to maintain in the safety boundary in such an incubation period. That would require us to take a holistic approach to understand the issues that we used to simply label as “human errors” or “technical failure”.

References:

Dekker, S. (2011). Drift into Failure: From Hunting Broken Components to Understanding Complex Systems. Farnham: AshgatePublishing Co.

Dekker, S. (2017). The field guide to understanding ‘human error’. CRC press.

Dekker S, Pruchnicki S (2014) Drifting into failure: theorising the dynamics of disaster incubation. Theor Issues Ergon Sci 15:534–544. https://doi.org/10.1080/1463922X.2013.856495

Hollnagel, E. (2002). Understanding accidents-from root causes to performance variability. Paper presented at the the IEEE 7th Conference on Human Factors and Power Plants, Scottsdale, AZ, USA.

Hopkins, A. (2014). Issues in safety science. Safety Science, 67, 6–14. doi:https://doi.org/10.1016/j.ssci.2013.01.007

Le Coze, J.-C. (2013). New models for new times. An anti-dualist move. Safety Science, 59, 200–218. doi:https://doi.org/10.1016/j.ssci.2013.05.010

National Public Radio. (2017). Unsafe Driving Leads To Jump In Highway Deaths, Study Finds : NPR

Reason, J. (2000). Human error: models and management. BMJ : British Medical Journal, 320(7237), 768–770. Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1117770/

Rothblum, A. M. (2000). Human Error and Marine Safety. Retrieved from http://bowles-langley.com/wp-content/files_mf/humanerrorandmarinesafety26.pdf

--

--