Monday, August 4, 2014

System safety engineering

Why do complex technologies so often fail, and fail in such unexpected ways? Why is it so difficult for hospitals, chemical plants, and railroads to design their processes in such a way as to dramatically reduce the accident rate? How should we attempt to provide systematic analysis of the risks that a given technology presents and the causes of accidents that sometimes ensue? Earlier posts have looked at the ways that sociologists have examined this problem (link, link, link); but how do gifted engineers address the issue?

Nancy Leveson's current book, Engineering a Safer World: Systems Thinking Applied to Safety (2012), is an outstanding introduction to system safety engineering. This book brings forward the pioneering work that she did in Safeware: System Safety and Computers (1994) with new examples and new contributions to the field of safety engineering.

Leveson's basic insight, here and in her earlier work, is that technical failure is rarely the result of the failure of a single component. Instead, failures result from multiple incidents involving the components, and unintended interactions among the components. So safety is a feature of the system as a whole, not of the individual sub-systems and components. Here is how she puts the point in Engineering a Safer World:
Safety is a system property, not a component property, and must be controlled at the system level, not the component level. (kl 263)
Traditional risk and failure analysis focuses on specific pathways that lead to accidents, identifying potential points of failure and the singular "causes" of the accident (most commonly including operator error). Leveson believes that this approach is no longer helpful. Instead she argues for what she calls a "new accident model" -- a better and more comprehensive way of analyzing the possibilities of accident scenarios and the causes of actual accidents. This new conception has several important parts (kl 877-903):
  • expand accident analysis by forcing consideration of factors other than component failures and human errors
  • provide a more scientific way to model accidents that produces a better and less subjective understanding of why the accident occurred
  • include system design errors and dysfunctional system interactions
  • allow for and encourage new types of hazard analyses and risk assessments 
  • shift the emphasis in the role of humans in accidents from errors ... to focus on the mechanisms and factors that shape human behavior
  • encourage a shift in the emphasis in accident analysis from "cause" ... to understanding accidents in terms of reasons, that is, why the events and errors occurred
  • allow for and encourage multiple viewpoints and multiple interpretations when appropriate
  • assist in defining operational metrics and analyzing performance data
Leveson is particularly dissatisfied with the formal apparatus in use in engineering and elsewhere when it comes to analysis of safety and accident causation, and she argues that there are a number of misleading conflations in the field that need to be addressed. One of these is the conflation between reliability and safety. Reliability is an assessment of the performance of a component relative to its design. But Leveson points out that systems like automobiles, chemical plants, and weapons systems can all consist of components that are highly reliable and yet that give rise to highly destructive and unanticipated accidents.

So thinking about accidents in terms of component failure is a serious misreading of the nature of the technologies with which we interact every day. Instead she argues that safety engineering must be systems engineering:
The solution, I believe, lies in creating approaches to safety based on modern systems thinking and systems theory. (kl 88)
One important part of a better understanding of accidents and safety is a recognition of the fact of complexity in contemporary technology systems -- interactive complexity, dynamic complexity, decompositional complexity, and nonlinear complexity (kl 139). Each of these forms of complexity makes it more difficult to anticipate possible accidents, and more difficult to assign discrete accident pathways to the occurrence of an accident.
Accidents are complex processes involving the entire sociotechnical system. Traditional event-chain models cannot describe this process adequately. (kl 496)
Leveson is highly critical of iterative safety engineering -- what she calls the "fly-fix-fly" approach. Given the severity of outcomes that are possible when it comes to control systems for nuclear weapons, the operations of nuclear reactors, or the air traffic control system, we need to be able to do better than simply improving safety processes following an accident (kl 148).

The model that she favors is called STAMP (Systems-Theoretic Accident Model and Processes; kl 1059). This model replaces the linear component-by-component analysis of technical devices with a system-level representation of their functioning. The STAMP approach begins with an effort to identify crucial safety constraints for a given system. (For example, in the Union Carbide plant at Bhopal, "never allow MIC to come in contact with water"; in design of the Mars Polar Lander, "don't allow the spacecraft to impact the planet surface with more than a maximum force" (kl 1074); in design of public water systems, "water quality must not be compromises" (kl 1205).) Once the constraints are specified, the issue of control arises; what are the internal and external processes that ensure that the constraints are continuously satisfied? This devolves into a set of questions about system design and system administration; the instrumentation that is developed to measure compliance with the constraint and the management systems that are in place to ensure continuous compliance.
Also of interest in the book is Leveson's description of a new systems-level way of analyzing the hazards associated with a device or technology, STPA (System-Theoretic Process Analysis) (kl 2732). She describes STPA as the hazards analysis associated with the risks identified by STAMP:
STPA has two main steps:
  1. Identify the potential for inadequate control of the system that could lead to a hazardous state.
  2. Determine how each potentially hazardous control action identified in step 1 could occur. (kl 2758)
Here is an example of the process through which an STPA risk analysis proceeds for NASA (kl 2995).

It would be very interesting to see how an engineer would employ the STAMP and STPA methodologies to evaluate the risks and hazards associated with swarms of autonomous vehicles. Each vehicle is a system that can be analyzed using the STAMP methodology. But likewise the workings of an expressway with hundreds of autonomous vehicles (perhaps interspersed with less predictable human drivers) is also a system with complex characteristics.

Each individual vehicle has a hierarchical system of control designed to ensure safe transportation of its passengers and the vehicle itself; what are the failure modes for this control system? And what about the swarm -- given that each vehicle is responsive to the other vehicles around it, how will individual cars respond to unusual circumstances (a jack-knifed truck blocking all three lanes, let's say)? It would appear that autonomous vehicles create the kinds of novel hazards with which Leveson begins her book -- complexity, non-linear relationships, emergent properties of the whole that are unexpected given the expected operations of the components. The fly-fix-fly approach would suggest the deployment of a certain number of experimental vehicles and then evaluate their interactions in real-world settings. A more disciplined approach using the methodologies of STAMP and STPA would make systematic efforts to identify and control the pathways through which accidents can occur.

Here is a simulated swarm of autonomous vehicles:

But accidents happen; neither software nor control systems are perfect. So what would be the result of one disabling fender-bender in the intersection, followed by a half dozen more; followed by a gigantic pileup of robo-cars?

No comments: