Showing posts with label technology failure. Show all posts
Showing posts with label technology failure. Show all posts

Thursday, July 18, 2019

Safety and accident analysis: Longford


Andrew Hopkins has written a number of fascinating case studies of industrial accidents, usually in the field of petrochemicals. These books are crucial reading for anyone interested in arriving at a better understanding of technological safety in the context of complex systems involving high-energy and tightly-coupled processes. Especially interesting is his Lessons from Longford: The ESSO Gas Plant Explosion. The Longford refining plant suffered an explosion and fire in 1998 that killed two workers, badly injured others, and interrupted the supply of natural gas to the state of Victoria for two weeks. Hopkins is a sociologist, but has developed substantial expertise in the technical details of petrochemical refining plants. He served as an expert witness in the Royal Commission hearings that investigated the accident. The accounts he offers of these disasters are genuinely fascinating to read.

Hopkins makes the now-familiar point that companies often seek to lay responsibility for a major industrial accident on operator error or malfeasance. This was Esso's defense concerning its corporate liability in the Longford disaster. But, as Hopkins points out, the larger causes of failure go far beyond the individual operators whose decisions and actions were proximate to the event. Training, operating plans, hazard analysis, availability of appropriate onsite technical expertise -- these are all the responsibility of the owners and managers of the enterprise. And regulation and oversight of safety practices are the responsibility of stage agencies. So it is critical to examine the operations of a complex and dangerous technology system at all these levels.

A crucial part of management's responsibility is to engage in formal "hazard and operability" (HAZOP) analysis. "A HAZOP involves systematically imagining everything that might go wrong in a processing plant and developing procedures or engineering solutions to avoid these potential problems" (26). This kind of analysis is especially critical in high-risk industries including chemical plants, petrochemical refineries, and nuclear reactors. It emerged during the Longford accident investigation that HAZOP analyses had been conducted for some aspects of risk but not for all -- even in areas where the parent company Exxon was itself already fully engaged in analysis of those risky scenarios. The risk of embrittlement of processing equipment when exposed to super-chilled conditions was one that Exxon had already drawn attention to at the corporate level because of prior incidents.

A factor that Hopkins judges to be crucial to the occurrence of the Longford Esso disaster is the decision made by management to remove engineering staff from the plant to a central location where they could serve a larger number of facilities "more efficiently".
A second relevant change was the relocation to Melbourne in 1992 of all the engineering staff who had previously worked at Longford, leaving the Longford operators without the engineering backup to which they were accustomed. Following their removal from Longford, engineers were expected to monitor the plant from a distance and operators were expected to telephone the engineers when they felt a need to. Perhaps predictably, these arrangements did not work effectively, and I shall argue in the next chapter that the absence of engineering expertise had certain long-term consequences which contributed to the accident. (34)
One result of this decision is the fact that when the Longford incident began there were no engineering experts on site who could correctly identify the risks created by the incident. Technicians therefore restarted the process by reintroducing warm oil into the super-chilled heat exchanger. The metal had become brittle as a result of the extremely low temperatures and cracked, leading to the release of fuel and subsequent explosion and fire. As Hopkins points out, Exxon experts had long been aware of the hazards of embrittlement. However, it appears that the operating procedures developed by Esso at Longford ignored this risk, and operators and supervisors lacked the technical/scientific knowledge to recognize the hazard when it arose.

The topic of "tight coupling" (the tight interconnection across different parts of a complex technological system) comes up frequently in discussions of technology accidents. Hopkins shows that the Longford case gives a new spin to this idea. In the case of the explosion and fire at Longford it turned out to be very important that plant 1 was interconnected by numerous plumbing connections to plants 2 and 3. This meant that fuel from plants 2 and 3 continued to flow into plant 1 and greatly extended the length of time it took to extinguish the fire. Plant 1 had to be fully isolated from plants 2 and 3 before the fire could be extinguished (or plants 2 and 3 could be restarted), and there were enough plumbing connections among them, poorly understood at the time of the fire, that took a great deal of time to disconnect (32).

Hopkins addresses the issue of government regulation of high-risk industries in connection with the Longford disaster. Written in 1999 or so, he recognizes the trend towards "self-regulation" in place of government rules stipulating the operating of various industries. He contrasts this approach with deregulation -- the effort to allow the issue of safe operation to be governed by the market rather than by law.
Whereas the old-style legislation required employers to comply with precise, often quite technical rules, the new style imposes an overarching requirement on employers that they provide a safe and healthy workplace for their employees, as far as practicable. (92)
He notes that this approach does not necessarily reduce the need for government inspections; but the goal of regulatory inspection will be different. Inspectors will seek to satisfy themselves that the industry has done a responsible job of identify hazards and planning accordingly, rather than looking for violations of specific rules. (This parallels to some extent his discussion of two different philosophies of audit, one of which is much more conducive to increasing the systems-safety of high-risk industries; chapter 7.) But his preferred regulatory approach is what he describes as "safety case regulation". (Hopkins provides more detail about the workings of a safety case regime in Disastrous Decisions: The Human and Organisational Causes of the Gulf of Mexico Blowout, chapter 10.)
The essence of the new approach is that the operator of a major hazard installation is required to make a case or demonstrate to the relevant authority that safety is being or will be effectively managed at the installation. Whereas under the self-regulatory approach, the facility operator is normally left to its own devices in deciding how to manage safety, under the safety case approach it must lay out its procedures for examination by the regulatory authority. (96)
The preparation of a safety case would presumably include a comprehensive HAZOP analysis, along with procedures for preventing or responding to the occurrence of possible hazards. Hopkins reports that the safety case approach to regulation is being adopted by the EU, Australia, and the UK with respect to a number of high-risk industries. This discussion is highly relevant to the current debate over aircraft manufacturing safety and the role of the FAA in overseeing manufacturers.

It is interesting to realize that Hopkins is implicitly critical of another of my favorite authors on the topic of accidents and technology safety, Charles Perrow. Perrow's central idea of "normal accidents" brings along with it a certain pessimism about the ability to increase safety in complex industrial and technological systems; accidents are inevitable and normal (Normal Accidents: Living with High-Risk Technologies). Hopkins takes a more pragmatic approach and argues that there are engineering and management methodologies that can significantly reduce the likelihood and harm of accidents like the Esso gas plant explosion. His central point is that we don't need to be able to anticipate a long chain of unlikely events in order to identify the hazard in which these chains may eventuate -- for example, loss of coolant in a nuclear reactor or loss of warm oil in a refinery process. These final events of numerous different possible accident scenarios all require procedures in place that will guide the responses of engineers and technicians when "normal accidents" occur (33).

Hopkins highlights the challenge to safety created by the ongoing modification of a power plant or chemical plant; later modifications may create hazards not anticipated by the rigorous accident analysis performed on the original design.
Processing plants evolve and grow over time. A study of petroleum refineries in the US has shown that "the largest and most complex refineries in the sample are also the oldest ... Their complexity emerged as a result of historical accretion. Processes were modified, added, linked, enhanced and replaced over a history that greatly exceeded the memories of those who worked in the refinery. (33)
This is one of the chief reasons why Perrow believes technological accidents are inevitable. However, Hopkins draws a different conclusion:
However, those who are committed to accident prevention draw a different conclusion, namely, that it is important that every time physical changes are made to plant these changes be subjected to a systematic hazard identification process. ...  Esso's own management of change philosophy recognises this. It notes that "changes potentially invalidate prior risk assessments and can create new risks, if not managed diligently." (33)
(I believe this recommendation conforms to Nancy Leveson's theories of system safety engineering as well; link.)

Here is the causal diagram that Hopkins offers for the occurrence of the explosion at Longford (122).


The lowest level of the diagram represents the sequence of physical events and operator actions leading to the explosion, fatalities, and loss of gas supply. The next level represents the organizational factors identified in Longford's analysis of the event and its background. Central among these factors are the decision to withdraw engineers from the plant; a safety philosophy that focused on lost-time injuries rather than system hazards and processes; failures in the incident reporting system; failure to perform a HAZOP for plant 1; poor maintenance practices; inadequate audit practices; inadequate training for operators and supervisors; and a failure to identify the hazard created by interconnections with plants 2 and 3. The next level identifies the causes of the management failures -- Esso's overriding focus on cost-cutting and a failure by Exxon as the parent company to adequately oversee safety planning and share information from accidents at other plants. The final two levels of causation concern governmental and societal factors that contributed to the corporate behavior leading to the accident.

(Here is a list of major industrial disasters; link.)


Saturday, May 25, 2019

The 737 MAX disaster as an organizational failure


The topic of the organizational causes of technology failure comes up frequently in Understanding Society. The tragic crashes of two Boeing 737 MAX aircraft in the past year present an important case to study. Is this an instance of pilot error (as has occasionally been suggested)? Is it a case of engineering and design failures? Or are there important corporate and regulatory failures that created the environment in which the accidents occurred, as the public record seems to suggest?

The formal accident investigations are not yet complete, and the FAA and other air safety agencies around the world have not yet approved the aircraft for flight following the suspension of certification following the second crash. There will certainly be a detailed and expert case study of this case at some point in the future, and I will be eager to read the resulting book. In the meantime, though, it is  useful to bring the perspectives of Charles Perrow, Diane Vaughan, and Andrew Hopkins to bear on what we can learn about this case from the public media sources that are available. The preliminary sketch of a case study offered below is a first effort and is intended simply to help us learn more about the social and organizational processes that govern the complex technologies upon which we depend. Many of the dysfunctions identified in the safety literature appear to have had a role in this disaster.

I have made every effort to offer an accurate summary based on publicly available sources, but readers should bear in mind that it is a preliminary effort.

The key conclusions I've been led to include these:

The updated flight control system of the aircraft (MCAS) created the conditions for crashes in rare flight conditions and instrument failures.
  • Faults in the AOA sensor and the MCAS flight control system persisted through the design process 
  • pilot training and information about changes in the flight control system were likely inadequate to permit pilots to override the control system when necessary  
There were fairly clear signs of organizational dysfunction in the development and design process for the aircraft:
  • Disempowered mid-level experts (engineers, designers, software experts)
  • Inadequate organizational embodiment of safety oversight
  • Business priorities placing cost savings, timeliness, profits over safety
  • Executives with divided incentives
  • Breakdown of internal management controls leading to faulty manufacturing processes 
Cost-containment and speed trumped safety. It is hard to avoid the conclusion that the corporation put cost-cutting and speed ahead of the professional advice and judgment of the engineers. Management pushed the design and certification process aggressively, leading to implementation of a control system that could fail in foreseeable flight conditions.

The regulatory system seems to have been at fault as well, with the FAA taking a deferential attitude towards the company's assertions of expertise throughout the certification process. The regulatory process was "outsourced" to a company that already has inordinate political clout in Congress and the agencies.
  • Inadequate government regulation
  • FAA lacked direct expertise and oversight sufficient to detect design failures. 
  • Too much influence by the company over regulators and legislators
Here is a video presentation of the case as I currently understand it (link). 

See also this earlier discussion of regulatory failure in the 737 MAX case (link). Here are several experts on the topic of organizational failure whose work is especially relevant to the current case:

Tuesday, October 23, 2018

Sexual harassment in academic contexts


Sexual harassment of women in academic settings is regrettably common and pervasive, and its consequences are grave. At the same time, it is a remarkably difficult problem to solve. The "me-too" movement has shed welcome light on specific individual offenders and has generated more awareness of some aspects of the problem of sexual harassment and misconduct. But we have not yet come to a public awareness of the changes needed to create a genuinely inclusive and non-harassing environment for women across the spectrum of mistreatment that has been documented. The most common institutional response following an incident is to create a program of training and reporting, with a public commitment to investigating complaints and enforcing university or institutional policies rigorously and transparently. These efforts are often well intentioned, but by themselves they are insufficient. They do not address the underlying institutional and cultural features that make sexual harassment so prevalent.

The problem of sexual harassment in institutional contexts is a difficult one because it derives from multiple features of the organization. The ambient culture of the organization is often an important facilitator of harassing behavior -- often enough a patriarchal culture that is deferential to the status of higher-powered individuals at the expense of lower-powered targets. There is the fact that executive leadership in many institutions continues to be predominantly male, who bring with them a set of gendered assumptions that they often fail to recognize. The hierarchical nature of the power relations of an academic institution is conducive to mistreatment of many kinds, including sexual harassment. Bosses to administrative assistants, research directors to post-docs, thesis advisors to PhD candidates -- these unequal relations of power create a conducive environment for sexual harassment in many varieties. In each case the superior actor has enormous power and influence over the career prospects and work lives of the women over whom they exercise power. And then there are the habits of behavior that individuals bring to the workplace and the learning environment -- sometimes habits of masculine entitlement, sometimes disdainful attitudes towards female scholars or scientists, sometimes an underlying willingness to bully others that finds expression in an academic environment. (A recent issue of the Journal of Social Issues (link) devotes substantial research to the topic of toxic leadership in the tech sector and the "masculinity contest culture" that this group of researchers finds to be a root cause of the toxicity this sector displays for women professionals. Research by Jennifer Berdahl, Peter Glick, Natalya Alonso, and more than a dozen other scholars provides in-depth analysis of this common feature of work environments.)

The scope and urgency of the problem of sexual harassment in academic contexts is documented in excellent and expert detail in a recent study report by the National Academies of Sciences, Engineering, and Medicine (link). This report deserves prominent discussion at every university.

The study documents the frequency of sexual harassment in academic and scientific research contexts, and the data are sobering. Here are the results of two indicative studies at Penn State University System and the University of Texas System:




The Penn State survey indicates that 43.4% of undergraduates, 58.9% of graduate students, and 72.8% of medical students have experienced gender harassment, while 5.1% of undergraduates, 6.0% of graduate students, and 5.7% of medical students report having experienced unwanted sexual attention and sexual coercion. These are staggering results, both in terms of the absolute number of students who were affected and the negative effects that these  experiences had on their ability to fulfill their educational potential. The University of Texas study shows a similar pattern, but also permits us to see meaningful differences across fields of study. Engineering and medicine provide significantly more harmful environments for female students than non-STEM and science disciplines. The authors make a particularly worrisome observation about medicine in this context:
The interviews conducted by RTI International revealed that unique settings such as medical residencies were described as breeding grounds for abusive behavior by superiors. Respondents expressed that this was largely because at this stage of the medical career, expectation of this behavior was widely accepted. The expectations of abusive, grueling conditions in training settings caused several respondents to view sexual harassment as a part of the continuum of what they were expected to endure. (63-64)
The report also does an excellent job of defining the scope of sexual harassment. Media discussion of sexual harassment and misconduct focuses primarily on egregious acts of sexual coercion. However, the  authors of the NAS study note that experts currently encompass sexual coercion, unwanted sexual attention, and gender harassment under this category of harmful interpersonal behavior. The largest sub-category is gender harassment:
"a broad range of verbal and nonverbal behaviors not aimed at sexual cooperation but that convey insulting, hostile, and degrading attitudes about" members of one gender (Fitzgerald, Gelfand, and Drasgow 1995, 430). (25)
The "iceberg" diagram (p. 32) captures the range of behaviors encompassed by the concept of sexual harassment. (See Leskinen, Cortina, and Kabat 2011 for extensive discussion of the varieties of sexual harassment and the harms associated with gender harassment.)


The report emphasizes organizational features as a root cause of a harassment-friendly environment.
By far, the greatest predictors of the occurrence of sexual harassment are organizational. Individual-level factors (e.g., sexist attitudes, beliefs that rationalize or justify harassment, etc.) that might make someone decide to harass a work colleague, student, or peer are surely important. However, a person that has proclivities for sexual harassment will have those behaviors greatly inhibited when exposed to role models who behave in a professional way as compared with role models who behave in a harassing way, or when in an environment that does not support harassing behaviors and/or has strong consequences for these behaviors. Thus, this section considers some of the organizational and environmental variables that increase the risk of sexual harassment perpetration. (46)
Some of the organizational factors that they refer to include the extreme gender imbalance that exists in many professional work environments, the perceived absence of organizational sanctions for harassing behavior, work environments where sexist views and sexually harassing behavior are modeled, and power differentials (47-49). The authors make the point that gender harassment is chiefly aimed at indicating disrespect towards the target rather than sexual exploitation. This has an important implication for institutional change. An institution that creates a strong core set of values emphasizing civility and respect is less conducive to gender harassment. They summarize this analysis in the statement of findings as well:
Organizational climate is, by far, the greatest predictor of the occurrence of sexual harassment, and ameliorating it can prevent people from sexually harassing others. A person more likely to engage in harassing behaviors is significantly less likely to do so in an environment that does not support harassing behaviors and/or has strong, clear, transparent consequences for these behaviors. (50)
So what can a university or research institution do to reduce and eliminate the likelihood of sexual harassment for women within the institution? Several remedies seem fairly obvious, though difficult.
  • Establish a pervasive expectation of civility and respect in the workplace and the learning environment
  • Diffuse the concentrations of power that give potential harassers the opportunity to harass women within their domains
  • Ensure that the institution honors its values by refusing the "star culture" common in universities that makes high-prestige university members untouchable
  • Be vigilant and transparent about the processes of investigation and adjudication through which complaints are considered
  • Create effective processes that ensure that complainants do not suffer retaliation
  • Consider candidates' receptivity to the values of a respectful, civil, and non-harassing environment during the hiring and appointment process (including research directors, department and program chairs, and other positions of authority)
  • Address the gender imbalance that may exist in leadership circles
As the authors put the point in the final chapter of the report:
Preventing and effectively addressing sexual harassment of women in colleges and universities is a significant challenge, but we are optimistic that academic institutions can meet that challenge--if they demonstrate the will to do so. This is because the research shows what will work to prevent sexual harassment and why it will work. A systemwide change to the culture and climate in our nation's colleges and universities can stop the pattern of harassing behavior from impacting the next generation of women entering science, engineering, and medicine. (169)

Sunday, October 21, 2018

System effects


Quite a few posts here have focused on the question of emergence in social ontology, the idea that there are causal processes and powers at work at the level of social entities that do not correspond to similar properties at the individual level. Here I want to raise a related question, the notion that an important aspect of the workings of the social world derives from "system effects" of the organizations and institutions through which social life transpires. A system accident or effect is one that derives importantly from the organization and configuration of the system itself, rather than the specific properties of the units.

What are some examples of system effects? Consider these phenomena:
  • Flash crashes in stock markets as a result of automated trading
  • Under-reporting of land values in agrarian fiscal regimes 
  • Grade inflation in elite universities 
  • Increase in product defect frequency following a reduction in inspections 
  • Rising frequency of industrial errors at the end of work shifts 
Here is how Nancy Leveson describes systems causation in Engineering a Safer World: Systems Thinking Applied to Safety:
Safety approaches based on systems theory consider accidents as arising from the interactions among system components and usually do not specify single causal variables or factors. Whereas industrial (occupational) safety models and event chain models focus on unsafe acts or conditions, classic system safety models instead look at what went wrong with the system's operation or organization to allow the accident to take place. (KL 977)
Charles Perrow offers a taxonomy of systems as a hierarchy of composition in Normal Accidents: Living with High-Risk Technologies:
Consider a nuclear plant as the system. A part will be the first level -- say a valve. This is the smallest component of the system that is likely to be identified in analyzing an accident. A functionally related collection of parts, as, for example, those that make up the steam generator, will be called a unit, the second level. An array of units, such as the steam generator and the water return system that includes the condensate polishers and associated motors, pumps, and piping, will make up a subsystem, in this case the secondary cooling system. This is the third level. A nuclear plan has around two dozen subsystems under this rough scheme. They all come together in the fourth level, the nuclear plant or system. Beyond this is the environment. (65)
Large socioeconomic systems like capitalism and collectivized socialism have system effects -- chronic patterns of low productivity and corruption in the latter case, a tendency to inequality and immiseration in the former case. In each case the observed effect is the result of embedded features of property and labor in the two systems that result in specific kinds of outcomes. And an important dimension of social analysis is to uncover the ways in which ordinary actors pursuing ordinary goals within the context of the two systems, lead to quite different outcomes at the level of the "mode of production". And these effects do not depend on there being a distinctive kind of actor in each system; in fact, one could interchange the actors and still find the same macro-level outcomes.

Here is a preliminary effort at a definition for this concept in application to social organizations:
A system effect is an outcome that derives from the embedded characteristics of incentive and opportunity within a social arrangement that lead normal actors to engage in activity leading to the hypothesized aggregate effect.
Once we see what the incentive and opportunity structures are, we can readily see why some fraction of actors modify their behavior in ways that lead to the outcome. In this respect the system is the salient causal factor rather than the specific properties of the actors -- change the system properties and you will change the social outcome.

When we refer to system effects we often have unintended consequences in mind -- unintended both by the individual actors and the architects of the organization or practice. But this is not essential; we can also think of examples of organizational arrangements that were deliberately chosen or designed to bring about the given outcome. In particular, a given system effect may be intended by the designer and unintended by the individual actors. But when the outcomes in question are clearly dysfunctional or "catastrophic", it is natural to assume that they are unintended. (This, however, is one of the specific areas of insight that comes out of the new institutionalism: the dysfunctional outcome may be favorable for some sets of actors even as they are unfavorable for the workings of the system as a whole.)
 
Another common assumption about system effects is that they are remarkably stable through changes of actors and efforts to reverse the given outcome. In this sense they are thought to be somewhat beyond the control of the individuals who make up the system. The only promising way of undoing the effect is to change the incentives and opportunities that bring it about. But to the extent that a given configuration has emerged along with supporting mechanisms protecting it from deformation, changing the configuration may be frustratingly difficult.

Safety and its converse are often described as system effects. By this is often meant two things. First, there is the important insight that traditional accident analysis favors "unit failure" at the expense of more systemic factors. And second, there is the idea that accidents and failures often result from "tightly linked" features of systems, both social and technical, in which variation in one component of a system can have unexpected consequences for the operation of other components of the system. Charles Perrow describes the topic of loose and tight coupling in social systems in Normal Accidents; 89 ff,)

Sunday, September 30, 2018

Philosophy and the study of technology failure

image: Adolf von Menzel, The Iron Rolling Mill (Modern Cyclopes)

Readers may have noticed that my current research interests have to do with organizational dysfunction and largescale technology failures. I am interested in probing the ways in which organizational failures and dysfunctions have contributed to large accidents like Bhopal, Fukushima, and the Deepwater Horizon disaster. I've had to confront an important question in taking on this research interest: what can philosophy bring to the topic that would not be better handled by engineers, organizational specialists, or public policy experts?

One answer is the diversity of viewpoint that a philosopher can bring to the discussion. It is evident that technology failures invite analysis from all of these specialized experts, and more. But there is room for productive contribution from reflective observers who are not committed to any of these disciplines. Philosophers have a long history of taking on big topics outside the defined canon of "philosophical problems", and often those engagements have proven fruitful. In this particular instance, philosophy can look at organizations and technology in a way that is more likely to be interdisciplinary, and perhaps can help to see dimensions of the problem that are less apparent from a purely disciplinary perspective.

There is also a rationale based on the terrain of the philosophy of science. Philosophers of biology have usually attempted to learn as much about the science of biology as they can manage, but they lack the level of expertise of a research biologist, and it is rare for a philosopher to make an original contribution to the scientific biological literature. Nonetheless it is clear that philosophers have a great deal to add to scientific research in biology. They can contribute to better reasoning about the implications of various theories, they can probe the assumptions about confirmation and explanation that are in use, and they can contribute to important conceptual disagreements. Biology is in a better state because of the work of philosophers like David Hull and Elliot Sober.

Philosophers have also made valuable contributions to science and technology studies, bringing a viewpoint that incorporates insights from the philosophy of science and a sensitivity to the social groundedness of technology. STS studies have proven to be a fruitful place for interaction between historians, sociologists, and philosophers. Here again, the concrete study of the causes and context of large technology failure may be assisted by a philosophical perspective.

There is also a normative dimension to these questions about technology failure for which philosophy is well prepared. Accidents hurt people, and sometimes the causes of accidents involve culpable behavior by individuals and corporations. Philosophers have a long history of contribution to these kinds of problems of fault, law, and just management of risks and harms.

Finally, it is realistic to say that philosophy has an ability to contribute to social theory. Philosophers can offer imagination and critical attention to the problem of creating new conceptual schemes for understanding the social world. This capacity seems relevant to the problem of describing, analyzing, and explaining largescale failures and disasters.

The situation of organizational studies and accidents is in some ways more hospitable for contributions by a philosopher than other "wicked problems" in the world around us. An accident is complicated and complex but not particularly obscure. The field is unlike quantum mechanics or climate dynamics, which are inherently difficult for non-specialists to understand. The challenge with accidents is to identify a multi-layered analysis of the causes of the accident that permits observers to have a balanced and operative understanding of the event. And this is a situation where the philosopher's perspective is most useful. We can offer higher-level descriptions of the relative importance of different kinds of causal factors. Perhaps the role here is analogous to messenger RNA, providing a cross-disciplinary kind of communications flow. Or it is analogous to the role of philosophers of history who have offered gentle critique of the cliometrics school for its over-dependence on a purely statistical approach to economic history.

So it seems reasonable enough for a philosopher to attempt to contribute to this set of topics, even if the disciplinary expertise a philosopher brings is more weighted towards conceptual and theoretical discussions than undertaking original empirical research in the domain.

What I expect to be the central finding of this research is the idea that a pervasive and often unrecognized cause of accidents is a systemic organizational defect of some sort, and that it is enormously important to have a better understanding of common forms of these deficiencies. This is a bit analogous to a paradigm shift in the study of accidents. And this view has important policy implications. We can make disasters less frequent by improving the organizations through which technology processes are designed and managed.

Tuesday, September 25, 2018

System safety


An ongoing thread of posts here is concerned with organizational causes of large technology failures. The driving idea is that failures, accidents, and disasters usually have a dimension of organizational causation behind them. The corporation, research office, shop floor, supervisory system, intra-organizational information flow, and other social elements often play a key role in the occurrence of a gas plant fire, a nuclear power plant malfunction, or a military disaster. There is a tendency to look first and foremost for one or more individuals who made a mistake in order to explain the occurrence of an accident or technology failure; but researchers such as Perrow, Vaughan, Tierney, and Hopkins have demonstrated in detail the importance of broadening the lens to seek out the social and organizational background of an accident.

It seems important to distinguish between system flaws and organizational dysfunction in considering all of the kinds of accidents mentioned here. We might specify system safety along these lines. Any complex process has the potential for malfunction. Good system design means creating a flow of events and processes that make accidents inherently less likely. Part of the task of the designer and engineer is to identify chief sources of harm inherent in the process -- release of energy, contamination of food or drugs, unplanned fission in a nuclear plant -- and design fail-safe processes so that these events are as unlikely as possible. Further, given the complexity of contemporary technology systems it is critical to attempt to anticipate unintended interactions among subsystems -- each of which is functioning correctly but that lead to disaster in unusual but possible interaction scenarios.

In a nuclear processing plant, for example, there is the hazard of radioactive materials being brought into proximity with each other in a way that creates unintended critical mass. Jim Mahaffey's Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima offers numerous examples of such unintended events, from the careless handling of plutonium scrap in a machining process to the transfer of a fissionable liquid from a vessel of one shape to another. We might try to handle these risks as an organizational problem: more and better training for operatives about the importance of handling nuclear materials according to established protocols, and effective supervision and oversight to ensure that the protocols are observed on a regular basis. But it is also possible to design the material processes within a nuclear plant in a way that makes unintended criticality virtually impossible -- for example, by storing radioactive solutions in containers that simply cannot be brought into close proximity with each other.

Nancy Leveson is a national expert on defining and applying principles of system safety. Her book Engineering a Safer World: Systems Thinking Applied to Safety is a thorough treatment of her thinking about this subject. She offers a handful of compelling reasons for believing that safety is a system-level characteristic that requires a systems approach: the fast pace of technological change, reduced ability to learn from experience, the changing nature of accidents, new types of hazards, increasing complexity and coupling, decreasing tolerance for single accidents, difficulty in selecting priorities and making tradeoffs , more complex relationships between humans and automation, and changing regulatory and public view of safety (kl 130 ff.). Particularly important in this list is the comment about complexity and coupling: "The operation of some systems is so complex that it defies the understanding of all but a few experts, and sometimes even they have incomplete information about the system's potential behavior" (kl 137).

Given the fact that safety and accidents are products of whole systems, she is critical of the accident methodology generally applied to serious industrial, aerospace, and chemical accidents. This methodology involves tracing the series of events that led to the outcome, and identifying one or more events as the critical cause of the accident. However, she writes:
In general, event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management decision making, and flaws in the safety culture of the or industry. An accident model should encourage a broad view of accident mechanisms that expands the investigation beyond the proximate evens.A narrow focus on technological components and pure engineering activities or a similar narrow focus on operator errors may lead to ignoring some of the most important factors in terms of preventing future accidents. (kl 452)
Here is a definition of system safety offered later in ESW in her discussion of the emergence of the concept within the defense and aerospace fields in the 1960s:
System Safety ... is a subdiscipline of system engineering. It was created at the same time and for the same reasons. The defense community tried using the standard safety engineering techniques on their complex new systems, but the limitations became clear when interface and component interaction problems went unnoticed until it was too late, resulting in many losses and near misses. When these early aerospace accidents were investigated, the causes of a large percentage of them were traced to deficiencies in design, operations, and management. Clearly, big changes were needed. System engineering along with its subdiscipline, System Safety, were developed to tackle these problems. (kl 1007)
Here Leveson mixes system design and organizational dysfunctions as system-level causes of accidents. But much of her work in this book and her earlier Safeware: System Safety and Computers gives extensive attention to the design faults and component interactions that lead to accidents -- what we might call system safety in the narrow or technical sense.
A systems engineering approach to safety starts with the basic assumption that some properties of systems, in this case safety, can only be treated adequately in the context of the social and technical system as a whole. A basic assumption of systems engineering is that optimization of individual components or subsystems will not in general lead to a system optimum; in fact, improvement of a particular subsystem may actually worsen the overall system performance because of complex, nonlinear interactions among the components. (kl 1007)
Overall, then, it seems clear that Leveson believes that both organizational features and technical system characteristics are part of the systems that created the possibility for accidents like Bhopal, Fukushima, and Three Mile Island. Her own accident model designed to help identify causes of accidents, STAMP (Systems-Theoretic Accident Model and Processes) emphasizes both kinds of system properties.
Using this new causality model ... changes the emphasis in system safety from preventing failures to enforcing behavioral safety constraints. Component failure accidents are still included, but or conception of causality is extended to include component interaction accidents. Safety is reformulated as a control problem rather than a reliability problem. (kl 1062)
In this framework, understanding why an accident occurred requires determining why the control was ineffective. Preventing future accidents requires shifting from a focus on preventing failures to the broader goal of designing and implementing controls that will enforce the necessary constraints. (kl 1084)
Leveson's brief analysis of the Bhopal disaster in 1984 (kl 384 ff.) emphasizes the organizational dysfunctions that led to the accident -- and that were completely ignored by the Indian state's accident investigation of the accident: out-of-service gauges, alarm deficiencies, inadequate response to prior safety audits, shortage of oxygen masks, failure to inform the police or surrounding community of the accident, and an environment of cost cutting that impaired maintenance and staffing. "When all the factors, including indirect and systemic ones, are considered, it becomes clear that the maintenance worker was, in fact, only a minor and somewhat irrelevant player in the loss. Instead, degradation in the safety margin occurred over time and without any particular single decision to do so but simply as a series of decisions that moved the plant slowly toward a situation where any slight error would lead to a major accident" (kl 447).

Monday, May 7, 2018

What the boss wants to hear ...


According to David Halberstam in his outstanding history of the war in Vietnam, The Best and the Brightest, a prime cause of disastrous decision-making by Presidents Kennedy and Johnson was an institutional imperative in the Defense Department to come up with a set of facts that conformed to what the President wanted to hear. Robert McNamara and McGeorge Bundy were among the highest-level miscreants in Halberstam's account; they were determined to craft an assessment of the situation on the ground in Vietnam that conformed best with their strategic advice to the President.

Ironically, a very similar dynamic led to one of modern China's greatest disasters, the Great Leap Forward famine in 1959. The Great Helmsman was certain that collective agriculture would be vastly more productive than private agriculture; and following the collectivization of agriculture, party officials in many provinces obliged this assumption by reporting inflated grain statistics throughout 1958 and 1959. The result was a famine that led to at least twenty million excess deaths during a two-year period as the central state shifted resources away from agriculture (Frank DikötterMao's Great Famine: The History of China's Most Devastating Catastrophe, 1958-62).

More mundane examples are available as well. When information about possible sexual harassment in a given department is suppressed because "it won't look good for the organization" and "the boss will be unhappy", the organization is on a collision course with serious problems. When concerns about product safety or reliability are suppressed within the organization for similar reasons, the results can be equally damaging, to consumers and to the corporation itself. General Motors, Volkswagen, and Michigan State University all seem to have suffered from these deficiencies of organizational behavior. This is a serious cause of organizational mistakes and failures. It is impossible to make wise decisions -- individual or collective -- without accurate and truthful information from the field. And yet the knowledge of higher-level executives depends upon the truthful and full reporting of subordinates, who sometimes have career incentives that work against honesty.

So how can this unhappy situation be avoided? Part of the answer has to do with the behavior of the leaders themselves. It is important for leaders to explicitly and implicitly invite the truth -- whether it is good news or bad news. Subordinates must be encouraged to be forthcoming and truthful; and bearers of bad news must not be subject to retaliation. Boards of directors, both private and public, need to make clear their own expectations on this score as well: that they expect leading executives to invite and welcome truthful reporting, and that they expect individuals throughout the organization to provide truthful reporting. A culture of honesty and transparency is a powerful antidote to the disease of fabrications to please the boss.

Anonymous hotlines and formal protection of whistle-blowers are other institutional arrangements that lead to greater honesty and transparency within an organization. These avenues have the advantage of being largely outside the control of the upper executives, and therefore can serve as a somewhat independent check on dishonest reporting.

A reliable practice of accountability is also a deterrent to dishonest or partial reporting within an organization. The truth eventually comes out -- whether about sexual harassment, about hidden defects in a product, or about workplace safety failures. When boards of directors and organizational policies make it clear that there will be negative consequences for dishonest behavior, this gives an ongoing incentive of prudence for individuals to honor their duties of honesty within the organization.

This topic falls within the broader question of how individual behavior throughout an organization has the potential for giving rise to important failures that harm the public and harm the organization itself.


Thursday, April 5, 2018

Empowering the safety officer?


How can industries involving processes that create large risks of harm for individuals or populations be modified so they are more capable of detecting and eliminating the precursors of harmful accidents? How can nuclear accidents, aviation crashes, chemical plant explosions, and medical errors be reduced, given that each of these activities involves large bureaucratic organizations conducting complex operations and with substantial inter-system linkages? How can organizations be reformed to enhance safety and to minimize the likelihood of harmful accidents?

One of the lessons learned from the Challenger space shuttle disaster is the importance of a strongly empowered safety officer in organizations that deal in high-risk activities. This means the creation of a position dedicated to ensuring safe operations that falls outside the normal chain of command. The idea is that the normal decision-making hierarchy of a large organization has a built-in tendency to maintain production schedules and avoid costly delays. In other words, there is a built-in incentive to treat safety issues with lower priority than most people would expect.

If there had been an empowered safety officer in the launch hierarchy for the Challenger launch in 1986, there is a good chance this officer would have listened more carefully to the Morton-Thiokol engineering team's concerns about low temperature damage to O-rings and would have ordered a halt to the launch sequence until temperatures in Florida raised to the critical value. The Rogers Commission faulted the decision-making process leading to the launch decision in its final report on the accident (The Report of the Presidential Commission on the Space Shuttle Challenger Accident - The Tragedy of Mission 51-L in 1986 - Volume One, Volume Two, Volume Three).

This approach is productive because empowering a safety officer creates a different set of interests in the management of a risky process. The safety officer's interest is in safety, whereas other decision makers are concerned about revenues and costs, public relations, reputation, and other instrumental goods. So a dedicated safety officer is empowered to raise safety concerns that other officers might be hesitant to raise. Ordinary bureaucratic incentives may lead to underestimating risks or concealing faults; so lowering the accident rate requires giving some individuals the incentive and power to act effectively to reduce risks.

Similar findings have emerged in the study of medical and hospital errors. It has been recognized that high-risk activities are made less risky by empowering all members of the team to call a halt in an activity when they perceive a safety issue. When all members of the surgical team are empowered to halt a procedure when they note an apparent error, serious operating-room errors are reduced. (Here is a report from the American College of Obstetricians and Gynecologists on surgical patient safety; link. And here is a 1999 National Academy report on medical error; link.)

The effectiveness of a team-based approach to safety depends on one central fact. There is a high level of expertise embodied in the staff operating a surgical suite, an engineering laboratory, or a drug manufacturing facility. By empowering these individuals to stop a procedure when they judge there is an unrecognized error in play, this greatly extend the amount of embodied knowledge involved in a process. The surgeon, the commanding officer, or the lab director is no longer the sole expert whose judgments count.

But it also seems clear that these innovations don't work equally well in all circumstances. Take nuclear power plant operations. In Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima James Mahaffey documents multiple examples of nuclear accidents that resulted from the efforts of mid-level workers to address an emerging problem in an improvised way. In the case of nuclear power plant safety, it appears that the best prescription for safety is to insist on rigid adherence to pre-established protocols. In this case the function of a safety officer is to monitor operations to ensure protocol conformance -- not to exercise independent judgment about the best way to respond to an unfavorable reactor event.

It is in fact an interesting exercise to try to identify the kinds of operations in which these innovations are likely to be effective.

Here is a fascinating interview in Slate with Jim Bagian, a former astronaut, one-time director of the Veteran Administration's National Center for Patient Safety, and distinguished safety expert; link. Bagian emphasizes the importance of taking a system-based approach to safety. Rather than focusing on finding blame for specific individuals whose actions led to an accident, Bagian emphasizes the importance of tracing back to the institutional, organizational, or logistic background of the accident. What can be changed in the process -- of delivering medications to patients, of fueling a rocket, or of moving nuclear solutions around in a laboratory -- that make the likelihood of an accident substantially lower? (Here is a co-authored piece by Bagian and others on the topic of team-based patient safety in the operating room; link.)

The safety principles involved here seem fairly simple: cultivate a culture in which errors and near-misses are reported and investigated without blame; empower individuals within risky processes to halt the process if their expertise and experience indicates the possibility of a significant risky error; create individuals within organizations whose interests are defined in terms of the identification and resolution of unsafe practices or conditions; and share information about safety within the industry and with the public.

Sunday, March 25, 2018

Mechanisms, singular and general


Let's think again about the semantics of causal ascriptions. Suppose that we want to know what  caused a building crane to collapse during a windstorm. We might arrive at an account something like this:
  • An unusually heavy gust of wind at 3:20 pm, in the presence of this crane's specific material and structural properties, with the occurrence of the operator's effort to adjust the crane's extension at 3:21 pm, brought about cascading failures of structural elements of the crane, leading to collapse at 3:25 pm.
The process described here proceeds from the "gust of wind striking the crane" through an account of the material and structural properties of the device, incorporating the untimely effort by the operator to readjust the device's extension, leading to a cascade from small failures to a large failure. And we can identify the features of causal necessity that were operative at the several links of the chain.

Notice that there are few causal regularities or necessary and constant conjunctions in this account. Wind does not usually bring about the collapse of cranes; if the operator's intervention had occurred a few minutes earlier or later, perhaps the failure would not have occurred; and small failures do not always lead to large failures. Nonetheless, in the circumstances described here there is causal necessity extending from the antecedent situation at 3:15 pm to the full catastrophic collapse at 3:25 pm.

Does this narrative identify a causal mechanism? Are we better off describing this as a sequences of cause-effect sequences, none of which represents a causal mechanism per se? Or, on the contrary, can we look at the whole sequence as a single causal mechanism -- though one that is never to be repeated? Does a causal mechanism need to be a recurring and robust chain of events, or can it be a highly unique and contingent chain?

Most mechanisms theorists insist on a degree of repeatability in the sequences that they describe as "mechanisms". A causal mechanism is the triggering pathway through which one event leads to the production of another event in a range of circumstances in an environment. Fundamentally a causal mechanism is a "molecule" of causal process which can recur in a range of different social settings.

For example:
  • X typically brings about O.
Whenever this sequence of events occurs, in the appropriate timing, the outcome O is produced. This ensemble of events {X, O} is a single mechanism.

And here is the crucial point: to call this a mechanism requires that this sequence recurs in multiple instances across a range of background conditions.

This suggests an answer to the question about the collapsing crane: the sequence from gust to operator error to crane collapse is not a mechanism, but is rather a unique causal sequence. Each part of the sequence has a causal explanation available; each conveys a form of causal necessity in the circumstances. But the aggregation of these cause-effect connections falls short of constituting a causal mechanism because the circumstances in which it works are all but unique. A satisfactory causal explanation of the internal cause-effect pairs will refer to real repeatable mechanisms -- for example, "twisting a steel frame leads to a loss of support strength". But the concatenation does not add up to another, more complex, mechanism.

Contrast this with "stuck valve" accidents in nuclear power reactors. Valves control the flow of cooling fluids around the critical fuel. If the fuel is deprived of coolant it rapidly overheats and melts. A "stuck valve-loss of fluid-critical overheating" sequence is a recognized mechanism of nuclear meltdown, and has been observed in a range of nuclear-plant crises. It is therefore appropriate to describe this sequence as a genuine causal mechanism in the creation of a nuclear plant failure.

(Stuart Glennan takes up a similar question in "Singular and General Causal Relations: A Mechanist Perspective"; link.)

Saturday, March 10, 2018

Technology lock-in accidents

image: diagram of molten salt reactor

Organizational and regulatory features are sometimes part of the causal background of important technology failures. This is particularly true in the history of nuclear power generation. The promise of peaceful uses of atomic energy was enormously attractive at the end of World War II. In abstract terms the possibility of generating useable power from atomic reactions was quite simple. What was needed was a controllable fission reaction in which the heat produced by fission could be captured to run a steam-powered electrical generator.

The technical challenges presented by harnessing nuclear fission in a power plant were large. Fissionable material needed to be produced as useable fuel sources. A control system needed to be designed to maintain the level of fission at a desired level. And, most critically, a system for removing heat from the fissioning fuel needed to be designed so that the reactor core would not overheat and melt down, releasing energy and radioactive materials into the environment.

Early reactor designs took different approaches to the heat-removal problem. Liquid metal reactors used a metal like sodium as the fluid that would run through the core removing heat to a heat sink for dispersal; and water reactors used pressurized water to serve that function. The sodium breeder reactor design appeared to be a viable approach, but incidents like the Fermi 1 disaster near Detroit cast doubt on the wisdom of using this approach. The reactor design that emerged as the dominant choice in civilian power production was the light water reactor. But light water reactors presented their own technological challenges, including most especially the risk of a massive steam explosion in the event of a power interruption to the cooling plant. In order to obviate this risk reactor designs involved multiple levels of redundancy to ensure that no such power interruption would occur. And much of the cost of construction of a modern light water power plant is dedicated to these systems -- containment vessels, redundant power supplies, etc. In spite of these design efforts, however, light water reactors at Three Mile Island and Fukushima did in fact melt down under unusual circumstances -- with particularly devastating results in Fukushima. The nuclear power industry in the United States essentially died as a result of public fears of the possibility of meltdown of nuclear reactors near populated areas -- fears that were validated by several large nuclear disasters.

What is interesting about this story is that there was an alternative reactor design that was developed by US nuclear scientists and engineers in the 1950s that involved a significantly different solution to the problem of harnessing the heat of a nuclear reaction and that posed a dramatically lower level of risk of meltdown and radioactive release. This is the molten salt reactor, first developed at the Oak Ridge National Laboratory facility in the 1950s. This was developed as part of the loopy idea of creating an atomic-powered aircraft that could remain aloft for months. This reactor design operates at atmospheric pressure, and the technological challenges of maintaining a molten salt cooling system are readily solved. The fact that there is no water involved in the cooling system means that the greatest danger in a nuclear power plant, a violent steam explosion, is eliminated entirely. Molten salt will not turn to steam, and the risk of a steam-based explosion is removed completely. Chinese nuclear energy researchers are currently developing a next generation of molten salt reactors, and there is a likelihood that they will be successful in designing a reactor system that is both more efficient in terms of cost and dramatically safer in terms of low-probability, high-cost accidents (link). This technology also has the advantage of making much more efficient use of the nuclear fuel, leaving a dramatically smaller amount of radioactive waste to dispose of.

So why did the US nuclear industry abandon the molten-salt reactor design? This seems to be a situation of lock-in by an industry and a regulatory system. Once the industry settled on the light water reactor design, it was implemented by the Nuclear Regulatory Commission in terms of the regulations and licensing requirements for new nuclear reactors. It was subsequently extremely difficult for a utility company or a private energy corporation to invest in the research and development and construction costs that would be associated with a radical change of design. There is currently an effort by an American company to develop a new-generation molten salt reactor, and the process is inhibited by the knowledge that it will take a minimum of ten years to gain certification and licensing for a possible commercial plant to be based on the new design (link).

This story illustrates the possibility that a process of technology development may get locked into a particular approach that embodies substantial public risk, and it may be all but impossible to subsequently adopt a different approach. In another context Thomas Hughes refers to this as technological momentum, and it is clear that there are commercial, institutional, and regulatory reasons for this "stickiness" of a major technology once it is designed and adopted. In the case of nuclear power the inertia associated with light water reactors is particularly unfortunate, given that it blocked other solutions that were both safer and more economical.

(Here is a valuable review of safety issues in the nuclear power industry; link. Also relevant is Robin Cowan, "Nuclear Power Reactors: A Study in Technological Lock-in"; link -- thanks, Ã–zgür, for the reference. And here is a critical assessment of molten salt reactor designs by Bulletin of the Atomic Scientists (link).)

Saturday, February 24, 2018

Nuclear accidents


diagrams: Chernobyl reactor before and after

Nuclear fission is one of the world-changing discoveries of the mid-twentieth century. The atomic bomb projects of the United States led to the atomic bombing of Japan in August 1945, and the hope for limitless electricity brought about the proliferation of a variety of nuclear reactors around the world in the decades following World War II. And, of course, nuclear weapons proliferated to other countries beyond the original circle of atomic powers.

Given the enormous energies associated with fission and the dangerous and toxic properties of radioactive components of fission processes, the possibility of a nuclear accident is a particularly frightening one for the modern public. The world has seen the results of several massive nuclear accidents -- Chernobyl and Fukushima in particular -- and the devastating results they have had on human populations and the social and economic wellbeing of the regions in which they occurred.

Safety is therefore a paramount priority in the nuclear industry, both in research labs and military and civilian applications. So what is the situation of safety in the nuclear sector? Jim Mahaffey's Atomic Accidents: A History of Nuclear Meltdowns and Disasters: From the Ozark Mountains to Fukushima is a detailed and carefully researched attempt to answer this question. And the information he provides is not reassuring. Beyond the celebrated and well-known disasters at nuclear power plants (Three Mile Island, Chernobyl, Fukushima), Mahaffey refers to hundreds of accidents involving reactors, research laboratories, weapons plants, and deployed nuclear weapons that have had less public awareness. These accidents resulted in a very low number of lives lost, but their frequency is alarming. They are indeed "normal accidents" (Perrow, Normal Accidents: Living with High-Risk Technologies. For example:
  • a Japanese fishing boat is contaminated by fallout from Castle Bravo test of hydrogen bomb; lots of radioactive fish at the markets in Japan (March 1, 1954) (kl 1706)
  • one MK-6 atomic bomb is dropped on Mars Bluff, South Carolina, after a crew member accidentally pulled the emergency bomb release handle (February 5, 1958) (kl 5774)
  • Fermi 1 liquid sodium plutonium breeder reactor experiences fuel meltdown during startup trials near Detroit (October 4, 1966) (kl 4127)
Mahaffey also provides detailed accounts of the most serious nuclear accidents and meltdowns during the past forty years, Three Mile Island, Chernobyl, and Fukushima.

The safety and control of nuclear weapons is of particular interest. Here is Mahaffey's summary of "Broken Arrow" events -- the loss of atomic and fusion weapons:
Did the Air Force ever lose an A-bomb, or did they just misplace a few of them for a short time? Did they ever drop anything that could be picked up by someone else and used against us? Is humanity going to perish because of poisonous plutonium spread that was snapped up by the wrong people after being somehow misplaced? Several examples will follow. You be the judge. 
Chuck Hansen [U.S. Nuclear Weapons - The Secret History] was wrong about one thing. He counted thirty-two “Broken Arrow” accidents. There are now sixty-five documented incidents in which nuclear weapons owned by the United States were lost, destroyed, or damaged between 1945 and 1989. These bombs and warheads, which contain hundreds of pounds of high explosive, have been abused in a wide range of unfortunate events. They have been accidentally dropped from high altitude, dropped from low altitude, crashed through the bomb bay doors while standing on the runway, tumbled off a fork lift, escaped from a chain hoist, and rolled off an aircraft carrier into the ocean. Bombs have been abandoned at the bottom of a test shaft, left buried in a crater, and lost in the mud off the coast of Georgia. Nuclear devices have been pounded with artillery of a foreign nature, struck by lightning, smashed to pieces, scorched, toasted, and burned beyond recognition. Incredibly, in all this mayhem, not a single nuclear weapon has gone off accidentally, anywhere in the world. If it had, the public would know about it. That type of accident would be almost impossible to conceal. (kl 5527)
There are a few common threads in the stories of accident and malfunction that Mahaffey provides. First, there are failures of training and knowledge on the part of front-line workers. The physics of nuclear fission are often counter-intuitive, and the idea of critical mass does not fully capture the danger of a quantity of fissionable material. The geometry of the storage of the material makes a critical difference in going critical. Fissionable material is often transported and manipulated in liquid solution; and the shape and configuration of the vessel in which the solution is held makes a difference to the probability of exponential growth of neutron emission -- leading to runaway fission of the material. Mahaffey documents accidents that occurred in nuclear materials processing plants that resulted from plant workers applying what they knew from industrial plumbing to their efforts to solve basic shop-floor problems. All too often the result was a flash of blue light and the release of a great deal of heat and radioactive material.

Second, there is a fault at the opposite end of the knowledge spectrum -- the tendency of expert engineers and scientists to believe that they can solve complicated reactor problems on the fly. This turned out to be a critical problem at Chernobyl (kl 6859).
The most difficult problem to handle is that the reactor operator, highly trained and educated with an active and disciplined mind, is liable to think beyond the rote procedures and carefully scheduled tasks. The operator is not a computer, and he or she cannot think like a machine. When the operator at NRX saw some untidy valve handles in the basement, he stepped outside the procedures and straightened them out, so that they were all facing the same way. (kl 2057)
There are also clear examples of inappropriate supervision in the accounts shared by Mahaffey. Here is an example from Chernobyl.
[Deputy chief engineer] Dyatlov was enraged. He paced up and down the control panel, berating the operators, cursing, spitting, threatening, and waving his arms. He demanded that the power be brought back up to 1,500 megawatts, where it was supposed to be for the test. The operators, Toptunov and Akimov, refused on grounds that it was against the rules to do so, even if they were not sure why. 
Dyatlov turned on Toptunov. “You lying idiot! If you don’t increase power, Tregub will!”  
Tregub, the Shift Foreman from the previous shift, was officially off the clock, but he had stayed around just to see the test. He tried to stay out of it. 
Toptunov, in fear of losing his job, started pulling rods. By the time he had wrestled it back to 200 megawatts, 205 of the 211 control rods were all the way out. In this unusual condition, there was danger of an emergency shutdown causing prompt supercriticality and a resulting steam explosion. At 1: 22: 30 a.m., a read-out from the operations computer advised that the reserve reactivity was too low for controlling the reactor, and it should be shut down immediately. Dyatlov was not worried. “Another two or three minutes, and it will be all over. Get moving, boys! (kl 6887)
This was the turning point in the disaster.

A related fault is the intrusion of political and business interests into the design and conduct of high-risk nuclear actions. Leaders want a given outcome without understanding the technical details of the processes they are demanding; subordinates like Toptunov are eventually cajoled or coerced into taking the problematic actions. The persistence of advocates for liquid sodium breeder reactors represents a higher-level example of the same fault. Associated with this role of political and business interests is an impulse towards secrecy and concealment when accidents occur and deliberate understatement of the public dangers created by an accident -- a fault amply demonstrated in the Fukushima disaster.

Atomic Accidents provides a fascinating history of events of which most of us are unaware. The book is not primarily intended to offer an account of the causes of these accidents, but rather the ways in which they unfolded and the consequences they had for human welfare. (Generally speaking his view is that nuclear accidents in North America and Western Europe have had remarkably few human casualties.) And many of the accidents he describes are exactly the sorts of failures that are common in all largescale industrial and military processes.

(Largescale technology failure has come up frequently here. See these posts for analysis of some of the organizational causes of technology failure (link, link, link).)

Wednesday, December 13, 2017

Varieties of organizational dysfunction


Several earlier posts have made the point that important technology failures often include organizational faults in their causal background.

It is certainly true that most important accidents have multiple causes, and it is crucial to have as good an understanding as possible of the range of causal pathways that have led to air crashes, chemical plant explosions, or drug contamination incidents. But in the background we almost always find organizations and practices through which complex technical activities are designed, implemented, and regulated. Human actors, organized into patterns of cooperation, collaboration, competition, and command, are as crucial to technical processes as are power lines, cooling towers, and control systems in computers. So it is imperative that we follow the lead of researchers like Charles Perrow (The Next Catastrophe: Reducing Our Vulnerabilities to Natural, Industrial, and Terrorist Disasters), Kathleen Tierney (The Social Roots of Risk: Producing Disasters, Promoting Resilience), or Diane Vaughan (The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA) and give close attention to the social- and organization-level failures that sometimes lead to massive technological failures.

It is useful to have a few examples in mind as we undertake to probe this question more deeply. Here are a number of important accidents and failures that have been carefully studied.
  • Three Mile Island, Chernobyl nuclear disasters
  • Challenger and Columbia space shuttle disasters
  • Failure of United States anti-submarine warfare in 1942-43
  • Flawed policy and decision-making in US leading to escalation of Vietnam War
  • Flawed policy and decision-making in France leading to Dien Bien Phu defeat
  • Failure of Nuclear Regulatory Commission to ensure reactor safety
  • DC-10 design process
  • Osprey design process
  • failure of Federal flood insurance to appropriately guide rational land use
  • FEMA failure in Katrina aftermath
  • Design and manufacture of the Edsel sedan
  • High rates of hospital-born infections in some hospitals
Examples like these allow us to begin to create an inventory of organizational flaws that sometimes lead to failures and accidents:
  • siloed decision-making (design division, marketing division, manufacturing division all have different priorities and interests)
  • lax implementation of formal processes
  • strategic bureaucratic manipulation of outcomes 
    • information withholding, lying
    • corrupt practices, conflicts of interest and commitment
  • short-term calculation of costs and benefits
  • indifference to public goods
  • poor evaluation of data; misinterpretation of data
  • lack of high-level officials responsible for compliance and safety
These deficiencies may be analyzed in terms of a more abstract list of organizational failures:
  • Poor decisions given existing priorities and facts
    • poor priority-setting processes
    • poor information-gathering and analysis
  • failure to learn and adapt from changing circumstances
  • internal capture of decision-making; corruption, conflict of interest
  • vulnerability of decision-making to external pressures (external capture)
  • faulty or ineffective implementation of policies, procedures, and regulations

******

Nancy Leveson is a leading authority on the systems-level causes of accidents and failures. A recent white paper can be found here. Here is the abstract for that paper:
New technology is making fundamental changes in the etiology of accidents and is creating a need for changes in the explanatory mechanisms used. We need better and less subjective understanding of why accidents occur and how to prevent future ones. The most effective models will go beyond assigning blame and instead help engineers to learn as much as possible about all the factors involved, including those related to social and organizational structures. This paper presents a new accident model founded on basic systems theory concepts. The use of such a model provides a theoretical foundation for the introduction of unique new types of accident analysis, hazard analysis, accident prevention strategies including new approaches to designing for safety, risk assessment techniques, and approaches to designing performance monitoring and safety metrics. (1; italics added)
Here is what Leveson has to say about the social and organizational causes of accidents:

2.1 Social and Organizational Factors

Event-based models are poor at representing systemic accident factors such as structural deficiencies in the organization, management deficiencies, and flaws in the safety culture of the company or industry. An accident model should encourage a broad view of accident mechanisms that expands the investigation from beyond the proximate events.

Ralph Miles Jr., in describing the basic concepts of systems theory, noted that:

Underlying every technology is at least one basic science, although the technology may be well developed long before the science emerges. Overlying every technical or civil system is a social system that provides purpose, goals, and decision criteria (Miles, 1973, p. 1).

Effectively preventing accidents in complex systems requires using accident models that include that social system as well as the technology and its underlying science. Without understanding the purpose, goals, and decision criteria used to construct and operate systems, it is not possible to completely understand and most effectively prevent accidents. (6)