Showing posts with label safety. Show all posts
Showing posts with label safety. Show all posts

Tuesday, August 15, 2023

Are organizations emergent?


Do organizations have properties that are in some recognizable way independent from the behaviors and intentions of the individuals who inhabit them? In A New Social Ontology of Government I emphasized the ways in which organizations fail because of actor-level features: principal-agent problems, inconsistent priorities and goals across different working groups, strategic manipulation of information by some actors to gain advantage over other actors, and the like. With a nod to Fligstein and McAdam's theory of strategic action fields (link), I took an actor-centered approach to the workings (and dysfunctions) of organizations. I continue to believe that these are accurate observations about the workings of organizations and government agencies, but now that I've reoriented my thinking away from a strictly actor-centered approach to the social world (link), I'm interested in asking the questions about meso-level causes I did not ask in A New Social Ontology.

For example: 

(a) Are there relatively stable meso-level features of organizations that constrain and influence individual behavior in consistent ways that produce relatively stable meso-level outcomes? 

(b) Are there routine behaviors that are reproduced within the organization by training programs and performance audits that give rise to consistent patterns of organizational workings? 

(c) Are there external structural constraints (legal, environmental, locational) that work to preserve certain features of the organization's scheme of operations? 

It seems that the answer to each of these questions is "yes"; but this in turn seems to imply that organizations have properties that persist over time and through changes of personnel. They are not simply the result of the sum of the behaviors and mental states of the participants. These meso-level properties are subject to change, of course, depending on the behaviors and intentions of the individuals who inhabit the organization; but they are sometimes stable across extended periods of time and individual personnel. Or in other words, there seem to be meso-level features of organizations that are emergent in some moderate sense.

Here are possible illustrations of each kind of "emergent" property.

(a) Imagine two chemical plants Alpha and Beta making similar products with similar industrial processes and owned by different parent corporations. Alpha has a history of occasional fires, small explosions, and defective equipment, and it was also the site of a major chemical fire that harmed dozens of workers and neighbors. Beta has a much better safety record; fires and explosions are rare, equipment rarely fails in use, and no major fires have occurred for ten years. We might then say that Alpha and Beta have different meso-level safety characteristics, with Alpha lying in the moderate risk range and Beta in the low risk range. Now suppose that we ask an all-star team of industrial safety investigators to examine both plants, and their report indicates that Alpha has a long history of cost reduction plans, staff reductions, and ineffective training programs, whereas Beta (owned by a different parent company) has been well funded for staffing, training, and equipment maintenance. This is another meso-level property of the two plants -- production decisions guided by profitability and cost reduction at Alpha, and production decisions guided by both profitability and a commitment to system safety at Beta. Finally, suppose that our team of investigators conducts interviews and focus groups with staff and supervisors in the two plants, and finds that there are consistent differences between the two plants about the importance of maintaining safety as experienced by plant workers and supervisors. Supervisors at Alpha make it clear that they disagree strongly with the statement, "interrupting the production process to clarify anomalous temperature readings would be encouraged by the executives", whereas their counterparts at Beta indicate that they agree with the statement. This implies that there is a significant difference in the safety culture of the two plants -- another meso-level feature of the two organizations. All of these meso-level properties persist over decades and through major turnover of staff. Supervisors and workers come and go, but the safety culture, procedures, training, and production pressure persist, and new staff are introduced to these practices in ways that reproduce them. And -- this is the key point -- these meso-level properties lead to different rates of failure at the two plants over time, even though none of the actors at Alpha intend for accidents to occur. 

(b) This example comparing industrial plants with different safety rates also serves to answer the second question posed above about training and oversight. The directors and staff who conduct training in an industrial organization can have high commitment or low commitment to their work -- energetic and focused training programs or perfunctory and forgettable training programs -- and the difference will be notable in the performance of new staff as they take on their responsibilities. For example, training for control room directors may always emphasize the importance of careful annotation of the day's events for the incoming director on the next shift. But the training may be highly effective, resulting in informative documentation across shift changes; or it may be ineffective and largely disregarded. In most cases poor documentation does not lead to a serious accident; but sometimes it does. So organizations with effective training on procedures and operations will have a better chance of avoiding serious accidents. Alpha has weak training programs, while Beta has strong training programs (and each dedicates commensurate resources to training). Routine behaviors at Alpha lead to careless implementation of procedures, whereas routine behaviors at Beta result in attentive implementation, and as a result, Beta has a better safety performance record.

(c) What about the external influences that have an effect on the overall safety performance of an industrial plant? The corporate governance and ownership of the plant is plainly relevant to safety performance through the priorities it establishes for production, profitability, and safety. If the corporation's highest priority is profitability, then safety procedures and investments take the back seat. Local budget managers are pressed to find cost reductions, and staff positions and equipment devoted to safety are often the easiest category of budget reduction to achieve. On the other hand, if the corporation's guidance to plant executives is a nuanced set of priorities within which both production goals and safety goals are given importance, there is a better chance of preserving the investments in process inspectors, better measurement instruments, and on-site experts who can be called on to offer advice during a plant emergency. This differentiating feature of corporate priority-setting too is a meso-level property that contributes to the level of safety performance in a chemical plant, independent of the knowledge and intentions of local plant managers, directors, and workers.

These brief hypothetical examples seem to establish a fairly mundane form of "emergence" for organizational properties. They provide examples of causal independence of meso-level properties of organizations. And significantly, each of these meso-level features can be identified in case studies of important industrial failures -- the Boeing 737 Max (link), the Deepwater Horizon disaster (link), or the 2005 Texas City refinery explosion (link).

It may be noted that there are two related ideas here: the idea that a higher-level property is emergent from the properties of the constituent entities; and the idea that a higher-level feature may be causally persistent over time and over change of the particular actors who make up the social entity. The connection is this: we might argue that the causally persistent property at the meso-level is different in nature and effect from the causal properties (actions, behaviors, intentions) of the individuals who make up the organization. So causal persistence of meso-level properties demonstrates emergence of a sort.


Saturday, July 22, 2023

Regulatory failure in freight rail traffic

On any given day some 7,000 freight trains are in motion around the United States, with perhaps 70,000 individual freight cars and intermodal units in transit daily (link). (Here is the US DOT Federal Railroad Administration (FRA) website, which provides a fair amount of information about the industry.) Freight rail is big business, with record profits over the past several years. And it is occasionally an industry prone to accidents, failures, and disasters. Most recently was the derailment of a Norfolk Southern train, resulting in the release of a large amount of vinyl chloride, a flammable and toxic chemical, near the town of Palestine, Ohio. The full extent of this catastrophe is not yet known.

Derailments, crashes, fires, and explosions make the news. But there is a more insidious process at work as well: the relentless effort by the large freight rail companies to increase profits by increasing the volume of freight and reducing costs. And -- as is true in many other risky business operations -- reducing costs has worrisome consequences for safety (link). Reducing personnel is one way of reducing costs, and crew size on freight trains has been reduced substantially. There were only two crew members on the Palestine, Ohio train (engineer and conductor) that derailed, and the industry wants to preserve the freedom to reduce the cabin crew to a single engineer (link). Increasing the number of cars -- and therefore the length of individual trains -- is another way of reducing costs for a ton-mile of transportation; and sure enough, trains are now traveling around the country that are substantially more than a mile long. Another strategy for cost "containment" is the strategy of tightening operations in and around large rail yards, streamlining the process of re-mixing cars into trains with different destinations. And the tighter the schedules become, the more tightly-linked the system becomes. So a disruption in St. Louis can lead to congestion in Pittsburgh.

The railroad companies and the Association of American Railroads make the case that the rail safety safety record has improved significantly over recent years; link. And it is of course true that there is a business case for maintaining safe operations. However, it is plain that voluntary efforts at maintaining public safety are insufficient when they conflict directly with other business priorities.

Rail operations and business management plainly involve risks for the public; so government regulation of the industry is crucial. But the economic power of railroads -- today as well as in the 1880s -- allows the companies, their associations, and their lobbyists to block sensible regulations that plainly serve the interests of the public (plainly, at least, to neutral observers). Because railroads are largely a form of interstate commerce, states and local authorities have little or no ability to regulate safety. Instead, this authority is assigned to the Congress and the Department of Transportation Federal Railroad Administration. And yet the hazards created by railroads are inherently local, and local and state authorities have almost no jurisdiction.

Consider the photo above. It is a freight train stopped across an unguarded rural rail crossing in Michigan. The train will sit across the road for an extended period of time, from ten minutes to an hour. And the relatively few people who use the road to get to work, to take children to school, to go shopping or bowling (yes, we have bowling alleys in Michigan) -- these people will simply have to wait, or to take a circuitous route around the obstruction. Fortunately in this local instance in Michigan the blockages are relatively short and there are other routes that drivers can take.

But turn now to York, Alabama, as reported in the July 15, 2023 New York Times (link). Peter Eavis, Mark Walker, and Niraj Chokshi describe a chronic problem in this small town on a rail line owned by Norfolk Southern. "Freight trains frequently stop and block the roads of York, Ala., sometimes for hours. Emergency services and health care workers can't get in, and those trapped inside can't get out." "On a sweltering election day in June 2022, a train blockage lasted more than 10 hours, forcing many people, some old and ill, to shelter in an arts center." And the problem is getting worse, as freight trains become longer and longer, with more frequent (and longer) periods in which a train blocks a crossing.

The article makes the point that state and local laws aimed at regulating these blockages have been regularly overturned by the courts, and efforts to introduce Federal remedies have failed. "Congressional proposals to address the issue have failed to overcome opposition from the rail industry." The article indicates that the lobbying efforts of the rail companies and their industry associations are highly effective in shaping legislation and regulation that affects the industry. They report that the rail companies and the AAR have spent $454 million in lobbying over the past twenty years, including campaign contributions to key legislators. They also make the point that the extent of the problem of extended blockages of rail crossings is poorly documented, since there are more than 200,000 rail crossings and and only a low level of reporting of individual blockages. Long freight trains are part of the problem, because trains longer than a mile exceed the length of many sidings that were previously used to manage train traffic without blocking crossings.

This is a familiar problem -- the problem of industry capture of regulatory agencies through the use of their financial and political resources. The industry wants to have the freedom to organize operations as they see fit; and their first goal is to maintain and increase profits. The public needs regulatory agencies that depend on neutral expert assessment and rule-setting that protects the safety of the families who are affected by the industry; and yet -- as Charles Perrow argued in "Cracks in the Regulatory State", all too often the regulatory process defers to the business interests of the industry (link). He writes:

Almost every major industrial accident in recent times has involved either regulatory failure or the deregulation demanded by business and industry. For more examples, see Perrow (2011). It is hard to make the case that the industries involved have failed to innovate because of federal regulation; in particular, I know of no innovations in the safety area that were stifled by regulation. Instead, we have a deregulated state and deregulated capitalism, and rising environmental problems accompanied by growing income and wealth inequality. (210).

Blocked crossings are an inconvenience of everyday life for many people. But they can also lead to life-threatening situations when ambulances and fire vehicles cannot gain access to scenes of emergency. Leaving the problem of blocked crossings to the railroad companies -- rather than a sensible set of FRA rules and penalties -- is surely a prescription for a worsening problem over time. As Willie Lake, the mayor of York, put the point in the New York Times article: "They have no incentive" to make substantial changes in their operations to substantially improve the blocked-crossing problem. The FRA needs to provide clear and sensible regulations that give the companies the incentives needed to address the problem.

(The Washington Post ran an extensive story in May on blocked rail crossings, with examples from Leggett, Texas; link. The National Academy of Science is conducting a study of the safety implications of freight trains longer than 7,500 feet (link).)


Sunday, July 9, 2023

The air traffic control system and ethno-cognition



Diane Vaughan's recent Dead Reckoning: Air Traffic Control, System Effects, and Risk is an important contribution to the literature on safety in complex socio-technical systems. The book is an ethnographic study of the workspaces and the men and women who manage the flow of aircraft throughout US airspace. Her ethnographic work for this study was extensive and detailed. She is interested in arriving at a representation of air traffic control arrangements as a system, and she pays ample attention to the legal and regulatory arrangements embodied in the Federal Aviation Administration as the administrator of this system. As an organizational sociologist, she understands full well that "institutions matter" -- the institutions and organizations that have been created and reformed over time have specific characteristics that influence the behavior of the actors who work within the system, and influence in turn the effectiveness and safety of the system. But her central finding is that it is the situated actors who do their work in control towers and flight centers who are critical to the resilience and safety of the system. And what is most important about their characteristics of work is the embodied social cognition that they have achieved through training and experience. She uses the term "ethno-cognition" to refer to the extended system of concrete and embodied knowledge that is distributed across the corps of controllers in air traffic control centers and towers across the country.

Vaughan emphasizes the importance of “situated action” in the workings of a complex socio-technological system: “the dynamic between the system’s institutional environment, the organization as a socio-technical system, and the controllers’ material practices, interpretive work, and the meanings the work has for them” (p. 11). She sees this intellectual frame as a bridge between the concrete activities in a control tower and the meso-level arrangements and material infrastructure within which the work proceeds. This is where "micro" meets macro and meso in the air traffic control system.

The key point here is that the skilled air traffic controller is not just the master of an explicit set of protocols and procedures. He or she has gained a set of cognitive skills that are “embodied” rather than formally represented as a system of formal rules and facts. “Collectively, controllers’ cultural system of knowledge is a set of embodied repertoires – cognitive, physical, emotional, and material practices – that are learned and drawn upon to craft action from moment to moment in response to changing conditions. In constructing the act, structure, culture, and agency combine” (p. 122). Vaughan's own process of learning through this extended immersion sounds a great deal like Michael Polanyi’s description in Personal Knowledge of the acquisition of “tacit knowledge” by a beginning radiologist; she learned to “see” the sky in the way that a trained controller saw it. The controllers have mastered a huge set of tacit repertoires that permit them to understand and control the rapidly changing air spaces around them.

A special strength of the book is the detailed attention Vaughan gives to the historicity and contingency of institutions and organizations. Vaughan’s approach is deliberately “multi-level”, including government agencies, institutions, organizations, and individuals in their workplaces. Vaughan takes full account of the fact that institutions change over time as a result of the actions of a variety of actors, and changes in the institutional settings have consequences for the workings of embedded technological systems. She points out, moreover, that these changes are almost always “patch-work” changes, involving incremental efforts to fit new technologies or team practices into existing organizational forms.  “Incrementally, problem-solving people and organizations inside the air traffic control system have developed strategies of resilience, reliability, and redundancy that provided perennial dynamic flexibility to the parts of the system structure, and they have improvised tools of repair to adjust innovations to local conditions, contributing to system persistence” (p. 9). Institutions and organizations change largely through processes of "social hacking" and adjustment, rather than wholesale redesign, and she finds that the small number of instances of efforts to fully redesign the system have failed.

Particularly valuable in Vaughan’s narrative is her fluid integration of processes and factors at the macro, meso, and micro levels. High-level features like economic pressures on airlines, budget constraints within the Federal government (e.g. delaying implementation of long-range radar in the air traffic control system; 83), and military imperatives on the movements of aircraft (83); meso-level features like the regulatory system for air traffic safety as it emerged and evolved; and micro-level features like the architecture of the workspaces of controllers over time and the practices and problem-solving abilities that were embodied in their work – all these levels are represented in almost every page of the book. And Vaughan points out that all of these processes have the potential of creating unanticipated system dysfunctions beyond their direct effects. Even the facts of the diffusion of high-speed commercial jets and the rise in military staffing demands during the Vietnam War had important and unanticipated system consequences for the air traffic control system.

Vaughan refers frequently to the causal role that “history” plays in complex technology systems like the air traffic control system. But she avoids the error of reification of “history” by carefully paraphrasing what this claim means to her: “History has a causal effect on the present only through the agency of multiple heterogeneous social actors and actions originating in different institutional and organizational locations and temporalities that intersect with a developing system and through its life course in unanticipated ways…. Causal explanations of historical events, institutions, and outcomes are best understood by storylike explanations that capture the sequential unfolding of events in and over time, revealing the interaction of structures and social actions that drive change” (p. 42). This clarification correctly disaggregates the causal powers of “history” into the actors, institutions, and processes whose influences over time have contributed to the current workings of the socio-technical system. Further, it provides a useful contribution to the literature on institutional change with its granular level of detail concerning the “career” of the air traffic control system over several decades. (Here she draws on the work of sociologists like Andrew Abbott on the role of temporality in social explanations.)

One illustration of Vaughan’s attention to historical details occurs in her account of the extended process of invention, design, test, and publicity undertaken by the Wright brothers. This narrative illustrates the multi-level influences that contributed to the establishment of a paradigm of heavier-than-air flight in the early decades of the twentieth century – individual innovation, networks of transmission of ideas, institutional context, and the authority and reputation of the magazine Scientific American (pp. 49-63). And, like Thomas Hughes' historical account of the development of electric power in the United States in Networks of Power, her account is fully attentive to the contingency and path-dependency of these processes. This material makes a genuine contribution to science and technology studies and to recent work in the history of technology. 

Vaughan sounds a cautionary note about the safety and resilience of the air traffic system, and its (usually) excellent record of preventing mid-air and on-ground collisions among aircraft. She has argued persuasively throughout the book that these features of safety and resilience depend crucially on the well-trained and experienced controllers who observe and control the airspace. But she notes as well the perennial desire of both private businesses and government agencies to squeeze costs and "waste" out of complex processes. In the context of the air traffic control system, this has meant trying to reduce the number of controllers through "streamlined" processes and more extensive technologies. Her reaction to these impulses is clearly a negative one: reducing staff in air traffic control towers is a very good way of making unlikely events like mid-air collisions incrementally more likely; and that fact translates into an increasing likelihood of loss of life (and the business and government losses associated with major disasters). We should not look at reasonable staffing levels in control towers as a "wasteful" organizational practice.

(Here is an earlier post on "Expert Knowledge" that is relevant to Vaughan's findings; link.)

Thursday, August 25, 2022

Organizational factors and nuclear power plant safety

image: Peach Bottom Nuclear Plant

The Nuclear Regulatory Commission has responsibility for ensuring the safe operations of the nuclear power reactors in the United States, of which there are approximately 100. There are significant reasons to doubt whether its regulatory regime is up to the task. Part of the challenge is the technical issue of how to evaluate and measure the risks created by complex technology systems. Part is the fact that it seems inescapable that organizational and management factors play key roles in nuclear accidents -- factors the NRC is ill-prepared to evaluate. And the third component of the challenge is the fact that the nuclear industry is a formidable adversary when it comes to "intrusive" regulation of its activities. 

Thomas Wellock is the official historian of the NRC, and his work shows an admirable degree of independence from the "company line" that the NRC wishes to present to the public. Wellock's book, Safe Enough?: A History of Nuclear Power and Accident Risk, is the closest thing we have to a detailed analysis of the workings of the commission and its relationships to the industry that it regulates. A central focus in Safe Enough is the historical development of the key tool used by the NRC in assessing nuclear safety, the methodology of "probabilistic risk assessment" (PRA). This is a method for aggregating the risks associated with multiple devices and activities involved in a complex technology system, based on failure rates and estimates of harm associated with failure. 

This preoccupation with developing a single quantitative estimate of reactor safety reflects the engineering approach to technology failure. However, Charles Perrow, Diane Vaughan, Scott Sagan, and numerous other social scientists who have studied technology hazards and disasters have made clear that organizational and managerial failures almost always play a key role in the occurrence of a major accident such as Three Mile Island, Fukushima, or Bhopal. This is the thrust of Perrow's "normal accident" theory and Vaughan's "normalization of deviance" theory. And organizational effectiveness and organizational failures are difficult to measure and quantify. Crucially, these factors are difficult to incorporate into the methodology of probabilistic risk assessment. As a result, the NRC has almost no ability to oversee and enforce standards of safety culture and managerial effectiveness.

Wellock addresses this aspect of an incomplete regulatory system in "Social Scientists in an Adversarial Environment: The Nuclear Regulatory Commission and Organizational Factors Research" (link). The problem of assessing "human factors" has been an important element of the history of the NRC's efforts to regulate the powerful nuclear industry, and failure in this area has left the NRC handicapped in its ability to address pervasive ongoing organizational faults in the nuclear industry. Wellock's article provides a detailed history of efforts by the NRC to incorporate managerial assessment and human-factors analysis into its safety program -- to date, with very little success. And, ironically, the article demonstrates a key dysfunction in the organization and setting of the NRC itself; because of the adversarial relationship that exists with the nuclear industry, and the influence that the industry has with key legislators, the NRC is largely blocked from taking commonsense steps to include evaluation of safety culture and management competence into its regulatory regime.

Wellock makes it clear that both the NRC and the public have been aware of the importance of organizational dysfunctions in the management of nuclear plants since the Three Mile Island accident in 1979. However, the culture of the organization itself makes it difficult to address these dysfunctions. Wellock cites the experience of Valerie Barnes, a research psychologist on staff at the NRC, who championed the importance of focusing attention on organizational factors and safety culture. "She recalled her engineering colleagues did not understand that she was an industrial psychologist, not a therapist who saw patients. They dismissed her disciplinary methods and insights into human behavior and culture as 'fluffy,' unquantifiable, and of limited value in regulation compared to the hard quantification bent of engineering disciplines" (1395). 

The NRC took the position that organizational factors and safety culture could only properly be included in the regulatory regime if they could be measured, validated, and incorporated into the PRA methodology. The question of the quantifiability and statistical validity of human-factors research and safety-culture research turned out to be insuperable -- largely because these were the wrong standards for evaluating the findings of these areas of the social sciences. "In the new program [in the 1990s], the agency avoided direct evaluation of unquantifiable factors such as licensee safety culture" (1395). (It is worth noting that this presumption reflects a thoroughly positivistic and erroneous view of scientific knowledge; linklink. There are valid methods of sociological investigation that do not involve quantitative measurement.) 

After the Three Mile Island disaster, both the NRC and external experts on nuclear safety had a renewed interest in organizational effectiveness and safety culture. Analysis of the TMI disaster made organizational dysfunctions impossible to ignore. Studies by the Battelle Human Affairs Research Center were commissioned in 1982 (1397), to permit design of a regulatory regime that would evaluate management effectiveness. Here again, however, the demand for quantification and "correlations" blocked the creation of a regulatory standard for management effectiveness and safety culture. Moreover, the nuclear industry was able to resist efforts to create "intrusive" inspection regimes involving assessment of management practices. "In the mid-1980s, the NRC deferred to self-regulating initiatives under the leadership of the Institute for Nuclear Power Operations (INPO). This was not the first time the NRC leaned on INPO to avoid friction with industry" (1397). 

A serious event at the Davis-Besse plant in Ohio in 1983 focused attention on the importance of management, organizational dysfunction, and safety culture, and a National Academy of Sciences report in 1988 once again recommended that the NRC must give high priority to these factors -- quantifiable or not (Human Factors Research and Nuclear Safety; link).

The panel called on the NRC to prioritize research into organizational and management factors. “Management can make or break a plant,” Moray told the NRC’s Advisory Committee for Reactor Safeguards. Even more than the man-machine interface, he said, it was essential that the NRC identify what made for a positive organizational culture of reliability and safety and develop appropriate regulatory feedback mechanisms that would reduce accident risk. (1400)

These recommendations led  the NRC to commission an extensive research consultancy with a group of behavioral scientists at Brookhaven Laboratory. The goal of this research, once again, was to identify observable and measurable factors of organizations and safety culture that would permit quantification of the quality of both intangible features of nuclear plants -- and ultimately to permit incorporation of these factors into PRA models. 

 Investigators identified over 20 promising organizational factors under five broad categories of control systems, communications, culture, decision making, and personnel systems. Brookhaven concluded the best measurement methodologies included research surveys, behavioral checklists, structured interview protocols, and behavioral-anchored rating scales. (1401)

However, this research foundered on three problems: the cost of evaluating a nuclear operator on this basis; the "intrusiveness" of the methods needed to evaluate these organizational systems, and the intransigent and adversarial opposition of the operators of nuclear plants against these kinds of assessment. It also emerged that it was difficult to establish correlations between the organizational factors identified and the safety performance of a range of plants. NRC backed down from its effort to directly assess organizational effectiveness and safety culture, and instead opted for a new "Reactor Oversight Process" (ROP) that made use only of quantitative factors associated with safety performance (1403).

A second and more serious incident at the Davis-Besse nuclear plant in 2002 resulted in a near-miss loss-of-coolant accident (link), and investigation by NRC and GAO compelled the NRC to once again bring safety culture back into the regulatory agenda. Executives, managers, operators, and inspectors were all found to have behaved in ways that greatly increased the risk of a highly damaging LOCA accident at Davis-Besse. The NRC imposed more extensive organizational and managerial requirements on the operators of the Davis-Besse plant, but these protocols were not extended to other plants.

It is evident from Wellock's 2021 survey of the NRC history of human-factors research and organizational research that the commission is currently incapable of taking seriously the risks to reactor safety created by the kinds of organizational failures documented by Charles Perrow, Diane Vaughan, Andrew Hopkins, Scott Sagan, and many others. NRC has shown that it is aware of these social-science studies of technology system safety. But its intellectual commitment to a purely quantitative methodology for risk assessment, combined with the persistent ability of the nuclear operators to prevent forms of "intrusive" evaluation that they don't like, leads to a system in which major disasters remain a distinct possibility. And this is very bad news for anyone who lives within a hundred miles of a nuclear power plant.


Friday, December 10, 2021

China's food-safety governance system


Food safety is a very high-level concern for ordinary consumers. This is true because the food we eat can poison us or ruin our health, and yet consumers have little ability to evaluate the safety of the foods available in the marketplace. Therefore government regulation of food safety appears to be mandatory in any complex society. 

Regulation requires several things: science-based regulations on processes and composition of the products that are regulated, consistent and disinterested inspection, effective enforcement of regulations and violations, and oversight by a regulatory agency that is independent from the industry being regulated and insulated from the general political interests of the government within which it exists.

China has experienced many food-contamination scandals in the past twenty years, and food safety is ranked as a high-level concern by many Chinese citizens. In his 2012 post on food safety in China in the Council on Foreign Relations blog (link), Yanzhong Huang writes that "in the spring of 2012, a survey carried out in sixteen major Chinese cities asked urban residents to list 'the most worrisome safety concerns.' Food safety topped the list (81.8%), followed by public security (49%), medical care safety (36.4%), transportation safety (34.3%), and environmental safety (20.1%)". And the public anxiety is well justified; (link). In 2012 Bi Jingquan, then the head of the China Food and Drug Administration, testified that "Chinese food safety departments conducted more than 15 million individual inspections in the first three quarters of the year and found more than 500,000 incidents of illegal behavior" (link). Especially notorious is the milk-melamine contamination scandal of 2008, resulting in hospitalization of over 50,000 affected children and at least six deaths of children and infants.

The question of interest here has to do with China's governmental system of regulation of safety in the food system. What are the regulatory arrangements currently in place? And do these governmental systems provide a basis for a reasonable level of confidence in the quality and safety of China's food products?

Liu, Mutukumira, and Chen 2019 (link) provide a detailed and comprehensive analysis of the evolution of food-safety regulation in China since 1949. This resview article is worth studying in detail for the light it sheds on the challenge of establishing an effective system of regulation in a vast population governed by a single-party state. The article is explicit about the food-safety problems that persist on a wide scale in China:

Food safety incidents still occur, including abuse of food additives, adulterated products as well as contamination by pathogenic microorganisms, pesticides, veterinary drug residues, and heavy metals, and use of substandard materials. (abstract)

The authors refer to a number of important instances of widespread food contamination and dangerous sanitary conditions, including "spicy gluten strips" consumed by teenagers.

Liu et al recommend "coregulation" for the China food system, in which government and private producers each play a crucial role in evaluating and ensuring safe food processes and products. They refer to the "Hazard Analysis Critical Control Point (HACCP) system" that should be implemented by food producers and processors (4128), and they emphasize the need in China for a system that succeeds in ensuring safe food at low regulatory cost.

Increasing number of countries uses new coregulation schemes focusing on a specific type of coregulation where regulations are developed by public authorities and then implemented by the coordinated actions of public authorities and food operators or “enforced self-regulation” (Guo, Bai, & Gong, 2019; Rouvière & Caswell, 2012).... Coregulation aims to combine the advantages of the predictability and binding nature of legislation with the flexibility of self-regulatory approaches. (4128)

Here is their outline of the chronology of food-safety regimes in China since 1949:



Previous posts have discussed some of the organizational dysfunctions associated with "coregulation" and its cognate concepts (link). The failures of design and implementation of the Boeing 737 Max are attributed in large part to the system of delegated regulation used by the Federal Aviation Administration (link, link). And the Nuclear Regulatory Commission too appears to defer extensively to "industry expertise" in its approach to regulation (link). The problems of regulatory capture and weak, ineffective governmental regulatory institutions are well understood in the US and Europe. And this experience supports a healthy skepticism about the likely effectiveness of "coregulation" in China's food system as well. Earlier posts have emphasized the importance of independence of regulatory agencies from both the political interests of the government and the economic interests of the industry that they regulate. This independence appears to be all but impossible in China's governmental structure and Party rule.

Another weakness identified in Liu et al concerns the level and organizational home of enforcement of food-safety regulations. "The supervision of food safety is mainly dependent on law enforcement departments" (4128). This system is organizationally flawed for several reasons. First, it implies a lack of coordination, with different jurisdictions (cities, provinces, counties) exercising different levels and forms of enforcement. And second, it raises the prospect of corruption, both petty and large, in which inspectors, supervisors, and enforcers are induced to look the other way at infractions. This problem was noted in a prior post on fire safety regulation in China (link). The localism inherent in the food safety system in China is evident in Figure 1:



And the authors highlight the dysfunction that is latent in this diagram:

The local government is responsible for food safety information. At the same time, the local government accepts the leadership of the central government and is responsible to the central government, which forms a principal-agent relationship under asymmetric information. Meanwhile, food producers are in a position of information superiority over local governments and are regulated by the local governments. Therefore, the relationship of the central government, local governments, and food producers is multiple principal-agent relationship. Under the standard of fiscal decentralization and political assessment, local governments are both food safety regulatory agencies and regional competitive entities, so the collusion between local governments, or different counties, and enterprises becomes a rational choice (Tirole, 1986). (4134)

It appears incontrovertible that "publicity" is an important factor in enhancing safety in any industry. If the public is informed about incidents -- whether food safety, chemical plant spills, or nuclear disasters -- their concerns can lead to full and rigorous accident investigation and process changes supporting greater safety in the future. Conversely, if government suppresses news media in its ability to provide information about these kinds of incidents, there is much less public pressure leading to more effective safety regulation. Chinese leaders' determination to tightly control the flow of information is decidedly harmful for the goal of increasing food safety and other dimensions of environmental safety.

Liu et al describe the progression of food safety laws and policies over five decades, and they appear to believe that the situation of food safety has improved in the most recent period. They also note, however, that much remains to be done:

With the enactment of the 2015 FSL, China developed and reinforced various regulatory tools. However, there are areas of the law and regulation that need further work, such as effective coordination among government agencies, a focus on appropriate risk communication, facilitating social governance and responsibility, nurturing a food safety culture from bottom-up, and assisting farmers at the primary level (Roberts & Lin, 2016). (4131)

These areas for future improvement are fundamental for establishing a secure and effective safety regime -- whether in the area of food safety or other areas of environmental and industrial safety. And to these we may add several more important factors that are currently absent: independence of regulatory agencies from government direction and industry capture; lack of freedom of information permitting the public to be well informed about incidents when they occur; and an enforcement system that fails to deter and ameliorate bad performance and process inadequacies.


Friday, October 15, 2021

Fire safety in urban China


A rapidly rising percentage of the Chinese population is living in high-rise apartment buildings in hundreds of cities around the country. There is concern, however, about the quality and effectiveness of fire-safety regulation and enforcement for these buildings (as well as factories, warehouses, ports, and other structures). This means that high-rise fires represent a growing risk in urban China. Here is a news commentary from CGTN (link) in 2010 describing a particularly tragic high-rise fire that engulfed a 28-story building in Shanghai, killing 58 people. This piece serves to identify the parameters of the problem of fire safety more generally.

It is of course true that high-rise fires have occurred in many cities around the world, including the notorious Grenfell Tower disaster in 2017. And many of those fires also reflect underlying problems of safety regulation in the jurisdictions in which they occurred. But the problems underlying infrastructure safety seem to converge with particular seriousness in urban China. And, crucially, major fire disasters in other countries are carefully scrutinized in public reports, providing accurate and detailed information about the causes of the disaster. This scrutiny creates the political incentive to improve building codes, inspection regimes, and enforcement mechanisms of safety regulations. This open and public scrutiny is not permitted in China today, leaving the public largely ignorant of the background causes of fires, railway crashes, and other large accidents.

It is axiomatic that modern buildings require effective and professionally grounded building codes and construction requirements, adequate fire safety system requirements, and rigorous inspection and enforcement regimes that ensure a high level of compliance with fire safety regulations. Regrettably, it appears that no part of this prescription for fire safety is well developed in China.

The CGTN article mentioned above refers to the "effective" high-level fire safety legislation that the central government adopted in 1998, the Fire Control Law of the People's Republic of China (link), and this legislation warrants close study. However, close examination suggests that this guiding legislation lacks crucial elements that are needed in order to ensure compliance with safety regulations -- especially when compliance is highly costly for the owners/managers of buildings and other facilities. Previous disasters in China suggest a pattern: poor inspection and enforcement prior to an accident or fire, followed by prosecution and punishment of individuals involved in the occurrence of the disaster in the aftermath. But this is not an effective mechanism for ensuring safety. Owners, managers, and officials are more than ready to run the small risk of future prosecution for the sake of gains in the costs of present operations of various facilities.

The systemic factors that act against fire safety in China include at least these pervasive social and political conditions: ineffective and corrupt inspection offices, powerful property managers who are able to ignore safety violations, pressure from the central government to avoid interfering with rapid economic growth, government secrecy about disasters when they occur, and lack of independent journalism capable of freely gathering and publishing information about disasters.

In particular, the fact that the news media (and now social media as well) are tightly controlled in China is a very serious obstacle to improving safety when it comes to accidents, explosions, train wrecks, and fires. The Chinese news media do not publish detailed accounts of disasters as they occur, and they usually are unable to carry out the investigative journalism needed to uncover background conditions that have created the circumstances in which these catastrophes arise (ineffective or corrupt inspection regimes; enforcement agencies that are hampered in their work by the political requirements of the state; corrupt practices by private owners/managers of high-rise properties, factories, and ports; and so on). It is only when the public can become aware of the deficiencies in government and business that have led to a disaster, that reforms can be designed and implemented that make those disasters less likely in the future. But the lack of independent journalism means leaving the public in the dark about these important details of their contemporary lives.

The story quoted above is from CGTN, a Chinese news agency, and this story is unusual for its honesty in addressing some of the deficiencies of safety management and regulation in Shanghai. CGTN is an English-language Chinese news service, owned and operated by Chinese state-owned media organization China Central Television (CCTV). As such it is under full editorial control by offices of the Chinese central government. And the government is rarely willing to have open and honest reporting of major disasters, and the organizational, governmental, and private dysfunctions that led to them. It is noteworthy, therefore, that the story is somewhat explicit about the dysfunctions and corruption that led to the Shanghai disaster. The article quotes an article in China Daily (owned by the publicity department of the CCP) that refers to poor enforcement and corruption:

However, a 2015 article by China Daily called for the Fire Control Law to be more strictly enforced, saying that the Chinese public now “gradually takes it for granted that when a big fire happens there must be a heavy loss of life.”

While saying “China has a good fire protection law,” the newspaper warned that it was frequently violated, with fire engine access blocked by private cars, escape routes often blocked and flammable materials still being “widely used in high buildings.”

The article also pointed at corruption within fire departments, saying inspections have “become a cash cow,” with businesses and construction companies paying bribes in return for lax safety standards being ignored.

So -- weak inspections, poor compliance with regulations, and corruption. Both the CCTV report and the China Daily story it quotes are reasonably explicit about unpalatable truths. But note -- the CGTN story was prepared for an English-speaking audience, and is not available to ordinary Chinese readers in China. And this appears to be the case for the China Daily article that was quoted as well. And most importantly -- the political climate surrounding the journalistic practices of China Daily has tightened very significantly since 2015.

Another major institutional obstacle to safety in China is the lack of genuinely independent regulatory safety agencies. The 1998 Fire Control Law of the People's Republic of China is indicative. The legislation refers to the responsibility of local authorities (provincial, municipal) to establish fire safety organizations; but it is silent about the nature, resources, and independence of inspection authorities. Here is the language of the first several articles of the Fire Control Law:

Article 2 Fire control work shall follow the policy of devoting major efforts into prevention and combining fire prevention with fire fighting, and shall adhere to the principle of combining the efforts of both specialized organizations and the masses and carry out responsibility system on fire prevention and safety.

Note that this article immediately creates a confusion of responsibility concerning the detailed tasks of establishing fire safety: "specialized organizations" and "the masses" carry out responsibility.

Article 3 The State Council shall lead and the people's governments at all levels be responsible for fire control work. The people's government at all levels shall bring fire control work in line with the national economy and social development plan, and ensure that fire control work fit in with the economic construction and social development.

Here too is a harmful diffusion of responsibility: "the people's governments at all levels [shall] be responsible ...". In addition a new priority is introduced: consistency with the "national economy and social development plan". This implies that fire safety regulations and agencies at the provincial and municipal level must balance economic needs with the needs of ensuring safety -- a potentially fatal division of priorities. If substituting a non-flammable cladding to an 80-story residential building will add one billion yuan to the total cost of the building -- does this requirement impede the "national economy and development plan"? Can the owner/managers resist the new regulation on the grounds that it is too costly?

Article 4 The public security department of the State Council shall monitor and administer the nationwide fire control work; the public security organs of local people's governments above county level shall monitor and administer the fire control work within their administrative region and the fire control institutions of public security organs of the people's government at the same level shall be responsible for the implementation. Fire control work for military facilities, underground parts of mines and nuclear power plant shall be monitored and administered by their competent units. For fire control work on forest and grassland, in case there are separate regulations, the separate regulations shall be followed.

Here we find specific institutional details about oversight of "nationwide fire control work": it is the public security organs that are tasked to "monitor and administer" fire control institutions. Plainly, the public security organs have no independence from the political authorities at provincial and national levels; so their conduct is suspect when it comes to the task of "independent, rigorous enforcement of safety regulations".

Article 5 Any unit and individual shall have the obligation of keeping fire control safety, protecting fire control facilities, preventing fire disaster and reporting fire alarm. Any unit and adult shall have the obligation to take part in organized fire fighting work.

Here we are back to the theme of diffusion of responsibility. "Any unit and individual shall have the obligation of keeping fire control safety" -- this statement implies that there should not be free-standing, independent, and well-resourced agencies dedicated to ensuring compliance with fire codes, conducting inspections, and enforcing compliance by reluctant owners.

It seems, then, that the 1998 Fire Control Law is largely lacking in what should have been its primary purpose: specification of the priority of fire safety, establishment of independent safety agencies at various levels of government with independent power of enforcement, and with adequate resources to carry out their fire safety missions, and a clear statement that there should be no interference with the proper inspection and enforcement activities of these agencies -- whether by other organs of government or by large owner/operators.

The 1998 Fire Control Law was extended in 2009, and a chapter was added entitled "Supervision and Inspection". Clauses in this chapter offer somewhat greater specificity about inspections and enforcement of fire-safety regulation. Departments of local and regional government are charged to "conduct targeted fire safety inspections" and "promptly urge the rectification of hidden fire hazards" (Article 52). (Notice that the verb "urge" is used rather than "require".) Article 53 specifies that the police station (public security) is responsible for "supervising and inspecting the compliance of fire protection laws and regulations". Article 54 addresses the issue of possible discovery of "hidden fire hazards" during fire inspection; this requires notification of the responsible unit of the necessity of eliminating the hazard. Article 55 specifies that if a fire safety agency discovers that fire protection facilities do not meet safety requirements, it must report to the emergency management department of higher-level government in writing. Article 56 provides specifications aimed at preventing corrupt collaboration between fire departments and units: "Fire rescue agencies ... shall not charge fees, shall not use their positions to seek benefits". And, finally, Article 57 specifies that "all units and individuals have the right to report and sue the illegal activities of the authorities" if necessary. Notice, however, that, first, all of this inspection and enforcement activity occurs within a network of offices and departments dependent ultimately on central government; and second, the legislation remains very unspecific about how this set of expectations about regulation, inspection, and enforcement is to be implemented at the local and provincial levels. There is nothing in this chapter that gives the observer confidence that effective regulations will be written; effective inspection processes will be carried out; and failed inspections will lead to prompt remediation of hazardous conditions.

The Tianjin port explosion in 2015 is a case in point (link, link). Poor regulations, inadequate and ineffective inspections, corruption, and bad behavior by large private and governmental actors culminated in a gigantic pair of explosions of 800 tons of ammonium nitrate. This was one of the worst industrial and environmental disasters in China's recent history, and resulted in the loss of 173 lives, including 104 poorly equipped fire fighters. Prosecutions ensued after the disaster, including the conviction and suspended death sentence of Ruihai International Logistics Chairman Yu Xuewei for bribery, and the conviction of 48 other individuals for a variety of crimes (link). But punishment after the fact is no substitute for effective, prompt inspection and enforcement of safety requirements.

It is not difficult to identify the organizational dysfunctions in China that make fire safety, railway safety, food safety, and perhaps nuclear safety difficult to attain. What is genuinely difficult is to see how these dysfunctions can be corrected in a single-party state. Censorship, subordination of all agencies to central control, the omnipresence of temptations to corrupt cooperation -- all of these factors seem to be systemic within a one-party state. The party state wants to control public opinion; therefore censorship. The party state wants to control all political units; therefore a lack of independence for safety agencies. And positions of decision-making that create lucrative "rent-seeking" opportunities for office holders -- therefore corruption, from small payments to local inspectors to massive gifts of wealth to senior officials. A pluralistic, liberal society embodying multiple centers of power and freedom of press and association is almost surely a safer society. Ironically, this was essentially Amartya Sen's argument in Poverty and Famines: An Essay on Entitlement and Deprivation, his classic analysis of famine and malnutrition: a society embodying a free press and reasonably free political institutions is much more likely to respond quickly to conditions of famine. His comparison was between India in the Bengal famine (1943) and China in the Great Leap Forward famine (1959-61).

Here is a Google translation of Chapter V of the 2009 revision of the Fire Protection Law of the People's Republic of China mentioned above.

Chapter V Supervision and Inspection

Article 52 Local people's governments at all levels shall implement a fire protection responsibility system and supervise and inspect the performance of fire safety duties by relevant departments of the people's government at the same level.

The relevant departments of the local people's government at or above the county level shall, based on the characteristics of the system, conduct targeted fire safety inspections, and promptly urge the rectification of hidden fire hazards.

Article 53 Fire and rescue agencies shall supervise and inspect the compliance of fire protection laws and regulations by agencies, organizations, enterprises, institutions and other entities in accordance with the law. The police station may be responsible for daily fire control supervision and inspection, and conduct fire protection publicity and education. The specific measures shall be formulated by the public security department of the State Council.

The staff of fire rescue agencies and public security police stations shall present their certificates when conducting fire supervision and inspection.

Article 54: Fire rescue agencies that discover hidden fire hazards during fire supervision and inspection shall notify relevant units or individuals to take immediate measures to eliminate the hidden hazards; if the hidden hazards are not eliminated in time and may seriously threaten public safety, the fire rescue agency shall deal with the dangerous parts in accordance with regulations. Or the place adopts temporary sealing measures.

Article 55: If the fire rescue agency discovers that the urban and rural fire safety layout and public fire protection facilities do not meet the fire safety requirements during the fire supervision and inspection, or finds that there is a major fire hazard affecting public safety in the area, it shall report to the emergency management department in writing. Level People’s Government.

The people's government that receives the report shall verify the situation in a timely manner, organize or instruct relevant departments and units to take measures to make corrections.

Article 56 The competent department of housing and urban-rural construction, fire rescue agencies and their staff shall conduct fire protection design review, fire protection acceptance, random inspections and fire safety inspections in accordance with statutory powers and procedures, so as to be fair, strict, civilized and efficient.

Housing and urban-rural construction authorities, fire rescue agencies and their staff shall conduct fire protection design review, fire inspection and acceptance, record and spot checks and fire safety inspections, etc., shall not charge fees, shall not use their positions to seek benefits; they shall not use their positions to designate or appoint users, construction units, or Disguisedly designate the brand, sales unit or fire-fighting technical service organization or construction unit of fire-fighting equipment for fire-fighting products.

Article 57 The competent housing and urban-rural construction departments, fire and rescue agencies and their staff perform their duties, should consciously accept the supervision of society and citizens.

All units and individuals have the right to report and sue the illegal activities of the housing and urban-rural construction authorities, fire and rescue agencies and their staff in law enforcement. The agency that receives the report or accusation shall investigate and deal with it in a timely manner in accordance with its duties.

*    *    *    *    *

(Here is a detailed technical fire code for China from 2014 (link).)


Thursday, January 2, 2020

The power of case studies in system safety



Images: Andrew Hopkins titles



Images: Other safety sources

One of the genuinely interesting aspects of the work of Andrew Hopkins is the extensive case studies he has conducted of the causation of serious industrial accidents. A good example is his analysis of the explosion of an Esso natural gas processing plant in Longford, Australia in 1998, presented in Lessons from Longford: The ESSO Gas Plant Explosion, with key findings also presented in this video. Also valuable is Hopkins' analysis of the Deepwater Horizon blowout in the Gulf of Mexico (link). Here he dispassionately walks through the steps of the accident and identifies faults at multiple levels (operator, engineering, management, corporate policy).

In addition to these books about major accidents and disasters, Hopkins has also created a number of very detailed videos based on the analysis presented in the case studies. These videos offer vivid recreation of the accidents along with a methodical and evidence-based presentation of Hopkins' analysis of the causes of the accidents at multiple levels.

It is intriguing to consider whether it would be possible to substantially improve the "safety thinking" of executives and managers in high-risk industries through an intensive training program based on case studies like these. Intensive system safety training for executives and managers is clearly needed. If complex processes are to be managed in a way that avoids catastrophic failures, executives and managers need to have a much more sophisticated understanding of safety science. Further, they need more refined skills in designing and managing risky processes. And yet much training about industrial safety focuses on the wrong level of accidents -- shop floor accidents, routine injuries, and days-lost metrics -- whereas there is a consensus among safety experts that the far larger source of hazard in complex industrial processes lies at the system level.

We might think of Hopkins' case studies (and others that are available in the literature) as the basis of cognitive and experiential training for executives and managers on the topic of system safety, helping them gain a broader understanding of the kinds of failures that are known to lead to major accidents and better mental skills for managing risky processes. This might be envisioned in analogy with the training that occurs through scenario-based table-top exercises for disaster response for high-level managers, where the goal is to give participants a practical and experiential exposure to the kinds of rare situations they may be suddenly immersed in and a set of mental tools through which to respond. (My city's top fire official and emergency manager once said to a group of senior leaders at my university at the end of a presentation about the city's disaster planning: "When disaster strikes, your IQ will drop by 20 points. So it is imperative that you work with lots of scenarios and develop a new set of skills that will allow you to respond quickly and appropriately to the circumstances that arise. And by the way -- a tornado has just blown the roof off the humanities building, and there are casualties!")

Consider a program of safety training for managers along these lines: simulation-based training, based on detailed accident scenarios, with a theoretical context introducing the ideas of system accidents, complexity, tight coupling, communications failures, lack of focus on organizational readiness for safety, and the other key findings of safety research. I would envision a week-long training offering exposure to the best current thinking about system safety, along with exposure to extensive case studies and a number of interactive simulations based on realistic scenarios.

I taught a graduate course in public policy on "Organizational causes of large technology failures" this year that made substantial use of case materials like these. Seeing the evolution that masters-level students underwent in the sophistication of their understanding of the causes of large failures, it seems very credible that senior-manager training like that described here would indeed be helpful. The learning that these students did on this subject was evident through the quality of the group projects they did on disasters. Small teams undertook to research and analyze failures as diverse as the V-22 Osprey program, the State of Michigan Unemployment Insurance disaster (in which the state's software system wrongly classified thousands of applicants as having submitted fraudulent claims), and the Chinese melamine milk adulteration disaster. Their work products were highly sophisticated, and very evidently showed the benefits of studying experts such as Diane Vaughan, Charles Perrow, Nancy Leveson, and Andrew Hopkins. I feel confident that these students would be able to take these perspectives and skills into the complex organizations in which they may work in the future, and their organizations will be safer as a result.

This kind of training would be especially useful in sectors that involve inherently high risks of large-scale accidents -- for example, the rail industry, marine shipping, aviation and space design and manufacturing, chemical and petrochemical processing, hospitals, banking, the electric power grid, and the nuclear industry.

(I should note that Hopkins himself provides training materials and consultation on the subject of system safety through FutureMedia Training Resources (link).)

Saturday, December 28, 2019

High-reliability organizations


Charles Perrow takes a particularly negative view of the possibility of safe management of high-risk technologies in Normal Accidents: Living with High-Risk Technologies. His summary of the Three Mile Island accident is illustrative: “The system caused the accident, not the operators” (12). Perrow’s account of TMI is chiefly an account of complex and tightly-coupled system processes, and the difficulty these processes create for operators and managers when they go wrong. And he is doubtful that the industry can safely manage its nuclear plants.

It is interesting to note that systems engineer and safety expert Nancy Leveson addresses the same features of “system accidents” that Perrow addresses, but with a greater level of confidence about the possibility of creating engineering and organizational enhancements. A recent expression of her theory of technology safety is provided in Engineering a Safer World: Systems Thinking Applied to Safety (Engineering Systems) and Resilience Engineering: Concepts and Precepts.

In examining the safety of high-risk industries, our goal should be to identify some of the behavioral, organizational, and regulatory dysfunctions that increase the likelihood and severity of accidents, and to consider organizational and behavioral changes that would serve to reduce the risk and severity of accidents. This is the approach taken by a group of organizational theorists, engineers, and safety experts who explore the idea and practice of a “high reliability organization”. Scott Sagan describes the HRO approach in these terms in The Limits of Safety:
The common assumption of the high reliability theorists is not a naive belief in the ability of human beings to behave with perfect rationality, it is the much more plausible belief that organizations, properly designed and managed, can compensate for well-known human frailties and can therefore be significantly more rational and effective than can individuals. (Sagan, 16)
Sagan lists several conclusions advanced by HRO theorists, based on a small number of studies of high-risk organizational environments. Researchers have identified a set of organizational features that appear to be common among HROs:
  • Leadership safety objectives: priority on avoiding altogether serious operational failures
  • Organizational leaders must place high priority on safety in order to communicate this objective clearly and consistently to the rest of the organization
  • The need for redundancy. Multiple and independent channels of communication, decision-making, and implementation can produce a highly reliable overall system
  • Decentralization -- authority must exist in order to permit rapid and appropriate responses to dangers by individuals closest to the problems
  • culture – recruit individuals who help maintain a strong organizational culture emphasizing safety and reliability
  • continuity – maintain continuous operations, vigilance, and training
  • organizational learning – learn from prior accidents and near-misses.
  • Improve the use of simulation and imagination of failure scenarios
Here is Sagan's effort to compare Normal Accident Theory with High Reliability Organization Theory:


The genuinely important question here is whether there are indeed organizational arrangements, design principles, and behavioral practices that are consistently effective in significantly reducing the incidence and harmfulness of accidents in high-risk enterprises, or whether on the other hand, the ideal of a "High Reliability Organization" is more chimera than reality.

A respected organizational theorist who has written on high-reliability organizations and practices extensively is Karl Weick. He and Kathleen Sutcliffe attempt to draw some useable maxims for high reliability in Managing the Unexpected: Sustained Performance in a Complex World. They use several examples of real-world business failures to illustrate their central recommendations, including an in-depth case study of the Washington Mutual financial collapse in 2008.

The chief recommendations of their book come down to five maxims for enhancing reliability:
  1. Pay attention to weak signals of unexpected events
  2. Avoid extreme simplification
  3. Pay close attention to operations
  4. Maintain a commitment to resilience
  5. Defer to expertise
Maxim 1 (preoccupation with failure) encourages a style of thinking -- an alertness to unusual activity or anomalous events and a commitment to learning from near-misses in the past. This alertness is both individual and organizational; individual members of the organization need to be alert to weak signals in their areas, and managers need to be receptive to hearing the "bad news" when ominous signals are reported. By paying attention to "weak signals" of possible failure, managers will have more time to design solutions to failures when they emerge.

Maxim 2 addresses the common cognitive mistake of subsuming unusual or unexpected outcomes under more common and harmless categories. Managers should be reluctant to accept simplifications. The Columbia space shuttle disaster seems to fall in this category, where senior NASA managers dismissed evidence of foam strike during lift-off by subsuming it under many earlier instances of debris strikes.

Maxim 3 addresses the organizational failure associated with distant management -- top executives who are highly "hands-off" in their knowledge and actions with regard to ongoing operations of the business. (The current Boeing story seems to illustrate this failure; even the decision to move the corporate headquarters to Chicago, very distant from the engineering and manufacturing facilities in Seattle, illustrates a hands-off attitude towards operations.) Executives who look at their work as "the big picture" rather than ensuring high-quality activity within the actual operations of the organization are likely to oversee disaster at some point.

Maxim 4 is both cognitive and organizational. "Resilience" refers to the "ability of an organization (system) to maintain or regain a dynamically stable state, which allows it to continue operations after a major mishap and/ or in the presence of a continuous stress". A resilient organization is one where process design has been carried out in order to avoid single-point failures, where resources and tools are available to address possible "off-design" failures, and where the interruption of one series of activities (electrical power) does not completely block another vital series of activities (flow of cooling water). A resilient team is one in which multiple capable individuals are ready to work together to solve problems, sometimes in novel ways, to ameliorate the consequences of unexpected failure.

Maxim 5 emphasizes the point that complex activities and processes need to be managed by teams incorporating experience, knowledge, and creativity in order to be able to confront and surmount unexpected failures. Weick and Sutcliffe give telling examples of instances where key expertise was lost at the frontline level through attrition or employee discouragement, and where senior executives substituted their judgment for the recommendations of more expert subordinates.

These maxims involve a substantial dose of cognitive practice, changing the way that employees, managers, and executives think: the importance of paying attention to signs of unexpected outcomes (pumps that repeatedly fail in a refinery), learning from near-misses, making full use of the expertise of members of the organization, .... It is also possible to see how various organizations could be evaluated in terms of their performance on these five maxims -- before a serious failure has occurred -- and could improve performance accordingly.

It is interesting to observe, however, that Weick and Sutcliffe do not highlight some factors that have been given strong priority in other treatments of high-reliability organizations: the importance of establishing a high priority for system safety in the highest management levels of the organization (which unavoidably competes with cost and profit pressures), the organizational feature of an empowered safety executive outside the scope of production and business executives in the organization, the possible benefits of a somewhat decentralized system of control, the possible benefits of redundancy, the importance of well-designed training aimed at enhancing system safety as well as personal safety, and the importance of creating a culture of honesty and compliance when it comes to safety. When mid-level managers are discouraged from bringing forward their concerns about the "signals" they perceive in their areas, this is a pre-catastrophe situation.

There is a place in the management literature for a handbook of research on high-reliability organizations; at present, such a resource does not exist.

(See also Sagan and Blanford's volume Learning from a Disaster: Improving Nuclear Safety and Security after Fukushima.)