Be Like Pixar, Not NASA

It began at 4:00 in the morning on March 28, 1979, at Three Mile Island, Pennsylvania. The nuclear reactor was operating at nearly full power when a secondary cooling circuit malfunctioned and affected the temperature of the primary coolant. This sharp rise in temperature made the reactor shut down automatically. In the second it took to deactivate the reactor’s system, a relief valve failed to close. The nuclear core suffered severe damage, but operators couldn’t diagnose or deal with the unexpected shutdown of the reactor in the heat of the moment.

Sociologist Charles Perrow later analyzed why the Three Mile Island accident had happened, hoping to anticipate other disasters to come. The result was his seminal book Normal Accidents. His goal, he said, was to “propose a framework for characterizing complex technological systems such as air traffic, marine traffic, chemical plants, dams, and especially nuclear power plants according to their riskiness.”

One factor was complexity: The more components and interactions in a system, the more challenging it is when something goes wrong. With scale comes complexity, whether we are thinking of the technology or the organization that supports it. Imagine you run a start-up where everyone sits in the same loft space. From where you sit, you can easily see what they are all doing. In a large organization, that visibility is lost. The moment a leader can’t see the inner workings of the system itself—in this case, staff activities—complexity rises.

Perrow associated this type of complexity with tech failures. At Three Mile Island, operators couldn’t just walk up to its core and measure the temperature manually or peek inside to discover there was not enough coolant. Similarly, executives in a large company can’t monitor every employee all the time without incurring resentment. They have to rely on indirect indicators, such as performance evaluations and sales results. Large companies also rely on complex information technologies and complex supply chains.

Another factor, wrote Perrow, was a system’s coupling: the level of interdependence among its components. When systems are both complex and tightly coupled, they are more likely to produce negative unexpected consequences and get out of control.

Perrow did not include artificial intelligence (A.I.) or even software among the technologies whose interactions he charted. But using the criteria that he laid out relative to technological risk, A.I. systems fit in Perrow’s framework next to nuclear power plants, space missions, and DNA sequencing. If some element isn’t working according to plan, there can be unanticipated cascading effects that affect a system in wholly unexpected ways.

Tight and Loose Coupling

Tightly coupled systems have architectures—technological and social—that promote interdependence among their components and often isolation from outside connection. This makes them efficient and self-protective but less robust.

Loosely coupled systems, by contrast, have more open and diverse architectures. Changes in one module, section, or component hardly affect the other components. Each operates somewhat independently of the others. A loosely coupled architecture is easy to maintain and scale. It is also robust, in that problems don’t propagate easily to other parts of the system.

Executives who run large organizations tend to favor a tightly coupled system. It is what they know. They grew up in their industries seeing a small number of people making decisions that affect millions of people. But tightly coupled systems can be harder to control. Think of a floor covered with dominoes that are lined up. When you tip one over, it will then, in sequence, knock down the entire array of dominoes—a simple example of a tightly coupled system. Now try to stop it once the domino effect is in motion. It’s much harder than you would think.

A large company is also generally a tightly coupled system, especially compared to small businesses and local mom-and-pop retailers. If you have a complaint about a corner store’s product, you can take it back and they’ll take it in stride, handling it in a different way for each customer. They have control over their actions. If they work in a large company, or as a franchise, they are tightly coupled to the company’s branding and scaled-up procedures—and to one another. Those who want to operate differently from the standard procedures must buck the tightly coupled network.

During the pandemic, we realized just how tightly coupled and interconnected our supply chains are—how one container ship stuck in the Suez Canal can delay global shipments for months. Many organizations have been looking to create more robust redundancies, effectively loosening the coupling in their supply chains by finding alternate vendors and investing in local sources.

The Formula for Disaster

Organizational sociologist Diane Vaughan is an expert on the ways systems can repeatedly engender catastrophe. She started studying the issue after the Challenger disaster of 1986, when the space shuttle exploded shortly after launch. The “technical cause,” she later wrote, was “a failure of the rubber O-rings to seal the shuttle’s solid rocket booster joints. But the NASA organization also failed.”

NASA had been launching space shuttles with damaged O-rings since 1981. Pressured by the launch schedule, the agency leaders had ignored engineers’ warnings right up to the day of the launch. In fact, within the established rules, the agency had labeled the O-ring damage an “acceptable risk.”

Vaughan spent the next five years researching and writing The Challenger Launch Decision, an in-depth book about the organizational problems leading to the technological disaster. Like Perrow, she concluded that this type of organization would repeatedly produce catastrophic mistakes. After the book came out, she later noted, “I heard from engineers and people in many different kinds of organizations who recognized the analogies between what happened at NASA and the situations at their organizations. ‘NASA is us,’ some wrote.”

Another crash, this time of the space shuttle Columbia, occurred on February 1, 2003. Another seven astronauts died. A technical review found a piece of foam had broken off and struck a wing. Once again, engineers had warned the agency and the warnings had been ignored. Once again, Vaughan became closely involved in investigating the causes, ultimately joining the government’s Columbia Accident Investigation Board. She testified to the board that she had found the same organizational causes for both accidents.

In her writing on the disasters, Vaughan cites Perrow, noting that NASA’s tightly coupled, complex nature made it systematically prone to occasional major errors. The key decision makers had fallen prey to a “normalization of deviance,” in which dangerous complacency gradually became the ordinary way of doing things. “We can never totally resolve the problem of complexity, but we have to be sensitive to our organizations and how they work,” she wrote. “While many of us work in complex organizations, we don’t realize how much the organizations that we inhabit completely inhabit us. This is as true for those powerful actors at the top of the organization responsible for creating culture as it is for the people in the tiers below them who carry out their directives and do the everyday work.”

In these disasters, she testified to the board, “the technological failure was a result of NASA’s organizational failure.”

Tightly Coupled A.I.

Software designer Alan Chan argues that some innate aspects of artificial intelligence tend to make everything it touches more complex and more tightly coupled. Even when a project is supposed to be “responsible A.I.,” working with an automated algorithm can override the best intentions of the software engineers.”

Although designers may try as much as possible to include all the relevant features, they may only come to know the relevance of some features after an accident informs them to that effect,” says Chan. “Moreover, while a human observer is limited by the ways in which their senses interact with measurement instruments, an A.I. subsystem is limited not only by the same conditions as the human observer but also by the fact that human observers select the features for consideration. The measurement instruments may themselves be faulty, which was a crucial factor in the Three Mile Island accident.”

In Perrow’s parlance, “normal accidents” can be expected to increase over time in such systems. This is particularly true when not just an A.I. system itself but the organizational ecosystem around it are both complex and tightly coupled.

In the tech arena, the process of optimization itself exacerbates tight coupling. It creates strong dependencies and, therefore, ripple effects. Imagine an A.I. system tasked with allocating production resources in a supply chain. The system might have maximizing output as its only goal. This single focus would influence the whole system to couple itself more tightly.

The algorithm would resolve any tradeoffs between flexibility and optimization in favor of optimization. For instance, it would not keep reserve stocks because that would drag on inventory. The system is coded to align with the company’s strategy in doing this, but in such a tightly coupled way that the system would falter under stress, as many supply chains did at the start of the COVID-19 pandemic. At various times in recent history, this dynamic led to shortages in things like protective equipment, semiconductor chips, diapers, and infant formula.

Another case of a tightly coupled A.I. system is Zillow’s failed use of an automated decision-making algorithm to purchase homes. As an online real estate marketplace, Zillow was originally designed to help sellers and buyers make more informed decisions. In 2018, it opened a new division with a business model based on buying and flipping homes, using a machine learning algorithm called Zillow Offers. As home prices quickly rose during the COVID-19 pandemic, Zillow’s iBuying algorithms used data such as the home’s age, condition, and zip code to predict which homes would grow in value. But the system failed to take into account the radical uncertainty caused by the virus and completely underestimated rapid changes in the housing market. Moreover, there was a backlash against Zillow when a real estate agent, Sean Gotcher, created a viral video decrying the company’s perceived manipulation of the housing market. By November 2021, the firm sold only 17,000 homes out of the 27,000 it had purchased.

Decoupling Zillow’s home-buying business from its online marketplace may have saved the company or at least part of its reputation. Ultimately, Zillow shut down its home-buying division, cut 25 percent of the company’s work force—about 2,000 employees—and wrote off a loss of $304 million in housing inventory.

To John Sviokla, who holds a Harvard doctorate in management information systems, tight coupling is directly related to the opaque nature of algorithmic systems: the closed-box effect. “If I can’t look inside the system and see the weights given to different factors,” he says, “then it is de facto tightly coupled. From a semantic standpoint, I can’t communicate with it. I can only manage it by trying to figure out how it works, based on the behaviors it produces. I am not given access to the assumptions going in, or how it works. I either have to reject it or use it—those are my only two choices.”

Chan argues that the greatest risk lies in A.I. systems that are both tightly coupled and complex within organizations that are tightly coupled and complex. Accidents are especially likely to occur when the organizational conditions are right. Since exact conditions cannot be predicted or prevented in detail and the organizational structure prevents them from being resilient, algorithmic, autonomous, and automated systems represent a continual challenge. Even when systems are working well, it is impossible to make them absolutely fail-safe from a “normal accident.”

If you want to make the system safer and less harmful, you have to loosen it up.

Loosening a System

Pixar Animation Studios, the creators of the films Toy Story and Finding Nemo, has a well-known ritual that takes advantage of the studio’s loosely coupled nature. Whenever a film under development hits a rough spot, the director can convene the company’s “brain trust” for an in-depth critique. After the session, the director and his team decide what to do with the advice. It takes a thick skin to have a work under review, but the result is immense, tangible improvement.”

There are no mandatory notes, and the brain trust has no authority,” Pixar cofounder Ed Catmull explained in Harvard Business Review. “This dynamic is crucial. It liberates the trust members, so they can give their unvarnished expert opinions, and it liberates the director to seek help and fully consider the advice.”

It took Pixar a while to understand why this system helped so much. “When we tried to export the brain trust model to our technical area, we found at first that it didn’t work,” Catmull wrote. “As soon as we said, ‘This is purely peers giving feedback to each other,’ the dynamic changed, and the effectiveness of the review sessions dramatically improved.”

Note that Pixar’s organizational design is deliberately loose. The brain trust’s reactions are treated not as demands but as creative opportunities. These opportunities allow for simplicity on the other side of complexity.

Charles Perrow devoted much of Normal Accidents to a study of complex sociotechnical operations that had not ended in crisis or catastrophe. One option, he found, was to make decision making simple by focusing on just one or two activities: You centralize decision making around this relatively simple set of goals so that there is clear direction for channeling all the complexities involved. Another option was to put in place some basic organizational designs. A risk audit and oversight group may seem like yet another boring bureaucratic function, but if it is led by someone who understands loose coupling, it will be staffed by a diverse group of people who make sense of complex issues together.

And there is another alternative: to loosen the system. To bring decision making to the lowest possible level in the hierarchy, and to make sure every part of the organization can operate autonomously. To encourage people to communicate freely, so that no one small group is seen as the single source of knowledge about a key issue. To move decision making as close as possible to the point of action, and to bring people together regularly to learn from each other and avoid competing with other silos.

A.I. chatbots can tighten couplings in complex systems, intensifying the communications within them and automating the ways companies control behavior. That could lead to more disasters and missteps. But they could also loosen complex systems by providing alternative links and making it easier to seek out alternative views. Success often depends on finding someone with an outside sensibility who respects the inside priorities. Generative A.I. systems may make it easier to find those people and introduce them to one another.

There is much to learn about the interrelationship between machine behavior, human behavior, and the behavior of larger sociotechnical systems such as corporations and governments. In the end, it doesn’t matter whether we think our A.I. systems are intelligent. What matters most is what they do and how they grow, and how we grow along with them.

The post Be Like Pixar, Not NASA appeared first on Reason.com.