Skip to main content

Widespread software outages can often be prevented. Resilience – a software program's ability to handle and maintain critical functionality in the face of bugs or other unexpected conditions – is essential given the significant role software plays in the infrastructure of the economy. As with previously discussed security principles, many common types of software flaws can be preemptively addressed through systematic and known processes that prevent or minimize the likelihood of outages. Systematic processes for software resilience include rigorous testing of both code and configuration, incremental rollout procedures, and the development and use of APIs that minimize risk. 

Agency staff have acknowledged that resilience can be threatened by lack of competition in critical inputs. Dominant firms pushing maximum adoption of their digital products and services can create single points of failure for entire industries, within and beyond the digital economy. As the number of users relying on the same enterprise software increases, so too can the scale of disruption caused by outages of that software. Every software outage is an opportunity to take a critical look at systems in place to achieve resilience and assess these systems for appropriateness and sufficiency.

Beyond code, configuration and data changes can lead to failures. An inability to treat these with the same care and robust processes can lead to outages. Software is made up of code, instructions that tell the computer how to operate. But these instructions may reference configuration files or other data which can also lead to failures. Consider a local ice cream store with a mobile app. The app may contain code for displaying a menu based on a set of items defined in a configuration file (or received from a server). Even if the code for displaying the menu doesn't change, a change to the configuration file could trigger a previously unknown bug — for example an apostrophe being added for the first time with "Cookies 'n Cream" — causing the app to crash. 

The principle of treating code, configuration files, and data changes with the same care is especially critical when building software run by banks, hospitals, and transportation services. Software vendors must use appropriate software architecture and processes to guard against unexpected failures from both code and non-code changes. 

Everything Everywhere All at Once may be a great movie – but it's a poor strategy for deploying software changes. Even the most rigorous testing environment can't mirror the diversity of real-world computers. When deploying changes to automatically updating software, one common strategy to mitigate this risk is to initially deploy to a small subset of machines, and then rolling out to more users after it’s confirmed that the smaller subset has continued to function without interruption. It's important to consider this strategy for not just code changes, but also updates to configuration and data. 

Auto-updating software can be an important mechanism to ensure that critical security or stability updates get deployed in a timely manner. For security software, it can help ensure companies using the software are running a version able catch the latest threats. Yet, software vendors may unnecessarily risk causing widespread outages by deploying these updates broadly without a mechanism to detect and reverse changes that cause major issues for a high percentage of machines before the software is widely deployed. 

Platforms and operating systems that fail to provide resilient APIs risk the stability of the whole system. An API, or application programming interface, is a way for two pieces of software to communicate in a structured way. For example, the local ice cream store's app may use the phone's location API to tell a customer how far they are from the store. A poor API design could mean that a mistake by the app's programmer causes the customer's whole phone to crash, rather than just the app itself. Building APIs that account for human error can help programmers by limiting the damage from mistakes and preventing small issues from becoming larger ones. 

However, creating resilient APIs is not an excuse to lock competitors out of the access they need to build effective alternatives. In order to be competitive with a platform or operating system vendor's own offerings, some types of software require broad, low-level access (the ability to read detailed information and control aspects of the inner workings of a system). Some might argue that allowing such access to third parties requires companies to sacrifice resiliency, and that they can only have one or the other. But just as competition through interoperability can coexist with privacy and security, so too can interoperability coexist with reliability. When platforms or operating system build safer versions of these interfaces, they must do so in a way that, to the fullest extent possible, supports, rather than suppresses, the ecosystem of existing and emerging products. Incumbent companies shouldn't use API changes in the name of resilience as an excuse to cut off broad access for competitors, especially not while continuing to retain broad access for their own offerings.

Widespread software outages can lead to substantial consumer injury including the inability to access critical services and financial loss. Given the enormous impacts on consumers, small businesses, and the public that arise from failures in computer systems, FTC staff will continue to monitor such developments and investigate when warranted. The FTC has long examined companies' computer systems to determine whether proper steps were taken to secure the systems and the consumer data they hold. The FTC’s record investigating computer systems has shown that there are steps companies can take to minimize the likelihood of bad outcomes. While the state of the art in software continues to improve, there are known approaches that systemically and preemptively address many potential issues, either entirely preventing them or dramatically reducing the likelihood of them occurring. 

Thank you to the staff who contributed to this post: Simon Fondrie-Teitler, Hannah Garden-Monheit, Wells Harrell, Amritha Jayanti, Erik Jones, Stephanie T. Nguyen, Shaoul Sussman, Grady Ward, Ben Wiseman.  

More from the Technology Blog

Get Business Blog updates