Politics 80X: Politics of the Internet

CRASHING

Can the Net Crash?

Any complex system can fail. It need not be hacked, or brought down physically by someone. It could, for example, be caught in earthquake, or in a fire which broke out for reasons unrelated to it. It could lose its power supply. Operators err. A storage medium may be subject to inexplicable change at the bit level ("bit rot"). And the whole system could fail for reasons internal to its design. Several years ago a major Japanese bank lost its primary and backup computer systems simultaneously because the design assumed a ceiling on one number which was, then, exceeded.

The following from The New York Times 15 April 1998 illustrates how severe a crash can be:

From Monday afternoon until yesterday afternoon . . . a few million people found that their credit cards were useless and automated teller machines at banks were dead because a national high-speed data network of the AT&T Corporation had crashed. . .

. . . industry experts . . . said it was the data network equivalent of the breakdown of AT&T's long-distance switching system in January 1990, when about half of the nation's long-distance calls ended in busy signals or recorded messages for nine hours. . .

AT&T's frame relay network has 145 switching hubs, or nodes. . . one executive said that the problem began with a data transmission between two hubs on the network--one in Albany and another in Cambrdige, Mass. The problem cascaded uncontrollably to the other hubs in the network, for some as-yet-undetermined reason.

If it matters whether the system is up or not, designers will attempt to control the effects of unwanted and unanticipated crashes. The techniques are rather obvious. Maintain multiple, near-simultaneous systems ("mirrored systems"). Maintain a stand-by system, to which operations can be transferred if the first fails. Log all transactions. Backup. In the case of database systems, include features to 'roll back' to the point at which an anomaly occurred, and on restarting 'roll forward' from that point to ensure that all transactions are correctly entered.

The politically significant point to bear in mind, when thinking for example about the Year 2000 or Y2K problem, is that complex computer systems are designed to function in a world in which failure is possible, and system administrators routinely anticipate failure and construct hedges against its consequences.

A second politically significant issue concerns the consequences of failure. Failures are often partial, not complete. Some effects are small, even minor, while others may be catastrophic. How do the people who rely on this computer system gauge the consequences of failure? How much failure, and of what kinds, can be tolerated? For how long? At what cost? A student make this type of judgment every time she decides whether to back up a paper, whether to lodge a backup disk "off-site", whether to send a copy of the paper by email to someone in another city.

And it may not be clear whether a problem is due to system failure, operator error, or an attack. On 14 July 1997 a rival to InterNIC, the company which has assigned domain names, mounted an attack on the InterNIC Domain Name Server. As a result traffic was shunted to a spoof web site. Three days later the entire Web appeared on the verge of breakdown. By one account "Web sites disappear, emails go undelivered or lost, the Net is in chaos. The Net moves closer to meltdown than ever before. The problem is traced to an 'unprecedented' malfunction in InterNIC's DNS server." [Irish Times, 19 July 1997.] But InterNIC claims that a technician's failure to react to an automated alarm in Herndon, Virginia, garbled computer files where millions of Internet addresses were kept, causing Network Solutions Inc to send garbled files to 10 large Internet-connected 'root server' computers around the world. Whatever the source, millions of Web sites were inaccessible for hours. [Ibid.]