BellSouth reports that the telecom damages from Hurricane Katrina in the New Orleans/Mississippi Gulf Coast area are on the order of half a billion dollars, with repairs taking 4 to 6 months, according to preliminary estimates. Roughly 1.1 million lines are still out. [Source: Arshad Mohammed, *The Washington Post*, 6 Sep 2005; PGN-ed] http://www.washingtonpost.com/wp-dyn/content/article/2005/09/05/AR2005090501231.html [Of course, those costs are dwarfed by the overall catastrophe. The huge magnitude of the natural disaster, the lack of foresight over past decades in protecting the levees, and the many problems with emergency responses are horrendous. This once again reminds us of the extent to which we tend to deprecate far-sighted proactive risk management. PGN]
The crew members of a Cypriot airliner that crashed Aug 14 near Athens became confused by a series of alarms as the plane climbed, failing to recognize that the cabin was not pressurizing until they grew mentally disoriented because of lack of oxygen and lost consciousness, according to several people connected with the investigation into the crash. In addition, the German pilot and the young/inexperienced Cypriot co-pilot did not have a common language in which they could speak fluently. and had difficulty understanding each other's standard airline English. A total of 121 people were killed in the crash after the plane climbed and flew on autopilot, circling near Athens until one engine stopped running because of a lack of fuel. The sudden imbalance of power, with only one engine operating, caused the autopilot to disengage and the plane to begin to fall. [Source: Don Phillips, International Herald Tribune, 7 Sep 2005; PGN-ed] http://www.nytimes.com/2005/09/07/international/europe/07cypriot.html [Also noted by Chuck Weinstock]
In this era of fly-by-wire, I am fond of saying that, as far as I know, there has never been a commercial aircraft accident caused by anomalies in flight control software. And it has been 17 years (the first A320 was introduced into service in 1988). It is thus well to remember that designing and writing critical software-based systems for such applications is not a routine task that we now know how to perform. In fact, there are plenty of anomalies that crop up that the public doesn't hear about. Here is one that made it out, and a pointer to another. The B777 is a high-capacity Boeing electric airplane (that is, fly-and-a- lot-of-other-things-by-wire) designed inter alia for intercontinental travel. The aircraft has been in service since 1995, and 490 of them have been delivered (for comparison, of Boeing's "jumbo", there are 625 B747-400 delivered, and a further 477 "classic" B747 still in service) . The B777 has just been subjected to an emergency Airworthiness Directive (AD) from the U.S. FAA . In April 2005, the FAA issued AD 2005-10-03 requiring "modification of the operational program software (OPS) of the air data inertial reference unit (ADIRU) from software version part number (P/N) 3470-HNC-100-03 to software version P/N 3475-HNC-100-06 or 3474-HNC-100-07. That AD resulted from a report of the display of erroneous heading information to the pilot due to a defect in the OPS of the ADIRU." An AD is issued in response to an identified hazard, and the reasons are given as a list of possible consequences of the hazard, including worst-case consequences: "We issued that AD to prevent the display of erroneous heading information to the pilot, which could result in loss of the main sources of attitude data, consequent high pilot workload, and subsequent deviation from the intended flight path." Attitude data consist of angle of pitch (nose up or down), angle of bank (left/right wing low/high) and heading. The flight control system, and pilots, also use the rate of change of these quantities, as well as their accelerations, although these are not presented as separate displays to pilots (one notes these are "trends" through cognitive processing rather than display). Emergency AD 2005-18-51 was issued on August 29, 2005. An unsafe condition had been identified through analysis of an incident, and Boeing had issued an Alert Service Bulletin on August 26 addressing the problem with workarounds. The Emergency AD makes these actions mandatory. The FAA explains as follows: Since [AD 2005-10-03] was issued, we received a recent report of a significant nose-up pitch event on a Boeing Model 777-200 series airplane while climbing through 36,000 feet altitude. The flight crew disconnected the autopilot and stabilized the airplane, during which time the airplane climbed above 41,000 feet, decelerated to a minimum speed of 158 knots, and activated the stick shaker. A review of the flight data recorder shows there were abrupt and persistent errors in the outputs of the ADIRU. These errors were caused by the OPS using data from faulted (failed) sensors. This problem exists in all software versions after P/N 3470-HNC-100-03, beginning with P/N 3477-HNC-100-04 approved in 1998 and including the versions mandated by AD 2005-10-03. While these versions have been installed on many airplanes before we issued AD 2005-10-03, they had not caused an incident until recently, and the problem was therefore unknown until then. OPS using data from faulted sensors, if not corrected, could result in anomalies of the fly-by-wire primary flight control, autopilot, auto-throttle, pilot display, and auto-brake systems, which could result in high pilot workload, deviation from the intended flight path, and possible loss of control of the airplane. ................ We have evaluated all pertinent information and identified an unsafe condition that is likely to exist or develop on other Boeing Model 777 airplanes of this same type design. Therefore, we are issuing this AD to prevent the OPS from using data from faulted (failed) sensors, which could result in anomalies of the fly-by-wire primary flight control, autopilot, auto-throttle, pilot display, and auto-brake systems. These anomalies could result in high pilot workload, deviation from the intended flight path, and possible loss of control of the airplane. This new AD supersedes AD 2005-10-03. Note that the consequences list has been extended by "possible loss of control of the airplane". According to John Sampson, the incident to which the AD refers occurred to a Malaysian Airlines B777-200 on 3 August 2005, on Flight MH 124 from Perth to Kuala Lumpur . The aircraft returned to Perth after 51 minutes flight for an emergency landing after an ADIRU malfunction which caused a "flight control outage". This is the first public statement of which I know which addresses classes of Byzantine faults. Byzantine faults have occurred, seriously, in avionics before but the details are not public (see the quote from  below). Byzantine faults are faults in which agents (sensors, computers) in a distributed system "lie" to their interlocutors: they do not fail silently but distribute erroneous data, or data which is read differently by different receivers. The name arose from a whimsical analogy by Lamport, Shostak and Pease to a group of Byzantine generals trying to reach agreement in a situation in which no one trusts anyone else to speak the truth. The classic papers from twenty years ago are [4,5], and I understand arose from SRI International's involvement in trying formally to verify the operating system of the first digital flight control computer, SIFT. Dealing with Byzantine faults became an extremely active area of distributed computing theory, but practitioners did not take them so seriously at first, perhaps partially due to the very high resource consumption of the solutions: Lamport, Shostak and Pease showed that any correct algorithm to achieve consensus required a large number of processors (roughly speaking, at least 3n+1, where n is the number of "liars") and a lot of processor cycles. It follows that solutions judged to be practical are unlikely to be complete solutions, and therefore one must analyse the actual problem space more closely to find out where one can most profitably handle possible problems, and which areas one can ignore. The SAFEbus, the backplane communications bus of the B777 flight control system, now standardised as ARINC 659, was designed by Ken Hoyme and Kevin Driscoll at Honeywell. Driscoll, with Honeywell colleagues Hall and Zumsteg, and Sivencrona (Chalmers Uni, Sweden) wrote a paper in SAFECOMP 2003 in which they described occurrences of Byzantine faults in avionics and how one can deal with them (or not, as the case may be) . They say "Byzantine faults in safety-critical systems are real and occur with failure rates far more frequently than 10**(-9) faults per operational hour. In addition, the very nature of Byzantine faults allows them to propagate through traditional fault containment zones, thereby invalidating system architectural assumptions." Driscoll et al. refer to a set of incidents in which the occurrence of Byzantine failures "threatened to ground all of one type of aircraft". This set of incidents is not publicly available. I quote: This aircraft had a massively redundant system (theoretically, enough redundancy to tolerate at least two Byzantine faults). but, no amount of redundancy can succeed in the event of a Byzantine fault unless the system has been designed specifically to tolerate these faults. In this case, each Byzantine fault occurrence caused the simultaneous failure of two or three "independent" units. The calculated probability of two or three simultaneous random hardware failures in the reporting period was 5 x 10**(-13) and 6 x 10**(-23) respectively. After several of these incidents, it was clear that these were not multiple random failures, but a systematic problem. The fleet was just a few days away from being grounded, when a fix was identified that could be implemented fast enough to prevent idling a large number of expensive aircraft. The significance of the 10**(-9) figure is that airworthiness requires that a "catastrophic" failure of aircraft systems occur with a rate less than this per operational hour. The figure was originally chosen to be low enough that one would not expect a catastrophic failure during the lifetime of the fleet of that type aircraft (whether this calculation still holds is a separate question) . Any hazard (for example, failure) with potentially catastrophic consequences which is seen or judged to have more frequent occurrence can lead to withdrawal of the airworthiness certificate of the type. Hence Driscoll et al.'s story. I do not know (yet) whether the fault identified in AD 2005-18-51 is of one of the types specifically considered by Driscoll et al. Circumstances in which messages are sent which are misinterpreted by receivers are not at all unusual. It is not clear that Driscoll et al. would classify these all as Byzantine faults. A well-known occurrence in which an error message was misinterpreted as navigation data is the in-flight break-up of the first Ariane 5, Ariane Flight 501 . Another example from another area of transportation is the grounding of the cruise ship Royal Majesty in 1995, in which incident the autopilot was designed in the expectation that the GPS would fail silent, but the GPS continued to send dead-reckoning data for over a day when it failed to receive a signal. The ship tracked 17 miles off course and grounded on Rose and Crown shoal near Nantucket Island, near Boston, MA [9,10]. The fault models of the system designers in the avionics case in , as well as in the Ariane 5 architecture and that of the STN Atlas NACOS 25 autopilot on the Royal Majesty, were inappropriate for the task. There are various conclusions one can draw: * The kinds of numbers used in Fault Tree Analysis for random hardware failures in software-based systems give no good indication of the rate of systematic failures (due to design or to errors in software) which can be expected. * Fault-handling models are crucial parts of the architecture and their assumptions are critical. (This is made clear by the incidents discussed in [6,8,9].) * That there have been no accidents does not mean that there are no occurrences of substantial problems with potentially catastrophic consequences with software-based critical avionics. Acknowledgments Thanks to John Sampson for pointing me to ; John Rushby for pointing me to ; Rod Chapman for pointing me to . References  World Airliners Directory, Flight International, 26 Oct - 1 Nov 2004.  U.S. Federal Aviation Administration, AD 2005-18-51 Available at http://www.airweb.faa.gov/Regulatory_and_Guidance_Library/rgad.nsf/0/25F9233FE09B613F8625706C005D0C53?OpenDocument  Accidents and Incidents, Aviation Safety Week, 08 August 2005.  Pease, M., Shostak, R., and Lamport, L., Reaching Agreement in the Presence of Faults, J. ACM 27(2): 228-234, 1980. Available from http://research.microsoft.com/users/lamport/pubs/pubs.html  Lamport, L., Shostak, R., and Pease, M., The Byzantine Generals Problem, ACM Trans. Prog. Lang. Sys. 4(3): 382-401, 1982. Available from http://research.microsoft.com/users/lamport/pubs/pubs.html  Driscoll, K., Hall, B., Sivencrona, H., and Zumsteg, P., Byzantine Fault Tolerance, from Theory to Reality, in S. Anderson, M. Felici and B. Littlewood, eds, Computer Safety, Reliability, and Security, 22nd International Conference, SAFECOMP 2003, Edinburgh, UK, September 23-26, 2003, Book series Lecture Notes in Computer Science No. 2788, Springer-Verlag, 2003. Available from http://www.ce.chalmers.se/old/staff/sivis/articles/Safecomp_2003_revised.pdf  Lloyd, E., and Tye, W., Systematic Safety, U.K. Civil Aviation Authority Publications, 1982.  O'Halloran, C., Ariane 5: Learning from Failure, in J. Bowen and J. Woodcock, eds, Proceedings of the Grand Challenge 6 Workshop on Dependable Systems Evolution, at Formal Methods '05 conference, 18 July 2005, Newcastle-upon-Tyne. Available at http://www.fmnet.info/gc6/fm05/proceedings.pdf  Heidiecker, L., Hoffmann, N., Husemann, P., Ladkin, P. B., Paller, J., Sanders, J., Stuphorn, J., Vangerow, A., WBA of the Royal Majesty Accident, RVS-RR-03-01, RVS Group, University of Bielefeld, 1 July 2003. Available from www.rvs.uni-bielefeld.de -> Publications -> Research Reports.  U.S. National Transportation Safety Board, Marine Accident Report, Grounding of the Panamanian Passenger Ship Royal Majesty on Rose and Crown Shoal near Nantucket, Massachusetts, June 10, 1995. Report Number NTSB/Mar-97/01. Available from http://www.ntsb.gov Peter B. Ladkin, University of Bielefeld, Germany www.rvs.uni-bielefeld.de
(was: US Navy to drop paper charts, Lichtensteiger, RISKS-24.02) > For commercial shipping, with it's much smaller crews, and civilian > sailors, the level of faith placed in a GPS and chartplotter scares me. The cruise ship Royal Majesty ran aground off Nantucket Island in 1995. The crew had been relying for over a day on an autopilot taking readings from a GPS position sensor. The GPS signal had been lost, supposedly thanks to the aerial being inadvertently disconnected, shortly after setting off from Bermuda, and the sensor was giving information through dead reckoning. It seems no one noticed, despite having a Loran available as a cross check. Also, the auto pilot error-handling was based on the GPS sensor being fail-silent, which was an incorrect assumption. During the last hours, tracking some 17 miles off course, into known dangerous shallow waters, various obvious signals were ignored (white and "confused" water ahead; static shore lights sitting in the middle of the "ocean") as well as misidentification, and failure of identification, of buoys. The ship ran aground on Rose and Crown shoal and needed to be salvaged. The report is on the U.S. NTSB WWW site. Slides from a talk, as well as a paper, giving a Why-Because Analysis of the accident may be found at www.rvs.uni-bielefeld.de -> Bieleschweig Workshops -> Second Workshop -> Talks Thanks to Luke Emmet of Adelard for suggesting this as a case study in 2002. Peter B. Ladkin, University of Bielefeld, Germany www.rvs.uni-bielefeld.de
1% of Vermont's 6,000 gas pumps are unable to compute with gas prices over $2.99. [PGN-ed from an Associated Press item, 3 Sep 2005] http://www.boston.com/news/local/articles/2005/09/03/for_some_pumps_3_doesnt_compute/
Government plans to introduce e-voting for next year's local council elections have been dropped. According to the government spokesman (Elections Minister Harriet Harman), "the time is not right". The government has not ruled out further attempts to introduce e-voting. Oliver Heald MP, Shadow Secretary of State for Constitutional Affairs, has described the whole process as a shambles, citing the security concerns with e-voting. BBC News: http://news.bbc.co.uk/1/hi/uk_politics/4219008.stm The Register: http://www.theregister.com/2005/09/06/govt_voting/
In the never-ending tale of woe surrounding the German social services and unemployment software A2LL (produced by T-Systems, the software arm of the former German state Telecom company), the Spiegel has just reported that the software miscalculates the health insurance premiums that the government pays every month - to the tune of 25 million Euros too much, every month. The bill is footed by the taxpayers, of course, since T-Systems wisely put a cap in to contract for reparations - a maximum of 5 million Euros is all T-Systems needs to pay: Spiegel: http://www.spiegel.de/wirtschaft/0,1518,372998,00.html Tagesschau: http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4712732_REF2,00.html Tagesspiegel: http://archiv.tagesspiegel.de/archiv/06.09.2005/2035255.asp Wikipedia for more background information on A2LL: http://de.wikipedia.org/wiki/A2LL According to *Der Spiegel*, an expert commission is already discussing what to do with the software, which was taken into service just in January of 2005. It has been declared to be in such a state of non-maintainability and non-adaptability ("nicht mehr wartungs- und entwicklungsfähig") that they are speaking about an entirely new software - to be written, of course, by T-Systems, who brought on this mess in the first place. They just are trying to decide whether to start a new central "solution" or a decentralized one for each unemployment office, as there are many local rules and insurance providers that seem to be causing difficulty. The problem is with the insurance premiums for the unemployed, which was lowered retrospectively to save money for the government in March. A health insurance umbrella organization, VdAK, says it has difficulty in determining how much to pay back, if anything, because they do not know for exactly which people and months the wrong premium was calculated. A previous large error reported completely wrong data on who exactly was insured when to the insurance companies. The VdAK has said that when the German Social Services BA (Bundesagentur für Arbeit) gets their software straightened out, they will be glad - for a fee, of course - to see if they can repay the premiums payed in error. (In other news, the health insurance companies reported a surprise surplus recently...) Even with the error now known, the software will not be able to be fixed this year at all [the last time I looked we had about a third of a year left....-dww], although it seems that just the rate for the premiums needs to be adjusted from 14.3% to 13.2%. The problem seems to stem from there being hundreds of different insurance providers, all with slightly different premium calculations. See RISKS-23.53, 23.60, 23.92. Prof. Dr. Debora Weber-Wulff FHTW Berlin, FB 4, Internationale Medieninformatik Treskowallee 8, 10313 Berlin +49-30-5019-2320 email@example.com
The Berlin newspaper Tagespiegel reports on a curious court case: http://archiv.tagesspiegel.de/archiv/31.08.2005/2022942.asp#art Seems that a social worker found a neat way to dole out funds to himself a few years ago. [And yes, Peter, the pun is intentional! --dww] Social services have a money machine set up in which, when a client is given money, instead of having it transferred to their account, a chip card is selected, and the number of the card typed into a computer program that controls payouts. The client takes the card to an ATM-like money machine, puts the card in, key is the secret password which is [I hope you are sitting down... --dww] the *birthday* of the client, and takes out the money. A camera films the transaction, but erases the tapes about 6 weeks later. The program records the payout in the files of the client, and only people with proper passwords have access to the payout system. This is called security. About 27.000 Euros (about the same in dollars these days) disappeared about 2 years ago. The revision department nailed down 22 transactions that had been conducted without an entry in the files of a client, and the clients knew nothing of the windfalls. The accused kept his mouth shut during the process, and it was uncovered that the cards were not kept track of and "flew around the offices", people would log onto their payout computers and remain logged in all day, sometimes leaving the office without locking the door. It would have been trivial for a colleague to quickly use a computer to load up a card, then slip it to an accomplice and have them pick up the cash. In addition, everyone seemed to know everyone else's passwords... The defence lawyer also noted that the social workers were all mad about the extra work they had to do about the new German dole system, so it really could have been anyone. Berlin remains out the 27.000 Euros and has to pay court costs, the accused keeps his job (but was transferred, probably to the filing room), and the judge recommends they re-think the security of the payout system. I'm with the judge on this one! Prof. Dr. Debora Weber-Wulff FHTW Berlin, FB 4, Internationale Medieninformatik Treskowallee 8, 10313 Berlin Tel: +49-30-5019-2320 Fax: +49-30-5019-2300 firstname.lastname@example.org http://www.f4.fhtw-berlin.de/people/weberwu/
The September issue of *IEEE Spectrum* has a number of articles of interest to comp.risks readers. Cover story about why the VCF project failed. Article about Praxis High-Integrity Systems in Bath, England, where they use formal methods to ensure program correctness. And an article on why software fails, including a list of 31 projects from 1992 to 2005 that failed after billions had been spent.
Last week I watched the chauffeur of a Mercedes car. There was a parking spot left just in front of another Mercedes. Both different types, though fairly new. As I watched by the chauffeur got out of her car and pushed the button on the remote control to close the doors. The system worked. The doors of the Mercedes closed. The already parked Mercedes responded with a happy 'click' and opened it's doors. The chauffeur, confident the click was her car telling everything was fine, didn't pay attention, until I pointed her to the fact that she opened the other Mercedes. She tried several times. When her car opened the other one closed. And vice versa. But she didn't see it as a problem, she could close the doors of her car and walk away. Until I pointed out the system probably worked the other way round as well ...
> [...] the claimed modus operanti seems unlikely as short range > wireless would be inactive unless the laptops were powered on [...] Actually my Apple G4 laptop has an entry in the Bluetooth properties to allow Bluetooth devices to wake up the computer. This is to enable the user to move a Bluetooth mouse or press a key on a Bluetooth keyboard to wake up the laptop. Of course, Bluetooth-enabled PDAs and cellphones are also at risks since these also respond to Bluetooth queries unless the feature has been turned off by the user. First generation Bluetooth devices imposed a significant burden on the battery of a portable device which is why the user was made more aware of the wireless network (prominent annunciators indicating Bluetooth activity etc.). Newer Bluetooth devices can operate in very low power mode (light sleep) so they can be left turned on continuously. As the power requirements are decreased further, Bluetooth activity may become "transparent" to the user resulting in another silent feature can bite unsuspecting users. Vassilis Prevelakis, Ph.D., Computer Science Department, Drexel University, Philadelphia, PA 19104-2875 http://vp.cs.drexel.edu +1 215-895-2920
Please report problems with the web pages to the maintainer