The Risks Digest

The RISKS Digest

Forum on Risks to the Public in Computers and Related Systems

ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Volume 24 Issue 3

Weds 7 September 2005


Katrina's telecom damage tops $400 Million; repairs may take months
Monty Solomon
Cockpit confusion found in Cypriot airliner crash
Lindsay Marshall
Flight Control System Software Anomalies
Peter B. Ladkin
Ships relying on GPS-based systems
Peter B. Ladkin
VT Gas pumps give up at $3/gallon
Monty Solomon
UK Elections: Web and text vote trials dropped
Chris Leeson
German social services software with new, costly errors
Debora Weber-Wulff
Not guilty because of system deficiencies
Debora Weber-Wulff
The FBI Virtual Case File and other disasters
Mercedes car-door locking functionality
Leon Kuunders
Re: Risks of Bluetooth pirates?
Vassilis Prevelakis
Info on RISKS (comp.risks)

Katrina's telecom damage tops $400 Million; repairs may take months

<Monty Solomon <>>
Tue, 6 Sep 2005 22:49:42 -0400

BellSouth reports that the telecom damages from Hurricane Katrina in the New
Orleans/Mississippi Gulf Coast area are on the order of half a billion
dollars, with repairs taking 4 to 6 months, according to preliminary
estimates.  Roughly 1.1 million lines are still out.  [Source: Arshad
Mohammed, *The Washington Post*, 6 Sep 2005; PGN-ed]

  [Of course, those costs are dwarfed by the overall catastrophe.  The huge
  magnitude of the natural disaster, the lack of foresight over past decades
  in protecting the levees, and the many problems with emergency responses
  are horrendous.  This once again reminds us of the extent to which we tend
  to deprecate far-sighted proactive risk management.  PGN]

Cockpit confusion found in Cypriot airliner crash

<"Lindsay Marshall" <>>
Wed, 7 Sep 2005 16:46:06 +0100

The crew members of a Cypriot airliner that crashed Aug 14 near Athens
became confused by a series of alarms as the plane climbed, failing to
recognize that the cabin was not pressurizing until they grew mentally
disoriented because of lack of oxygen and lost consciousness, according to
several people connected with the investigation into the crash.  In
addition, the German pilot and the young/inexperienced Cypriot co-pilot did
not have a common language in which they could speak fluently. and had
difficulty understanding each other's standard airline English.  A total of
121 people were killed in the crash after the plane climbed and flew on
autopilot, circling near Athens until one engine stopped running because of
a lack of fuel. The sudden imbalance of power, with only one engine
operating, caused the autopilot to disengage and the plane to begin to fall.
[Source: Don Phillips, International Herald Tribune, 7 Sep 2005; PGN-ed]
  [Also noted by Chuck Weinstock]

Flight Control System Software Anomalies

<"Peter B. Ladkin" <>>
Wed, 31 Aug 2005 11:19:04 +0200

In this era of fly-by-wire, I am fond of saying that, as far as I know,
there has never been a commercial aircraft accident caused by anomalies
in flight control software. And it has been 17 years (the first A320
was introduced into service in 1988).

It is thus well to remember that designing and writing critical
software-based systems  for such applications is not a routine task that
we now know how to perform. In fact, there are plenty of anomalies that
crop up that the public doesn't hear about. Here is one that made it out,
and a pointer to another.

The B777 is a high-capacity Boeing electric airplane (that is, fly-and-a-
lot-of-other-things-by-wire) designed inter alia for intercontinental
travel.  The aircraft has been in service since 1995, and 490 of them have
been delivered (for comparison, of Boeing's "jumbo", there are 625 B747-400
delivered, and a further 477 "classic" B747 still in service) [1].

The B777 has just been subjected to an emergency Airworthiness Directive
(AD) from the U.S. FAA [2]. In April 2005, the FAA issued AD 2005-10-03
requiring "modification of the operational program software (OPS) of the air
data inertial reference unit (ADIRU) from software version part number (P/N)
3470-HNC-100-03 to software version P/N 3475-HNC-100-06 or
3474-HNC-100-07. That AD resulted from a report of the display of erroneous
heading information to the pilot due to a defect in the OPS of the ADIRU."

An AD is issued in response to an identified hazard, and the reasons are
given as a list of possible consequences of the hazard, including worst-case
consequences: "We issued that AD to prevent the display of erroneous heading
information to the pilot, which could result in loss of the main sources of
attitude data, consequent high pilot workload, and subsequent deviation from
the intended flight path."

Attitude data consist of angle of pitch (nose up or down), angle of bank
(left/right wing low/high) and heading. The flight control system, and
pilots, also use the rate of change of these quantities, as well as their
accelerations, although these are not presented as separate displays to
pilots (one notes these are "trends" through cognitive processing rather
than display).

Emergency AD 2005-18-51 was issued on August 29, 2005. An unsafe condition
had been identified through analysis of an incident, and Boeing had issued
an Alert Service Bulletin on August 26 addressing the problem with
workarounds. The Emergency AD makes these actions mandatory. The FAA
explains as follows:

  Since [AD 2005-10-03] was issued, we received a recent report of a
  significant nose-up pitch event on a Boeing Model 777-200 series airplane
  while climbing through 36,000 feet altitude. The flight crew disconnected
  the autopilot and stabilized the airplane, during which time the airplane
  climbed above 41,000 feet, decelerated to a minimum speed of 158 knots,
  and activated the stick shaker. A review of the flight data recorder shows
  there were abrupt and persistent errors in the outputs of the ADIRU. These
  errors were caused by the OPS using data from faulted (failed) sensors.
  This problem exists in all software versions after P/N 3470-HNC-100-03,
  beginning with P/N 3477-HNC-100-04 approved in 1998 and including the
  versions mandated by AD 2005-10-03. While these versions have been
  installed on many airplanes before we issued AD 2005-10-03, they had not
  caused an incident until recently, and the problem was therefore unknown
  until then. OPS using data from faulted sensors, if not corrected, could
  result in anomalies of the fly-by-wire primary flight control, autopilot,
  auto-throttle, pilot display, and auto-brake systems, which could result
  in high pilot workload, deviation from the intended flight path, and
  possible loss of control of the airplane. ................

  We have evaluated all pertinent information and identified an unsafe
  condition that is likely to exist or develop on other Boeing Model 777
  airplanes of this same type design. Therefore, we are issuing this AD to
  prevent the OPS from using data from faulted (failed) sensors, which could
  result in anomalies of the fly-by-wire primary flight control, autopilot,
  auto-throttle, pilot display, and auto-brake systems. These anomalies
  could result in high pilot workload, deviation from the intended flight
  path, and possible loss of control of the airplane. This new AD supersedes
  AD 2005-10-03.

Note that the consequences list has been extended by "possible loss of
control of the airplane". According to John Sampson, the incident to which
the AD refers occurred to a Malaysian Airlines B777-200 on 3 August 2005, on
Flight MH 124 from Perth to Kuala Lumpur [3]. The aircraft returned to Perth
after 51 minutes flight for an emergency landing after an ADIRU malfunction
which caused a "flight control outage".

This is the first public statement of which I know which addresses classes
of Byzantine faults. Byzantine faults have occurred, seriously, in avionics
before but the details are not public (see the quote from [6] below).
Byzantine faults are faults in which agents (sensors, computers) in a
distributed system "lie" to their interlocutors: they do not fail silently
but distribute erroneous data, or data which is read differently by
different receivers. The name arose from a whimsical analogy by Lamport,
Shostak and Pease to a group of Byzantine generals trying to reach agreement
in a situation in which no one trusts anyone else to speak the truth. The
classic papers from twenty years ago are [4,5], and I understand arose from
SRI International's involvement in trying formally to verify the operating
system of the first digital flight control computer, SIFT.

Dealing with Byzantine faults became an extremely active area of distributed
computing theory, but practitioners did not take them so seriously at first,
perhaps partially due to the very high resource consumption of the
solutions: Lamport, Shostak and Pease showed that any correct algorithm to
achieve consensus required a large number of processors (roughly speaking,
at least 3n+1, where n is the number of "liars") and a lot of processor
cycles. It follows that solutions judged to be practical are unlikely to be
complete solutions, and therefore one must analyse the actual problem space
more closely to find out where one can most profitably handle possible
problems, and which areas one can ignore.

The SAFEbus, the backplane communications bus of the B777 flight control
system, now standardised as ARINC 659, was designed by Ken Hoyme and Kevin
Driscoll at Honeywell. Driscoll, with Honeywell colleagues Hall and Zumsteg,
and Sivencrona (Chalmers Uni, Sweden) wrote a paper in SAFECOMP 2003 in
which they described occurrences of Byzantine faults in avionics and how one
can deal with them (or not, as the case may be) [6]. They say "Byzantine
faults in safety-critical systems are real and occur with failure rates far
more frequently than 10**(-9) faults per operational hour. In addition, the
very nature of Byzantine faults allows them to propagate through traditional
fault containment zones, thereby invalidating system architectural

Driscoll et al. refer to a set of incidents in which the occurrence of
Byzantine failures "threatened to ground all of one type of aircraft".  This
set of incidents is not publicly available. I quote:

  This aircraft had a massively redundant system (theoretically, enough
  redundancy to tolerate at least two Byzantine faults). but, no amount of
  redundancy can succeed in the event of a Byzantine fault unless the system
  has been designed specifically to tolerate these faults. In this case,
  each Byzantine fault occurrence caused the simultaneous failure of two or
  three "independent" units. The calculated probability of two or three
  simultaneous random hardware failures in the reporting period was 5 x
  10**(-13) and 6 x 10**(-23) respectively. After several of these
  incidents, it was clear that these were not multiple random failures, but
  a systematic problem. The fleet was just a few days away from being
  grounded, when a fix was identified that could be implemented fast enough
  to prevent idling a large number of expensive aircraft.

The significance of the 10**(-9) figure is that airworthiness requires that
a "catastrophic" failure of aircraft systems occur with a rate less than
this per operational hour. The figure was originally chosen to be low enough
that one would not expect a catastrophic failure during the lifetime of the
fleet of that type aircraft (whether this calculation still holds is a
separate question) [7]. Any hazard (for example, failure) with potentially
catastrophic consequences which is seen or judged to have more frequent
occurrence can lead to withdrawal of the airworthiness certificate of the
type. Hence Driscoll et al.'s story.

I do not know (yet) whether the fault identified in AD 2005-18-51 is of one
of the types specifically considered by Driscoll et al.

Circumstances in which messages are sent which are misinterpreted by
receivers are not at all unusual. It is not clear that Driscoll et al. would
classify these all as Byzantine faults. A well-known occurrence in which an
error message was misinterpreted as navigation data is the in-flight
break-up of the first Ariane 5, Ariane Flight 501 [8].  Another example from
another area of transportation is the grounding of the cruise ship Royal
Majesty in 1995, in which incident the autopilot was designed in the
expectation that the GPS would fail silent, but the GPS continued to send
dead-reckoning data for over a day when it failed to receive a signal. The
ship tracked 17 miles off course and grounded on Rose and Crown shoal near
Nantucket Island, near Boston, MA [9,10]. The fault models of the system
designers in the avionics case in [6], as well as in the Ariane 5
architecture and that of the STN Atlas NACOS 25 autopilot on the Royal
Majesty, were inappropriate for the task.

There are various conclusions one can draw:

* The kinds of numbers used in Fault Tree Analysis for random hardware
  failures in software-based systems give no good indication of the rate of
  systematic failures (due to design or to errors in software) which can be

* Fault-handling models are crucial parts of the architecture and their
  assumptions are critical. (This is made clear by the incidents discussed
  in [6,8,9].)

* That there have been no accidents does not mean that there are no
  occurrences of substantial problems with potentially catastrophic
  consequences with software-based critical avionics.


Thanks to John Sampson for pointing me to [2]; John Rushby for pointing me
to [6]; Rod Chapman for pointing me to [8].


[1] World Airliners Directory, Flight International, 26 Oct - 1 Nov 2004.

[2] U.S. Federal Aviation Administration, AD 2005-18-51
Available at

[3] Accidents and Incidents, Aviation Safety Week, 08 August 2005.

[4] Pease, M., Shostak, R., and Lamport, L., Reaching Agreement in the
Presence of Faults, J. ACM 27(2): 228-234, 1980. Available from

[5] Lamport, L., Shostak, R., and Pease, M., The Byzantine Generals Problem,
ACM Trans. Prog. Lang. Sys. 4(3): 382-401, 1982. Available from

[6] Driscoll, K., Hall, B., Sivencrona, H., and Zumsteg, P., Byzantine Fault
Tolerance, from Theory to Reality, in S. Anderson, M. Felici and
B. Littlewood, eds, Computer Safety, Reliability, and Security, 22nd
International Conference, SAFECOMP 2003, Edinburgh, UK, September 23-26,
2003, Book series Lecture Notes in Computer Science No. 2788,
Springer-Verlag, 2003. Available from

[7] Lloyd, E., and Tye, W., Systematic Safety, U.K. Civil Aviation Authority
Publications, 1982.

[8] O'Halloran, C., Ariane 5: Learning from Failure, in J. Bowen and
J. Woodcock, eds, Proceedings of the Grand Challenge 6 Workshop on
Dependable Systems Evolution, at Formal Methods '05 conference, 18 July
2005, Newcastle-upon-Tyne. Available at

[9] Heidiecker, L., Hoffmann, N., Husemann, P., Ladkin, P. B., Paller, J.,
Sanders, J., Stuphorn, J., Vangerow, A., WBA of the Royal Majesty Accident,
RVS-RR-03-01, RVS Group, University of Bielefeld, 1 July 2003. Available
from -> Publications -> Research Reports.

[10] U.S. National Transportation Safety Board, Marine Accident Report,
Grounding of the Panamanian Passenger Ship Royal Majesty on Rose and Crown
Shoal near Nantucket, Massachusetts, June 10, 1995. Report Number
NTSB/Mar-97/01. Available from

Peter B. Ladkin, University of Bielefeld, Germany

Ships relying on GPS-based systems

<"Peter B. Ladkin" <>>
Wed, 31 Aug 2005 13:57:18 +0200

(was: US Navy to drop paper charts, Lichtensteiger, RISKS-24.02)

> For commercial shipping, with it's much smaller crews, and civilian
> sailors, the level of faith placed in a GPS and chartplotter scares me.

The cruise ship Royal Majesty ran aground off Nantucket Island in 1995.  The
crew had been relying for over a day on an autopilot taking readings from a
GPS position sensor. The GPS signal had been lost, supposedly thanks to the
aerial being inadvertently disconnected, shortly after setting off from
Bermuda, and the sensor was giving information through dead reckoning. It
seems no one noticed, despite having a Loran available as a cross
check. Also, the auto pilot error-handling was based on the GPS sensor being
fail-silent, which was an incorrect assumption.

During the last hours, tracking some 17 miles off course, into known
dangerous shallow waters, various obvious signals were ignored (white and
"confused" water ahead; static shore lights sitting in the middle of the
"ocean") as well as misidentification, and failure of identification, of
buoys. The ship ran aground on Rose and Crown shoal and needed to be

The report is on the U.S. NTSB WWW site. Slides from a talk, as well as a
paper, giving a Why-Because Analysis of the accident may be found at -> Bieleschweig Workshops -> Second Workshop ->

Thanks to Luke Emmet of Adelard for suggesting this as a case study in 2002.

Peter B. Ladkin, University of Bielefeld, Germany

VT Gas pumps give up at $3/gallon

<Monty Solomon <>>
Sat, 3 Sep 2005 14:09:31 -0400

1% of Vermont's 6,000 gas pumps are unable to compute with gas prices over
$2.99.  [PGN-ed from an Associated Press item,  3 Sep 2005]

UK Elections: Web and text vote trials dropped

<Chris Leeson <>>
Wed, 7 Sep 2005 10:10:34 +0100

Government plans to introduce e-voting for next year's local council
elections have been dropped. According to the government spokesman
(Elections Minister Harriet Harman), "the time is not right".  The
government has not ruled out further attempts to introduce e-voting.  Oliver
Heald MP, Shadow Secretary of State for Constitutional Affairs, has
described the whole process as a shambles, citing the security concerns with

BBC News:
The Register:

German social services software with new, costly errors

<Debora Weber-Wulff <>>
Tue, 06 Sep 2005 09:06:53 +0200

In the never-ending tale of woe surrounding the German social services and
unemployment software A2LL (produced by T-Systems, the software arm of the
former German state Telecom company), the Spiegel has just reported that the
software miscalculates the health insurance premiums that the government
pays every month - to the tune of 25 million Euros too much, every
month. The bill is footed by the taxpayers, of course, since T-Systems
wisely put a cap in to contract for reparations - a maximum of 5 million
Euros is all T-Systems needs to pay:

Wikipedia for more background information on A2LL:

According to *Der Spiegel*, an expert commission is already discussing what
to do with the software, which was taken into service just in January of
2005.  It has been declared to be in such a state of non-maintainability and
non-adaptability ("nicht mehr wartungs- und entwicklungsfähig") that they
are speaking about an entirely new software - to be written, of course, by
T-Systems, who brought on this mess in the first place. They just are trying
to decide whether to start a new central "solution" or a decentralized one
for each unemployment office, as there are many local rules and insurance
providers that seem to be causing difficulty.

The problem is with the insurance premiums for the unemployed, which was
lowered retrospectively to save money for the government in March. A health
insurance umbrella organization, VdAK, says it has difficulty in determining
how much to pay back, if anything, because they do not know for exactly
which people and months the wrong premium was calculated. A previous large
error reported completely wrong data on who exactly was insured when to the
insurance companies. The VdAK has said that when the German Social Services
BA (Bundesagentur für Arbeit) gets their software straightened out, they
will be glad - for a fee, of course - to see if they can repay the premiums
payed in error. (In other news, the health insurance companies reported a
surprise surplus recently...)

Even with the error now known, the software will not be able to be fixed
this year at all [the last time I looked we had about a third of a year
left....-dww], although it seems that just the rate for the premiums needs
to be adjusted from 14.3% to 13.2%. The problem seems to stem from there
being hundreds of different insurance providers, all with slightly different
premium calculations.

See RISKS-23.53, 23.60, 23.92.

Prof. Dr. Debora Weber-Wulff FHTW Berlin, FB 4, Internationale Medieninformatik
Treskowallee 8, 10313 Berlin  +49-30-5019-2320

Not guilty because of system deficiencies

<Debora Weber-Wulff <>>
Wed, 31 Aug 2005 18:14:00 +0200

The Berlin newspaper Tagespiegel reports on a curious court case:

Seems that a social worker found a neat way to dole out funds to himself a
few years ago.  [And yes, Peter, the pun is intentional! --dww]

Social services have a money machine set up in which, when a client is given
money, instead of having it transferred to their account, a chip card is
selected, and the number of the card typed into a computer program that
controls payouts. The client takes the card to an ATM-like money machine,
puts the card in, key is the secret password which is [I hope you are
sitting down... --dww] the *birthday* of the client, and takes out the
money. A camera films the transaction, but erases the tapes about 6 weeks

The program records the payout in the files of the client, and only people
with proper passwords have access to the payout system. This is called

About 27.000 Euros (about the same in dollars these days) disappeared about
2 years ago. The revision department nailed down 22 transactions that had
been conducted without an entry in the files of a client, and the clients
knew nothing of the windfalls.

The accused kept his mouth shut during the process, and it was uncovered
that the cards were not kept track of and "flew around the offices", people
would log onto their payout computers and remain logged in all day,
sometimes leaving the office without locking the door.  It would have been
trivial for a colleague to quickly use a computer to load up a card, then
slip it to an accomplice and have them pick up the cash. In addition,
everyone seemed to know everyone else's passwords...

The defence lawyer also noted that the social workers were all mad about the
extra work they had to do about the new German dole system, so it really
could have been anyone.

Berlin remains out the 27.000 Euros and has to pay court costs, the accused
keeps his job (but was transferred, probably to the filing room), and the
judge recommends they re-think the security of the payout system. I'm with
the judge on this one!

Prof. Dr. Debora Weber-Wulff
FHTW Berlin, FB 4, Internationale Medieninformatik
Treskowallee 8, 10313 Berlin
Tel: +49-30-5019-2320      Fax: +49-30-5019-2300

The FBI Virtual Case File and other disasters

Mon, 5 Sep 2005 20:24:08 -0500 (CDT)

The September issue of *IEEE Spectrum* has a number of articles of interest
to comp.risks readers.  Cover story about why the VCF project failed.
Article about Praxis High-Integrity Systems in Bath, England, where they use
formal methods to ensure program correctness.  And an article on why
software fails, including a list of 31 projects from 1992 to 2005 that
failed after billions had been spent.

Mercedes car-door locking functionality

<"Leon Kuunders" <>>
Sat, 13 Aug 2005 00:56:48 +0200

Last week I watched the chauffeur of a Mercedes car. There was a parking
spot left just in front of another Mercedes. Both different types, though
fairly new. As I watched by the chauffeur got out of her car and pushed the
button on the remote control to close the doors.

The system worked. The doors of the Mercedes closed. The already parked
Mercedes responded with a happy 'click' and opened it's doors. The
chauffeur, confident the click was her car telling everything was fine,
didn't pay attention, until I pointed her to the fact that she opened the
other Mercedes.

She tried several times. When her car opened the other one closed. And vice
versa. But she didn't see it as a problem, she could close the doors of her
car and walk away.  Until I pointed out the system probably worked the other
way round as well ...

Re: Risks of Bluetooth pirates? (Kramer, RISKS-24.02)

< (Vassilis Prevelakis)>
Mon, 29 Aug 2005 22:14:58 -0400 (EDT)

> [...] the claimed modus operanti seems unlikely as short range
> wireless would be inactive unless the laptops were powered on [...]

Actually my Apple G4 laptop has an entry in the Bluetooth properties to
allow Bluetooth devices to wake up the computer.  This is to enable the user
to move a Bluetooth mouse or press a key on a Bluetooth keyboard to wake up
the laptop.

Of course, Bluetooth-enabled PDAs and cellphones are also at risks since
these also respond to Bluetooth queries unless the feature has been turned
off by the user.

First generation Bluetooth devices imposed a significant burden on the
battery of a portable device which is why the user was made more aware of
the wireless network (prominent annunciators indicating Bluetooth activity
etc.). Newer Bluetooth devices can operate in very low power mode (light
sleep) so they can be left turned on continuously. As the power requirements
are decreased further, Bluetooth activity may become "transparent" to the
user resulting in another silent feature can bite unsuspecting users.

Vassilis Prevelakis, Ph.D., Computer Science Department, Drexel University,
Philadelphia, PA 19104-2875  +1 215-895-2920

Please report problems with the web pages to the maintainer