The Risks Digest

The RISKS Digest

Forum on Risks to the Public in Computers and Related Systems

ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Volume 9 Issue 4

Thursday 13 July 1989

Contents

o Air Traffic Computer Fails 104 Times in a Day
Rodney Hoffman
o A320/MD-11 F-B-W differ on pilot authority
Mark Seecof
Rodney Hoffman
o UK MoD S/W Std -- "Crystal Clock" Architecture
Bob Munck
o A Biological Virus Risk
Frank Houston
o Software engineering models -- an apology
Rich D'Ippolito
o Info on RISKS (comp.risks)

Air Traffic Computer Fails 104 Times in a Day

Rodney Hoffman <Hoffman.ElSegundo@Xerox.com>
12 Jul 89 07:59:01 PDT (Wednesday)
Summary of an article by Jeffrey A. Perlman in the `Los Angeles Times' 12 July
1989 is headlined AIR TRAFFIC COMPUTER FAILS 104 TIMES IN A DAY:

Unspecified hardware problems caused 104 brief failures in a new
multimillion-dollar computer system throughout the day Sunday at the Coast
Terminal Radar Approach Control, known as Coast TRACON, in El Toro.  The
failures, up to five minutes in length, wiped altitude, speed, aircraft
identification, and radio frequency data off of controllers' screens, leaving
only aircraft blips or targets.

The new system has no backups, unlike the aging system it replaced 2.5 months
ago.  Furthermore, a decoder which could have provided some of the lost
information from aircraft transponders, is inoperative because parts are on
back order.  During the failures, controllers had no way of knowing an
aircraft's altitude or identity except through voice communications.

Controllers said the failures endangered air safety, although the FAA minimized
the hazards.  Although there were no "near misses," many aircraft departing
from LA International airport did encounter short delays because controllers
were swamped.

An emergency technical crew was flown in from New Jersey and worked all night
Monday to correct the problem.  A controller spokesman said there were still at
least three more failures as of Tuesday afternoon, although the FAA's assistant
regional manager said he was unaware of any further computer failures after
Monday night's repairs.

The National Air Traffic Controllers Association has previously filed
complaints with the FAA about the El Toro system and about TRACON facilities
elsewhere that have received the same system.

[MORE SUBSEQUENTLY FROM RODNEY:]

The previously unspecified hardware problems are now spelled out:
According to Jim Panter, manager of the facility, "no new computer system
was involved.  It was stictly old equipment that needed to be replaced."
The series of computer outages was traced to four brittle, broken wires
inside aging memory storage units.  Four of the Univac computer's 11 memory
modules, in use for at least 12 years, went out on Sunday because of the
wires.  A heat build-up aggravated the situation, with the computer system
shutting itself down and restarting, catching up with new data, as it tried
on its own to locate the memory storage problem.

A special crew from the FAA's technical center in Atlantic City, NJ, worked
on successive nights this week to solve the problem, but because of a parts
shortage some outages still occurred Monday and Tuesday, Panter said.
There were no outages Wednesday, he said.

The computer system involved has been in use at least since 1972, although
it has been upgraded several times with new software.  The software was not
involved in Sunday's failures, although it has been involved in problems at
other TRACON facilities.  "Just like any program, the new software has had
its glitches," said FAA spokesman John Leyden in Washington.

  [This item was also noted -- briefly -- by Dave Curry.  PGN]


A320/MD-11 F-B-W differ on pilot authority

<lcc.marks@SEAS.UCLA.EDU>
Wed, 12 Jul 89 11:23:44 PDT
These excerpts from "Flying the Electric Skies," M. Mitchell Waldrop, Science
(30 June 1989, p.1532ff, v.244).  Elisions... and [bracketed paraphrasing] are
mine as part of condensation.  Comments follow.

  [...]  McDonnell Douglas... thinks [pure fly-by-wire isn't quite mature, so
it's]...  taking a different tack with its new MD-11, a three-engine widebody
scheduled to start service in late 1990.  The MD-11 is even more highly
automated than the A320 in many ways. ... But the MD-11's computerized controls
most emphatically will have mechanical backups.  ``The pilots have to have
control in all situations, not just the normal ones,'' says Douglas' Miller.
``A lot of what ended up in the Airbus got done because it was neat,'' claims
Joel Ornelas, manager of the MD-11 design effort.  ``Engineers love it.  But
the pilots...?''

  [...]  Airbus [doesn't agree.  They think their stuff is great and saves
weight and can't fail; but]... the A320 does have a partial backup system: a
set of cables running back to the tail and rudder.  In an absolute power
failure those cables are supposed to let the crew keep the airplane under
control until they establish emergency power--or if need be, longer.  ``In test
flights we've demonstrated {operation with the backups} from cruise... to
landing... which is more than they were designed to do.''

  [...  However, the pilot of an A320] is... going to have to live with the
A320's preprogrammed restrictions on what it will do.  This guardian angel
behavior, known more formally as the Flight Envelope Protection System, is
widely considered by aviation professionals to be... far more revolutionary...
than fly-by-wire per se.  In effect, the A320's designers have decreed that
their judgement about the aircraft's limits will always take precedence over
the pilot's judgement.  And that is not a constraint that any pilot can take
lightly.

  ``*Nothing* must have the authority to forbid the pilot to take the actions
he needs to,'' Says McDonnell Douglas' Miller.  The problem with giving away
that authority to a computer, he says, is that ``a computer is totally
fearless--it doesn't know that it's about to hit something.''

  Imagine, for example, that an A320 pilot makes a sharp turn to avoid an
imminent collision, or else dives and then has to pull out to avoid the ground.
No matter how desperately he or shell hauls back on that sidestick, the
envelope protection system will not let the airplane respond beyond a certain
rate: namely, the rate that limits stress on the airframe to... 2.5G.

  ...this seems like plenty.  [It would] require quite a violent maneuver [to
pull 2.5G]. ...But consider... China Airlines Flight 006 on 19 February 1985.
While cruising at 41,000 feet some 300 miles NW of San Francisco... the 747
suffered [power loss and cockpit confusion leading to a] near vertical dive.
Over the next three minutes it plunged nearly 6 miles, until the captain was
able to maneuver it back into level flight at only 9500 feet.  He measurably
warped the wings.  He caused several million dollars worth of other structural
damage.  But he saved the airplane and passengers.  And to do it he had to pull
an estimated 5.5G, or more than twice the A320 limits.

  Airbus, not surprisingly, responds that an A320 never would have [entered the
dive] in the first place. ...Be that as it may, notes [747 pilot] Waldrip, many
pilots do like to fee that they can bend or break the aircraft [if they must].
[McDonnell Douglas agrees.]  ``Our strategy...  is that the pilot must have
overriding authority,'' says Miller, ``and he must be able to exercise that
control by the normal method''--the same eye-hand-brain control loop that a
pilot develops through... experience.

  For example, imagine a pilot caught by wind shear, a sudden and very
dangerous burst of cold air falling out of a rain cloud.  The best thing to do
in such a situation is to pull the nose up and put the power to the wall.
Never mind if the engines have to be rebuilt later; you're just trying to keep
from hitting the ground.  However, this is hardly the time to be thinking about
things like special override switches, says Miller.  ``You want your normal...
actions to always have the expected effects.''  So on an MD-11, the pilot would
push the throttle all the way forward until it stopped to get the full rated
power of the engine, with the built in limits providing the same kind of
security as the A320's envelope protection system.  But then by doing the
instinctive thing-- pushing very, very hard-- he could break through to a
regime the A320 absolutely forbids: extreme thrust, well beyond the rated
power.  And so it goes throughout the MD-11, says Miller.  The limits are there
for safety, ``but to override you just have to pull hard or push hard.''  [...]
The Mac-Dac engineers seem to have worked hard on the ``human-factors'' stuff.
I'm struck by the thought that `push hard to override' is rather like "pounding
on things," which has sort-of appeared in office computer interfaces; for
example... giving a likely-to-be-an-error command to the vi text editor will
provoke a message the first time you try it, but if you repeat the command
immediately vi will shrug and do what you want.

It seems to me that risks from faulty computer-human interaction probably
can be reduced by this sort of thing.  When time is critical, it's handy
if "more force" gets you what you want without fumbling around for some
obscure override control.

The issue of "who's in charge here, anyway?" is more interesting.  The
A320 approach assumes that pilot error is more likely than proper-though-
unusual control motions.  Since pilots are still more versatile than the
computers in their planes, they should probably have the last word.  Also,
pilots can learn new tricks faster than computers can receive certified
software upgrades.  Post-aircrash review (and recently, simulator
recreation/experimentation) often produces "how-to" bulletins for pilots
to deal with unusual situations.  It might be a long time before that
stuff could be translated into revised FCS software.

Mark Seecof, Locus Computing Corp. Los Angeles (213-337-5218)
My opinions only, of course...


Fly-by-wire overview

Rodney Hoffman <Hoffman.ElSegundo@Xerox.com>
12 Jul 89 14:52:10 PDT (Wednesday)
[More excerpts from the same 'Science' article as in the foregoing...  PGN]

The stories briefly survey the philosophy, state of the art, and controversy of
fly-by-wire aircraft and (overly-?) protective new extensions.  Designers on
both sides of the argument, as well as pilots, are given their say.  [...]

Airbus engineering test pilot Udo Guenzel:  ``Suppose you suddenly find
yourself staring at a Cessna that has wandered into your airspace.  So you
swerve.  Now, in a standard airliner, you would probably hold back from
maneuvering as hard as you could for fear of tumbling out of control, or
worse.... But in the A320, you could just slam the controller all the way to
the side and instantly get out of there as fast as the plane will take you.''

The A320 has five separate computers, "redundant software obtained from two
different vendors to minimize the possibility that the same bug will appear
simultaneously," heavy shielding on its data cables to keep out electromagnetic
interference, "and, yes, the A320 does have a partial [mechanical] backup
system."

The sidebar reviews more of the history of fly-by-wire and flight
automation philosophy:

   "We started out with cockpit automation backwards," says Northwest
   Airlines 747 pilot Kenneth Waldrip.  In the 1970s and early 1980s,
   he says, "the idea was that the computers would fly the plane and the
   pilot would monitor them in case anything went wrong...."  There was
   only one problem with that scenario, Waldrip says:  humans are
   absolutely terrible at passive monitoring.... People get bored.  Their
   attention flags.  They start missing things.  Worse, a passive pilot
   would often have to tackle an emergency cold....

   By the mid-1980s, aircraft designers, pilot trainers, and the aviation
   community generally had gone through a 180-degree turn in their concept
   of what automation should do.  The new philosophy, which often goes
   under the name of "human-centered" automation, was illustrated in 1980
   in a seminal paper by human factors researchers Earl Wiener (Univ. of
   Miami) and Renwick Curry (NASA Ames Research Center).  They used the
   image of an "Electric Cocoon" [similar to the Flight Envelope Protection
   System  of today's  A320].

   Instead of having people watch the machines, human-centered automation
   means having the machines watch the people.... and it means putting
   automation back on the right track: as an assistant to the pilots.


UK MoD S/W Std -- "Crystal Clock" Architecture

Bob Munck <munck@mbunix.mitre.org>
Thu, 13 Jul 89 14:31:52 EDT
I can't let go unchallenged Norm Finn's and Nancy Leveson's statements
(RISKS-9.2, .3) on what are variously called "polling," "time-slot," or
"crystal clock" system architectures:

   "Polling is absolutely the safest and most-used method for
    programming embedded systems for saftey-critical applications."

   "The trivial scheduling algorithm makes the interaction between the
    routines vastly easier to write and verify than is the case in a
    system that can switch tasks at random times."

   "Software with these characteristics is safer because it is
    deterministic rather than non-deterministic -- the number of states
    may often be reduced to a small enough number to perform extensive
    (and sometimes exhaustive) testing and analysis."

I don't want to comment on the _theoretical_ truth of these statements;
that argument may never be resolved.  However, in my personal experience
with major DoD systems over many years, this approach to system
architecture has led to many more absolute abominations that any other.
                                  ^^^^^^^^ ^^^^^^^^^^^^
The major problem, I believe, is that it does not work well with normal
design and coding team organization and management.  The approach AS
PRACTICED seems to grind together the pieces of any original functional
organization, breaks them into small pieces, and then "shotguns" them
all over the time-slot hierarchy.  Traceability from requirements to
code, never our strong suit, is lost.  "Invisible dependencies" become
the order of the day: code in major cycle XXYa34/minor cycle Ybbb_CAL
assumes that variable IXXm3_AP was set correctly by a side-effect of
code in XXZdd3/PQ88_SET/ECP9347q; when XXZdd3 is reorganized because it
wasn't fitting in its time-slot, the assumption became incorrect.  This
is first noticed when the $100 million satellite attempts to orbit at
-800 miles altitude.  Because our teams are large, scattered,
ill-managed, and leavened with the mediocre, the possible advantages
quoted above are overwhelmed.

I feel about these dueling approaches as I do about HOLs: a good
programmer/team/company can write a good system in any language, and a
bad one can write a bad system in any language.  The time-slot systems
I've seen might have been even worse had they been designed to be
multi-tasking; the choice of architecture was not the root cause of
failure.  It may be a symptom.
                         -- Bob Munck, MITRE Corporation, McLean


A Biological Virus Risk

Frank Houston <houston@itd.nrl.navy.mil>
Wed, 12 Jul 89 12:54:43 -0400
         Over the past six years, FDA has documented an increase in
       medical device problem reports of and voluntary recalls for
       system design errors and computer programming errors.  Some of
       the reported errors have compromised patient safety.  Included
       are blood bank computer systems used in controling the release of
       whole blood and blood products for treating patients.  Some of
       the errors could have allowed institutions to release infected
       blood products.

         Such inappropriate release is particularly hazardous to the
       public health because blood products infected with HIV (AIDS virus)
       and hepatitis might become available for transfusion.  Fortunately,
       so far, no cases of HIV infection have been traced directly to
       computer errors, but agency experts and field investigators believe
       this result is only because the institutions have not depended
       heavily on software controls to authorize release of blood products.

One particularly troubling aspect of the problem, uncovered during an
inspection, was a lack of security on serologic data entries.  The system
allowed more than one operator to edit the same record simultaneously, with the
result that the last one to close the record had the final word on the contents
of the record.  That is to say, one operator could enter a test result of
positive for hepatitis and exit while at the same time a second operator
entered some other result (the default hepatitis result being negative); so
when the second operator exited, the default negative for hepatitis became the
permanent entry.  This seems to violate well known principles of data
integrity, but it happened.

The software in question is no longer used, but the problem remains:  how does
one assure that dangerous bugs have been eliminated from commercial software
when one has neither real-time information about nor control over the way it is
developed?

Frank Houston, FDA/CDRH


Software engineering models -- an apology

<rsd@SEI.CMU.EDU>
Wed, 12 Jul 89 17:25:50 EDT
[The lateness of this response and to my e-mail is due to a city watermain
break that flooded the basement of the building with the cooling equipment
in it, shutting down our building for 6 days -- as noted in RISKS-9.1.]

First, let me apologize to Stan Shebs and the readers of RISKS for the tone
and content of my recent article.  It was not intended to be a shoot-the-
messenger reply, and I'm sorry that it appears to be.  I most likely read
the original article as an series of general unsupported charges against a
specific former employer who had no way to respond, and took my cue from the
clearly provocative title.

Let me make the attempt to extract some lessons from the original article
and all of the subsequent discussion:

Mr. Loux asks if the intent was to discourage whistle-blowers.  Of course
not; I didn't take the article as whistle-blowing to begin with.  The intent
was to elicit more specific information on exactly what the charges were and
who they were against.  Many of the statements were made with little or no
support other than admitted hearsay, and I'm sorry I didn't respond to the
message instead the statements.  I will disregard those statements in the
remaining discussion.

It was not clear to me that there was a connection between the "model"
software engineering methodology and the file transfer problem -- I should
not have missed that.  Mr. Shebs's clarification of this point and expansion
to the comparison of meeting the letter of DoD acquisition requirements with
software quality should be seen as evidence for the immaturity of the SE
discipline everywhere -- not just for the defense industry.

It is the real lack of true engineering methods that leads to the acceptance
of "models" where the emphasis and attention is placed on the checking off
of boxes on a delivery schedule.  After all, the incentive structure is set
up to reward schedule adherence, so it is easy to predict that the efforts
will be devoted to getting the all of boxes checked at the right times.  How
can you fault the management for doing what they are paid to do?  Even a
well-educated management (in terms of technical realism) will have a hard
time making the right response under such reward conditions.  Many
non-defense contractors have found that offering the programmers extra
rewards for meeting schedules does not get better software, nor does it get
it on time, when the impossible is requested.

Mr. Shebs also states that the greatest risk might have been from things that
weren't anticipated during testing.  I suggest that waiting until testing is
too late.  The time to do the risk analysis is in the initial stages of
design, so that the testing procedures are only used to verify that the
implementation is faithful to the design.  It is because there is not the
clear separation of design from implementation in the software industry that
testing is misunderstood and misused.

The issue of problem reports and their unclear relationship to software
quality is also a result of the failure of traditional methodologies to
separate design from implementation, and from their nearly exclusive focus
on process.  The best process in the world can't be expected to do more than
produce economical, consistent copies of poor designs!  The problem will be
with us until we shift from process orientation to product orientation.

I would like to take issue with Mr. Loux's characterization of the problem as
being inherent in engineering.  True engineers do not respond to absurd or
impossible situations.  Engineering, as an applied science, has been the basis
for reasonable expectations, as engineers are adverse to undertaking projects
of any kind without having a base of working models of prior solutions
(captured scientific knowledge) to adapt.  Without such models, there is
nothing on which to build a set of application techniques to apply the
knowledge, let alone a way to generate a rational set of tests and expectations
for the final product.

Which company, defense or otherwise, will be the first to refuse to bid on a
project requiring technology that it has never delivered, and which software
engineers will be the first to refuse to agree to deliver something for which
no models have been proven?
                                                  Rich

Please report problems with the web pages to the maintainer

Top