The Risks Digest

The RISKS Digest

Forum on Risks to the Public in Computers and Related Systems

ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Volume 2 Issue 15

Tuesday, 25 Feb 1986

Contents

o Software Safety Survey
Nancy Leveson
o Titanic Effect
Nancy Leveson
o F-18 spin accident
Henry Spencer
o Space shuttle problems
Brad Davis
o Misdirected modems
Matt Bishop
o Info on RISKS (comp.risks)

Software Safety Survey

Nancy Leveson <nancy@ICSD.UCI.EDU>
22 Feb 86 12:20:49 PST (Sat)
I have been interested in safety for the past five years and have just
completed a long-term project to write a survey of software safety.  It
includes sections on whether there is a problem (probably not of doubt to
those who already read RISKS), why there is a problem, the implications for
software engineering research, the relationship of software safety to
software reliability and security research, a definition of software safety,
a brief survey of relevant aspects of system safety, and a description of
software safety techniques and research issues (requirements analysis,
verification and validation of safety, assessment, software design, and
human factors).  Although good software engineering techniques will
undoubtedly add to the safety of software, this is not a software
engineering survey -- emphasis is on needed additions and changes to current
software engineering techniques and research and on new procedures which
have special relevance to safety.  It is also a technical rather than a
political document (although a few ethical and political issues are
mentioned in the conclusions).

The paper is currently in the form of a technical report although it has
been submitted to Computing Surveys (chosen primarily because of the size of
the document -- about 60 pages).  I will be glad to send out a reasonable
number of copies in exchange for any comments which might help me to improve
it.  Comments on what you like and think is correct or helpful would be nice
along with the complaints.  If you would like a copy, send your regular mail
address (not e-mail) to me:  nancy@uci.edu (after March 4 my e-mail address
is changing to nancy@ics.uci.edu).

Nancy Leveson
ICS Dept.
University of California, Irvine


Titanic Effect

Nancy Leveson <nancy@ICSD.UCI.EDU>
22 Feb 86 12:21:30 PST (Sat)
In Peter Neumann's latest SEN column, he mentions the Titanic Effect without
an explanation of why it occurs.
     [Actually, JAN Lee introduced it unattributably in RISKS-1.21:
           The severity with which a system fails is directly proportional
           to the intensity of the designer's belief that it cannot.   PGN]

I would like to suggest a hypothesis.  The Titanic effect is essentially the
statement that the worst accidents often occur in systems which are thought
to be completely safe.  The Titanic was thought to be so safe that normal
safety procedures, such as having an adequate number of lifeboats, were
neglected with the result that many more lives were lost than might have
been necessary.

The lesson is an important one because it goes back to the problems of using
quantitative assessment techniques.  Quantitative risk assessment can
provide insight and understanding and allow comparison of alternatives.
Probabilistic approaches have merit in that the necessity to calculate very
low probability numbers forces on the analyst a discipline that requires
studying the system in great detail.  But there is also the danger of
placing implicit belief in the accuracy of a calculated number.  That is, it
is easy to place too much emphasis on the models and forget the many
assumptions which are implied.  The models can also never capture all the
factors, such as quality of life, that are important in a problem.  (see
Morgan -- Probing the Question of Technology-Induced Risk, IEEE Spectrum,
Nov. 1981).

Getting back to the Titanic, certain assumptions were made in the analysis
that did not hold in practice.  For example, the ship was built to stay
afloat if four or less of the sixteen water-tight compartments (spaces below
the waterline) were flooded.  Previously, there had never been an incident
where more than four compartments of a ship were damaged so this assumption
was considered reasonable.  Unfortunately, the iceberg ruptured five spaces.
It can be argued that the assumptions were the best possible given the state
of knowledge at that time.  The mistake was in placing too much faith in the
assumptions and the models and not taking measures in case they were
incorrect (like the added cost of putting on-board an adequate number of
lifeboats).  The Titanic is not an isolated example.  Safety devices (such
as sensors in solid-rocket boosters or software safety analysis and design
procedures) cost -- usually in terms of dollars and performance.  There are
often attempts to get around them by using models which show that the
hazards are so negligible that the cost is unjustified.  On the other hand,
too much safety can make the system unusable or unprofitable which is not
the answer either.

The Titanic provides an important lesson for us involved in building
safety-critical computer systems.  Our current models for software
reliability make a large number of assumptions which may be unrealistic or
just not hold for particular systems.  This is true also, to a lesser
extent, with the hardware reliability models.  Much effort is frequently
diverted to proving theoretically that a system meets a stipulated level of
risk when the effort could much more profitably be applied to eliminating,
minimizing, and controlling hazards.  I have seen a great deal of effort
spent on trying to prove that a system which contains software has two or
three more "9's" after the decimal point in the reliability models when the
effort and resources might have been more effective if applied to using
sophisticated software engineering and software safety techniques.

Nancy Leveson
ICS Dept.
University of California, Irvine


F-18 spin accident

<ihnp4!utzoo!henry@seismo.CSS.GOV>
Tue, 25 Feb 86 02:44:45 EST
Recent reading of a book on the F-18 turned up a couple of details on the
spin accident that might be of interest; I don't think these were part of
the original reports.

(For those who haven't heard about this before...  The US Navy's F-18
fighter is heavily computerized, including "fly by wire" controls in which
the computers always mediate between the pilot's controls and the aircraft.
One thing the software does is to limit control-surface movements to within
safe ranges, so the pilot cannot accidentally break the aircraft in combat
maneuvering.  The 12th prototype was lost when it got into a peculiar type
of spin and the software did not give the pilot enough control authority
for recovery.  The pilot ejected safely.  After investigation, the software
was modified.)

Detail number one has something to say about the problems of exhaustive
testing:  even after the problem was known to exist, it took 110 attempts
to duplicate the spin!

Detail number two is that the spin was *not* unrecoverable.  The spin-test
F-18 was equipped with an anti-spin chute just in case, but the pilot who
first duplicated the spin discovered that he could recover without the chute
by setting the outer-side engine to flight idle and the inner-side engine to
full afterburner.  The original pilot could have done this, had he thought
of it.  This might strengthen the case for giving flight-control software
more authority, so that such unorthodox substitutes for ineffective or damaged
controls could be employed automatically.

Reference:  Bill Gunston, "F/A-18 Hornet", Ian Allan 1985, p. 43.  Gunston
does comment that apart from this one strange spin mode, the F-18 is probably
the most spin-proof fighter in existence.

                Henry Spencer @ U of Toronto Zoology
                {allegra,ihnp4,linus,decvax}!utzoo!henry


Re: Space shuttle problems (a comment on risks in general)

Brad Davis <b-davis@utah-cs.arpa>
Mon, 24 Feb 86 19:53:39 MST
If the current speculation about the shuttle is true (that seals on the
solid fuel rockets failed because of the cold and that the Thiokol
engineers asked for a delay because of their worries) then we should
look more at the humans in any decision chain.  Most of the major
failures that I can recall were due to humans overriding the expert
system (whether it was a electronic, mechanical, or human expert
system) or just messing things up in the first place (like the UPL
power generator that was connected to the power grid backwards, made
a real big electric motor for a short time).

Brad Davis  {ihnp4, decvax, seismo}!utah-cs!b-davis
        b-davis@utah-cs.ARPA
One drunk driver can ruin your whole day.


Re: Misdirected modems

Matt Bishop <mab@riacs.ARPA>
24 Feb 1986 2221-PST (Monday)
Reminds me of something I read at 7 SOSP.  Someone got the bright idea
of collecting computer horror stories (an excellent idea, by the way!)
and one of them involved one computer calling another.  The connection
suddenly quit working after about a year.  The system people got
curious and hooked an audio device to the telephone line.  They then
told the computer to contact its counterpart.  They heard the computer
dial, the phone at the other end go off hook, and the computer send its
whistling tones indicating it had something to say.  From the other end
came the words, "Martha, it's that nut with the whistle again."
Problem solved.

Matt

     [Thanks for the anonymous plug.  It was I who anthologized the
      anecdotes for 7 SOSP, and they all appeared in ACM Software
      Engineering Notes Vol 5 no 1 (January 1980) -- as noted at the
      very bottom of my disaster list in RISKS-2.1!  PGN

Please report problems with the web pages to the maintainer