The Risks Digest

The RISKS Digest

Forum on Risks to the Public in Computers and Related Systems

ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Volume 9 Issue 27

Thursday 21 September 1989

Contents

o Re: Brian Randell's commentary on safety analysis
Nancy Leveson
o Re: Risks of Distributed Systems
Charles Shub
o Re: Hospital problems due to software bug
Will Martin
o Mailer Bug moves to MCI?
Jerry Durand
o Loose wires, master clocks and satellites
Peter Jones
PGN
o Info on RISKS (comp.risks)

Re: Brian Randell's commentary on safety analysis

Nancy Leveson <nancy@ICS.UCI.EDU>
Wed, 20 Sep 89 17:54:03 -0700
In Risks 9.21, Brian Randell asks for the reactions to his comments from
people who have tried applying conventional risk assessment and management
techniques to systems involving "really complex software."  I have had
experience applying such techniques to many real systems involving software,
although I am not sure what Brian means by his qualification "really complex."
My arguments in this forum have always stressed the need to eliminate
unnecessary complexity in safety-critical systems, and I usually refuse to be
involved in the analysis of overly complex systems until they have been
simplified.  Brian writes:

  >Ideally, deployment of any potentially risky computer-based system will be
  >preceded by the sort of careful assessment of the risks involved that is
  >typical in a number of engineering disciplines. There are a number of
  >well-established techniques for such risk assessment, such as Failure Modes
  >and Effects Analysis, Event Tree Analysis, and Fault Tree Analysis. As I
  >understand it, all of these involve enumeration and consideration of
  >possibilities, and identification of dependencies, which are then
  >represented in some sort of graph structure, but none of the ensuing
  >analyses take account of the possibility that this graph structure will
  >be incorrect. To my mind, this makes these techniques of limited value
  >for systems employing computers running large suites of software.

Any type of analysis can be wrong, whether applied to hardware or software.
In some respects, safety analysis applied to software may actually be less
prone to error than that applied to hardware.  Software represents the
logical structure in itself and, therefore, the analysis is performed on
the actual thing that is being analyzed.  For systems involving hardware and
physical devices, abstract models must be formed first on which the analysis
is applied.  This extra step (building the logical representation of the
hardware) adds an additional possibility of introducing error.  In my
experience, safety analysis on software is no more error-prone than that
performed on complex hardware systems.  For example, I have found that it is
much easier to build correct software fault trees than system fault trees.

It has also been my experience that the system safety analyses and control
devices do take into account the possibility of errors in particular analyses
and failure of safety protective devices.

By the way, not all risk assessment techniques involve graph structures.
Of the three mentioned, FMEA records information in the form of tables and
does not include identification of dependencies.  Event Tree Analysis is
equivalent to reachability analysis in computer science models and does
represents reachable states in the form of a tree or graph structure.  Fault
Tree Analysis uses a tree in a very different way -- the tree is just a
convenient notation for representing a boolean expression that records the
relationship between states or events.  It could easily be replaced (and often
is) by formal logic expressions.

  >In making this statement, I have three characteristics of such systems in
  >mind: (i) their great logical complexity, and hence the danger of their
  >harbouring potentially risky design faults, (ii) the largely discrete
  >nature of their behaviour, which means that concepts such as "stress",
  >"failure region", "safety factor", which are basic to conventional risk
  >mangement have little meaning,

As I understand it, safety factors are built into hardware systems because
of the possible inaccuracies of calculations on continuous models, the
limitations of these models in representing real systems, and the possibility
that the basic assumptions of the models are incorrect for the actual physical
systems that the models represent.

Because software can be analyzed using discrete mathematics and logic,
some of the need for safety factors in the analysis are eliminated.
It is, of course, always possible for the underlying assumptions (e.g., that
the computer hardware does not experience failures) of software analyses
to be violated.  Brian is correct that safety margins may not useful to
protect against this.  However, run-time checks on basic assumptions (using
built-in test, assertions, TMR, etc.) can be used in computer systems to
provide the same function as safety margins in continuous systems.  Failure
regions and safety envelopes are easily applied to software. Furthermore,
many, if not most, of the safety devices used in conventional risk management
are applicable to discrete systems.

  >and (iii) the almost ethereal nature of software, which makes it much
  >more difficult to identify appropriate components and to understand their
  >interactions (both planned and accidental) than with many physical systems.

I don't know what is meant by the "ethereal" nature of software here.  Floyd,
Hoare, and others who have followed them have demonstrated that the formal
semantics of software can be specified and that software can be treated
as a mathematical object.  Admittedly, analysis on semantically complex
software is difficult.  This is why the argument has been made that the
semantic complexity of the software used in safety-critical systems must
be limited as long as our analysis techniques are limited.   However, I have
rarely found that this limitation is not possible, even when the system of
which the computer is a component appears quite complex at the system level.
(This goes back to my recent discussion with Dave Parnas in Risks).

I think it is a mistake to underestimate the complexity and difficulty of
identifying unplanned interactions in physical systems.  Perrow's book
on "Normal Accidents" argues that such interactions (stemming from complexity
and coupling) are inherently not possible to identify and control in complex
physical systems, and he provides many examples of accidents that have
occurred as a result.  Again, given that at least the software can be thought
of and analyzed as a model of itself, in the future we may find that it is
actually easier to identify and control such interactions in software than in
physical systems.

  >It is of course good practice to design software with a highly modular
  >structure, and to try to isolate critical parts of the software in as small
  >and simple a subsystem as possible. However with a really complex system such
  >structuring is again essentially ethereal and by no means simple. There will
  >therefore remain a significant and unquantifiable likelihood that such
  >modularisation and isolation is itself faulty.

Again, I do not understand the use of the word "ethereal" with respect to
software structure.  Of course, modularization and isolation may be faulty
in software.  It may also be faulty in hardware.  Engineers also use
analysis techniques, e.g., Sneak Circuit Analysis, to attempt to identify
unplanned interactions in electronic systems.  Unfortunately, this type of
fault is as unquantifiable for hardware as it is for software.

John Shore in the Sachertorte Algorithm has argued that physical systems
often have simpler interfaces because of the inherent difficulty of building
complex interfaces in such physical systems versus the simplicity of making
complex interfaces in software.  But that does not mean that it is not
possible for software engineers to exercise discipline in order to control
and limit the complexity of the software interfaces to that which we can
analyze with some degree of confidence.  Modularization, information hiding,
and isolation applied to software are essentially the same procedures that
are used to control complexity in hardware.

  >I thus believe that the graph structure which purports to represent, at any
  >significant level of detail, the internals of (especially the software in) a
  >complex computing system, is likely to be wrong, in that it will not
  >represent any residual design faults properly, so that such faults will
  >remain an (unquantified) contributor to overall system risks.

I am not sure what Prof. Randell here means by a "graph structure" representing
the "internals" of software.  For example, a Software Fault Tree represents a
boolean expression of events that can lead to a hazardous condition.  It does
not represent residual design faults (although the process of building the tree
may identify some critical ones) or the structure of the software.

I agree that all design faults cannot be identified with high confidence for
software.  But safety analysis does not require this.  In my experience in
using software fault tree analysis on real software systems, the most useful
aspect of the technique may lie in its ability to identify hazardous states
that need to be detected at runtime (perhaps by assertions or acceptance
tests) regardless of the actual design faults (or underlying computer hardware
faults) that caused them.

I should add that most safety analysis techniques on physical systems are
not able to quantify the contribution of residual hardware design faults
on system risks either.  They basically quantify only risks based on physical
failures, not design errors.

  >The question of whether we will ever able either to guarantee that a complex
  >computing system is entirely free of design faults, or alternatively able to
  >quantify the likely impact of residual design faults is moot.
  >The point I am trying to make here is that in present circumstances, I
  >see no current alternative to basing the assessment of the risks of the
  >overall system on a worst case scenario for the behaviour of the computing
  >system, based on the physical capabilities of the its interfaces to the
  >outside world, rather than on mere hopes about its internal activities - and   >to taking appropriate precautions externally to the computing system. Yet,
  >unfortunately, one sees these days the opposite trend: namely the increasing
  > use of computing systems in situations where there is little or no ability
  >of some surrounding system to mask the failures of the computing system.

I agree with this conclusion.  It is something I have preached for a long time,
and, in fact, it is exactly this worst case scenario analysis that is involved
in the application of safety assessment and management techniques to software
and to systems containing computers.  For example, it is what I was suggesting
in my discussion in Risks with Bev Littlewood about using a probability of "1"
for software failure in system fault trees.  Unfortunately, it is not
sufficient, and "taking appropriate precautions externally to the computing
system" often requires the same type of system level safety analysis including
the internal design of the software that Brian seems to be arguing above is
not possible.

In particular, risk assessment and management techniques for physical systems
may be no less error prone than that applied to software.  The hardware backup
systems that Brian (and I) suggest may themselves fail or contain design
faults.  This does not mean that they should not be used; it merely means
that multiple "levels of defense" (a term common in the nuclear industry)
are necessary including both hardware and software safeguards and analysis.
We should not rely entirely on hardware safety devices.  Risk assessment
and management at the system level that excludes the behavior of the
software in the analysis and in the design of the software is incomplete
and thus potentially dangerous.

Furthermore, it is not always possible to design physical devices to mask
completely or with high confidence the failures and errors of the computing
subsystem.  One reason this is true is that the software is usually controlling
the other components of the overall system, and it is difficult to build
physical devices that are able to identify and mask control errors (versus
total failures) before a hazardous system state is reached.  When it can, this
should obviously be done.  But what will we do about other very desirable
systems (e.g., medical systems that could in themselves save lives) where
this is not possible?


Re: Risks of Distributed Systems (RISKS-9.26)

Charles Shub <cdash@boulder.Colorado.EDU>
Wed, 20 Sep 89 12:17:56 MDT
>...  Using Ada gives a standardized mechanism for introducing issues such
>     as exceptions and (concurrent) tasks.

Ah yes, Ada* has a standardized mechanism for (concurrent) tasks, but Ada*
unfortunately does not have a good (IMHO) model for concurrent activities to
communicate. Their rendezvous is as bad as the "remote procedure call"
technology. We also need to discuss asynchronous IPC which is done at even
fewer places. This message is an example of asynchronous inter process
communication. I can guarantee you that i'm doing other things until you
respond or acknowledge (or this message gets dropped in a bit bucket somewhere)
and neither RPC nor the rendezvous allows that. So please don't tout Ada* as
the cure-all for concurrency. We all know that
        "dod created ada and it was good."

* Ada (and nuclear annihilation for that matter) are trademarks of the
  Department of Defense.


Re: Hospital problems due to software bug (RISKS-9.26)

Will Martin <wmartin@STL-06SIMA.ARMY.MIL>
Wed, 20 Sep 89 13:05:16 CDT
>Due to a fault in the aging software, the machines were unable to accept as
>valid the date September 19, 1989, ...
>One computer specialist described the problem as a "birth defect", ...

Just what we need -- more jargon for the media to splash around. "Birth defect"
instead of just simple "error" or the traditional "bug".

Does anyone know if this was one of the built-in magic-number date breakdowns
that have previously been mentioned on RISKS? That is, the ones where the
system date/time is maintained in a field containing the number of seconds
since some arbitrary start-date in the past, and which will fill up and trip
back to 0 at some predictable future date (at which time all applications using
that OS or system will trash their date processing and mangle any data based on
dates... :-( ).

I was hoping that someone out there has kept track of and will post a note
listing those magic dates for various OS's and systems. It will be a useful
reference for all of us.

Regards, Will Martin


Mailer Bug moves to MCI

<JDurand@cup.portal.com>
Wed, 20-Sep-89 19:02:35 PDT
I received the following notice today from MCI, it sounds like your MAILER
bug (RISKS-9.22, 23) is contagious!
                             Jerry Durand, Durand Interstellar

Date:     Wed Sep 20, 1989  6:45 am  PDT
From:     FAX Help / MCI ID: 369-3746
TO:     * Durand Interstellar / MCI ID: 114-9128
Subject:  Multiple Fax Messages

Dear Customer,

   According to our records you sent a Fax Dispatch message Tuesday,
afternoon September 19, 1989.  Due to a temporary software bug in the
system, it is possible that MCI Mail attempted to deliver your message
numerous times. Therefore, you may receive several message confirmations
or cancellation notices.

   This problem, which only affected Fax Dispatch has been corrected.
MCI Mail has also taken the necessary steps to insure that you are not
billed for any extra messages.

   We regret any inconvenience this may have caused you, and will do our
utmost to avoid a recurrence of this situation.

Sincerely,

MCI Mail Customer Support


Loose wires, master clocks and satellites

Peter Jones <MAINT@UQAM.bitnet>
Thu, 21 Sep 89 10:57:17 EDT
Two articles in RISKS 9.26, namely "Man-Machine Failure at 1989 World Rowing
Championships", and "An Interesting Answer to the Distributed Time Problem"
reminded me of a certain performance of Beethoven's Ninth Symphony on December
12, 1988, in which I sang in the tenor section offstage in Montreal. The
National Film Board of Canada has just released a video documentary about
this event, entitled "Satellite Symphony --- One Woman's Dream".

The dream in question was to conduct an orchestra in Montreal, with choirs in
San Mateo, Geneva and Moscow. Unfortunately, as the links between these places
were via satellite, there were perceptible delays in satellite transmission,
making it difficult to get everyone in sync.

One solution tried was to send a cue to the remote choirs ahead of the conductor
in Montreal, so that their sound would come back in time with the orchestra.
This proved to be unworkable, for the delay required was in terms of
milliseconds, and not in terms of beats and fractions thereof. Also the
response times of the remote choir conductors and their singers was difficult to
assess, and the main conductor might not be at the same tempo from one
performance to another.

The solution that was finally adopted was to transmit a recording of the dress
rehearsal of the previous evening to the remote choirs, ahead of the live sound.
With the aid of an earphone, the conductor in Montreal would hear and follow
the same recording, with a delay to compensate for the two-way satellite
transmission of her cues and the choirs' singing in response. (I'm glossing over
the technicality that number of satellite "hops", hence the delay, was different
for each choir.)

So far, we have seen the master-clocking difficulties. What happened on concert
night? Yes, a proverbial loose wire. The conductor's earphone malfunctioned,
although the choirs heard the recorded sound by satellite! After a 45-second
wait, and a fruitless call for assistance (she couldn't get a reply on the
defective earphone without leaving the podium, or having someone come out on
stage), she decided to start anyway. Radio Canada technicians sent the live
sound to the remote choirs. The end result was that the choirs missed their
first entry. Fortunately, the Montreal Symphony's choir was on hand backstage
in Montreal as a backup, providing the illusion to the audience that the other
choirs had indeed come in. The rest of the choirs managed to come in further
on, more or less on time. The resulting sound was, as Spock of Star Trek would
say, "Fascinating".
                         Peter Jones     MAINT@UQAM     (514)-987-3542


Re: Loose wires, master clocks and satellites

"Peter G. Neumann" <neumann@csl.sri.com>
Thu, 21 Sep 1989 13:22:01 PDT
Following Hollywood hi-tech practice, and minimizing the real-time risks, next
time you might try videotaping the orchestral dress rehearsal in Montreal --
with the option of the local backstage chorus being heard only on ear-phones by
the conductor and players, and recorded on a separate audio track -- then
letting San Mateo, Geneva and Moscow dub their contributions independently
(while each watched the videotaped conductor and heard the dress rehearsal
orchestra audio), and finally mixing the whole thing together in a `live'
performance.  That way the conductor would not have to rely on ear-phones --
she would be mimicking herself on the tape monitor while the sound of taped
choruses and the prerecorded orchestra filled the hall.  The Montreal orchestra
folks could synch their playing (especially the wind players, who would indeed
be lip-synching) to the recording -- or simply fake it altogether!  With a
little computer processing you could even do real-time resynchronization to
compensate for recording machines that were slightly off speed.  However, the
whole thing seems somewhat silly because the conductor could not influence the
performance in progress, which was presumably her original intent.  And the
results had to be MURKY!  With the transmission delays, you cannot really
afford to let anyone hear another group anyway unless recorded; because the
time delay from `ictus' (low-point of the beat) to sound attack generally
varies with the tempo, and because it is very difficult to anticipate
adequately from auditory cues alone, the visual cues are much more important --
even if tape delayed.  Furthermore, if there is no live playing or singing
except maybe locally, you can afford to run everything tape delayed with
variable delays in the premix.  (I hope they weren't singing in English, French
and Russian!  But I am reminded of two high-school bands getting together
unrehearsed in 1949 for the Star Spangled Banner, with one playing in Bb, the
other in Ab.)

Perhaps I got parODied away.  It happens every now and then.  But this is
really a marvelous real-time synchronization problem, and the risks in pulling
it off are considerable.  On the other hand, if you lose one chorus completely,
probably nobody in the audience would ever know -- except that the Russian
accent in German might suddenly disappear.

Freud would have been amused with the aural fixation.

     ``Freude, Freude, got 'er, 'funken,     {as in RundFunk = broadcast}
       Sighed, `umschlungen, millionen'!''   {THAT's a lot of singers.}
                                             Owed to Joy.  Schiller.

     ``... achoired a certain measure of reknown ...''       Tom Lehrer.

Please report problems with the web pages to the maintainer

Top