The Risks Digest

The RISKS Digest

Forum on Risks to the Public in Computers and Related Systems

ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Volume 10 Issue 40

Tuesday 18 September 1990

Contents

o Software Unreliability and Social Vulnerability
Perry Morrison
o DTI/SERC Safety Critical Systems Research Programme
Brian Randell
o Canadian Transportation Accident Investigation
Brian Fultz
o Risk of Collision
Brian Fultz
o Knight reference: `Shapes of bugs'
Pete Mellor
o Info on RISKS (comp.risks)

Software Unreliability and Social Vulnerability (Re: RISKS-10.31)

Perry Morrison MATH <pmorriso@gara.une.oz.au>
15 Sep 90 07:30:18 GMT
Many thanks to all those who have contributed their views on issues raised
in my article with Tom Forester -"Software Unreliability and Social
Vulnerability", Futures, June 1990. I don't have time or space to respond
in detail to all of the comments made, but I would like to pick up on some of
the main threads while they are still salient for most risks readers.

First, I was never completely happy with the blanket conclusion of removing
digital systems from life-critical applications, but it has served a purpose
in bringing software unreliability to the attention of the general public in
a fashion that has more rigour than the tabloids would provide (I hope!).
In the process, it has brought what I believe are increasingly important
issues back to the arena of software/systems specialists.

Although the paper restricts itself to software/hardware failures, it is
highly influenced by Charles Perrow and the "Normal Accidents" argument of
complex systems. As others have already stated, the central issue is
really system complexity -- the limits it places upon our capacity to predict
system behaviour. Unfortunately, digital systems have two other properties
that exacerbate their unreliability: (a) it is easier to build much more
complex digital systems than analogue ones (or at least we have so far) and
(b) digital representation and computation provides more scope for
catastrophic/unpredictable behaviour than analogue techniques -- essentially
the "untestable number of system states" and "each state is a possible
catastrophic discontinuity" arguments that we have (hopefully correctly)
borrowed from Dave Parnas and others.

Our motivation stemmed from our concern that we (others, I mean) have
already built systems of such complexity that we no longer understand or
can predict in any adequate sense, their full range of possible behaviours.
This concern is exacerbated by the tendency of complex systems to also control
complex responsibilities or awesome energies and involve huge monetary
investment.  Such systems -- things like the Shuttle, some particle
accelerator/conglomerates and others exhibit very high levels of unreliability.
Note however that it is not only the digital components of
such "hybrid" systems that are unreliable, but their analogue components as
well -- valves etc. Again, it is enormous complexity and our inability to get
adequate intellectual handles on it that is the root problem.

However, for better or worse, society is steaming down the road toward
greater application of complex digital techniques... to everything -- financial
systems, control of basic services, defence, communications.... In our
view, this will lead to more complex digital systems with concomitant
levels of unreliability and sometimes quite disastrous results. Our major
purpose was to point out that in creating such powerful systems we cannot
have our cake and eat it too -- we should not be surprised that when they
fail (as they occasionally will being human created artifacts) the
consequences will be catastrophic. The unreliability of complex systems
(which, as outlined above tend to be digital) and the power/energies bound
up within them should lead us to expect such consequences. Whether this
scenario is any better than other alternatives (whatever they are) is a value
judgement of some complexity itself. It is a value judgement because it
relies upon subjective assessment of what constitutes an acceptable level
of risk for a given system and the responsibilities it is tasked with.

Note also, that we tend to ignore the notion that there is more than one
legitimate assessor of risk -- not just the developer or client, but the
users and those that the system affects in direct or indirect ways. This
is essentially the PR problem that nuclear reactors and fly-by wire systems
have -- perhaps the psychological processes of individuals are often awry,
but they clearly assess the risks of FBW (and the owners/operators merely
mirror the commercial consequences of this concern) as worrying. In
this sense, successful design is as much an educational/consultative
process as abstract discussions about reliability.

I need to take some time to address the issue of risk in a way that
I don't believe any commentator has to this point.

1. It is important to note that most commentators in this discussion have
assumed that most system design/implementation happens in a relatively
controlled, client-expert relationship where the expert has a fair
degree of control over the qualities that the system will have. Often
this is not the case -- increasingly, subsystems are being designed and
plugged into ad-hoc mega-systems (like the international financial system
communications networks) thereby adding to overall complexity without
the advantage of any over-riding design to guide the development of the
subsystem (apart from protocols say). I believe that this is a dangerous
trend and it is a characteristic that many communications nets and other
systems have -- they just grow like topsy and "design" if it can be called
that is blatantly ad-hoc.  Clearly, even conventional systems are added to and
expanded throughout their useful lifetime, but generally some design documents
are available and eventual limits to the system can be reliably assessed.
Assessment of risks in the former situation and judgements of required
levels of reliability can become very difficult indeed.

2. The application or nonapplication of digital techniques and the subissue
of whether analogue systems are better in some applications need to
be considered against some systemic cost/benefit analysis or analysis
of risk (how appropriate!). Phil Maker of the Northern Territory Uni called and
pointed out to me that he designed defibrillators (things that restablish
the normal electrical activity of the heart I believe) and that the
chances of such people dying without a defibrillator is 50% in any given year.

Clearly, software unreliability is an insignificant risk in that situation --
small numbers of people with high existing risk and enormous potential
benefits -- using systems of comparatively low complexity. On the other hand,
a complex digital system controlling a nuclear reactor may (possibly) provide
efficiency benefits that are insignificant compared to the consequences of a
nuclear accident stemming from software unreliability . i.e. a situation in
which marginal benefits and high complexity exist, coupled with catastrophic
consequences in the (perhaps low probability) event of systemic complexity
bringing about an accident.

**Perhaps** a less complex analogue system could be applied in this case, if
at all possible. Note however that reduction of complexity in a situation with
high background risk and/or potential catastrophic consequences (subjective I
know) is what I would (now) advocate.  (Note that the example of a computer
based intensive care unit has all the qualities of a good application
of complex systems -- high existing risk, good potential benefits, localised
failure consequences). It just so happens (I believe) that analogue systems
by their nature, limit how complex a system can become. On this point, shortly
after the article was published, we received a call from someone in
authority at the Shell oil company. They mentioned that they were
investigating the possibility of going back to analogue systems for their
drilling rigs because of their concerns relating to software unreliability.

3. In answer to the implicit question -- "what else can we do but apply digital
systems to life-critical (or potentially catastrophic) situations" one should
be clear that many "needs" or areas of application are only driven by the
possibilities that computers provide. They are often not genuine,
pre-existing needs. For example, my father hardly knew what cheques were
for until later in life. Only rich people had a need for such things
and banks wouldn't wear the manual processing costs for large numbers of
cheque accounts. Computers eventually allowed that to happen but I'm not
sure that there was a screaming demand for cheque books at the time. Yes it's
convenient and yes the whole electronic funds basis of our economies
allows us to do wonderful things with enormous ease. At the same
time it has created a system that is subject to massive, international abuse,
theft, fraud and (perhaps) great risk of rapid systemic collapse given a
sufficiently powerful event. Given these benefits and arguably large
risks, is the electronic financial system a good idea? The major point
however is that NO-ONE ever had the chance to evaluate the risks involved
because the system grew incrementally without any central design base.
It just happened.

Some of us may ask -- "how else could the international financial system be
controlled?" Clearly, it can't be controlled by anything but digital
techniques, since they spawned it in the first place, but the point is we
know have a whole new class of problems to deal with due to the develoment
of the system. This may all sound like old hat and recycled luddism, but
as our article pointed out, there's no substitute for technologists being
concerned about the implications of the systems they design and implement.
For example, I imagine that antarctic mining would create some really
interesting temperature/electromagnetic reliability problems for software
and hardware designers. But for my money, antarctic mining is irresponsible
with large potential for dramatic environmental damage. Obviously, we
shouldn't restrict ourselves to the mere technical problems involved in our
work. It might sound quaint, but too many systems are simply technology
driven rather than need driven. Stop the clock, go back to the caves you
ask? No, but we shouldn't complain when complexity makes unreliability a
problem for implementors and an issue for the general public. If we
actually EXPLAINED to the public the design contraints we face and INVOLVE
them in the design, they may come to accept levels of unreliability that
we think they won't tolerate. On the other hand they might throw up their
hands and say "what the hell do think you're doing". Either way it would
be very interesting.

4. Where does this leave us? Merely with the ideology that lots of technology
commentators have peddled for a long time -- the notion that technologists
must properly evaluate the implications of what they do and actively
involve in the design process the people they affect. This includes amongst
other things a consensus of opinion on what is a sufficient level of
reliability given the array of risks. Too often we want prefabricated
solutions/methodologies like "what are the best methods for providing
an error minimized system at minimal cost for applications of type x".
That is the engineer talking to his colleagues and designing for an
abstract user group or (real) corporate owner. It is meaningless unless the
designer/implementor understands that such prescriptions must be modified
in the face of existing circumstances. Those circumstances include what the
users and others affected by the system think.

In this same vein, some commentators seem to believe that we are hostile to
probabilistic measures of software reliablity/quality. This is not so -- just
that probabilities by themselves are not enough. They are meaningless without
some understanding of the reliability/quality required in the intended
application by the intended users.  That is, as already explained, low levels
of reliability may be an unacceptable source of risk in some situations
(reactors?) but not in others (defibrillators) even when the software has
equivalent probability of failure. How do you know how reliable the software
must be? How do we know if digital systems are best or analogue? The answer
is that as a generalization, law or principle we probably don't. But in the
end we should remember that we are designing for people and if they are
included and fully informed and satisfied with the
safety/reliability/performance provisions in a design, then that should be
enough.

As always, the customer is always right.

Perry Morrison


DTI/SERC Safety Critical Systems Research Programme

Brian Randell &rian.Randell@newcastle.ac.uk>
Mon, 17 Sep 90 9:38:19 BST
The following article is reprinted in its entirety from the British Computer
Society's Computer Bulletin (vol. 2, part 7, Sept. 1990, p.2). It marks the
launching of a new joint initiative by the Department of Trade and Industry and
the Science and Engineering Research Council, to fund industry/university
collaborative research projects relating to safety-critical systems. I
personally very much like the tone of the article - and hope that it can be
matched by the research projects that get selected.  Brian Randell

  Brian Randell, Computing Laboratory, University of Newcastle upon Tyne, UK
  EMAIL =   Brian.Randell@newcastle.ac.uk
  PHONE =   +44 91 222 7923    FAX = +44 91 222 8232

: Safety Critical Systems - Bob Malcolm
:
: On the stairs and landing of my cottage I have four light-bulbs. A few
: weeks ago the landing went dark. No, it was not a blown fuse, but the
: fourth light bulb gone. As technical co-ordinator for the DTI-SERC
: Safety Critical Systems Research Programme I smiled - wanly, as they
: say.
:
: I had failed to replace the three bulbs which had failed previously. I
: had fallen prey to one of the classic safety critical system problems
: - failure to maintain redundant channels. Why did I do it? Why do
: people store rubbish in fire-escapes? It is no good saying that 'They
: shouldn't do it'. They do. What are designers to do about it?
:
: The challenge in building safety critical systems is to achieve safety
: in the real world. We may not understand the world and may not be able
: to describe it properly, populated as it is by real and really
: perverse people - designers and managers as well as users and vandals.
:
: It is not enough to think only of 'meeting a specification',
: especially when the majority of failures arise because of
: specification errors. Nor is it sufficient to add something about
: getting the specification right, since we can never be sure that we
: have succeeded.
:
: One view of safety critical systems is that all failures have their
: origins in human error of some kind - whether a design or operational
: mistake, a failure to understand either a system or its failure modes,
: or a failure at a higher level to appreciate and accommodate the
: likelyhood of these errors.
:
: To produce safety arguments for systems based on reasoning about a
: wide range of technological and human factors requires not only that
: we compare chalk and cheese, but that we add it together in some way.
: We will need to balance arguments about the style of design with
: arguments about its correct construction; arguments about proof with
: arguments about likelihood; arguments about comprehension with
: arguments about rigorous notation. Who knows - maybe if we succeed in
: general then we will succeed in the particular of reconciling
: formalists and 'the rest'. To do that we would also need to make
: explicit the rigour which is presently only implicit in so-called
: 'informal' methods and tools.
:
: I doubt whether we can do all these things without the help of
: psychologists, operational researchers, even philosophers. Computer
: technologists and experts in application domains must work with them
: to devise better global solutions and, as far as is possible and
: sensible, common solutions. We must avoid the 'displacement' problem
: where fixing one problem simply creates a different and perhaps worse
: problem elsewhere. And we must find ways of quantifying the
: effectiveness of what we do - whether as a contribution to a 'safety
: argument' or to a probabilistic risk prediction.
:
: But we need the cooperation of all these different disciplines not
: just for the particular skills which they individually bring to the
: party. We need them because of their very diversity - to enrich our
: intellectual stock.
:
: In the new DTI-SERC programme we must produce results which will be of
: commercial value in the medium term. To do this we must have a firmer,
: better-reasoned, basis for what we do. We might stand a chance if we
: are prepared to open our minds.
:
: Hard scientists must stop treating soft science as non-science: soft
: scientists must not feel disenfranchised. Formalists must seek the
: value of what they call informal, and help in its formalisation;
: informalists must find the justification for what they do.
:
: Safety critical systems pose problems which push us toward
: co-operation between disciplines. I hope that industry and academia
: seize the opportunity to do this within the new programme.
:
: [Bob Malcolm is an independent consultant in the field of systems and
: software engineering.  He currently holds a number of contracts for
: consultancy in research and technology transfer, and is technical
: co-ordinator for the new DTI-SERC Safety Critical Systems Research
: Programme. He is a visiting professor at City University, London.   BR]


Canadian Transportation Accident Investigation

&rian_Fultz@carleton.ca>
Sat, 15 Sep 90 07:31:35 EDT
The above body in Canada is responsibility for making reports on "accidents"
and "aviation occurrences".  They produce reports on safety related problems.
With reference to report 87-A74947 "Risk of collision between Delta Air Lines
Lockheed L-1011 N1739D and Contential Air Lines Boeing 747-200 N608PE
<location> North Atlantic 52 degrees 14 north, 34 degrees, 00 West Long &ime>
08 July 1987.

As the report is over 4 pages long in reduced for I will give only the parts I
think are risk-relevant.  At 30 degrees West Longitude Delta 37 began deviating
to the south of it's assigned track and closed laterally with Contential untill
the aircraft crossed the assigned track of the Contential ...  The front half
of Deltaa 37 passed beneath the rear fueselage of the Contential 25 with less
than 100 feet of vertical separation.  Neither crew saw the other aircraft in
time to take evasive action. ....  The Captain used the Flight Management
System to ...  Passed possition report at 30 degrees west < wrong by 16 minutes
> ...  Gross navigaional error < deff'n more than 25 miles off > Conclusions <
I am going to use the numbering of CTAT > 1) The near collision resulted from a
INS < interial navigation system> data input error by the Deltal Flight 37
crew.  9) Although cross checks were contained in explicit detail in the manual
used by the crew, they were set out in such terms and format that a manditory
requirement to cross check at each and every waypoint may not have been clearly
conveyed to aircrew.  13)The estimate for Delta 37 at 40 west showing a
difference of 16 minutes was not challenged by ATC nor was it required to be...
19 no evidence was found that would indicate the Delta 37 INS was
malfunctioning prior to or during the flight.  ....lots of recommendations to
make pilots make cross checks...
                                              Brian Fultz


Risk of Collision

&rian_Fultz@carleton.ca>
Sat, 15 Sep 90 08:12:31 EDT
Between Boeing 767 C-GPWA and Boeing 727 C-GAAY < location>
Toronto < time > 18 February 1987

The board writes .. < I have taken out aviation specific terms > The Boeing 727
and the 767 were on identical tracks to Toronto.  The 727 was at 29,000 feet
the 767 directly overhead at 31,000 feet....  ATC cleared the 767 to 11,000
feet ...  ATC 3 attempts to stop 767 ... not understood ... eventually halted
1/4 mile EAST of 727 ...  evasive vectors issued by ATC ... evasive
instructions turned aircraft toward each other

In the Pertinent information section:
Radar recordings indicated that the flight data blocks for the aircraft
overlapped on the controllers radar screen.  The system then repositioned the
display: the 767 to the WEST and the 727 to the EAST < see above > ....  The
repositioned data block display was opposite to the actual relative positions
of the two aircraft.
                                             Brian Fultz


Knight reference: 'Shapes of bugs'

Pete Mellor <pm@cs.city.ac.uk>
Mon, 17 Sep 90 21:34:49 PDT
In a RISKS-10.31, I wrote:

> A serious attempt *has* been made (by John Knight - reference not to hand)
> to examine the shapes of bugs in programs, i.e. the topological properties
> of those subsets of the input space which activate program faults.
> Chaotists will be pleased to learn that they were fractal.

Several people have asked for the full reference. It is:-

Paul E. Amman, John C. Knight: 'Data diversity: an approach to fault tolerance'
     Proc. FTCS, 1987, pp 122-126,
     0731-3071/87/0000/0122$01.00 (c) 1987 IEEE

The paper concerns the use of `data diversity' as a means of improving program
reliability, as opposed to *design diversity*. The idea is that, since the
activation of a software fault depends on a particular selection of input data,
it may be possible to perturb the inputs in such a way that the output is still
valid in all cases, and yet not all of the perturbed versions of the input
activate the fault. A voter can then process the various outputs in a similar
fashion to N-version programming.

The ability to perturb inputs in a valid way depends upon the shape and size of
the `failure region'. To quote:

  The `failure domain' of a program is the set set of input points which
  cause program failure [3]. The geometry of a failure domain, which we call a
  `failure region', describes the distribution of points in the failure
  domain and determines the effectiveness of data diversity. The
  fault-tolerance of a system employing data diversity depends upon the
  ability of the re-expression algorithm to produce data points that lie
  outside of a failure region, given an initial data point that lies within
  a failure region.
  ...[stuff omitted]
  We have obtained two-dimensional cross sections of several failure
  regions for faults in programs used in a previous experiment [5].
  ...[stuff omitted]
  The cross sections shown are typical for these programs. This small sample
  illustrates two important points. First, at the resolution used in scanning,
  [The inputs are simulated data from radar scanners tracking an object. - PM]
  these particular failure regions are locally continuous. Second, since the
  failure regions vary greatly in size [by a factor of 4 x 10~10 ! - PM],
  exiting them varies greatly in difficulty.

[End quote]

You will see that the authors do not, in fact, claim that the failure regions
described in this paper are fractal, so apologies if my statement misled anyone.
However, they cross-refer another paper which describes these faults:

S.S. Brilliant, J.C. Knight, N.G. Leveson: `Analysis of faults in an N-version
software experiment', University of Virginia Technical Report No. TR-86-20,
September 1986.

The two papers cross-referred in the above extract are:

[3] F. Cristian: `Exception Handling', in `Resilient Computing Systems',
    volume 2, T. Anderson, ed., John Wiley & Sons, New York (to appear).

[5] J.C. Knight, N.G. Leveson: `An experimental evaluation of the assumption
    of independence in multiversion programming', IEEE Transactions on
    Software Engineering, Vol. SE-12, No. 1, January 1986, pp 96-109

BTW my mail system has been playing up. If anyone has mailed me, e.g., to
request a copy of the Forester and Morrison paper, and not received a reply,
please retransmit.

Peter Mellor, Centre for Software Reliability, City University, Northampton
Sq., London EC1V 0HB +44(0)71-253-4399 Ext. 4162/3/1 p.mellor@uk.ac.city (JANET)

Please report problems with the web pages to the maintainer

Top