The Risks Digest

The RISKS Digest

Forum on Risks to the Public in Computers and Related Systems

ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Volume 4 Issue 14

Wednesday, 19 November 1986

Contents

o Re: On placing the blame
Matt Bishop
o At last, a way to reduce [net]news traffic
Jerry Aguirre via Matthew P Wiener
o Safety-Critical Software in the UK
Appendix B of ACARD report
o Info on RISKS (comp.risks)

Re: On placing the blame (RISKS 4.13)

Matt Bishop <mab@riacs.edu>
Wed, 19 Nov 86 15:59:31
There's an old joke about computer scientists who build the most
advanced, intelligent computer ever.  As a test, they ask it "Is there
a God?"  It responds, "There is now!"

Sadly, a lot of people tend to think of computers as infallible.
(We've discussed this in Risks before, I think.)  Computer scientists
know better, and try to educate the public to this fact of life.
Peter Denning asks "Have some computer people become unduly eager to
accept the blame when there is a mishap in a system that contains a
computer?"   If the answer is yes, one cause may be an eagerness to
demonstrate to the public that the machines are not perfect.

Others once thought of the computer as infallible but have come to realize
it is only as good as the people who build it, program it, and feed it data.
When something (or someone, for that matter) once put on a pedestal falls
off, there is a very human tendency to be more harsh towards that thing than
something never put on a pedestal.  We may be seeing some of this in
"journalists [becoming] unduly accustomed to fingering the computer for
every mishap" (although I suspect it's not just journalists who do this!)

There's also a third tendency at work here -- it's a lot easier to blame
someone whom you don't have to look in the eye.  With a human, you would
have to (say, when you were firing him, when you were prosecuting him, or so
forth.)  Also, a human can strike back verbally or nonverbally.  A computer
can do none of these things, and best of all you don't have to think about
its feelings when you chastise it.  Maybe that's part of it too.

Matt Bishop


At last, a way to reduce [net]news traffic

Matthew P Wiener <weemba@brahms.Berkeley.EDU>
Wed, 12 Nov 86 14:41:03 PST
Newsgroups: net.news
From: jerry@oliveb.UUCP <Jerry Aguirre, Olivetti ATC; Cupertino, Ca >
Date: 11 Nov 86 17:39:44 GMT

Most of you are probably aware that there was a premature posting of
newsgroup messages for all the proposed newsgroup renamings.  This caused
many (if not most) sites to exceed the maximum allowed number of newsgroups
in their active files.  Some sites are still recovering from this problem.

It is interesting to note that the volume of news articles for last week
was less than half what it was for the previous week.

    Oct 27 11:48 to Nov 1 23:58 6,755 articles
    Nov 1 23:59 to Nov 10 15:15 3,102 articles

The reduction in volume gives you some idea of the number of sites that
were blown off the air.

I know it took me a couple of hours to clean up old newsgroups,
recompile news with larger tables, and reprocess the failing batches.
(My news daemon renames and saves batches when rnews exits with an error
status.)  Multiply that times the number of sites on the net and you
probably get many thousands of manhours spent cleaning up.

Amazing to think how vulnerable the net is to the actions of one individual.

                    Jerry Aguirre, Olivetti ATC

   [And this was precisely the glitch that triggered the macro error that
    led to the saga prior to the real RISKS-4.7!  To add to the irony,
    MPW's message slipped through a crack last week while I was travelling.
    I just found it while cleaning up the RISKS mailbox!  PGN]


Safety-Critical Software in the UK

Peter G. Neumann <Neumann@CSL.SRI.COM>
Wed 19 Nov 86 14:20:48-PST
             [John Rushby called to my attention a remarkable report on the
              British view of software in the future.  The entire report is
              fascinating reading, but in particular the following appendix
              is of sufficient interest to the RISKS community that it is
              reproduced here in its entirety for the private use of RISKS
              readers.  It represents an important step toward the problems
              of developing safety-critical software.  PGN]


            ``Software: A Vital Key to UK Competitiveness''
  Cabinet Office: Advisory Council for Applied Research and Development (ACARD)
      London, Her Majesty's Stationery Office.  (C) Crown Copyright 1986

                 Appendix B: Safety-Critical Software

The problem: non-technical

B.1 No computer software failure has killed or injured a large number of
people.  It is just conceivable that such a tragedy could occur.  What steps
should be taken to:

* prevent such a disaster,

* cope with it when it does occur,

* ensure such a disaster, having happened once, cannot recur?


The problem: technical

B.2 Stored-program digital computers must be among the most reliable
mechanisms ever built by man.  Millions of computers throughout the
world are executing millions of instructions per second for millions
of seconds without a single error in any of the millions of bits from
which each computer is made.  In spite of this, nobody trusts a
computer; and this lack of faith is amply justified.

B.3 The fault lies not so much in the computer hardware as in the programs
which control them, programs full of the errors, oversights, inadequacies
and misunderstandings of the programmers who compose them.  There are some
large and widely used programs in which hundreds of new errors are
discovered each month; and even when these errors are corrected, the error
rate remains constant over several decades.  Indeed it is suspected that
each correction introduces on average more than one new error.  Other
estimates offer the dubious comfort that only a negligible proportion of all
the errors in these programs will ever be discovered.

B.4 New computers are beginning to be used in increasingly life-critical
applications, where the correction of errors on discovery is not an
acceptable option, for example industrial process control, nuclear reactors,
weapon systems, station-keeping of ships close to oil rigs, aero engines and
railway signalling .  The engineers in charge of these projects are
naturally worried about the correctness of the programs performing these
tasks, and they have suggested a number of expedients for tackling the
problem.  Many of these methods are of limited effectiveness because they
are based on false analogies rather than on a true appreciation of the
nature of computer programs and the activity of programming.

B.5 The steps which ACARD has been considering in answer to the
introductory question are discussed under the following headings:

* Disaster prevention

* Disaster management

* Disaster analysis


Disaster prevention

B.6 The initiative for disaster prevention must come from the UK government
and system customers.  Current software is built, operated and maintained
using methods and tools which are not keeping pace with the development of
the hardware, nor with the increased sophistication demanded by new
applications; nor does it take account of progress of research into the
reliability of programs.  The necessary improvements in software engineering
require investment in advanced development and production techniques,
education, training and legislation.  Legal obligations should be at least
as stringent as those imposed by the Data Protection Act, and the care and
time required for detailed drafting of legislation will be just as great.  A
start must be made immediately.

B.7 The remainder of this appendix outlines an imaginable solution that may emerge over the next fifteen years.  It is intended to promote rather than to pre-preempt a discussion of the details.


Registration

B.8 A register must be established of those (software) systems which,
if they fail, will endanger lives or public safety.


Operation (demand side)

B.9 Before any organization can operate a life-critical computer
system it must first obtain a License To Operate (LTO), which will
only be issued when the operator can demonstrate that certain
conditions (detailed below) have been met.

B.10 Each life-critical system must be operated by a Certified Software
Engineer who is named as being personally responsible for the system.  This
Certified Software Engineer must have received the appropriate mathematical
training in safety-critical software engineering.

B.11 A life-critical system must be adequately maintained; this must
be one of the conditions of the LTO.  Maintenance (that is,
rectification and development) must be the responsibility of a named
Certified Software Engineer.


Certification

B.13 An LTO must only be granted when a Safety Certificate has been issued.
Certificates must be issued for limited periods, for example, five years.
Operational systems will thus need to be recertificated (relicensed)
periodically (analogous to Certificate of Airworthiness).


Reliability data collection

B.14 To aid research into system reliability, and to assist Boards of
Enquiry, all registered life-critical software systems must supply
operating data on the Licensing Authority.


Disaster management

B.15 In the past, the danger arising from failure of computer hardware and
software has been limited by switching off the computer and reverting to
manual operation if necessary.  In future, there will be applications for
which this fall-back procedure is not available.  The computers will have to
continue to run, and any necessary software changes and corrections will
have to be inserted into the incorrectly running system.  For these
applications, specially stringent precautions are necessary.


Procedures

B.16 The Licensing Authority should require disaster management procedures
to be laid down in advance of operation and practiced regularly during
operation (that is 'fire drill practice').  The documentation of the system
must need a standard which would permit a team of experts/specialists to
master it during the progress of an emergency.


Data Logging

B.17 The disaster management procedures should include the logging of data
so that any subsequent Enquiry can ascertain the progress and cause of the
disaster (analogous to the 'black box recorder' in an aeroplane).


Emergency call-out

B.18 There must be more than one Certified Software Engineer available to
the operating company; and a duty rota should ensure that one of them is
always available at short notice.  Procedures must be set up for calling out
a team of expert specialists in a longer-lasting emergency.


Disaster analysis

B.19 During the normal (safe) operation of any life-critical system,
data on its performance and reliability must be made available to the
Licensing Authority.  This data will be made available to any Enquiry.
(This is additional to the data logging required in para B.14.)


Board of Enquiry

B.20 Any disaster should be the subject of an official Board of Enquiry
(similar to rail and air disaster enquiries).  A Board of Enquiry must have
the power to make changes to the system under investigation and/or the
methods, tools, products and staff associated with the certification
procedure.


Any error triggers Board of Enquiry

B.21 Any error, no matter how 'small', in a software system which has been
certified as being safe must be subject of an Enquiry.  This is the only way
of discovering weaknesses in the certification process itself, or misuse or
misunderstanding of its application.  Enquiries concerning non-fatal errors
should not have disciplinary implications, so that operators are encouraged
always to give notification of minor faults.


Near Miss

B.22 Any serious 'near miss' must be reported to the Licensing
Authority.  An Enquiry should be held if the Licensing Authority is
concerned at the incident's implications.


Safety certification

B.23 The UK must develop the ability to certify safety aspects of
software system construction and operation.  These include:

* certification of the mathematical soundness of the methods of construction;

* certification that certified methods are properly applied during
construction and subsequent maintenance (rectification and
development);

* certification of the tools used during construction and maintenance;

* certification of the software engineers who build and maintain the systems;

* certification of the end product, that is, the software itself.


B.24 Methods should not be certified which are merely 'good practice'.
Safety and reliability require more rigorous theoretical bases than
existing good practice, so that system behavior can be accurately and
consistently predicted; hence the need for mathematical soundness to
enable prediction to be based on mathematical proof.

B.25 Certification of a tool will only be given when it is shown that
the tool preserves the mathematical soundness of the method is supports.

B.26 Certification of software engineers will only be given when they have
completed an approved level of formal mathematical and methodological
training together with an approved track record of experience.
Certification should be of limited duration; recertification should require
additional formal training both of the refresher type and new developments.
Recertification should occur at regular intervals.

B.27 Certification of end products (and their components) implies proof
obligations in addition to thorough testing.  Proofs must be performed and
checked by competent mathematicians or by a machine running certified
software.

B.28 As in other branches of engineering, the rigour of the inspection
procedures should be adjusted to the degree of risk, the severity of
the danger and the cost.  For example, we can imagine the emergence of
several levels of certification:

    a.  Disaster Level.  Failure could involve more than ten deaths.
The whole of the software must be checked by formal mathematical proof,
which is itself checked by a competent mathematician.  Further precautions
required if damage limitation by switch-off is not feasible (para B.15).

    b.  Safety Level.  Where failure could cause one death, but further
danger can be averted by switch-off.  The whole of the software must be
constructed by proof-oriented methods, checked by a competent mathematician.
On occurrence of a fatality, the mandatory Enquiry must name the programmer
and mathematician responsible, who might be liable for criminal negligence.
Perhaps one error per 100,000 lines of code would be a realistic
expectation, so that most shorter programs will contain no errors.

    c.  High Quality Level.  Appropriate for software sold commercially,
where error could bring financial loss to the customer.  By law, such losses
should be reimbursed.  All programmers involved should be certified
competent in mathematical methods of software design and construction.
Their use of the methods is checked by sampling.  An acceptable error rate
would be one error per 10,000 lines of code delivered.  Each error corrected
requires recertification at Safety Level.  If the target error rate is
exceeded, certification is withdrawn.  Eventually, all software used to
construct other certified software should be certified to this level; and
the construction of 'disaster level' software should include independent
checks on the correct working of support software used (for example, check
of binary code against higher level source codes).

    d.  Normal Quality.  Corresponds roughly to the best of current
practice (say, one error per 1,000 lines of code).  The methods used to
construct software to higher levels of reliability may also be used to
achieve normal reliability; and this should bring a significant improvement
in programmer productivity and a reduction in the whole life cycle costs of
the programs they produce.

Please report problems with the web pages to the maintainer

Top