The Risks Digest

The RISKS Digest

Forum on Risks to the Public in Computers and Related Systems

ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Volume 14 Issue 48

Wednesday 7 April 1993

Contents

o Re: Shuttle Failure Blamed On Computer Glitch
Kriss A. Hougland
o Safety-Critical Software, special issue of IEEE Software
John Knight
o London Ambulance Service Inquiry Report
Brian Randell
o Info on RISKS (comp.risks)

Re: Shuttle Failure Blamed On Computer Glitch (RISKS-14.47)

"Kriss A. Hougland" <hougland@enuxhb.eas.asu.edu>
Wed, 7 Apr 1993 11:35:21 -0700
From all the information on the shuttle delay, the situation seems to be:
A faulty sensor or broken wire that monitors that status of a valve.

So far, I have heard that the problem is still a computer glitch.  This is
not correct.  The software performed as required.  The solution to the
problem is:

1) find and fix the problem  -- I would speculate a very $$$ option

2) update the software to override the situation -- quick and easiy, but
    very risky if the problem is the valve.

It looks like people are fixing hardware problems in software again.  There is
a classic risk of overriding hardware problems with software while introducing
the ability to do the override correctly, or by a nasty side effect by the
program (oops -- I was using that variable to turn on the engines!)

I hope at NASA, they are willing to assume the risk of correcting hardware
problems in software.  (NASA does have some good brains so I think they are
taking a very educated guess from the telemetry.)  I would hate to see another
shuttle go up in flames (sorry about the pun).


Safety-Critical Software, special issue of IEEE Software

<jck@neptune.cs.virginia.edu>
Wed, 7 Apr 93 16:20:35 EDT
CALL FOR ARTICLES                              IEEE SOFTWARE
                    SAFETY-CRITICAL SOFTWARE

     A forthcoming special issue of IEEE Software will focus on
safety-critical software development.  The theme of the special issue is to
document recent achievements and current challenges in both research and
application of safety-critical software technology.  Papers are solicited that
report recent research results, both theoretical and experimental.  Similarly,
papers are solicited that document the best current practices, experience with
these practices, and the major outstanding problem that the applications
community sees.
     Original articles are sought on relevant topics including (but are not
limited to):

o Experience in safety-critical applications development in areas such as
  avionics, nuclear power system, and medical devices.
o Results of experiments in any area related to safety-critical software
  development.
o Significant challenge areas whose definition and motivation arise from
  practical experience.
o Development methods, processes, and standards designed for safety-critical
  software.
o Specification and verification techniques.
o Dependability assessment and modelling.
o Tools and environments supporting safety-critical software development.

Submitted papers must not have been previously published nor be under
consideration for publication elsewhere.  To be considered for the special
issue, please send eight copies of the complete manuscript to either of the
guest editors:

   John C. Knight                  Bev Littlewood
   Department of Computer Science  Center for Software Reliability
   University of Virginia          The City University
   Thornton Hall                   Northampton Square
   Charlottesville                 London, EC1V 0HB
   VA 22903, USA                   UK
   (knight@virginia.edu)           (b.littlewood@city.ac.uk)

   Submission deadline is June 15 for IEEE SOFTWARE


London Ambulance Service Inquiry Report (long)

<Brian.Randell@newcastle.ac.uk>
Wed, 24 Mar 1993 12:58:12 GMT
  [Brian noted that his reason for sending this to RISKS was that, unlike
  the previous postings, this one is AUTHORITATIVE.  He also wanted to give a
  clear impression of the scope and level of detail of the computer-related
  parts of the report, and of how they fitted into the report as a whole.
  PGN]

I have today managed to obtain a copy of the actual 80-page "Report of the
Inquiry into the London Ambulance Service, February 1993".

The terms of reference of the Inquiry were "To examine the operation of the
CAD [Computer-Aided Dispatch] system, including:

   a) the circumstances surrounding its failures on Monday and Tuesday 26 and
      27 November 1992

   b) the process of its procurement

and to identify the lessons to be learned for the operation and management
of the London ambulance Service against the imperatives of delivering
service at the required standard, demonstrating good working relationships
and restoring public confidence."

The Inquiry Team membership is listed as

- Don Page, Chief Executive of South Yorkshire Metropolitan Ambulance and
Paramedic Service NHS Trust

- Paul Williams, senior computer audit partner of BDO Binder Hamlyn

- Dennis Boyd, former Chief Conciliation Officer of the Advisory
Conciliation and arbitration Service (ACAS)

The principal background facts given about the LAS in the report are that
the service "covers a geographical area of about 600 square miles. It is
the largest ambulance service in the world. It covers a resident population
of some 6.8 million, but its daytime population is larger particularly in
Central London. LAS carries over 5,000 patients every day. It receives
between 2,000 and 2,500 calls daily; this includes between 1,300 and 1,600
999 calls."

The Inquiry's Report carries no copyright notice, and is freely available
(see end of this message). Here are the scanned-in Table of Contents, and
the complete text of the Sections entitled "COMPUTER AIDED DESPATCH
SUMMARY", "COMPUTER AIDED DISPATCH RECOMMENDATIONS", "KEY SYSTEM PROBLEMS",
"CAUSES AND EFFECTS OF BREAKDOWN ON 26 AND 27 OCTOBER 1992",  and "FAILURE
OF THE COMPUTER SYSTEM. 4 NOVEMBER 1992" (The section "CAUSES AND EFFECTS
OF BREAKDOWN ON 26 AND 27 OCTOBER 1992" also contains a very detailed and
interesting "Cause-Effects" diagram, with about 35 boxes and many directed
links, which is not reproduced here.)

Brian Randell, Dept. of Computing Science, University of Newcastle, Newcastle
upon Tyne, NE1 7RU, UK Brian.Randell@newcastle.ac.uk +44 91 222 7923

    = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

REPORT OF THE INQUIRY INTO THE LONDON AMBULANCE SERVICE FEBRUARY 1993

CONTENTS
SECTION and Sub-Section

SUMMARY, CONCLUSIONS AND RECOMMENDATIONS
Computer Aided Despatch Summary
Management and Operations Summary
Computer Aided Despatch Conclusions
Management and Operations Conclusions
Computer Aided Despatch Recommendations
Management and Operations Recommendations
Resource Implications of Inquiry Team Report

BACKGROUND
Terms of Reference and Inquiry Team Membership
Facts About the LAS
Computer Aided Despatch
LAS and CAD
Report Description

THE SYSTEM AND ITS DEVELOPMENT
Rationale For a CAD System
Background to CAD
Concept/Design
Supplier Selection - The Procurement Process
Project Management
Systems Testing/Implementation
Technical Communications
Human Resources and CAD Training
The System Structure

26 AND 27 OCTOBER AND 4 NOVEMBER 1992
CAD Conclusions
Demand on LAS Services 26 and 27 October
Key System Problems
System Configuration Changes
Causes and Effects of Breakdown on 26 and 27 October 1992
Failure of the Computer System, 4 November 1992

THE WAY FORWARD FOR CAD

MANAGEMENT AND OPERATION OF THE LAS
The Scope of LAS Operations
Managing the LAS
Management / Union Relationships
Resource Management
Personnel Management
LAS Accountability
Public Confidence

ANNEX A: List of organisations and individuals who gave evidence

ANNEX B: Glossary of abbreviations

     -----------------

COMPUTER AIDED DESPATCH SUMMARY

1001 What is clear from the Inquiry Team's investigations is that neither
the Computer Aided Despatch (CAD) system itself, nor its users, were ready
for full implementation on 26 October 1992. The CAD software was not
complete, not properly tuned, and not fully tested. The resilience of the
hardware under a full load had not been tested. The fall back option to the
second file server had certainly not been tested. There were outstanding
problems with data transmission to and from the mobile data terminals.
There was some scepticism over the accuracy record of the Automatic Vehicle
Location System (AVLS). Staff, both within Central Ambulance Control (CAC)
and ambulance crews, had no confidence in the system and were not all fully
trained. The physical changes to the layout of the control room on 26
October 1992 meant that CAC staff were working in unfamiliar positions,
without paper backup, and were less able to work with colleagues with whom
they had jointly solved problems before. There had been no attempt to
foresee fully the effect of inaccurate or incomplete data available to the
system (late status reporting/vehicle locations etc.). These imperfections
led to an increase in the number of exception messages that would have to
be dealt with and which in turn would lead to more call backs and
enquiries. In particular the decision on that day to use only the computer
generated resource allocations (which were proven to be less than 100%
reliable) was a high risk move.

1002 Whilst understanding fully the pressures that the project team were
under to achieve a quick and successful implementation it is difficult to
understand why the final decision was made, knowing that there were so many
potential imperfections in the system.

1003 The development of a strategy for the future of computer aided
despatch within the London Ambulance Service (LAS) must involve a full
process of consultation between management, staff, trade union
representatives and the Service's information technology advisers. It may
also be appropriate to establish a wider consultative panel involving
experts in CAD from other ambulance services, the police and fire brigade.
Consequently the recommendations from the Inquiry Team should be regarded
as suggestions and options for the future rather than as definitive
recommendations on the way forward. What is certain is that the next CAD
system must be made to fit the Service's current or future organisational
structure and agreed operational procedures. This was not the case with the
current CAD.

     -----------------

COMPUTER AIDED DESPATCH RECOMMENDATIONS

1009 These are the main recommendations drawn by the Inquiry Team from its
investigations into the CAD system, each of which is covered fully in the
main text. We recommend:

a) that LAS continues to plan the implementation of a CAD system [3009];

b) that the standing financial instructions should be extended to provide
more qualitative guidance for future major IT procurements [3032];

c) that any future CAD system must conform to the following imperatives:

i. it must be fully reliable and resilient with fully tested levels of back-up;

ii. it must have total ownership by management and staff, both within CAC
and the ambulance crews;

iii. it must be developed and introduced in a timescale which, whilst
recognising the need for earliest introduction, must allow fully for
consultation, quality assurance, testing, and training;

iv. management and staff must have total, demonstrable, confidence in the
reliability of the system;

v. the new system must contribute to improving the level and quality of the
provision of ambulance services in the capital;

vi. any new system should be introduced in a stepwise approach, with, where
possible, the steps giving maximum benefit being introduced first;

vii. any investment in the current system should be protected and carried
forward to the new system only if it results in no compromises to the above
objectives [5004];

d) re-training of CAC staff be carried out on the system to ensure that
they are familiar with its features and that they are operating the system
in a totally consistent way [5025];

e) a suitably qualified and experienced project manager be appointed
immediately to coordinate and control the implementation of the proposed
first stage of CAD [50271;

f) that a specialist review be undertaken of communications in the light of
the final objectives of CAD and that any recommendations arising are
actioned as part of the proposed second phase of CAD [5033]; g) the
establishment of a Project Subcommittee of the LAS Board [5040]; h) that
LAS recruit an IT Director, who will have direct access to the LAS Board
[5041].

     -----------------

KEY SYSTEM PROBLEMS

4007 As detailed earlier there were a number of basic flaws in the CAD
system and its supporting infrastructure. In summary, the system and its
concept has several major problems:

a) a need for near perfect input information in an imperfect world;

b) poor interface between crews, MDTs [Mobile Data Terminals] and the system;

c) unreliability, slowness and operator interface.

**Need for near perfect information**


4008 The system relied on near perfect information of vehicle location and
crew/vehicle status. Without accurate knowledge of vehicle locations and
status the system could not allocate the optimum resource to an incident.
Although some poor allocations may be attributable to errors in the
allocation routine, it is believed that the majority of allocation errors
were due to the system not knowing the correct vehicle location or status
of vehicles that may have proved more appropriate.

**Poor interface between crews, MDTs and the system**


4009 Given that the system required almost perfect information on vehicle
location and status, each of the component parts of the chain from crews to
despatch system must operate well. This was not the case. From our
investigations, possible reasons for the despatch system not knowing the
correct vehicle location or status of vehicles that may have proved more
appropriate:

a) a failure of the system to catch all of the data;

b) a genuine failure of crews to press the correct status button owing to
the nature and pressure of certain incidents;

c) poor coverage of the radio system, i.e. black spots;

d) crews failing to press status buttons as they became frustrated with
re-transmission problems;

e) a radio communications bottle neck, e.g. when crews commence duty and
try to log on via their vehicle's MDT or during other busy periods;

f) missing or swapped callsigns;

g) faults in the "hand shaking" routines between MDTs and the despatch
system, eg MDTs showing Green and OK, but system screens showing them in a
different status;

h) crews intentionally not pressing the correct status buttons or pressing
them in an incorrect order;

i) crews taking a different vehicle to that which they have logged on to,
or a different vehicle/crew responding to that allocated by the system;

j) incorrect or missing vehicle locations;

k) too few call takers.

4010 The above reasons are often interconnected.

**Unreliability, Slowness and Operator Interface**

4011 It is reported that the system "fell over" a few times before 26
October 1992. More common was the frequent "locking up" of screens. Staff
had been instructed to re-boot their screens if they locked up. The system
also slowed up when under load and whilst it was doing its "house keeping"
at 02:00 hours each morning.

4012 General imperfections include:

a) failure to identify all duplicated calls;

b) lack of prioritisation of exception messages;

c) exception messages and awaiting attention queues scrolling off the top
of the allocators'/exception rectifiers' screens;

d) software resource allocation errors;

e) general robustness of the system (workstation and MDT "lockups");

f) slow response times for certain screen based activities.

     -----------------

CAUSES AND EFFECTS OF BREAKDOWN ON 26 AND 27 OCTOBER 1992

4016 On 26 and 27 October 1992 the computer system itself did not fail in a
technical sense. Response times did on occasions become unacceptable, but
overall the system did what it had been designed to do. However, much of
the design had fatal flaws that would, and did, cumulatively lead to all of
the symptoms of systems failure.

4017 In order to work effectively the system needed near perfect
information all of the time. Without this the system could not be expected
to propose the optimum resource to be allocated to an incident. There were
many imperfections in this information which individually may not be
serious, but which cumulatively were to lead to system "failure".

4018 The changes to CAC operation on 26 and 27 October 1992 made it
extremely difficult for staff to intervene and correct the system.
Consequently, the system rapidly knew the correct location and status of
fewer and fewer vehicles. The knock on effects were:

a) poor, duplicated and delayed allocations;

b) a build up of exception messages and the awaiting attention list;

c) a slow up of the system as the messages and lists built up;

d) an increased number of call backs and hence delays in telephone answering.

4019 Each effect quickly reinforced the others leading to severe
lengthening of response times. A more detailed explanation follows.

4020 A cause and effect diagram is shown opposite, Diagram 4.5, for the
operation of the system on 26 and 27 October 1992. As the number of
incidents increases there are several naturally reinforcing loops which
escalate the problems. A description of the course of events and
interactions follows.

4021 When the system was fully implemented at 07:00 hours 26 October 1992
the system was lightly loaded. Staff and system could cope with the various
problems (left hand side of the diagram) which caused the despatch system
to have imperfect information on the fleet and its status. As the number of
incidents increased, incorrect vehicle location or status information
received by the system increased. With the new room configuration and
method of operation, allocators were less able to spot and correct errors.

4022 The amount of incorrect location and status information in the system
increased with four direct effects:

a) the system made incorrect allocations: multiple vehicles sent to same
incident, or not the closest vehicle sent;

b) the system had fewer resources to allocate, increasing the problems of
effect a);

c) as previously allocated incidents fed through the system and suffered
from the problems on the left hand side of the diagram which resulted in
the system not having the resource's correct status, the system placed
covered calls that had not gone through the amber, red, green status cycle,
back on the attention waiting list;

d) failures because of the problems on the left hand side of the diagram
caused the system to generate exception messages.

4023 Starting with effect 4022 d), the number of exception messages
increased rapidly to such an extent that staff were unable to clear the
queue. As the exception message queue grew the system slowed. The situation
was made worse as unrectified exception messages generated more exception
messages. With the increasing number of "awaiting attention" and exception
messages it became increasingly easy to fail to attend to messages that had
scrolled off the top of the screen. Failing to attend to these messages
arguably would have been less likely in a "paper-based" environment.

4024 Effects 4022 b) and c). With fewer resources to allocate the system
would recommend what it saw as the closest vehicle. This was often an
incorrect allocation as a closer vehicle was actually available. It took
longer to allocate resources for three reasons:

a) the allocator had to spend more time finding and confirming suitable
resources;

b) incidents were held until a suitable resource became available;

c) resource proposal software took longer to process as resources became
more distant.

4025 There was a re-enforcing effect in that as allocators tried to contact
a resource, that resource was unavailable for allocation to another
incident. Once an allocator "clicked onto" a resource its status turned to
dark green thus preventing it from being allocated elsewhere. It is
reported that one allocator was allocating resources, but not mobilising
them. Any delay in allocation or mobilisation was a delay to a patient.

4026 It also took longer to allocate resources as more two line summaries
fed through the system. Standard two line summaries of incidents awaiting
resource allocation included those that had previously been covered, but
were not seen by the system as complete. As this queue built up it caused
the system to slow.

4027 At one stage two line summaries were scrolling onto the screen so fast
that in trying to stop summaries moving off the screen, allocators were
further slowed in their tasks.

4028 In summary, effects 4022 b) and c) contributed to incorrect
allocations, a slowing of the system and uncovered incidents all leading to
delays to patients. The number of uncovered incidents was probably
increased when at one stage the exception report queue was cleared in an
effort to increase the speed of the system.

4029 Effect 4022 a), incorrect allocations, led directly to patient delays
and crew frustration. Crew frustration was further increased by delays in
arriving at the scene and the reaction from the public.

4030 Crew frustration may have been responsible for:

a) increasing the instances when crews didn't press the status buttons in
the correct sequence;

b) the allocated crew taking a different vehicle, or a different crew and
vehicle responding to the incident.

4031 In the month preceding 26 and 27 October 1992 crew frustration also
led to an increase in radio traffic which, owing to the potential for radio
bottlenecks, increased the number of failed data mobilisations and voice
communication delays. In turn, and completing the loop, failed data
mobilisations and voice communications delays lead to further increased
voice communications and crew frustration. On 26 October instruction was
for minimum voice communication. Statistics show that the number of
successful data mobilisations increased. However, with no voice
communications, wrong or multiple allocations were not corrected thus
negating the beneficial effect of increase data mobilisations.

4032 Turning to telephone communications between the public and CAC, delays
to patients and uncovered incidents greatly increased the number of call
backs, thus increasing the total number of calls handled. An increased call
volume, together with a slow system and too few call takers caused
significant delays in telephone answering, thereby further increasing
delays to patients.

     --------------

FAILURE OF THE COMPUTER SYSTEM. 4 NOVEMBER 1992

4033 Following the CAD problems of 26 and 27 October 1992, CAC had reverted
to a semi manual method of operation, identical to that which had operated
with a variable degree of success before 26 October.

4034 This method of working comprised:

a) calls being taken on the CAD system (including use of gazetteer);

b) incident details being printed out in CAC;

c) optimum vehicle resource identified through contact with nearest station
to incident;

d) mobilisation of the resource via CAD, direct to the station printer or
to the MDT.

4035 In general CAC staff were comfortable with operating this system as
they found the computer based call taking and the gazetteer for the most
part reliable. There were known inadequacies with the gazetteer and
occasional "lock-up" problems with workstations, but overall the benefits
outweighed the disadvantages. The vehicle crews were also more comfortable
as the stations still had local flexibility in deciding which resource to
allocate to an incident. The radio voice channels were available to help
clear up any mobilisation misunderstandings. Largely as a result of the
problems of the previous week, additional call taking staff had been
allocated to each shift thus reducing significantly the average call
waiting time.

4036 This system operated with reasonable success from the afternoon of 27
October 1992 up to the early hours of 4 November.

4037 However, shortly after 2am on 4 November the system slowed
significantly and, shortly after this, locked up altogether. Attempts were
made to re-boot (switch off and restart workstations) in the manner that
CAC staff had previously been instructed by Systems Options to do in these
circumstances. This re-booting failed to overcome the problem with the
result that calls in the system could not be printed out and mobilisations
via CAD from incident summaries could not take place. CAC management and
staff, having assured themselves that all calls had been accounted for by
listening to the voice tapes, and having taken advice from senior
management, reverted fully to a manual, paper-based system with voice or
telephone mobilisation. As these problems occurred in the early hours when
the system was not stretched the operational disruption was minimised.

4038 SO [Systems Options Ltd.] were called in immediately to investigate
the reasons for the failure. In particular LAS required an explanation as
to why the specified fallback to the standby system had not worked.

4039 The Inquiry Team has concluded that the system crash was caused by a
minor programming error. In carrying out some work on the system some three
weeks previously the SO programmer had inadvertently left in the system a
piece of program code that caused a small amount of memory within the file
server to be used up and not released every time a vehicle mobilisation was
generated by the system. Over a three week period these activities had
gradually used up all available memory thus causing the system to crash.
This programming error should not have occurred and was caused by
carelessness and lack of quality assurance of program code changes. Given
the nature of the fault it is unlikely that it would have been detected
through conventional programmer or user testing.

4040 The failure of the fallback procedures arises as a consequence of what
was believed at the time to be only a temporary addition of printers. The
concept of the system was that it would operate on a totally paperless
basis. Printers were only added, as a short term expedient, in order to
implement at least a partial system at the originally planned
implementation date of 8 January 1992.

4041 The fallback to the second server was never implemented by SO as an
integral part of this level of CAD implementation. It was always specified,
and indeed implemented, as part of the complete paperless system and thus
arguably would have activated had the system actually crashed on 26 and 27
October 1992. However, there is no record of this having been tested and
there can be no doubt that the effects of server failure on the
printer-based system had not been tested. This was a serious oversight on
the part of both LAS IT staff and SO and reflects, at least in part, the
dangers of LAS not having their own network manager.

ISBN 0 905133 70 6

Further copies available from: Communications Directorate, South West Thames
Regional Health Authority, 40 Eastbourne Terrace, London W2 3QR  071-725 2551

Dept. of Computing Science, University of Newcastle, Newcastle upon Tyne,
NE1 7RU, UK  Brian.Randell@newcastle.ac.uk   PHONE = +44 91 222 7923

Please report problems with the web pages to the maintainer

Top