There's an old joke about computer scientists who build the most advanced, intelligent computer ever. As a test, they ask it "Is there a God?" It responds, "There is now!" Sadly, a lot of people tend to think of computers as infallible. (We've discussed this in Risks before, I think.) Computer scientists know better, and try to educate the public to this fact of life. Peter Denning asks "Have some computer people become unduly eager to accept the blame when there is a mishap in a system that contains a computer?" If the answer is yes, one cause may be an eagerness to demonstrate to the public that the machines are not perfect. Others once thought of the computer as infallible but have come to realize it is only as good as the people who build it, program it, and feed it data. When something (or someone, for that matter) once put on a pedestal falls off, there is a very human tendency to be more harsh towards that thing than something never put on a pedestal. We may be seeing some of this in "journalists [becoming] unduly accustomed to fingering the computer for every mishap" (although I suspect it's not just journalists who do this!) There's also a third tendency at work here — it's a lot easier to blame someone whom you don't have to look in the eye. With a human, you would have to (say, when you were firing him, when you were prosecuting him, or so forth.) Also, a human can strike back verbally or nonverbally. A computer can do none of these things, and best of all you don't have to think about its feelings when you chastise it. Maybe that's part of it too. Matt Bishop
Newsgroups: net.news From: jerry@oliveb.UUCP <Jerry Aguirre, Olivetti ATC; Cupertino, Ca > Date: 11 Nov 86 17:39:44 GMT Most of you are probably aware that there was a premature posting of newsgroup messages for all the proposed newsgroup renamings. This caused many (if not most) sites to exceed the maximum allowed number of newsgroups in their active files. Some sites are still recovering from this problem. It is interesting to note that the volume of news articles for last week was less than half what it was for the previous week. Oct 27 11:48 to Nov 1 23:58 6,755 articles Nov 1 23:59 to Nov 10 15:15 3,102 articles The reduction in volume gives you some idea of the number of sites that were blown off the air. I know it took me a couple of hours to clean up old newsgroups, recompile news with larger tables, and reprocess the failing batches. (My news daemon renames and saves batches when rnews exits with an error status.) Multiply that times the number of sites on the net and you probably get many thousands of manhours spent cleaning up. Amazing to think how vulnerable the net is to the actions of one individual. Jerry Aguirre, Olivetti ATC [And this was precisely the glitch that triggered the macro error that led to the saga prior to the real RISKS-4.7! To add to the irony, MPW's message slipped through a crack last week while I was travelling. I just found it while cleaning up the RISKS mailbox! PGN]
[John Rushby called to my attention a remarkable report on the British view of software in the future. The entire report is fascinating reading, but in particular the following appendix is of sufficient interest to the RISKS community that it is reproduced here in its entirety for the private use of RISKS readers. It represents an important step toward the problems of developing safety-critical software. PGN] ``Software: A Vital Key to UK Competitiveness'' Cabinet Office: Advisory Council for Applied Research and Development (ACARD) London, Her Majesty's Stationery Office. (C) Crown Copyright 1986 Appendix B: Safety-Critical Software The problem: non-technical B.1 No computer software failure has killed or injured a large number of people. It is just conceivable that such a tragedy could occur. What steps should be taken to: * prevent such a disaster, * cope with it when it does occur, * ensure such a disaster, having happened once, cannot recur? The problem: technical B.2 Stored-program digital computers must be among the most reliable mechanisms ever built by man. Millions of computers throughout the world are executing millions of instructions per second for millions of seconds without a single error in any of the millions of bits from which each computer is made. In spite of this, nobody trusts a computer; and this lack of faith is amply justified. B.3 The fault lies not so much in the computer hardware as in the programs which control them, programs full of the errors, oversights, inadequacies and misunderstandings of the programmers who compose them. There are some large and widely used programs in which hundreds of new errors are discovered each month; and even when these errors are corrected, the error rate remains constant over several decades. Indeed it is suspected that each correction introduces on average more than one new error. Other estimates offer the dubious comfort that only a negligible proportion of all the errors in these programs will ever be discovered. B.4 New computers are beginning to be used in increasingly life-critical applications, where the correction of errors on discovery is not an acceptable option, for example industrial process control, nuclear reactors, weapon systems, station-keeping of ships close to oil rigs, aero engines and railway signalling . The engineers in charge of these projects are naturally worried about the correctness of the programs performing these tasks, and they have suggested a number of expedients for tackling the problem. Many of these methods are of limited effectiveness because they are based on false analogies rather than on a true appreciation of the nature of computer programs and the activity of programming. B.5 The steps which ACARD has been considering in answer to the introductory question are discussed under the following headings: * Disaster prevention * Disaster management * Disaster analysis Disaster prevention B.6 The initiative for disaster prevention must come from the UK government and system customers. Current software is built, operated and maintained using methods and tools which are not keeping pace with the development of the hardware, nor with the increased sophistication demanded by new applications; nor does it take account of progress of research into the reliability of programs. The necessary improvements in software engineering require investment in advanced development and production techniques, education, training and legislation. Legal obligations should be at least as stringent as those imposed by the Data Protection Act, and the care and time required for detailed drafting of legislation will be just as great. A start must be made immediately. B.7 The remainder of this appendix outlines an imaginable solution that may emerge over the next fifteen years. It is intended to promote rather than to pre-preempt a discussion of the details. Registration B.8 A register must be established of those (software) systems which, if they fail, will endanger lives or public safety. Operation (demand side) B.9 Before any organization can operate a life-critical computer system it must first obtain a License To Operate (LTO), which will only be issued when the operator can demonstrate that certain conditions (detailed below) have been met. B.10 Each life-critical system must be operated by a Certified Software Engineer who is named as being personally responsible for the system. This Certified Software Engineer must have received the appropriate mathematical training in safety-critical software engineering. B.11 A life-critical system must be adequately maintained; this must be one of the conditions of the LTO. Maintenance (that is, rectification and development) must be the responsibility of a named Certified Software Engineer. Certification B.13 An LTO must only be granted when a Safety Certificate has been issued. Certificates must be issued for limited periods, for example, five years. Operational systems will thus need to be recertificated (relicensed) periodically (analogous to Certificate of Airworthiness). Reliability data collection B.14 To aid research into system reliability, and to assist Boards of Enquiry, all registered life-critical software systems must supply operating data on the Licensing Authority. Disaster management B.15 In the past, the danger arising from failure of computer hardware and software has been limited by switching off the computer and reverting to manual operation if necessary. In future, there will be applications for which this fall-back procedure is not available. The computers will have to continue to run, and any necessary software changes and corrections will have to be inserted into the incorrectly running system. For these applications, specially stringent precautions are necessary. Procedures B.16 The Licensing Authority should require disaster management procedures to be laid down in advance of operation and practiced regularly during operation (that is 'fire drill practice'). The documentation of the system must need a standard which would permit a team of experts/specialists to master it during the progress of an emergency. Data Logging B.17 The disaster management procedures should include the logging of data so that any subsequent Enquiry can ascertain the progress and cause of the disaster (analogous to the 'black box recorder' in an aeroplane). Emergency call-out B.18 There must be more than one Certified Software Engineer available to the operating company; and a duty rota should ensure that one of them is always available at short notice. Procedures must be set up for calling out a team of expert specialists in a longer-lasting emergency. Disaster analysis B.19 During the normal (safe) operation of any life-critical system, data on its performance and reliability must be made available to the Licensing Authority. This data will be made available to any Enquiry. (This is additional to the data logging required in para B.14.) Board of Enquiry B.20 Any disaster should be the subject of an official Board of Enquiry (similar to rail and air disaster enquiries). A Board of Enquiry must have the power to make changes to the system under investigation and/or the methods, tools, products and staff associated with the certification procedure. Any error triggers Board of Enquiry B.21 Any error, no matter how 'small', in a software system which has been certified as being safe must be subject of an Enquiry. This is the only way of discovering weaknesses in the certification process itself, or misuse or misunderstanding of its application. Enquiries concerning non-fatal errors should not have disciplinary implications, so that operators are encouraged always to give notification of minor faults. Near Miss B.22 Any serious 'near miss' must be reported to the Licensing Authority. An Enquiry should be held if the Licensing Authority is concerned at the incident's implications. Safety certification B.23 The UK must develop the ability to certify safety aspects of software system construction and operation. These include: * certification of the mathematical soundness of the methods of construction; * certification that certified methods are properly applied during construction and subsequent maintenance (rectification and development); * certification of the tools used during construction and maintenance; * certification of the software engineers who build and maintain the systems; * certification of the end product, that is, the software itself. B.24 Methods should not be certified which are merely 'good practice'. Safety and reliability require more rigorous theoretical bases than existing good practice, so that system behavior can be accurately and consistently predicted; hence the need for mathematical soundness to enable prediction to be based on mathematical proof. B.25 Certification of a tool will only be given when it is shown that the tool preserves the mathematical soundness of the method is supports. B.26 Certification of software engineers will only be given when they have completed an approved level of formal mathematical and methodological training together with an approved track record of experience. Certification should be of limited duration; recertification should require additional formal training both of the refresher type and new developments. Recertification should occur at regular intervals. B.27 Certification of end products (and their components) implies proof obligations in addition to thorough testing. Proofs must be performed and checked by competent mathematicians or by a machine running certified software. B.28 As in other branches of engineering, the rigour of the inspection procedures should be adjusted to the degree of risk, the severity of the danger and the cost. For example, we can imagine the emergence of several levels of certification: a. Disaster Level. Failure could involve more than ten deaths. The whole of the software must be checked by formal mathematical proof, which is itself checked by a competent mathematician. Further precautions required if damage limitation by switch-off is not feasible (para B.15). b. Safety Level. Where failure could cause one death, but further danger can be averted by switch-off. The whole of the software must be constructed by proof-oriented methods, checked by a competent mathematician. On occurrence of a fatality, the mandatory Enquiry must name the programmer and mathematician responsible, who might be liable for criminal negligence. Perhaps one error per 100,000 lines of code would be a realistic expectation, so that most shorter programs will contain no errors. c. High Quality Level. Appropriate for software sold commercially, where error could bring financial loss to the customer. By law, such losses should be reimbursed. All programmers involved should be certified competent in mathematical methods of software design and construction. Their use of the methods is checked by sampling. An acceptable error rate would be one error per 10,000 lines of code delivered. Each error corrected requires recertification at Safety Level. If the target error rate is exceeded, certification is withdrawn. Eventually, all software used to construct other certified software should be certified to this level; and the construction of 'disaster level' software should include independent checks on the correct working of support software used (for example, check of binary code against higher level source codes). d. Normal Quality. Corresponds roughly to the best of current practice (say, one error per 1,000 lines of code). The methods used to construct software to higher levels of reliability may also be used to achieve normal reliability; and this should bring a significant improvement in programmer productivity and a reduction in the whole life cycle costs of the programs they produce.
Please report problems with the web pages to the maintainer