Some of the details of the situation regarding the Darlington nuclear power plant's computerized shutdown systems referred to in the Nucleonics Week article quoted by M. Thomas (RISKS 11.08) are given in a previous Nucleonics Week article of May 24, 1990. I will try to summarize that article and include some further details from the Atomic Energy Control Board's (AECB) review process and licensing decision. That article correctly indicates that the problems with these shutdown systems were that the software is difficult to verify and difficult to modify. These difficulties caused both delays and extra cost. The article also mentions briefly that the software "will have to be rewritten because it is not designed so changes can be easily incorporated." This requirement was placed on Ontario Hydro by the AECB as part of the licensing decision. To quote that AECB licence: "The Board ... has concluded that the existing shutdown system software, while acceptable for the present, is unsuitable in the longer term, and that it should be redesigned. Before redesign is undertaken, however, an appropriate standard must be defined." What led to this conclusion was an extensive and thorough analysis of the software. The original software submitted by Ontario Hydro for AECB review was obviously very complex and convoluted. The introduction of digital computers had been taken as an opportunity to add additional complexity and new monitoring functions to the shutdown systems over and above previous analog and hybrid systems. The AECB was concerned about how such software could be reviewed and demonstrated to be safe. Dave Parnas was contracted to advise the AECB, and Nancy Leveson was contracted to advise Ontario Hydro. Nancy Leveson's advice resulted in a hazard analysis by Ontario Hydro and some revisions to the code for better fault detection and safety. The review method eventually chosen by the AECB was strongly based on Dave Parnas' work and involved rewriting the software requirements in "A-7 style" event and condition tables, deriving similar format "program-function tables" from the source code, and comparing the two sets of tables. It should be pointed out that the software was not designed with such a verification process in mind. When Ontario Hydro safety analysts and AECB reviewers (myself included) tried to verify this software, we encountered many problems simply because the designers and programmers had not expected to have their work verified by mathematical techniques. Nor could they have; the decision to do such a verification was made after the software was designed and coded. Nevertheless, the techniques were successfully applied to the software for the two shutdown systems. Since automated tools were not available, most of the work had to by done manually, and to compensate for human error, all verifications were independently reviewed. The AECB's audit of this process constituted a second independent review of the most critical 30-40% of the software. Despite the difficulties, the AECB did eventually license the Darlington reactor. The NW article of May 24 quotes Zygmond Domaratzki of the AECB: "At the end of the long tedious process we went through to review the software....(W)e don't have any reservations about its ability to shut down the reactor in an emergency." The other result of this process was considerable assurance that the software would not perform any unintended, unsafe actions. Every part of the code was analyzed, some unintended actions were discovered, but all actions were determined to be safe. The current situation is that Ontario Hydro and Atomic Energy Canada Limited (AECL) are developing methods for specifying, designing and verifying safety critical software. These methods will be applied to the development and verification of some prototype systems before they are adopted for general use, and for the redesign of the Darlington shutdown systems. The goal of these methods is to make software easier to modify and easier to verify and review. The AECB is monitoring this process closely. The AECB also is working (with the help of Dave Parnas and Wolfgang Ehrenberger of GRS in Germany) to develop Canadian standards for safety critical software in nuclear power plants. We are monitoring international developments in this area to ensure that Canadian standards are on par with the rest of the world. A separate, but related issue is to find a method of predicting the reliability of the software. I hope to join that RISKS discussion again shortly. Richard P. Taylor, AECB, Canada firstname.lastname@example.org *I have tried to be brief and informative rather than simply quoting published material. Any misquotes or additional material are strictly my own interpretation and should not be taken as the position of the AECB. All the usual disclaimers apply. Please also note that the AECB is distinct and separate from AECL even though I get my e-mail via an AECL address.
> According to the report, "Ontario Hydro faced a similar situation at its > Darlington station, in which proving the safety effectiveness of a > sophisticated computerized shutdown system delayed startup of the first > unit through much of 1989. Last year, faced with regulatory complaints > that the software was too difficult to adapt to operating changes, Hydro > decided to replace it altogether". [ I hope that Dave Parnas or Nancy > Leveson can fill in the details here.] This is not exactly the situation as I understand it. Although I have not been directly involved for a while, I do have contact with people at Ontario Hydro. It is true that granting of a low-power testing license for the reactor was delayed due to questions about how to certify the software. Both Dave and I were consulting on this -- Dave for the Atomic Energy Control Board (the government agency) and me for Ontario Hydro. Dave and I disagreed about what OH had to do to ensure the safety of the shutdown software, and they ended up having to satisfy both of us. Very briefly, my major requirements were that the software be subjected to a hazard analysis, including a software fault tree analysis. A few other minor suggestions involved such things as rewriting the code slightly so that it was easier to read and review. A paper on the results of the software hazard analysis (using Software Fault Tree Analysis) was just presented at the PSAM (Probabilistic Safety Assessment and Management) Conference in L.A. two weeks ago. The Software Fault Tree Analysis took 2 man months. There were no "errors" found, but they did make 42 changes to the code as a result of what they learned by doing it (e.g., changed the order of some statements to make it more naturally fault tolerant and added assertions to detect hazardous states at various points in the code). They reported to me (and in the PSAM presentation) that they liked the software fault tree analysis technique, have used it on some other control system software, and are planning to use it again in the future. A little more about this can be found in my current CACM paper on how to build safety-critical software. Dave required that they rewrite their requirements specification in the A-7 style and that they do a "handproof" of the code using functional abstraction from the code (called Program Function (PF) Tables). This was quite costly and painful in comparison with the fault tree analysis (at PSAM I was told that the PF tables took 30 man years), but it is also more complete. I heard that a few errors were found in the specification (not in the code) as a result -- but this may not be correct. I have also heard from several people at Ontario Hydro that they are not happy with the prospect of having to repeat the PF analysis when changes are made in the code (which the AECB has decreed), and some have suggested getting rid of the software altogether to avoid having to go through this type of PF table analysis again. nancy
Dr. Tanner Andrews writes: ) The theory here is that running 100 units for 100 hours gives you ) the same information as running one unit for 10000 hours. The theory is crocked. It builds heat slowly. The actual behavior: 100 hours: a little warm 200 hours: case is softening 250 hours: case melts 257 hours: catches fire The times and failure modes will vary, depending on the type of device in question. He's just re-discovered the problem of correlated failures, which was what my whole article was about. I gave a very similar example concerning digital watches. Martyn Thomas asks: How can we have confidence that the means by which we have combined the n-versions [of an n-version program] (for example, the voting logic) has a failure probability below 1 in 10^9? Clearly, we can't. If your view of an n-version program is that it just produces some numbers that you somehow combine to get some other number, you've got a problem. But the real issue is how you build reliable systems that somehow affect the real world. Consider the brake system in an automobile. It is divided into independent halves from the master cylinder on; the halves control diagonally opposite pairs of wheels. Either half can stop the car, and failures that affect both are "unlikely". Now suppose we wished to build a computer-controlled brake. We might try to get reliable operation by having redundant computers and a voter which then applied all four brakes. But it would make much more sense to have a pair of independent computers controlling diagonally opposite pairs of wheels. The "voter" is then the car itself, and physical laws guarantee that if either vote says "stop", the car stops. (This actually comes full circle to a comment I made a number of months back about the significance of physical laws in mechanical systems, and the lack of such "enforced by the universe" laws in digital systems.) This is a HARD problem, there's no denying that! How can we be sure that our analysis of the upper bound on failure correlation among modules is accurate? How accurate does it need to be - does it need to have a probability of less than 1 in 10^9 that it is grossly wrong? (By "grossly wrong" I mean wrong enough to invalidate the calculation that the overall system meets the "1 in 10^9" figure). This would seem impossible. Consider, for example, the probability that the common specification is wrong. We can never be SURE. Try and come up with any analysis that makes you SURE (with high probability) that your car will stop when you hit the brakes - or, for that matter, that the sun will rise tomorrow. The language of mathematics is very misleading here: Mathematics deals with models of the world, not the world itself. There is no certainty, even of probabilistic estimates, in the world. But we have to muddle on. I'm not knowledgeable enough in statistical theory to comment on how one even measures the correlation, much less what the appropriate tests and sample sizes are. On the other hand, fairly small experiments showed the FAILURE of the independence hypothesis for naive n-version programming. Also, it's worth commenting on an assumption about testing that many people make implicitly: That only tests that do NOT use special knowledge about the system being tested are acceptable. In fact, hardly anything is ever tested that way - it's just not practical. It requires too many tests and takes too long, often longer than the useful lifetime of the object under test. Consider, for example, computing MTBF for disks. People ask how a manufactu- rer can come up with an estimate of 100,000 hours on a fairly new product. The answer is two-fold: For failures that occur essentially at random during the lifetime of a disk, testing a number of disks in parallel gives you a valid estimate. For failures related to aging - i.e., those whose probability goes up as time in service goes up - there are a variety of "accelerated aging" techniques. Almost anything that's the result of a chemical reaction (e.g., deterioration of lubricants) will proceed faster if you run the device at higher temperatures. Similarly, Dr. Andrews's slow heating will occur much faster in a higher temperature environment. Many kinds of mechanical failures are due to CHANGE in temperature; a common test environment cycles the tempe- rature repeatedly. Similar considerations apply to humidity. Vibration stresses can be easily applied. Someone asked about bearing failure after many thousands of hours. It may take a bearing thousands of hours to fail, but the failure process doesn't suddenly happen - subtle changes are going on in the bearing and the lubricants over that period of time. A close examination - much more than a "yes, it's still turning" determination - will find chemical and physical changes: Breakdown of the lubricant, migration of metal into the lubricant (whether in macroscopic (chips of metal) or micro- scopic (disolved metal) quantities), scoring of the bearing races, changes in metal crystal structure. We have a huge amount of experience with these kinds of systems, and know what their plausible failure modes are. Are we always right? No, of course not - but again, we have to muddle through. Paul Ammann writes: > 1. Testing (whether by explicit test in a lab or by actual > use in the field) of very large numbers of copies of > the system > 2. Functional decomposition of the system into a number of > modules such that failure can occur only when ALL the > modules fail. The first technique assesses performance directly, and can be applied to any system, regardless of its construction. As Jerry points out, various assumptions must be made about the environment in which the testing takes place. The second technique estimates performance from a predictive model.... I am uncomfortable with merging the issues of direct measurement with those of indirect estimation. The difficulties in 1 are primarily system issues; details of the various components are by and large irrelevant. In technique 2 the major issue is the failure relationship between components. I don't believe the distinction is sharp. Again, most type 1 testing is NOT a naive "try it for a while and see what happens"; one designs tests based on assumptions about plausible failure modes. This is, in effect, a predictive model: We predict that we've isolated all the important contributors to system failure. If we're careful, we even TEST that prediction: After all our tests are complete, we check to see how many failures were the results of causes we did not include in designing our tests. If there are too many, we may have to go back and do it again. Conversely, we can simply build the system from what we think are independent modules and then do brute force testing for overall reliability. The Eckhardt and Lee model (TSE Dec 1985) makes it clear that performance prediction is much more difficult. To evaluate a particular type of system, one must know what fraction of the components are expected to fail over the entire distribution of inputs. The exact data is, from a practical point of view, impossible to collect. Unfortunately, minor variations in the data result in radically different estimates of performance. For a specific system, it is not clear (to me, anyway) what an appropriate "upper bound of failure correlation among modules" would be, let alone how one would obtain it. See my earlier comments. I don't believe there is any magic solution to this problem; just as in the design of physical artifacts, it's something we'll just have to learn about and solve on a case by case basis. > Either technique can be used to get believable >failure estimates in the 1 in 10^8 (or even better) range. Such >estimates are never easy to obtain - but they ARE possible. Rejecting >them out of hand is as much a risk as accepting them at face value. This statement came out sounding stronger than I intended. I don't believe we have the capability today to build a computer-based system for which we could believe error estimates of this sort. Nor do I see any techniques available today that could provide such an estimate. However, I don't see any funda- mental reason to belive that such techniques could not exist. BTW, it's also worth considering just how strong such a guarantee is, and in particular how many of the systems we already deal with in the world are much, much riskier. If I remember the numbers right, about 30,000 people die in car accidents every year. If we do some really stupid estimating, and assume that everyone in the US (about 3*10^8 people) gets into a car once a day, then my chance of dying in a car accident is 1 in 10^4 each year. Not a very reliable system, is it? In fact, I've always found it interesting how much more we demand from digital systems than we demand from mechanical ones. For example, we always reassure beginners that no incorrect input to a program can physically harm the machine. And yet, consider what will happen to your car should you take it down the highway at 60mph and then suddenly shift into reverse. Does this bother you? Does it make you afraid to drive? -- Jerry
The following message was posted on a local bulletin board. Msg#:28798 *HOUSTON SHOUTS* 02-14-91 22:23:32 (Read 10 Times) From: DONALD SAXMAN To: HELGA Subj: WRITE YOUR CONGRESSMEN (This message is really for anyone). It was recently brought to my attention that the Saudi Arabian government has replaced American traffic controlers in the Desert Shield war zone with native Saudis. This was done partly to appease Saudi nationalists and partly because some of the American military air traffic controlers were women. (Saudi and other Islamic military pilots aren't particularly fond of being directed by women, but they can live with it. Saudi civilian pilots reportedly would refuse to even listen to instructions fro female air traffic controlers, pretending they didn't exist). Anyway, the Saudi controlers may or may not be as good at their job as the Americans. But they reportedly don't speak English very well. (Sidebar: English is supposed to be the international air traffic control language, but there are some holdouts that don't follow this standard. Many of these are Islamic countries, although Saudi Arabia apparently does use English-speakers.) Anyway, UN Coalition forces are already having trouble coordinating operations. Pilots who operate from outside of the war zone, like refuelers or B-52s, are particularly at risk. Anyway, there has been a suggestion made that users write their Congressmen and complain about this situation. It couldn't hurt. If anyone out there has Usenet or Fidonet access, I'd appreciate them forwarding this message so that it gets as wide exposure as possible. (email@example.com)
Grand Central Terminal in New York City has a number of ticket vending machines that permit travellers who don't want to stand in long lines to purchase tickets. I had an unpleasant experience with one of them a while ago. I arrived at the terminal before the 6:00AM opening of the ticket offices with the intention of taking a 6:10AM train. Not knowing that the ticket offices would open at 6 (it's not posted) I went to one of the ticket machines and pressed the code for my destination. It told me to insert $8.60. The signs on the front of the machine informed me that I could use bills of any denomination up to and including $20. Having only $20s on me (bless the cash machine) I inserted one of them. The light at the little window where tickets and change are delivered flashed just as if it was delivering my ticket (I'd used the machine before and I knew what to expect) but nothing came out. My money wasn't returned, I got no ticket, and I got no change. Deciding that putting another $20 into the machine wasn't wise, I wandered around the main concourse looking for someone in authority for a few minutes until the ticket windows opened. I explained the situation to the ticket agent (after buying a ticket to my destination) and he said "Oh, yes, it does that if you put a 20 in for a purchase less than $10," and gave me a form to fill out requesting a refund. A few weeks later I received a letter from some official of the rail line asserting that they hadn't found any excess $20 bills in their machines and implying that I had been attempting to cheat them out of money. He also asserted that there were instructions on the machine explaining not to use $20 bills when less than $10 worth of tickets was being purchased. I had looked for such indications on the occasion of my loss and not found them, though on my next visit to the station after receiving the letter I found that little signs saying that had, in fact, been glued to the face of the machine in the intervening time. I wrote another letter suggesting that when the machine detected this problem it could print out a receipt on ticket stock and give it to the user so that he would have documentation for his loss. This letter, which included several other suggestions for simple, inexpensive solutions to the problem, evoked a rather hostile letter in response. At that point I gave up, though I did fantasize about blowing the machine up for several weeks after. Marc Donner
I had heard that having to have a purchase voided out can tie up credit block allocations for a while, but here's my experience of last Friday that illustrates what can happen when Murphy really gets rolling... I had a moderately sized chunk of cash I needed to pay for with my VISA card, over at the local Pay 'n Save. Half the staff was out sick and the checkers who were in were overworked. The checker opens a new register and runs my card through. Nothing happens. She finds that the printer for the reader is turned off. She turns it on. Still nothing. So she runs the card through again. This time, everything acts as normal until it goes to print out a register slip for me to sign. The paper on the printer jams. Much swearing later, she tries to get into the register to void the purchase, and finds that it has billed me TWICE for the amount of purchase, once for when the printer was off, once for when the printer jammed. The register doesn't know how to handle this, and refuses to void the second charge. She has to go find the manager, who manages to consult the reference manual and get the printer voided. Okay, now that things are finally (theoretically) working, they run my card through again. It comes back declined. Why? The two rather largish sums that were voided are stuck in my credit allocation, of course. Of course it went through, the first two times. I suppose I could have written a check at that point, but I decided to stubborn it out. (Besides, I was interested in finding out what could be done, now.) The salesgirl calls the VISA number, hoping she can get a manual override. A mechanical voice wants her to punch in the store code. She doesn't have it. She hangs up, goes to another person, gets the store code. She calls again, punches in the store code, then is asked for the credit card number. I'm sorry, says the other end, that credit is declined. Click. She gets the manager. He punches things in, and manages to get a real human being, and tries to explain what's going on to the credit authorization person. That it was okayed the first two times, but now (with two voided charges on the allocation) it's topped my credit limit. I'm sorry, says the credit drone, but I am not authorized to okay that credit authorization, despite what you tell me. That can only be done by the credit card company. Fine, says the manager, do you have a number for Citibank? Call the customer service number on the back of the card, she suggests. He calls the number, and explains to the Citibank service representative what has gone on. The Citibank rep says that if he puts the card through one more time, he can manually override the declined credit order. They put my card through one more time. The credit goes through. I get a slip to sign. However, you can bet I will be looking at my next bill very carefully. Also, the override was obviously a once-only, and the credit is still set aside, somewhere, as I tried to use my card for a small purchase, this week, and it was declined. That second charge is still in the system, somewhere. Some time, someone is going to have to program the system to accept voids. --Jane Beckman [firstname.lastname@example.org]
I have become sensitive to my exposure due to electronically compiled and disseminated personal data, but, until recently, i had never considered ways in which the users of such data expose themselves to possible losses. I was both amused and disconcerted to learn that a company which uses a credit reference service makes it easier for a competitor to target customers through traces which are maintained by the credit agency. This last week i received in the mail, from MCI, an offer for a rebate in exchange for electing them as my long distance carrier. [Ignore for the present discussion ethical issues raised by the particular incentive mechanism which MCI employed.] I had expected, and did receive, a number of enquiries from various alternative carriers at the time when equal access provisions went into effect in this area. I was, however, perplexed as to why they chose to target me now. It took a bit of reflection, but i finally concluded that one focus of MCI's current mailing is the holders of ATT Universal cards. [MCI used an address which gave them away.] Not really the kind of thing which one company would deliberately give to a competitor. So i called ATT to ask what happened. I was informed that they knew the likely path which the information had traveled, but that once they had made a credit enquiry, they were powerless to preventing MCI from approaching the credit agency and obtaining a list of those people for whom ATT had requested credit histories.
Please report problems with the web pages to the maintainer