A major power outage here on Tuesday demonstrated the risks of excessive automation and administrative convenience. Our computing environment consists of a heterogeneous network of Sun, Dec and IBM workstations and related fileservers. When a Sun workstation boots up, a hardware prom issues a rarp request to establish the workstations network address and to identify a server that can provide it with a bootprogram and then the Unix kernel. The boot prom uses the trivial file transfer protocol (tftp) to request the boot program. It initially issues a tftp request to the server it has identified, but if that tftp request times out then it broadcasts a tftp request on its local network looking for any server that can provide it with a bootprogram. It keeps repeating this process until it receives a boot program. One the Suns the prom has no builtin knowledge of its network address or the network address of the server. There are some good reasons for keeping the boot prom ignorant of its network environment and using a broadcast protocol, including the administrative convenience of not having to do anything to workstations when the server changes and providing a degree of robustness in a multi-server environment. In recent years there have been security problems related to the tftp protocol so in our environment the Dec workstations run security monitoring software that keeps a log of failed tftp attempts to help detect potential intruders. The security software writes a log file of failed tftp requests and also puts a message on the affected machines console. What got us into trouble after the power outage was that the Sun workstations came back online but the corresponding Sun servers came up in a wedged state in which they responded to the initial rarp request but then failed to respond to any workstations tftp request for a boot program. After the initial tftp request to the Sun server timed out, our network was flooded with tftp requests from many Sun workstations all trying to find any server that could boot them. In the meantime the Dec workstations on the network had rebooted successfully and were being used by a number of professors and students. However, these machines soon became unusable due to the effort required to deal with the flood of tftp requests. The security monitoring software contributed to this problem by writing messages to each machines console window (ignorable but consumptive of resources) and by almost filling up a critical file system with its log files. If this filesystem had filled up, the machines would have been totally unusable. Even if we hadn't been running the security monitoring software, usability of these workstations would have been impaired by the handling of the tftp requests. There are several things that could have been done better: - the question of whether falling back to a broadcast protocol for booting is the right approach should be reexamined. On most systems the set of servers that could successfully respond to a boot request is a) small, b) well-known, and c) changes very slowly over time. - the boot proms should use some form of backoff strategy when the tftp requests consistently fail to avoid overloading the network. - our security logging software needs to be more robust in dealing with its log files. Waiting until a log file write fails due to a full filesystem is too late if the full file system will cause other processes to crash. This is tricky since we don't want a introduce a mechanism that would allow an intruder to overwhelm the security software with failed attempts and then proceed to do dirty work once logging has been suspended due to log file overflow. A curious legal question comes to mind: could the manufacturer or the proprietor of the workstation containing boot prom be held guilty of a "denial of service attack" on our Dec workstations? If an individual had issued all of those tftp requests we certainly would be considering the question.
I'm an editor at Windows Magazine. In our May issue I wrote a news story reporting a bug in the Cyrix Cx486DX CPU. The Cyrix Cx486DX was designed to be completely software-compatible with Intel's i486DX processor. However, Ed Curry of Lone Star Evaluation Labs (LSEL) found a bug relating to floating-point operations while doing some in-depth compatibility testing. Cyrix shipped thousands of chips with this bug before April 1994, but has now fixed the problem. The bug occurs when a register load instruction (such as MOV reg,mem) is followed by an instruction that clears the floating-point status register (FCLEX). If the memory location being referenced is in the CPU's internal cache, the MOV instruction works fine. If, however, the MOV requires an external bus cycle, executing the FCLEX instruction aborts the cycle. As a result, the register is not loaded properly. The risk here is that someone may run software on the Cx486DX that generates incorrect results where an i486DX would work fine. The Cyrix position is that this is a minor bug and that we (Windows Magazine and LSEL) are making too much of it. However, LSEL has seen the bug in their test code compiled under OS/2 and Windows NT. The test code performs typical engineering and scientific calculations, so it's not contrived or artificial. We have not found the problem in any shrink-wrapped application. Most MS-DOS and Microsoft Windows insert a FWAIT instruction before any floating-point instruction, so they generally won't exhibit the problem. What does the Risks readership think? Are we making too much of this? Is anyone out there using PC with a Cx486DX?
I accepted a no-installation-cost trial of Caller ID and found it somewhat useful for correlating call times with answering machine messages, but found 90% of my received calls were out of my area and thus had no number actually displayed, only the date and time. Last night I noticed that the box had cleared out its memory. No call had been received on that line between the time I had last checked it and the time I noticed it with an empty list. The risk here is that if some message was sent to the box through the phone line to clear its list, then the box would be less useful for someone using the box to catch a crank caller or even log when important calls or messages were received. If the caller ID protocol includes such a message, then such a message could undoubtedly be faked if someone got physical access to a residence's network interface or telephone company signalling. I'm sure that boxes more advanced than the promotional box that was given to me might have precautions or a printed log, but I would imagine that the promotional boxes are widely used. --Robert
The voicemail system I use allows incoming FAXs to be saved and handled as messages. Upon receipt, the system notifies the user that there is an incoming fax message, and you can even query for the number of pages. When a message exists in the "voice mailbox", one can have the system forward it to a real FAX machine (either a preselected "primary" FAX or any other phone number.) Requesting such a forward places the FAX message into a queue, meaning it may actually be sent at some future time. Last week I received a 5-page FAX message. It did not come from a local caller (one on the same telephone switch.) All I knew was that it was five pages. I sent it off to my primary printer, and an hour or so later went to pick it up. No FAX for me there. I tried again. No FAX for me. FAX machine broken? After a day of this, I sent the FAX to a machine and promptly went to watch. Out came a list of imported tequila prices, and several blank pages! I recalled seeing several such lists at the other FAX machines...But none were addressed to me! Surely they weren't mine...but a closer inspection showed that the FAX phone number listed was indeed mine (perhaps a missing area code?) Whoa! that kind of business is illegal here! And I've been spreading the things around the area. At least I didn't have my name on them, but the phone number was mine! Welcome to the wonderful world of hi-tech. It used to be that FAX machines were relatively rare, and "dialing" a wrong number would mean the FAX doesn't get sent. Now, EVERY phone here can receive a FAX, and we can send multiple copies out without knowing what it is we sent! Yes, I'll be a bit more careful in the future. [A surprising number of readers chided me for NOT having appended a "You mean a FAX PAS? PGN" appendum. THANKS! PGN]
The memo Jerry Leichter posted was an actual Silicon Graphics memo. However, life for Silicon Graphics and Tom Davis is not quite so bleak as some might think. Tom Davis wrote the original memo to point out problems and ask everyone to help fix them. It was very effective. I installed a beta version of the new 5.2 release on my Indy in January, and only rebooted the machine a couple of weeks ago because I was moving to another building. Sure, I had to add another swap file on-the-fly about once a month because my emacs processes grew so large :-), but the system did not crash. And performance is quite snappy. "Watch the skies." Since the memo has been popping up all over the net, Tom has written a reply to it, included below. There isn't really a RISKS tie-in, unless you count the risk of having only the "bad" half of a story get wide distribution. Joan Eslinger / Silicon Graphics / email@example.com -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- I am the author of the original memo below, which was intended for internal Silicon Graphics use only, and was not for anyone outside the company. But since it has been leaked to the net, and is beginning to be used by competitor's sales people, I feel a response is required. I don't believe that these problems are unique to Silicon Graphics. >From discussions with friends who are insiders in many different companies, I am certain that similar memos could be written about the software of each of our competitors. What I like about working for Silicon Graphics is that at least here, something is being done about it -- I worked for companies in the past where the response would have been to stick our heads in the sand in hopes that the problems would just fix themselves. If I hadn't thought that the memo would catalyze some change here, I wouldn't have written it. The details appear as comments to my original article below. Luckily, the article is 6 months old, and we have had a chance to make some significant progress. Typically, what happens is that each faster generation of hardware is followed by software that more than compensates for the increased speed, but as a result of this memo, Silicon Graphics has been able to skip one of the slowing software cycles, making, instead, a performance and quality based release. The next release is going to be similar, and in the meantime, we get an extra hardware boost from the faster R4600 processors. -- Tom Davis Silicon Graphics General comments: As a fairly direct result of this memo, SGI decided not to continue "business as usual" in software development. The approach we took to the problem was the following: With the 4.0.5abcdefghi... fiasco, and the fact that the 5.* releases were still for specific machines, our developers were desperate for an all-platforms release. We decided to make such a release relatively soon -- and 5.2 actually MRed in February. The 5.2 release had two goals -- primarily, all-platform, and given that it went out in February, do as much performance-tuning and bug-fixing as time allowed. In that period, the performance on 16MB systems was essentially doubled, which improved performance on larger systems as well, but to a lesser degree. Significant numbers of bugs were fixed as well. Some people hoped that a few quick fixes would bring back all the performance in 5.2, but a little investigation indicated that was a list of things to be done, and that another quality release would be required. The 5.3 release, not officially scheduled, but which should be MRed around October or November is that quality (performance and bug-fix) release. We'll add a few new features, but they will be the exception rather than the rule. The longer time before the 5.3 release should give us time to do a thorough job of solving our problems. For 5.3, there's also time to set up solid performance and bug-fixing goals, and these are already being discussed. And most important -- the worst problems were with 16 MB systems that paged their brains out. They are better now, but not great. But we don't sell them. One of the 5.3 goals is to improve performance (or reduce sizes enough) that it will be acceptable on a 16 MB machine. The kernel memory leaks are all fixed, and many of the important programs have been reduced in size. For 5.2, 5 or 6 of our most heavily-used programs were subjected to close scrutiny to find out where the performance went, and many were significantly improved. A lot more work is planned for 5.3 to reduce the sizes of the executables. Work is continuing on the DSOs to split them up properly so that they don't all have to be loaded, and to improve their performance and start-up time. We're working to make "quick-starting" happen more automatically. > PERFORMANCE UPDATE I don't think it's unusual to do benchmarks with non-standard compiler settings. Both we and our competition have done so for a long time. We do ship all the libraries, et cetera, necessary to duplicate these results so customers for whom speed is the only objective can pay the cost of larger executables in exchange for the added speed. Unfortunately, I can't re-run some of these tests, but 5.2 is definitely better than 5.1. I think the 5.1 fiasco has caused a lot of our management to see the light, and in conversations with people at all levels, it's clear that nobody wants to see anything like it happen again. The 5.2 and future 5.3 releases seem to be steps in the right direction. But there's still a lot of work to do, and we in engineering can use every minute between now and the 5.3 release to improve things. The 5.3 release is being planned with reasonable beta-cycles, and with enough time between now and "code freeze" to make significant improvements. > Management Issues: I think this sort of disconnect is not too unusual -- there is always enormous pressure to announce a very low entry price-point, and the 16MB system provided that. Everybody does this with the full knowledge that on a minimum system, you won't be able to run many interesting applications, and almost everyone will have to purchase a bit more memory. It's just that in the case of Indy, there were so many new features that the proposed minimal system was embarrassingly slow. The "fix" is simply not to ship the 16MB systems which will insure that everyone will get a very usable machine. All we really lose is our low entry price point, and the gain is that we won't have to deal with the few irate customers who bought a minimal system. Although some of our performance loss is due to more complicated features, the vast majority is due to the fact that more memory is required, and without it, the systems page with a consequent dramatic reduction in performance. The 4.0.X -> 5.X change on our large machines was measurable, but not nearly so noticeable as on the smaller ones. We're still not completely there (as far as I can tell) with respect to better software management. The good thing is that many of our higher-level managers are acutely aware of the problem now -- Forest Baskett and Tom Jermoluk are extremely concerned, for example. It's too bad it took a shock like 5.1 to make everybody take notice, but at least they did, and we're doing the right sorts of things to correct it. [Moderator deleted the entire interstiated message from RISKS-15.80. PGN]
For years, people have been postulating projects that are too complicated to comprehend and we have seen several examples of what happens when this occurs. IMHO the only solution is to separate functions into stand-alones that utilize a common and understandable foundation and which are understandable. Where many have felt that a single integrated system is best, I have often been called in to "put out fires" and the first thing I do is to separate the problem into "atoms", the least divisible pieces. It is astounding how often problems that cannot be seen when tightly wrapped in a package becomes obvious when viewed by itself. Sometimes you just can't see the tree for the forest. > Some people claim that we need new software debugging tools to look at > the problem, and that may be true, but it's not a short-term solution, > and it runs the risk of causing us to spend all our time designing > performance measurement tools, rather than fixing performance. This is disturbing. Unless you have the tools to properly examine a system you cannot tell what is really going on and the reccuring theme of the memo seems to be that no-one knows. Without the proper tools, the job will never be completed. Again I can only speak from personal experience but cannot count the times when called in to fix a problem that supervisors have gotten very antsy waiting for something to happen while the envelope is being defined. Have found that unless the system is understood, it *can't* be fixed (see "little silver hammer" syndrome). The problem with the engineers also appears symptomatic. Engineers are supremely good at taking a concept and making it work. They are not generally good at determining that a concept is flawed in the first place, instead often they will continue to work as if the concept were correct and they were just lacking in skill. This leads to precisely the morale problems described. The major problem with engineers is that they accomplish the impossible so often that the marketeers come to expect it from them. The real problem seems to be simply "no-one in charge" and is all too common in large organizations. History is rife with examples of companies, states, countries that became too concentrated at the top and fell victim to the huns/vandals/Standard Oil as they rose to power. "Think of it as Evolution in Action" - Jerry Pournelle Padgett
Inspecting Critical Software: An Intensive 3-day Course offered by The Faculty of Engineering, McMaster University, Hamilton, Ontario, Canada Taught by Prof. David Lorge Parnas, with the support of TRIO June 7, 8, 9, 1994 1. Background Software is critical to the operation of modern companies and is frequently a key component of modern products. Some pieces of software are particularly critical; if they are not correct, the system will have serious failures. Standard methods of software inspection are not systematic. This course teaches a procedure for software inspection that is based on a sound mathematical model and can be carried out systematically by large groups. The software inspection procedure combines methods used at IBM, work originally done at the U.S. Naval Reserve Laboratory for the A-7E aircraft, and procedures applied to the inspection of software at the Darlington Nuclear Power Generating Station. The method has been refined and enhanced by the Software Engineering Research Group at McMaster University's Communication Research Laboratory. It can be applied to software written in any imperative programming language. 2. What Will Participants Learn? Participants in the course should return to their workplace with an understanding of the way that mathematics can be used to document and analyze programs. They will also return with documentation of a piece of their employer's code that can be used to explain the work to others. 3. Programme Day 1 Predicate Logic and Program-Functions/Relations 1) Overview and Case Study A discussion of previous applications of the method. 2) Predicate Logic The inspection method is based on predicate logic, which will be reviewed in this section. 3) Tabular Expressions This session will be devoted to the writing of readable predicates using two-dimensional notations rather than classical one-dimensional expressions. There will be numerous examples. Participants will be taught to read and write tabular expressions. 4) Describing Program Function This session will be devoted to writing program descriptions using predicates and tables. Day 2 Inspection of Dijkstra's Dutch National Flag Program Participants will be given a copy of E.W. Dijkstra's explanation of a program along with several sample programs. They will be asked to apply the inspection method and approve or reject each program. The instructor and some assistants will be available as consultants during this process. Day 3 Morning: Inspection of a "Real" Program Working in small groups, the participants will take a section of a program from their company and inspect it using the method learned so far, producing documentation as they go. Day 3 Afternoon: Report on the Inspection Results, Discussion of Testing The first part of the afternoon will be devoted to a series of reports by the participants on the results of their efforts in the morning. The remainder of the afternoon will be devoted to a discussion of the interaction between testing and inspection. We treat testing, not as an alternative to inspection, but as complementary to inspection. We discuss the way that the documentation produced in the inspection process can be used in the testing process. 4. Learning By Doing The course is language-independent. In fact, on the third day, participants will inspect code written in any language that they use in the workplace. This course presents an approach to active design reviews that has the reviewers writing precise documentation about the program and explaining their documentation to an audience of other reviewers. A significant part of each day will be spent using the ideas that have been presented to determine whether or not programs do what they are supposed to do. On the last day, participants will inspect a small program that they brought with them from their company. Participants should leave the course with improved ability to inspect software. 5. Who Should Attend? Participants should be experienced programmers and not afraid of learning a little mathematics. The mathematical basis for the method is classical and takes up only a few hours in the course. However, it is fundamental to understanding the method. It is expected that the participants will be used to reading code written by others and it will be helpful if they can read Pascal. 6. What Should You Bring With You? For the exercise on the third day, each participant should bring a small program, perhaps 50 lines that are critical to some project. It need not be "mature" code, but it should compile and have survived some testing or use. If there are several participants from the same company, they may work in small groups on slightly larger programs. You may want to bring a reference manual and some conventional documentation about the program with you. It will help if one of the participants is familiar with the program. 7. The Instructor The course will be taught by Prof. David L. Parnas, an internationally recognized expert on Software Engineering. Dr. Parnas initiated and led the U.S. Navy's Software Cost Reduction Project, where the tabular notation was first used, advised the AECB on the use of these methods at Darlington, worked with IBM's Federal Systems Division, leads the Software Engineering Research Group at McMaster University and is a Project Leader for the Telecommunications Research Institute of Ontario. Information about costs, registration, etc. can be obtained from: Jan Arsenault, Faculty of Engineering, JHE-201A, McMaster University, 1280 Main Street West, Hamilton, ON, Canada, L8S 4L7. Telephone: 905 525 9140 x 24910 email: firstname.lastname@example.org
Please report problems with the web pages to the maintainer