Read RISKS-1.31 only this morning. Found the detailed story on the Bank of New York fiasco extremely ironic. Last night in the debate at Stanford on the technical feasibility of the SDI, Richard Lipton chose the financial network as an example of the advantages of a distributed system (such as he is proposing for SDI) over a centralized one. "There have been no catastrophes." [What about the ARPANET collapse? PGN] More generally, I am interested in reactions to Lipton's proposal that SDI reliability would be improved by having hundreds or thousands of "independent" orbiting "battle groups," with no communication between separate groups (to prevent failures from propagating), and separately designed and implemented hardware and software for each group (to prevent common design flaws from affecting multiple groups). He said that the SDIO/PCSBM panel had concluded that the integrated Battle Management and Command, Communcation, and Control (C3) software called for in the Fletcher Report could never be made trustworthy, but that a "meta-system" of simple, non-communicating components might be. Jim H.
To: risks@SRI-CSL.ARPA As pointed out at the SDI economic debate at MIT 11/19, all the other computer software systems (and predictions of technological failure that have proven wrong) were problems of man against nature. SDI software, as Parnas also pointed out, is a problem of man against man. All historical analogies to SDI that involve successes in technological problems of man against nature are therefore worthless. If it were in the Soviet Union's vital interest to wreck the operation of the phone system, and they were willing to spend a few billion a year to do so, of course they could do it. (Remember how the mass Transit system in Tokyo was shut down by 50 terrorists a few weeks ago?) By the way, the economic debate I referred to was quite interesting. Panelists were Bernard O'Keefe, CEO of Nuclear weapons contractor EG&G, Lester Thurow, MIT economist, and Leo Steg, former manager of GE's Space systems division. O'Keefe's company does Star Wars research, yet he ridiculed the Star Wars concept and pointed out the economic dangers. Thurow didn't give an opinion on Star Wars, but he pointed out the problems of diversion of talent, the new competition from other countries, and the fallacy of thinking that we can spend the Soviet Union into the ground. Steg was pro-technology, pro-SDI, pro-spinoffs. The Boston Globe had an article about it on the front page of the Business Section, (11/21), but I felt the article was shallow and biased. If you'd like to judge for yourself, I have a TAPE of the discussion, AND a TRANSCRIPT. If interested, please contact me and I'll be happy to send you one of these (but you'll have to pay what it costs). -Rich (cowan@mit-xx)
One of our Directors has asked me to inquire about a reputed Bell labs study from 7 or so years ago, which he heard about at a conference. This study was on "failure probabilities"; one of the statements or conclusions he recalls was that if you have a string of five sequential decisions, one after the other, each based upon the preceeding, the reliability of the result is at the 59% level. I don't really have much other than this to go on, so, if this comment rings a bell with you, and you know the study (or studies) that this sort of conclusion came out of, I would greatly appreciate it if you could mail me a reference. If you know of work being done in this area by other organizations or particular researchers, any comments or rumors or hearsay or pointers to published work or theses would be welcomed. If any documents related to this area of research exist on-line and are not proprietary, please feel free to mail me copies of anything you think might be relevant. The context of this is to provide some sorts of standards of comparison or generally-acceptable figures to use when evaluating the quality of a very complex and involved large software system. Please e-mail responses to one of the addresses below. Thank you. Will Martin US Army Materiel Command Automated Logistics Mgmt Systems Activity UUCP/USENET: seismo!brl-bmd!wmartin or ARPA/MILNET: wmartin@almsa-1.ARPA
From: Hoey@NRL-AIC (Dan Hoey) hoey@NRL-AIC.ARPA 12/14/85 03:54:44 Re: Software alert: DATE-86 Received: from nrl-aic by MIT-MC.ARPA 11 Dec 85 10:59:57 EST Date: 11 Dec 1985 09:55:47 EST (Wed) From: Dan Hoey <hoey@nrl-aic.ARPA> Early this year a message appeared on ARPANET-BBOARDS commemorating the ten-year anniversary of DATE-75. A somewhat more ominous anniversary will occur in four weeks, on 9 January 1986. Users of the TOPS-10 operating system should beware of software failures beginning on that date. DATE-75 is the name of a set of program modifications applied to the TOPS-10 operating system, running on DEC PDP-10 computers. Before the modifications, the TOPS-10 system could only represent dates between 1 January 1964 and 4 January 1975. The DATE-75 modifications added three more bits to the representation of dates, so that dates up to 1 February 2052 could be represented. To maximize compatibility with existing software, the three extra bits were taken from several unused positions in existing data structures. The change was announced in mid-1974, and several tens of person-years went into updating software to recognize the new dates. Unfortunately, reassembling these bits into an integer representing the date was somewhat tricky. Also, some programs had already used the spare bits for other purposes. There were a large number of bugs that surfaced on 5 January 1975, the first day whose representation required the DATE-75 modification. Many programs ignored or cleared the new bits, and thought that the date was 1 January 1964. Other programs interpreted the new bits incorrectly, and reported dates in 1986 or later. Date-related program bugs were frequent well into the Spring of 1975. On 9 January 1986, the second bit of the DATE-75 extension will come into use. Users of software developed in the 60's and early 70's on the TOPS-10 operating system should beware of problems with testing and manipulation of dates. Beware especially of programs that were patched after manifesting bugs in 1975, for in the rush to fix the bugs it is possible that some programs were modified to assume that the date was between 1975 and 1986. Any date that is off by a multiple of eleven years and four days is probably caused by this type of bug. Dan Hoey
Once upon a time (when Rutgers.ARPA was flakey and I got five copies of SF-LOVERS in one hour), I discussed this problem with Mark Crispin (who maintains the Twenex mailer daemon, MMAILR). There are some real-world constraints that make it difficult to do things exactly as one would like here. I will use MMAILR for an example because it is the only mailer who's internals I have examined in detail. Firstly, it is obviously preferable to send twice than to send not at all (in the general case anyway, obviously everybody has some examples of things they would rather not receive at all :-)), so typically the *last* thing that happens is that the message is marked sent and the queued disk copy either deleted or updated. So there's a window there during which a system crash will cause a duplicate. The size of this window is increased in MMAILR because it only does this marking on a per message basis, not a per recipient basis (ie, if you have a message that is going to 25 different recipients on different machines, the disk copy only gets updated after all the recipients have been tried). I was a little puzzled by this, so I asked Mark. The theory is that if your system is crashing a lot, the last thing you want to do is increase the amount of I/O to the disk copy of the queued message (thus increasing the chance that the system will crash while the update to disk is in progess, thus maybe trashing the message file). One could conceivably argue that the mailer should update between recipients when each recipient takes a non-negligable amount of time, but how do you know? Doing this for, say, Arpanet mail might be reasonable, but doing it for a fast local net wouldn't (it would spend most of its time doing disk I/O). Furthermore, for any given net the delay is a factor of the current load, which is difficult to determine except by trying to use it. By this point you are spending more time trying to special-case your way around the problem than you are delivering mail, so you lose anyway. One thing that might help prevent this sort of thing on a flakey system would be to delay startup of the mailer daemon for a little while after system boot. I expect that most of these cases of ten copies to a single person are cases where the system crashes within five or ten minutes of being rebooted. --Rob
Please report problems with the web pages to the maintainer