The Risks Digest

The RISKS Digest

Forum on Risks to the Public in Computers and Related Systems

ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Volume 1 Issue 32

Monday, 23 Dec 1985

Contents

oCan Bank of New York Bank on Star Wars?
Jim Horning
o Cohen's AT&T SDI Software Analogy
Richard A. Cowan
o Failure probabilities in decision chains
Will Martin
o Ten-year any-worseries
Dan Hoey
o Multiple digests as a result of crashed systems
Rob Austein

Can Bank of New York Bank on Star Wars? [PGN's retitling]

Jim Horning <horning@decwrl.DEC.COM >
20 Dec 1985 1413-PST (Friday)
Read RISKS-1.31 only this morning. Found the detailed story on the Bank
of New York fiasco extremely ironic. Last night in the debate at Stanford
on the technical feasibility of the SDI, Richard Lipton chose the
financial network as an example of the advantages of a distributed
system (such as he is proposing for SDI) over a centralized one. "There
have been no catastrophes."  [What about the ARPANET collapse?  PGN]

More generally, I am interested in reactions to Lipton's proposal that
SDI reliability would be improved by having hundreds or thousands of
"independent" orbiting "battle groups," with no communication between
separate groups (to prevent failures from propagating), and separately
designed and implemented hardware and software for each group (to
prevent common design flaws from affecting multiple groups). He said
that the SDIO/PCSBM panel had concluded that the integrated Battle
Management and Command, Communcation, and Control (C3) software called
for in the Fletcher Report could never be made trustworthy, but that a
"meta-system" of simple, non-communicating components might be.

Jim H.


Cohen's AT&T SDI Software Analogy

Richard A. Cowan <COWAN@XX.LCS.MIT.EDU>
Thu 19 Dec 85 19:04:45-EST
To: risks@SRI-CSL.ARPA

As pointed out at the SDI economic debate at MIT 11/19, all the other
computer software systems (and predictions of technological failure
that have proven wrong) were problems of man against nature.  SDI
software, as Parnas also pointed out, is a problem of man against man.

All historical analogies to SDI that involve successes in
technological problems of man against nature are therefore worthless.
If it were in the Soviet Union's vital interest to wreck the operation
of the phone system, and they were willing to spend a few billion a
year to do so, of course they could do it.  (Remember how the mass Transit
system in Tokyo was shut down by 50 terrorists a few weeks ago?)

By the way, the economic debate I referred to was quite interesting.
Panelists were Bernard O'Keefe, CEO of Nuclear weapons contractor
EG&G, Lester Thurow, MIT economist, and Leo Steg, former manager of
GE's Space systems division.  O'Keefe's company does Star Wars
research, yet he ridiculed the Star Wars concept and pointed out the
economic dangers.  Thurow didn't give an opinion on Star Wars, but he
pointed out the problems of diversion of talent, the new competition
from other countries, and the fallacy of thinking that we can spend
the Soviet Union into the ground.  Steg was pro-technology, pro-SDI,
pro-spinoffs.  The Boston Globe had an article about it on the front
page of the Business Section, (11/21), but I felt the article was
shallow and biased.

If you'd like to judge for yourself, I have a TAPE of the discussion,
AND a TRANSCRIPT.  If interested, please contact me and I'll be happy
to send you one of these (but you'll have to pay what it costs).

-Rich  (cowan@mit-xx)


Failure probabilities in decision chains

Will Martin <wmartin@BRL.ARPA>
Thu, 19 Dec 85 14:58:42 EST
One of our Directors has asked me to inquire about a reputed Bell labs
study from 7 or so years ago, which he heard about at a conference. This
study was on "failure probabilities"; one of the statements or
conclusions he recalls was that if you have a string of five sequential
decisions, one after the other, each based upon the preceeding, the
reliability of the result is at the 59% level. I don't really have much
other than this to go on, so, if this comment rings a bell with you, and
you know the study (or studies) that this sort of conclusion came out
of, I would greatly appreciate it if you could mail me a reference. If
you know of work being done in this area by other organizations or
particular researchers, any comments or rumors or hearsay or pointers to
published work or theses would be welcomed.

If any documents related to this area of research exist on-line and are
not proprietary, please feel free to mail me copies of anything you
think might be relevant. The context of this is to provide some sorts of
standards of comparison or generally-acceptable figures to use when
evaluating the quality of a very complex and involved large software
system.

Please e-mail responses to one of the addresses below. Thank you.

Will Martin
US Army Materiel Command Automated Logistics Mgmt Systems Activity

UUCP/USENET: seismo!brl-bmd!wmartin   or   ARPA/MILNET: wmartin@almsa-1.ARPA


Ten-year any-worseries [Plucked off the BBOARDS]

<>
19-Dec-1985 2140
From: Hoey@NRL-AIC (Dan Hoey)

    hoey@NRL-AIC.ARPA 12/14/85 03:54:44 Re:  Software alert:  DATE-86
    Received: from nrl-aic by MIT-MC.ARPA 11 Dec 85 10:59:57 EST
    Date: 11 Dec 1985 09:55:47 EST (Wed)
    From: Dan Hoey <hoey@nrl-aic.ARPA>

    Early this year a message appeared on ARPANET-BBOARDS commemorating the
    ten-year anniversary of DATE-75.  A somewhat more ominous anniversary
    will occur in four weeks, on 9 January 1986.  Users of the TOPS-10
    operating system should beware of software failures beginning on that
    date.

    DATE-75 is the name of a set of program modifications applied to the
    TOPS-10 operating system, running on DEC PDP-10 computers.  Before the
    modifications, the TOPS-10 system could only represent dates between 1
    January 1964 and 4 January 1975.  The DATE-75 modifications added three
    more bits to the representation of dates, so that dates up to 1
    February 2052 could be represented.  To maximize compatibility with
    existing software, the three extra bits were taken from several unused
    positions in existing data structures.  The change was announced in
    mid-1974, and several tens of person-years went into updating software
    to recognize the new dates.

    Unfortunately, reassembling these bits into an integer representing the
    date was somewhat tricky.  Also, some programs had already used the
    spare bits for other purposes.  There were a large number of bugs that
    surfaced on 5 January 1975, the first day whose representation required
    the DATE-75 modification.  Many programs ignored or cleared the new
    bits, and thought that the date was 1 January 1964.  Other programs
    interpreted the new bits incorrectly, and reported dates in 1986 or
    later.  Date-related program bugs were frequent well into the Spring of
    1975.

    On 9 January 1986, the second bit of the DATE-75 extension will come
    into use.  Users of software developed in the 60's and early 70's on
    the TOPS-10 operating system should beware of problems with testing and
    manipulation of dates.  Beware especially of programs that were patched
    after manifesting bugs in 1975, for in the rush to fix the bugs it is
    possible that some programs were modified to assume that the date was
    between 1975 and 1986.  Any date that is off by a multiple of eleven
    years and four days is probably caused by this type of bug.

    Dan Hoey


multiple digests as a result of crashed systems

Rob Austein <SRA@XX.LCS.MIT.EDU>
Sat, 21 Dec 1985 13:50 EST
Once upon a time (when Rutgers.ARPA was flakey and I got five copies
of SF-LOVERS in one hour), I discussed this problem with Mark Crispin
(who maintains the Twenex mailer daemon, MMAILR).  There are some
real-world constraints that make it difficult to do things exactly as
one would like here.  I will use MMAILR for an example because it is
the only mailer who's internals I have examined in detail.

Firstly, it is obviously preferable to send twice than to send not at
all (in the general case anyway, obviously everybody has some examples
of things they would rather not receive at all :-)), so typically the
*last* thing that happens is that the message is marked sent and the
queued disk copy either deleted or updated.  So there's a window there
during which a system crash will cause a duplicate.  The size of this
window is increased in MMAILR because it only does this marking on a
per message basis, not a per recipient basis (ie, if you have a
message that is going to 25 different recipients on different
machines, the disk copy only gets updated after all the recipients
have been tried).  I was a little puzzled by this, so I asked Mark.
The theory is that if your system is crashing a lot, the last thing
you want to do is increase the amount of I/O to the disk copy of the
queued message (thus increasing the chance that the system will crash
while the update to disk is in progess, thus maybe trashing the
message file).  One could conceivably argue that the mailer should
update between recipients when each recipient takes a non-negligable
amount of time, but how do you know?  Doing this for, say, Arpanet
mail might be reasonable, but doing it for a fast local net wouldn't
(it would spend most of its time doing disk I/O).  Furthermore, for
any given net the delay is a factor of the current load, which is
difficult to determine except by trying to use it.  By this point you
are spending more time trying to special-case your way around the
problem than you are delivering mail, so you lose anyway.

One thing that might help prevent this sort of thing on a flakey
system would be to delay startup of the mailer daemon for a little
while after system boot.  I expect that most of these cases of ten
copies to a single person are cases where the system crashes within
five or ten minutes of being rebooted.

--Rob

Please report problems with the web pages to the maintainer

Top