The Risks Digest

The RISKS Digest

Forum on Risks to the Public in Computers and Related Systems

ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Volume 18 Issue 34

Friday 16 August 1996

Contents

o California DMV records NOT secure
Mark Seecof
o Re: London train crash: update
Scott Alastair
Jim Reid
o Re: 128-bit Netscape registration
Bernard Peek
o Re: Fault-tolerant software, "upgrade hell"
Kurt Fredriksson
Wayne Hayes
Valdis Kletnieks
Vladimir Z. Nuri
o Re: Electromagnetic pulses to stop car chases?
Michael Brady
o Re: Western Power Outage
Steve Forrette
o Re: America Offline
Valdis Kletnieks
Lowell Gilbert
o Re: Bread-riots and circuses
Hal Lockhart
o Info on RISKS (comp.risks)

California DMV records NOT secure

Mark Seecof <Mark.Seecof@latimes.com>
Thu, 15 Aug 1996 12:26:37 -0700
I'd like to correct an oft-repeated misstatement; that California DMV
records have been secured.  In fact, quite minor procedural barriers to
access by unsophisticated people at DMV counters have been instituted, but
only slightly more savvy people can still get info, and the DMV continues to
sell mass data extractions to reusers.  Also, DMV has taken no steps at all
to secure data or systems against misuse by DMV staff or others (e.g.,
police) with privileged access.


Re: London train crash: update (RISKS-18.32)

"Scott Alastair (Exchange)" <ScottA@logica.com>
Tue, 13 Aug 1996 13:23:20 +0100
What happened was that a train from Euston to Milton Keynes with about 400
passengers on board was travelling North at about 60mph. Another empty
train, which should have been waiting at a red signal just outside Watford
Junction, started moving slowly South and crossed over a set of points onto
the North-bound fast line straight into the path of the fast train. The fast
train driver saw what was happening and sounded his horn but could not avoid
a collision. I believe that both drivers managed to jump clear before the
crash and were not badly injured.

"Black boxes" were recovered from both crashed trains, and showed that
signalling and train systems were working properly just before the crash.
These black boxes, I understand, are a recent innovation and are not fitted
to all trains.

It turned out that the crashed trains were both fairly new (about 10 years
old) and had well-designed carriages which dissipated the force of the
impact; also, that the collision was glancing and not head-on.

Newspaper photographs showed track torn up and the front carriages of both
trains at crazy angles; one was hanging over an embankment. The other
carriages of both trains were derailed but had not been thrown over. The
line was closed for about 2 days while the carriages were removed and track
repaired.

It was a huge stroke of luck that the collision involved new rolling stock:
on some other lines carriages are 30 or 40 years old and are of an antique
"slam-door" design which concertinas in the event of a crash.  I have read
that, of about 70 deaths on the United Kingdom railways in the past 10
years, all but one have occurred in these older carriages.

There are four separate enquiries going on at the moment - by Railtrack (who
own the track), North London Trains (who own the trains), British Transport
Police and the UK Government's Health and Safety Executive - so I presume
the reason for the supposedly stationary train moving will come out in due
course.

Alastair Scott  scotta@logica.com


Re: London train crash: update (RISKS-18.32)

Jim Reid <jim.reid@eurocontrol.be>
Tue, 13 Aug 1996 11:20:39 +0200
A number of risks have emerged from the recent crash. The first concerns
Automatic Train Protection, ATP. This system is claimed to stop any train
which passes a red signal. It is not used on the British railway network,
though it is deployed on other European railways. [Apologies to any
trainspotters for any simplification I've made.] The UK railway companies
say that ATP is too expensive - it costs too much for each life it saves,
though how they work that out is beyond me. (Another risk?) They claim that
the money required for ATP would be better spent on other safety measures
like modern, stronger passenger carriages. So, rather than prevent trains
crashing into each other, they think the best strategy is to let them crash,
but make the rolling stock safer. (Yet another risk?)

The next risk is the absurd way in which Britain's railways are now run
after privatisation. One company - Railtrack - owns the stations, track and
signalling systems. It seems more interested in property development and
turning stations into shopping malls than the rail infrastructure. This
company would have to install ATP, which would make a big dent in their
balance sheet. So, it's hardly surprising that they are not enthusiastic for
ATP, even though they have some responsibility for safety. Railtrack charges
train companies for the use of its stations and track. The train companies
operate the services, but they don't own the rolling stock. These belong to
leasing companies who hire them to the rail operators.

Aside from the bureaucracy and ticketing nightmares, there are serious
safety risks in this setup. Safety of the trains is the responsibility of
the leasing companies who own them and the operating companies who use them
and presumably the companies who maintain them. Where the boundaries are is
anybody's guess. Railtrack have responsibility for the safety of the track
and signalling systems. However, if they were to deploy ATP, the leasing and
operating companies would have to pay for the extra kit in the trains. Where
the boundaries of responsibility lie between Railtrack and the leasing and
operating companies is yet unknown. There was a recent report that a small
fire was put out by staff throwing dirt and sand at the junction box. The
box belonged to Railtrack, the staff worked for an operating company.  They
feared being disciplined if they used company property - their company's
fire extinguishers - to help another company.

Sitting on top of this is a government agency, the Health and Safety
Executive which is responsible for safety in the workplace amongst other
things. Where their responsibility kicks in is yet another unknown. With no
single body responsible for train safety, it's hard to apportion blame for
crashes or establish better procedures and communications protocols to make
them less likely in future. The companies and agencies involved end up
shrugging their shoulders and pointing at each other.

An added problem is that none of these companies has safety as a prime
objective. They all want to cut costs to boost their profits. For Railtrack,
money spent of safety measures comes straight off their bottom line,
reducing dividends for shareholders and profitability bonuses for
management. For the leasing companies, repainting old trains is more cost
effective than buying new ones which presumably have better safety features.
For the operating companies, the cost of leasing is one of the few costs
they can control. [They can also work their drivers harder, but that will be
another safety risk.] Thus, they prefer to run old, less safe, trains
because they are cheaper to lease than new ones.


Re: 128-bit Netscape registration (Arndt, RISKS-18.33)

Bernard Peek <bap@intersec.demon.co.uk>
Thu, 15 Aug 96 19:02:01 GMT
> Apparently they pass on the information you entered to another service and
> presumably if you don't show up you don't get to download the software.

That's quite interesting. Now what happens if I telnet into my other service
provider (in California) and enter a perfectly valid name, US address and
phone number.

Not that I would do it of course, it seems an unnecessarily convoluted
way of achieving a fairly simple objective.

Risks readers might like to ponder a Catch 22 situation I found myself in
some years ago. Working for a UK company selling high-tech equipment (68000
processors) I wasn't permitted to supply anyone on a blacklist. I could have
been extradited and jailed for doing that. Of course I wasn't permitted to
actually see the blacklist either.

And on another subject entirely...

> The final report said that the crash was the result of the pilots fighting
> the autopilot.

I've just heard a report of a crash that happened here two days ago.
It was a result of the two pilots fighting each other. As a result a
business jet crashed on one of Europe's busiest roads. Four injuries,
no fatalities.

Bernard Peek  bap@intersec.demon.co.uk


Re: Fault-tolerant software, "upgrade hell" (Nuri, RISKS-18.33)

Kurt Fredriksson <etxkfrn@aom.ericsson.se>
Thu, 15 Aug 96 12:43:44 +0200
The author is right. This is not a novel idea. The Ericsson AXE exchange has
had the functionality to upgrade software in a running system from the first
system delivered more than ten years ago. This was a deliberate design
decision due to the harsh demands on "uptime" in telecommunication systems.

Kurt Fredriksson

   [Peter Denning reminds us that the Newcastle work on Recovery Blocks
   -- Brian Randell et al. -- is relevant here, and certainly worthy of
   mention.  It goes way back to the mid 1970s.  It was an extremely well
   thought-out effort, with a language, a run-time system, and a supporting
   hardware architecture for dealing with concurrent processes running
   different versions of an algorithm (alternat[iv]e blocks), such that the
   collection of processes would not terminate until at least one of them
   passed an acceptance test.  Their system include automatic checkpointing
   so that you could back up properly to the last known state that passed an
   acceptance test.  Thanks to PJD for the reminiscence.  PGN]


Re: Fault-tolerant software, "upgrade hell" (Nuri, RISKS-18.33)

Wayne Hayes <wayne@cdf.toronto.edu>
Thu, 15 Aug 1996 00:27:25 -0400
"Vladimir Z. Nuri" <vznuri@netcom.com> writes about the possibility of
running new software as a "shadow" of currently running software and
(perhaps automatically) testing its reliability before switching it to
"actively" controlling the system.

This method may be fine for bug fixes, but it has a fundamental
limitation because it ignores a fundamental issue of software upgrades:
that the new software may have new functionality which is unavailable
in the old version, and thus will never be tested while the old
software is the only one that is "active".

For example, Bell Canada recently introduced phone mail to residences
that previously only had "call answer".  Since phone mail has
fundamentally new functionality over call answer, the phone mail can
not be tested under real-world conditions until its functionality is
activated and actually used by the users.  And if it fails, it will be
difficult to reliably fall back to call answer alone, because then any
phone mail messages left in the queue may be lost or left dangling.


Re: Fault-tolerant software, "upgrade hell" (Nuri, RISKS-18.33)

Valdis Kletnieks <valdis@black-ice.cc.vt.edu>
15 Aug 1996 17:03:22 GMT
I believe that the venerable Multics system supported this to some degree.
Also, IBM's AIX system has gotten quite good at update-on-the-fly (although
it's not *quite* up to full 24x7 yet), mostly due to the fact that most of
the operating system kernel is loaded on the fly.  There's still some
gotchas, most notably in trying to reload a device driver after applying
maintenance...

>What we have is a sort of "shadow computation" going on behind the scenes...

Unfortunately, if you have a system that's running at 85% capacity, you will
require just about twice as much processor.  Also, you introduce new failure
modes.  I believe the Space Shuttle uses a 5-way redundant system, with 4
systems made and programmed by one contractor, and the 5th a different
design and programming from a separate contractor.

More than once, a shuttle launch has been scrubbed because the voting
mechanism itself broke down.

On the other hand, I seem to remember that when the great long-distance
telephone collapses happened a few years ago, a telco official was asked why
they just don't reboot the switch, and he replied that this implied that the
switches were booted a first time - apparently, some of them had been
upgraded from mechanical rotaries through to several generations of more
electronic and computerized designs, without ever actually going down..

Anybody have more info on how the telcos do software upgrades?  They seem to
have quite the good record on it (barring a few historical botches ;)

Valdis Kletnieks  Computer Systems Engineer  Virginia Tech


Re: fault-tolerant software for escaping "upgrade hell"

"Vladimir Z. Nuri" <vznuri@netcom.com>
Thu, 15 Aug 96 15:49:06 -0700
An nice analogy for fault-tolerant system upgrades struck me after I wrote
the earlier article. When highways are being resurfaced, they are rarely
barricaded entirely; instead traffic is diverted during off-peak periods and
run at less-than-peak efficiency. I am trying to spur designers to think of
the flow of data through programs as exactly the same kind of situation:
ideally data could be easily rerouted through different parts of a system
even as it is being upgraded. Of course, roads do not have the complexity of
software-- the pavers do not have to worry about it not working properly
once installed.

Kurt Fredrikson remarks that the "Ericsson AXE exchange" has the
functionality to upgrade software on-the-fly. I am not familiar with it but
I wonder if it has all the key properties I mentioned or if it subtly relies
on any of the hidden "gotchas" such as assuming designers will infallibly
carry over functionality between versions (i.e. no regressions).  It is easy
to claim that software can be instantly switched to a new version (or back
to an old one), but such a mechanism is not entirely desirable without other
features such as the ability to seamlessly compare the compatibility results
of new versions with earlier ones.

In other words, one of my basic assertions in the essay is that regressions
in software, where bugs in existing functionality are introduced in new
versions, can be caught through more systematic methods than are generally
being employed today. Wayne Hayes writes in a response that the system I
propose cannot handle "new functionality". In the short essay I could not
include this obvious caveat although it is quite apparent and a significant
limitation. Actually, the essence of the mini-essay focuses on an elegant
and graceful way of avoiding "regressions", and obviously one cannot have a
regression when the functionality did not previously exist.

But Mr. Hayes also brings up an excellent related point about situations
where new functionality conflicts with old functionality. In such cases
designers could actually write code that "bridges" the two versions, such
that they have fallback algorithms when the new code fails to function. In
other words, they write a bit of additional code that tries to gracefully
fall back or rearrange the system so that one can run the earlier version.
Of course, that could have bugs too, and we tend to run into issues of
infinite regress in some of these ideas. (Nevertheless, often companies have
to assume that many of their software versions are running simultaneously.)

In fact there is a whole array of issues that readers can immediately spot--
this is exactly the kind of development and attention I suggest be channeled
into creating computer languages and hardware systems that take into account
all the various scenarios.

I want to highlight this point of Mr. Hayes'-- the system I am describing
for graceful system upgrades is not designed to guarantee that new
functionality is correct, only that old functionality is not "clobbered" in
an upgrade, and no system downtime in the process.  Of course the process of
testing new functionality is an entire art form in itself.  However, because
"seamless software upgrading" does not guarantee the correctness of new
functionality does not mean it is not superior to the systems we have now,
which frequently do not even guarantee old functionality in practice
(although the designers would insist they do in theory-- a perception gap I
am explicitly challenging).  If a new feature does not work correctly but
the system is still running, that's highly desireable. Designers would be
elated to be able to test new features without fear of breaking the overall
system.

One nice way of thinking about this is the following: every software package
is a core of functions, say A = X + Z. X is the core code that should remain
compatible into the future. Z is code identified as obsolete and "to be
deleted" in a future version.  A new version, A', adds some new
functionality Y, so that A' = A + Y - Z = X + Y. A "regression" happens
whenever something in X fails to function in the new version, something I am
suggesting can be better dealt with by a change in designer's perceptions
and tools relative to the inherent evolutionary aspect of software (in
contrast to the perception of inertness that holds today).

To summarize the above points in this framework, Mr. Hayes quite rightfully
points out that if you don't have any prior information on Y, the concept of
"regression" is not applicable. But also in his example, the problem of
isolating exactly what Y is relative to X is apparent. Trying to draw the
line between the two would be very difficult in some situations such as the
case he gives, where new functionality is not merely an addition on old
functionality but a replacement of aspects of it. However, merely placing
attention onto the problem improves its chances for solution, and the
designer is forced to explicitly answer the question, "how can I gracefully
add this new feature and possibly revert to the old one if it doesn't work?"

Mr. Kletnieks mentions another demand of graceful software upgrading, that
if our system is running at anywhere over 50% capacity you would have to add
capacity to mirror it if it is a real-time system.  But to me this is
essentially like describing mirrored disk drives to someone, and them
saying, "but you have to have twice as many disk drives as you now have".
That's the basic part of the cost/benefit tradeoff.  Also, I think that many
of the ideas can be used in non-realtime systems which would not require an
upgrade in capacity but would instead imply slower running times. Of course
there is additional overhead in introducing such ideas, just as for example
in OOP the inherent overhead to function calls is increased due to the
indirection. Of course, the power derived is inherent to this sacrifice.

The use of voting software in the space shuttle is one example of
a system that is similar to what I am proposing. What he describes is
essentially the idea of using shadowed computation on a single process
to guarantee reliability, but the area of *upgrading* that I
am focusing on is not as apparent in that example. Mr. Kletnieks' example
makes a good point, however: the code designed to deal with multiple versions
may itself have bugs in it. Again, the infinite regress-- what code will
check the checker code? I suggest this will be less of a problem when the
checker code is intrinsic to the language, so that this would be like
asking the question, "what if the compiler has bugs in it"?

I would also like to point out that it ought to be up to the end-user to
determine actual compatibility of versions. Today we have a system in which
the companies that create the software give the assurance that it is or is
not compatible, and the end-user often cannot determine whether this is so
without committing to the new version, with the "upgrade hell" scenario
frequently ensuing despite everyone's best intentions. In contrast, with a
suitably flexible and powerful computation environment like I describe, it
would be less relevant what the company promises, and the end-user could
have ultimate control over testing and switching versions to match his own
demands.  How can a company guarantee it has a rigor in its testing
environment that exceeds that of all its customers? History generally
suggests they can't, and I suggest that systems be designed with this in
mind, which give greater control to the end user in arbitrarily and
seamlessly testing and switching between versions.

Based on responses, I am struck that the ideas I was highlighting in my
original essay were explored somewhat thoroughly in the 70's, yet they
apparently haven't made it into widespread use in software or hardware.  I
suggest they could be applied in varying degrees into areas outside of the
incredibly demanding environments of telecommunications and space
exploration with fruitful results (I would also like to hear more about how
telecommunication software is upgraded from RISKS readers-- the infamous
AT&T switching disaster of a few years ago shows that it was even "recently"
subject to catastrophic human miscalculations).

The graceful-upgrade paradigm seems to have failed to have made the
difficult leap from theoretical curiosity to widespread use and awareness by
"Joe Codehead". Again I think the analogy to OOP is relevant.  It took a
massive paradigm shift in consciousness for OOP to "trickle down" throughout
the software industry.  I suspect that the ideas for seamless software
upgrades are roughly as significant and valuable and am writing partly in
the hope others can take up this ideology for research and implementation
beyond my own limited and minimal elucidations.


Re: Electromagnetic pulses to stop car chases? (Wayner, RISKS-18.32)

Michael Brady <michaelb@gemsbok.corp.sgi.com>
13 Aug 1996 19:33:30 GMT
>       Very precisely directed beams are required, ...

Of course the traditional car-stopping alternatives are a fusillade of
gunfire, or a highly hazardous high speed pursuit, ramming of the fleeing
vehicle, and, all too frequently, a "spirited arrest" by the pursuing
officers.  Except for the risk of turning pacemakers "up", "down", or "off",
this non-lethal tool might be worth the collateral damage.

Michael Brady, CPP, Corporate Security Manager, Silicon Graphics, Inc.
Global Facilities, World Wide Administration Division michaelb@corp.sgi.com

   [Might as well try EMP (nuclear) while we are at it!
   It would reduce traffic (but not congestion).  PGN]


Re: Western Power Outage

Steve Forrette <stevef@wrq.com>
Tue, 13 Aug 1996 13:15:23 -0700
Regarding the power outage that hit several Western US states last Saturday,
I had an interesting experience.  I was in Las Vegas that weekend, and
arrived at the Hilton hotel/casino where I was staying shortly after the
power outage started.  At the time, I was unaware of the outage, and things
appeared more or less normal inside, since the casino was on generator
power.  However, when I got to the front of the check-in line, I was told
that I could not be issued a key to my room, since all of the machines that
make the room keys (in this case, plastic cards with mag stripes) were
"down."  At first, they were sending new guests to their rooms accompanied
by security, who would let them in with a master key.  However, this quickly
became overwhelming for them as the outage progressed, so they told us just
to check back every 30 minutes until they got the card key machines working
again.

With nothing better to do, I settled down in the bar :-), where I leared
about the power outage.  At this point, it was really interesting to see
what Hilton considered essential enough to have on the backup generators,
and what was "unessential" and therefore could be dark during the power
outage.  It was no surprise to see that *everything* on the casino floor was
considered essential, right down to the chandeliers.  Reportedly, everything
went dark for a few seconds when the outage began and while the generators
spun up.  After that, whatever was on the backup system came back on.  The
slot machines remembered how many credits each player had, so at least that
part of them must be on some sort of UPS.

In the bar, most of the lights were out, but the lights behind the bar, as
well as all of the equipment needed to keep it open (beer and soft drink
taps, cash registers, etc.) were on.

It was interesting that at the front desk, all of the computers were up, so
they could check guests in and out, but the card key machines were not.  The
risk here is that even when you have a backup generator, your operations can
still be crippled if you have a poorly thought out strategy of what you place
on the backup system.

Steve Forrette, stevef@wrq.com


Re: America Offline (Mellor, RISKS-18.33)

Valdis Kletnieks <valdis@black-ice.cc.vt.edu>
15 Aug 1996 17:19:51 GMT
The problem with this is that although it gives you a nice-looking set
of graphs, it probably doesn't help you make any predictions until
*after* you find and categorize the problem.  All it *really* tells
you is how long it takes to code various types of fixes.

When the system goes belly up, you probably can't right off the bat
say "oh, that's a one-line error causing a memory overlay, 20 minutes
to fix" or "We know what that is, it's a major design flaw that will
take 4 man-weeks".

I once had to find a memory overlay in ISODE 8.0 (which is on the
order of 500K lines of code). The error would only trip after about 6
million calls to the malloc() memory allocator, once about 120M of
data had been allocated on the fly. Took me about 20 seconds to figure
out "overlay".  I then spent 3 80-hour weeks chasing it (we had a
deadline to meet).  Towards the end of the second week, I was becoming
thoroughly convinced that the entire memory management system was
trash and needed to be overhauled.

Final fix was 3 lines of C code, to repair where a programmer had forgotten
to deal with one boundary condition.

I'm sure anybody who's been doing systems admin/support in the trenches for
more than a few years has a whole collection of horror stories where the
initial diagnosis had absolutely no relationship to the actual problem....

Valdis Kletnieks  Computer Systems Engineer  Virginia Tech


Re: America Offline (Mellor, RISKS-18.33)

Lowell Gilbert <lowell@epilogue.com>
Fri, 16 Aug 1996 08:29:33 -0400
Pete Mellor claims (in RISKS-18.33) that if software producers/ maintainers
keep better records, they'll become Real Engineers and be just as able to
accurately predict the "effort required to diagnose and fix a fault" as
other kinds of Real Engineers.

While I wouldn't remotely suggest that better measurements couldn't improve
predictability, truly accurate estimates of time to fix a bug that's not yet
understood will always be an unachievable goal.  The metaphor of hardware
design, although tempting because software designers tend to work closely
with hardware designers, is so weak as to be downright disingenuous.
Software *is* different from hardware in some very relevant ways, but the
most critical for this discussion is the fact that every bug is different.
You can measure mean-time-to-failure for a light bulb quite accurately, but
that's because every light bulb failure is the same.  Every software
problem, however, is unique.  [to paraphrase Tolstoy] Once a software bug is
fixed, it shouldn't occur again in the (repaired) code base.

You can make predictions of service intervals extremely precise by
collecting enough data about previous problems.  However, they will be no
more accurate than the similarity between the problems.  The RISK, as usual,
is that the world will unexpectedly be more complicated than our model of
it.  [The key word being "unexpectedly."  Of *course* the world will be more
complicated than the model.  That's the *purpose* of making a model.]


Re: Bread-riots and circuses (O'Connell, RISKS-18.32)

Hal Lockhart <hal@platsol.com>
Thu, 15 Aug 1996 11:34:25 -0400
This reminded me of a true situation that I remember reading about a while
back (1980's?).  In some city in the Mideast (Beirut?), there was fighting
in the streets every night.  But one night a week there was peace, because
everyone stayed home to watch two episodes of Kojak broadcast one after the
other by two different TV stations.  The locals named the phenomenon "double
Kojak".

Harold W. Lockhart Jr., Platinum Solutions Inc., 8 New England Executive
Park, Burlington, MA 01803 USA hal@platsol.com (617)229-4980 X1202

  [From hijack to lojack by Kojak?  Here is an opportunity for
  pacifist Trojan horses: distributing highly addictive
  interactive computer games to both sides.  PGN]

Please report problems with the web pages to the maintainer

Top