I'd like to correct an oft-repeated misstatement; that California DMV records have been secured. In fact, quite minor procedural barriers to access by unsophisticated people at DMV counters have been instituted, but only slightly more savvy people can still get info, and the DMV continues to sell mass data extractions to reusers. Also, DMV has taken no steps at all to secure data or systems against misuse by DMV staff or others (e.g., police) with privileged access.
What happened was that a train from Euston to Milton Keynes with about 400 passengers on board was travelling North at about 60mph. Another empty train, which should have been waiting at a red signal just outside Watford Junction, started moving slowly South and crossed over a set of points onto the North-bound fast line straight into the path of the fast train. The fast train driver saw what was happening and sounded his horn but could not avoid a collision. I believe that both drivers managed to jump clear before the crash and were not badly injured. "Black boxes" were recovered from both crashed trains, and showed that signalling and train systems were working properly just before the crash. These black boxes, I understand, are a recent innovation and are not fitted to all trains. It turned out that the crashed trains were both fairly new (about 10 years old) and had well-designed carriages which dissipated the force of the impact; also, that the collision was glancing and not head-on. Newspaper photographs showed track torn up and the front carriages of both trains at crazy angles; one was hanging over an embankment. The other carriages of both trains were derailed but had not been thrown over. The line was closed for about 2 days while the carriages were removed and track repaired. It was a huge stroke of luck that the collision involved new rolling stock: on some other lines carriages are 30 or 40 years old and are of an antique "slam-door" design which concertinas in the event of a crash. I have read that, of about 70 deaths on the United Kingdom railways in the past 10 years, all but one have occurred in these older carriages. There are four separate enquiries going on at the moment - by Railtrack (who own the track), North London Trains (who own the trains), British Transport Police and the UK Government's Health and Safety Executive - so I presume the reason for the supposedly stationary train moving will come out in due course. Alastair Scott firstname.lastname@example.org
A number of risks have emerged from the recent crash. The first concerns Automatic Train Protection, ATP. This system is claimed to stop any train which passes a red signal. It is not used on the British railway network, though it is deployed on other European railways. [Apologies to any trainspotters for any simplification I've made.] The UK railway companies say that ATP is too expensive - it costs too much for each life it saves, though how they work that out is beyond me. (Another risk?) They claim that the money required for ATP would be better spent on other safety measures like modern, stronger passenger carriages. So, rather than prevent trains crashing into each other, they think the best strategy is to let them crash, but make the rolling stock safer. (Yet another risk?) The next risk is the absurd way in which Britain's railways are now run after privatisation. One company - Railtrack - owns the stations, track and signalling systems. It seems more interested in property development and turning stations into shopping malls than the rail infrastructure. This company would have to install ATP, which would make a big dent in their balance sheet. So, it's hardly surprising that they are not enthusiastic for ATP, even though they have some responsibility for safety. Railtrack charges train companies for the use of its stations and track. The train companies operate the services, but they don't own the rolling stock. These belong to leasing companies who hire them to the rail operators. Aside from the bureaucracy and ticketing nightmares, there are serious safety risks in this setup. Safety of the trains is the responsibility of the leasing companies who own them and the operating companies who use them and presumably the companies who maintain them. Where the boundaries are is anybody's guess. Railtrack have responsibility for the safety of the track and signalling systems. However, if they were to deploy ATP, the leasing and operating companies would have to pay for the extra kit in the trains. Where the boundaries of responsibility lie between Railtrack and the leasing and operating companies is yet unknown. There was a recent report that a small fire was put out by staff throwing dirt and sand at the junction box. The box belonged to Railtrack, the staff worked for an operating company. They feared being disciplined if they used company property - their company's fire extinguishers - to help another company. Sitting on top of this is a government agency, the Health and Safety Executive which is responsible for safety in the workplace amongst other things. Where their responsibility kicks in is yet another unknown. With no single body responsible for train safety, it's hard to apportion blame for crashes or establish better procedures and communications protocols to make them less likely in future. The companies and agencies involved end up shrugging their shoulders and pointing at each other. An added problem is that none of these companies has safety as a prime objective. They all want to cut costs to boost their profits. For Railtrack, money spent of safety measures comes straight off their bottom line, reducing dividends for shareholders and profitability bonuses for management. For the leasing companies, repainting old trains is more cost effective than buying new ones which presumably have better safety features. For the operating companies, the cost of leasing is one of the few costs they can control. [They can also work their drivers harder, but that will be another safety risk.] Thus, they prefer to run old, less safe, trains because they are cheaper to lease than new ones.
> Apparently they pass on the information you entered to another service and > presumably if you don't show up you don't get to download the software. That's quite interesting. Now what happens if I telnet into my other service provider (in California) and enter a perfectly valid name, US address and phone number. Not that I would do it of course, it seems an unnecessarily convoluted way of achieving a fairly simple objective. Risks readers might like to ponder a Catch 22 situation I found myself in some years ago. Working for a UK company selling high-tech equipment (68000 processors) I wasn't permitted to supply anyone on a blacklist. I could have been extradited and jailed for doing that. Of course I wasn't permitted to actually see the blacklist either. And on another subject entirely... > The final report said that the crash was the result of the pilots fighting > the autopilot. I've just heard a report of a crash that happened here two days ago. It was a result of the two pilots fighting each other. As a result a business jet crashed on one of Europe's busiest roads. Four injuries, no fatalities. Bernard Peek email@example.com
The author is right. This is not a novel idea. The Ericsson AXE exchange has had the functionality to upgrade software in a running system from the first system delivered more than ten years ago. This was a deliberate design decision due to the harsh demands on "uptime" in telecommunication systems. Kurt Fredriksson [Peter Denning reminds us that the Newcastle work on Recovery Blocks — Brian Randell et al. — is relevant here, and certainly worthy of mention. It goes way back to the mid 1970s. It was an extremely well thought-out effort, with a language, a run-time system, and a supporting hardware architecture for dealing with concurrent processes running different versions of an algorithm (alternat[iv]e blocks), such that the collection of processes would not terminate until at least one of them passed an acceptance test. Their system include automatic checkpointing so that you could back up properly to the last known state that passed an acceptance test. Thanks to PJD for the reminiscence. PGN]
"Vladimir Z. Nuri" <firstname.lastname@example.org> writes about the possibility of running new software as a "shadow" of currently running software and (perhaps automatically) testing its reliability before switching it to "actively" controlling the system. This method may be fine for bug fixes, but it has a fundamental limitation because it ignores a fundamental issue of software upgrades: that the new software may have new functionality which is unavailable in the old version, and thus will never be tested while the old software is the only one that is "active". For example, Bell Canada recently introduced phone mail to residences that previously only had "call answer". Since phone mail has fundamentally new functionality over call answer, the phone mail can not be tested under real-world conditions until its functionality is activated and actually used by the users. And if it fails, it will be difficult to reliably fall back to call answer alone, because then any phone mail messages left in the queue may be lost or left dangling.
I believe that the venerable Multics system supported this to some degree. Also, IBM's AIX system has gotten quite good at update-on-the-fly (although it's not *quite* up to full 24x7 yet), mostly due to the fact that most of the operating system kernel is loaded on the fly. There's still some gotchas, most notably in trying to reload a device driver after applying maintenance... >What we have is a sort of "shadow computation" going on behind the scenes... Unfortunately, if you have a system that's running at 85% capacity, you will require just about twice as much processor. Also, you introduce new failure modes. I believe the Space Shuttle uses a 5-way redundant system, with 4 systems made and programmed by one contractor, and the 5th a different design and programming from a separate contractor. More than once, a shuttle launch has been scrubbed because the voting mechanism itself broke down. On the other hand, I seem to remember that when the great long-distance telephone collapses happened a few years ago, a telco official was asked why they just don't reboot the switch, and he replied that this implied that the switches were booted a first time - apparently, some of them had been upgraded from mechanical rotaries through to several generations of more electronic and computerized designs, without ever actually going down.. Anybody have more info on how the telcos do software upgrades? They seem to have quite the good record on it (barring a few historical botches ;) Valdis Kletnieks Computer Systems Engineer Virginia Tech
An nice analogy for fault-tolerant system upgrades struck me after I wrote the earlier article. When highways are being resurfaced, they are rarely barricaded entirely; instead traffic is diverted during off-peak periods and run at less-than-peak efficiency. I am trying to spur designers to think of the flow of data through programs as exactly the same kind of situation: ideally data could be easily rerouted through different parts of a system even as it is being upgraded. Of course, roads do not have the complexity of software-- the pavers do not have to worry about it not working properly once installed. Kurt Fredrikson remarks that the "Ericsson AXE exchange" has the functionality to upgrade software on-the-fly. I am not familiar with it but I wonder if it has all the key properties I mentioned or if it subtly relies on any of the hidden "gotchas" such as assuming designers will infallibly carry over functionality between versions (i.e. no regressions). It is easy to claim that software can be instantly switched to a new version (or back to an old one), but such a mechanism is not entirely desirable without other features such as the ability to seamlessly compare the compatibility results of new versions with earlier ones. In other words, one of my basic assertions in the essay is that regressions in software, where bugs in existing functionality are introduced in new versions, can be caught through more systematic methods than are generally being employed today. Wayne Hayes writes in a response that the system I propose cannot handle "new functionality". In the short essay I could not include this obvious caveat although it is quite apparent and a significant limitation. Actually, the essence of the mini-essay focuses on an elegant and graceful way of avoiding "regressions", and obviously one cannot have a regression when the functionality did not previously exist. But Mr. Hayes also brings up an excellent related point about situations where new functionality conflicts with old functionality. In such cases designers could actually write code that "bridges" the two versions, such that they have fallback algorithms when the new code fails to function. In other words, they write a bit of additional code that tries to gracefully fall back or rearrange the system so that one can run the earlier version. Of course, that could have bugs too, and we tend to run into issues of infinite regress in some of these ideas. (Nevertheless, often companies have to assume that many of their software versions are running simultaneously.) In fact there is a whole array of issues that readers can immediately spot-- this is exactly the kind of development and attention I suggest be channeled into creating computer languages and hardware systems that take into account all the various scenarios. I want to highlight this point of Mr. Hayes'-- the system I am describing for graceful system upgrades is not designed to guarantee that new functionality is correct, only that old functionality is not "clobbered" in an upgrade, and no system downtime in the process. Of course the process of testing new functionality is an entire art form in itself. However, because "seamless software upgrading" does not guarantee the correctness of new functionality does not mean it is not superior to the systems we have now, which frequently do not even guarantee old functionality in practice (although the designers would insist they do in theory-- a perception gap I am explicitly challenging). If a new feature does not work correctly but the system is still running, that's highly desireable. Designers would be elated to be able to test new features without fear of breaking the overall system. One nice way of thinking about this is the following: every software package is a core of functions, say A = X + Z. X is the core code that should remain compatible into the future. Z is code identified as obsolete and "to be deleted" in a future version. A new version, A', adds some new functionality Y, so that A' = A + Y - Z = X + Y. A "regression" happens whenever something in X fails to function in the new version, something I am suggesting can be better dealt with by a change in designer's perceptions and tools relative to the inherent evolutionary aspect of software (in contrast to the perception of inertness that holds today). To summarize the above points in this framework, Mr. Hayes quite rightfully points out that if you don't have any prior information on Y, the concept of "regression" is not applicable. But also in his example, the problem of isolating exactly what Y is relative to X is apparent. Trying to draw the line between the two would be very difficult in some situations such as the case he gives, where new functionality is not merely an addition on old functionality but a replacement of aspects of it. However, merely placing attention onto the problem improves its chances for solution, and the designer is forced to explicitly answer the question, "how can I gracefully add this new feature and possibly revert to the old one if it doesn't work?" Mr. Kletnieks mentions another demand of graceful software upgrading, that if our system is running at anywhere over 50% capacity you would have to add capacity to mirror it if it is a real-time system. But to me this is essentially like describing mirrored disk drives to someone, and them saying, "but you have to have twice as many disk drives as you now have". That's the basic part of the cost/benefit tradeoff. Also, I think that many of the ideas can be used in non-realtime systems which would not require an upgrade in capacity but would instead imply slower running times. Of course there is additional overhead in introducing such ideas, just as for example in OOP the inherent overhead to function calls is increased due to the indirection. Of course, the power derived is inherent to this sacrifice. The use of voting software in the space shuttle is one example of a system that is similar to what I am proposing. What he describes is essentially the idea of using shadowed computation on a single process to guarantee reliability, but the area of *upgrading* that I am focusing on is not as apparent in that example. Mr. Kletnieks' example makes a good point, however: the code designed to deal with multiple versions may itself have bugs in it. Again, the infinite regress-- what code will check the checker code? I suggest this will be less of a problem when the checker code is intrinsic to the language, so that this would be like asking the question, "what if the compiler has bugs in it"? I would also like to point out that it ought to be up to the end-user to determine actual compatibility of versions. Today we have a system in which the companies that create the software give the assurance that it is or is not compatible, and the end-user often cannot determine whether this is so without committing to the new version, with the "upgrade hell" scenario frequently ensuing despite everyone's best intentions. In contrast, with a suitably flexible and powerful computation environment like I describe, it would be less relevant what the company promises, and the end-user could have ultimate control over testing and switching versions to match his own demands. How can a company guarantee it has a rigor in its testing environment that exceeds that of all its customers? History generally suggests they can't, and I suggest that systems be designed with this in mind, which give greater control to the end user in arbitrarily and seamlessly testing and switching between versions. Based on responses, I am struck that the ideas I was highlighting in my original essay were explored somewhat thoroughly in the 70's, yet they apparently haven't made it into widespread use in software or hardware. I suggest they could be applied in varying degrees into areas outside of the incredibly demanding environments of telecommunications and space exploration with fruitful results (I would also like to hear more about how telecommunication software is upgraded from RISKS readers-- the infamous AT&T switching disaster of a few years ago shows that it was even "recently" subject to catastrophic human miscalculations). The graceful-upgrade paradigm seems to have failed to have made the difficult leap from theoretical curiosity to widespread use and awareness by "Joe Codehead". Again I think the analogy to OOP is relevant. It took a massive paradigm shift in consciousness for OOP to "trickle down" throughout the software industry. I suspect that the ideas for seamless software upgrades are roughly as significant and valuable and am writing partly in the hope others can take up this ideology for research and implementation beyond my own limited and minimal elucidations.
> Very precisely directed beams are required, ... Of course the traditional car-stopping alternatives are a fusillade of gunfire, or a highly hazardous high speed pursuit, ramming of the fleeing vehicle, and, all too frequently, a "spirited arrest" by the pursuing officers. Except for the risk of turning pacemakers "up", "down", or "off", this non-lethal tool might be worth the collateral damage. Michael Brady, CPP, Corporate Security Manager, Silicon Graphics, Inc. Global Facilities, World Wide Administration Division email@example.com [Might as well try EMP (nuclear) while we are at it! It would reduce traffic (but not congestion). PGN]
Regarding the power outage that hit several Western US states last Saturday, I had an interesting experience. I was in Las Vegas that weekend, and arrived at the Hilton hotel/casino where I was staying shortly after the power outage started. At the time, I was unaware of the outage, and things appeared more or less normal inside, since the casino was on generator power. However, when I got to the front of the check-in line, I was told that I could not be issued a key to my room, since all of the machines that make the room keys (in this case, plastic cards with mag stripes) were "down." At first, they were sending new guests to their rooms accompanied by security, who would let them in with a master key. However, this quickly became overwhelming for them as the outage progressed, so they told us just to check back every 30 minutes until they got the card key machines working again. With nothing better to do, I settled down in the bar :-), where I leared about the power outage. At this point, it was really interesting to see what Hilton considered essential enough to have on the backup generators, and what was "unessential" and therefore could be dark during the power outage. It was no surprise to see that *everything* on the casino floor was considered essential, right down to the chandeliers. Reportedly, everything went dark for a few seconds when the outage began and while the generators spun up. After that, whatever was on the backup system came back on. The slot machines remembered how many credits each player had, so at least that part of them must be on some sort of UPS. In the bar, most of the lights were out, but the lights behind the bar, as well as all of the equipment needed to keep it open (beer and soft drink taps, cash registers, etc.) were on. It was interesting that at the front desk, all of the computers were up, so they could check guests in and out, but the card key machines were not. The risk here is that even when you have a backup generator, your operations can still be crippled if you have a poorly thought out strategy of what you place on the backup system. Steve Forrette, firstname.lastname@example.org
The problem with this is that although it gives you a nice-looking set of graphs, it probably doesn't help you make any predictions until *after* you find and categorize the problem. All it *really* tells you is how long it takes to code various types of fixes. When the system goes belly up, you probably can't right off the bat say "oh, that's a one-line error causing a memory overlay, 20 minutes to fix" or "We know what that is, it's a major design flaw that will take 4 man-weeks". I once had to find a memory overlay in ISODE 8.0 (which is on the order of 500K lines of code). The error would only trip after about 6 million calls to the malloc() memory allocator, once about 120M of data had been allocated on the fly. Took me about 20 seconds to figure out "overlay". I then spent 3 80-hour weeks chasing it (we had a deadline to meet). Towards the end of the second week, I was becoming thoroughly convinced that the entire memory management system was trash and needed to be overhauled. Final fix was 3 lines of C code, to repair where a programmer had forgotten to deal with one boundary condition. I'm sure anybody who's been doing systems admin/support in the trenches for more than a few years has a whole collection of horror stories where the initial diagnosis had absolutely no relationship to the actual problem.... Valdis Kletnieks Computer Systems Engineer Virginia Tech
Pete Mellor claims (in RISKS-18.33) that if software producers/ maintainers keep better records, they'll become Real Engineers and be just as able to accurately predict the "effort required to diagnose and fix a fault" as other kinds of Real Engineers. While I wouldn't remotely suggest that better measurements couldn't improve predictability, truly accurate estimates of time to fix a bug that's not yet understood will always be an unachievable goal. The metaphor of hardware design, although tempting because software designers tend to work closely with hardware designers, is so weak as to be downright disingenuous. Software *is* different from hardware in some very relevant ways, but the most critical for this discussion is the fact that every bug is different. You can measure mean-time-to-failure for a light bulb quite accurately, but that's because every light bulb failure is the same. Every software problem, however, is unique. [to paraphrase Tolstoy] Once a software bug is fixed, it shouldn't occur again in the (repaired) code base. You can make predictions of service intervals extremely precise by collecting enough data about previous problems. However, they will be no more accurate than the similarity between the problems. The RISK, as usual, is that the world will unexpectedly be more complicated than our model of it. [The key word being "unexpectedly." Of *course* the world will be more complicated than the model. That's the *purpose* of making a model.]
This reminded me of a true situation that I remember reading about a while back (1980's?). In some city in the Mideast (Beirut?), there was fighting in the streets every night. But one night a week there was peace, because everyone stayed home to watch two episodes of Kojak broadcast one after the other by two different TV stations. The locals named the phenomenon "double Kojak". Harold W. Lockhart Jr., Platinum Solutions Inc., 8 New England Executive Park, Burlington, MA 01803 USA email@example.com (617)229-4980 X1202 [From hijack to lojack by Kojak? Here is an opportunity for pacifist Trojan horses: distributing highly addictive interactive computer games to both sides. PGN]
Please report problems with the web pages to the maintainer