Troubleshooting Professional Magazine
Intermittents |
Copyright (C) 1998 by Steve Litt. All rights reserved. Materials from guest authors copyrighted by them and licensed for perpetual use to Troubleshooting Professional Magazine. All rights reserved to the copyright holder, except for items specifically marked otherwise (certain free software source code, GNU/GPL, etc.). All material herein provided "As-Is". User assumes all risk and responsibility for any outcome.
Contents
Or maybe not so strange. Even though Troubleshooting Process doesn't guarantee solution of an intermittent, it sure helps. Since intermittents are an order of magnitude more difficult than reproducibles, ordinary personnel are given reproducibles while Troubleshooting Process experts get the intermittents. I guess the bottom line is we all need to become experts at intermittents. That's what this issue of Troubleshooting Professional is all about.
This issue sports our first-ever regular column, Linux Log. Every month, Linux Log will contain a Linux article fitting the theme of the issue.
And be sure to see the letters to the editor. The November issue, whose theme was Linux, broke all records for ANY Troubleshooters.Com web page. It appears to have been visited over 3000 times on 11/19/1998. And brought truckloads of mail, much of it (deserved) flames concerning my underestimation of non-Intel CPUs, and my holding Linux to a higher standard than Windows without making that fact clear. I find this month's letters to the editor intelligent and informative.
So kick back and relax as you read this issue. And remember, if you're a Troubleshooter, this is your magazine. Enjoy!
An intermittent is a problem for which there is no known procedure to consistently reproduce its symptom. |
Note these words:
KNOWN | Most problems can be reproduced. After all, if they couldn't be reproduced,
they wouldn't happen by chance. The issue isn't whether the problem can
be reproduced, it's whether the Troubleshooter can reproduce it at will,
in order to perform authoritative tests. And to do so, he or she must be
aware of the sequence of actions necessary and sufficient to reproduce
the fault. The point here is that intermittence, or non-reproducibility,
is dependent on the information available to the Troubleshooter, rather
than the condition of the system.
Here's a true story illustration: My continuously running software hung once or twice a day. I finally managed to see it happen, capture the exact sequence of input files causing it, assembled that sequence permanently by marking them read-only. From then on it happened every time. The reproduction procedure was to assemble that exact combination of input files in that order. |
PROCEDURE | Reproduction often depends on a complex combination of factors, and often on the order in which those factors appear. Thus reproduction requires a procedure to invoke the factors in proper sequence. |
CONSISTENTLY | I've heard many people define a problem as "reproducible", and go on to explain that it will happen within an hour of turning the machine on. I'd call that an intermittent, because the machine is reproducing the problem, not the Troubleshooter. Unless it consistently happens in 57 minutes and 32 seconds, the Troubleshooter can't reproduce it. And, as we'll discuss in the next article, if the Troubleshooter can't make it happen, troubleshooting becomes much harder. |
Now lets look at an intermittent. For a given test, there's no way to know whether symptom vanished because of the test, or because the intermittent just picked that moment to change state. Inability to conclusively rule out sections necessitates re-visits to those sections and circular Troubleshooting. Time to solution can go up by factors of 10, 100, or occasionally more. I've seen intermittents take months to solve.
In safety related cases, solution is necessary but costly. The Universal Troubleshooting Process is helpless against sparse intermittence. They're better handled by other Troubleshooting Processes designed specifically for them, including one called Root Cause Analysis. We'll discuss that later.
There's a part of us all that says "it was just a glitch". But if they fail again it might be on a mountain road with a 3000 foot drop.
The mechanic might replace the calipers as a likely suspect, and tell you it's probably OK. Customer testing is usually fine (if the customer is aware that this is being done), but in this case it could kill the customer.
The mechanic might replace everything in the braking system. But what if the root cause was outside the braking system. As cars get more computerized, this becomes more likely.
You might get a new car. While that would certainly eliminate the root cause, it's costly.
These are some of the special concerns when intermittents meet safety critical systems. Test and narrow just don't cut it. If the symptom happens again, lives can be lost. General maintenance won't cut it. The worst outcome would be to have the symptom vanish, only to wonder if it will happen again, and what injuries will result.
The right solution is to find the root cause, prove logically that it would have caused the EXACT symptom or syndrome, prove that it was likely to happen, find out exactly what caused that root cause to happen (often it's a problem with people, documentation, procedures, safety checks, etc.), and prevent future occurrence. And make no mistake: this costs big money.
Ultimately, every intermittent in a safety critical system becomes a tradeoff between safety and money. Take the car's brakes. Few of us would buy a new car or assign a committee to do a Root Cause Analysis. What we'd probably do is take the mechanic's advice about what to replace, then spend the next week testing the daylights out of the brakes on long, straight deserted roads. After that, we'd cross our fingers.
How different we'd react if the system was part of our nuclear defense system. A single failure could wipe out the human race. Here we'd get every expert, regardless of cost. We'd do a Root Cause Analysis, walk through everything, review every possibility. We wouldn't hesitate to spend a billion dollars.
Regarding safety, sparse intermittents are the worst. If brakes fail every 20 times, it's easy to test in a safe manner, reproduce the problem, and solve it. No big deal. Even in the nuclear defense system, we could disconnect all missiles (I'm hypothesizing here, I have no idea how this stuff works), and every 20 times get more info.
But if the brakes fail once a year, there's no practical way to reproduce it. We have the choice of bearing the cost of working backwards theoretically to try to isolate it, or doing our best and crossing our fingers.
It may be unpleasant to talk about, but intermittents in safety critical systems are always a tradeoff.
We always face the question "what maintenance is appropriate for General Maintenance". That's discussed thoroughly in the February 1998 issue of Troubleshooting Professional, and also in Troubleshooters.Com. In intermittent situations, we include a much broader range of activities under the General Maintenance umbrella, because the alternatives are so much more difficult. As an example, in a computer with a reproducible problem, we'd almost never start by reseating ever card and replacing every cable. We'd do that for a computer intermittent. We're gambling that the time and money we spend will save us huge expenses troubleshooting that intermittent.
As mentioned in the article on Safety and Intermittents, General Maintenance is unsuitable for intermittents in safety related systems. There's nothing worse than having the problem disappear, and not knowing whether it will ever occur again, possibly resulting in injury or loss of life.
PM is best done in situations where the organization recognizes the need, and is willing to spend the time and money. It works poorly where management believes "if it ain't broke, don't fix it". That type of management is optimized to put out fires, and that's what they spend most of their time doing. And of course, if their product is safety critical, the firefighters will soon be litigated out of business.
Manipulation, coupled with astute observation, is often the quickest way to solve an intermittent.
Finally, after seeing a crash, I shut down the system. I then used Netware's salvage command to bring back the input files processed just before the crash. I marked them read-only so they couldn't be renamed or deleted. I then fired up the system, which instantly crashed. Bringing those files to a local PC, I was able to make it happen every time. There was just something about that sequence of files that made it happen.
Now that it happened every time, it was just a matter of watching in the debugger til I found that I was passing a local string as a subroutine return. Of course, as soon as the subroutine ended the string variable went out of scope, and its memory was free for any other process to modify... The key to solving this sparse intermittent converting it from an intermittent to a reproducible by finding a series of files which would consistently reproduce the problem.
We're not quite there yet. The problem is that the same technology that allows our computers to do the troubleshooting has been incorporated in modern machines, making them orders of magnitude more complex, and therefore out of the reach of statistical analysis by an affordable computer. But we're getting closer, and some day we'll have it.
Statistical analysis is actually used today, although in non-rigorous forms. When the Troubleshooter wiggles, cleans, reseats, moves, tweaks, heats and cools, he or she observes the results and informally analyzes the statistical significance of the results. When the Troubleshooter finds something way off the bell curve, he or she narrows the scope of the problem.
And more rigorous statistical analysis are often practiced. We often use strip-chart recorders to record when symptoms happen, then compare the events to written logs of environmental factors. With today's technology, it would be quite easy to use a computer instead of a strip chart recorder, to record not only the events, but the environmental factors. And while it's at it, it could do a statistical analysis, and sound an alarm when it finds a probable reproduction sequence. The difference between this and the computer analysis described at the beginning of this section is:
It's not presented as a rigorous series of steps, preferring to present the "steps" as a series of tools with a recommended order of performance. The tools include defining the problem and collecting data, task analysis, change analysis, control barrier analysis, event and causal factor charting, interviewing, determining root cause, developing corrective actions. A detailed description of Root Cause Analysis is beyond the scope of this article, but if you'd like to learn more a good starting point is the book "The Root Cause Analysis Handbook" by Max Ammerman, ISBN 0-527-76326-8. If you understand the Universal Troubleshooting Process, you'll find this book's concepts familiar, and its optimization to determining cause after occurrence of an event ingenious.
He describes how his overclocked system, which had been running perfectly under DOS/Win31, crashed incessantly under Win95. He said he was terribly distressed (kind of like I was in "Litt Takes the Nine Count"). He slowed the processor, and the problem went away. Michael is overclock-savvy, so he immediately suspected a heat problem (most overclocking problems involve heat). He applied heatsink compound (also called silicon grease), and was able to overclock it again with no Windows 95 problems.
It sounded exactly like my problems with the Cyrix/noname motherboard. I know most integrators don't bother with heatsink compound. And Michael Verstichelen says "AMD and Cyrix have tended so far to push their CPUs a lot more", meaning they run hotter at their rated clock speed. I wish I had that old motherboard/Cyrix to test whether this was the problem, but I refunded them. I'll bet you dollars to donuts a bigger heatsink, with a better fan, attached with heatsink compound, as well as use of several case fans, would have made the situation different.
If you're noticing lots of intermittence in your computer, underclock it by 20% to 40%. See if the problems decrease dramatically. If so, it's either a timing issue or a heating issue. If, at full rated clock speed, either your CPU or heatsink are too hot to touch for 5 seconds, get better cooling and see if that helps. Remember to use heatsink compound to better thermally couple the CPU and heatsink..
Always try to notice whether the intermittence happens the first few minutes after powering up in the morning. If it doesn't, but instead waits a few minutes to start happening, your intermittence is almost certainly caused by an overheated CPU.
And here are a few tips:
Built in December, 1997, it's a Pentium II 300 on an Abit LX6 motherboard, with 128meg of 10ns 4 clock SDRAM, 6.4 meg disk, Win98 installed clean and fresh. As fate would have it, my power supply burned out a week ago, so I replaced it. I had my vendor replace it (heck, the labor was free if I paid for the parts), and while he was at it I had him install a faster, higher torque ball bearing heat sink fan (with a new heatsink), and an additional fan sucking air into the front of the case. Although my vendor didn't use heatsink compound, he showed me that the new heatsink had this special kind of metal designed to promote thermal conductivity from the CPU. This metal looked something like mu metal or lead. I guess what I'm saying here is that this machine is equipped better than average to be overclocked.
The computer's performance appeared "crisper", with menus operating quicker and screens painting faster. Programs appeared to load quicker, but I didn't stopwatch it, so it could have been an illusion caused by the "crispness". Please keep in mind that because I sped up the bus, memory access and disk access also sped up.
Be sure you're using the right memory. LX and newer boards require four clock SDRAM, not the 2 clock that worked on the TX's. The new 100mhz motherboards require a special kind of SDRAM to keep up with them. Make sure you're using the right ram for your setup.
See if this removes your intermittence. If so, don't worry about performance -- later you can turn most of the caching and shadowing on, 1 by 1, to gain back most of your performance while keeping any offending settings disabled.
Please note that it's possible the real cause is a hot processor, and dumbing down the bios appears to fix it, but really masks the CPU overheat problem. Under such a scenario, the true solution is to decrease the CPU's temperature.
On our desktops hourly crashes and reboots are a hassle, but we live with them. I've heard of nobody who could set up a crashless (let's say for 2 months at a time) Windows system, even NT.
On the other hand, when we're running a network or a busy website, crashes must be assigned a cause and eliminated -- not ignored. There's just too much chance of a crash leaving a database or file update in an illegal state, no matter how much commit, rollback and transaction processing we have. And of course, frequent crashes are seen by online customers as a sign of amateurism.
Linux stays "up" for days and weeks at a time. And when it fails, it's much easier to find the cause. Windows is a nice, convenient operating system for personal use. And indeed, Windows NT offers improved reliability. But when reliability is at a premium, and problems, especially intermittents, are intolerable, Linux is my choice among the low cost operating systems.
The November issue of Troubleshooting Professional
Magazine (Linux issue) produced more letters to the editor than all previous
issues combined. It appears I made the following mistakes:
Here's an elaboration of what I expect from Linux: I expect (and have not been disappointed) Linux to work in a logical fashion, and keep working that way day after day. I expect it to go easy on system resources. I forgive Windows for crashing or hanging hourly. After all, crashing is what Windows does best. I hold Linux to a higher standard. Life isn't fair. |
Your statement; "Cyrex/Amd are iffy in Linux" in the current "Troubleshooting Pro" is quite frankly, a load of bull! I have been running Linux for three years on various boxes almost all of them with AMD or Cyrix cpus. My very first Linux box had an AMD 486/66 and ran Redhat 2.0 with no problems. I currently own a K6-300, a K6-233, two AMD 5X86's (one over clocked at 160mhz) a Cyrix PR200+ and a Intel 486/66, all at one time or the other have run Linux with no problems. Most still do. I don't know who gave you this bogus info but he doesn't know his head from a hole in the ground.
You can put a complete distribution of Linux on a dos partition;
I have installed the "Monkey" five floppy distribution, (available from
SunSITE) on a 80meg drive, along with Win 3.1, ran the SVGA X-server, had
full networking, with room to spare, all on the above mentioned Intel 486/66.
8meg of ram is enough for Linux from the shell but I recommend at lease
16meg for the X-server.
--
Rick,
Rick writes further:
After further thought, here's what I think is going on; People
with a Cyrix or AMD K5-K6 think that they have a "Pentium clone" and build
their kernel with Pentium support. WRONG! Only Intel makes
a Pentium, The AMD K5-K6 and Cyrix 686 processors while providing Pentium
performance are not Pentiums! You must build your kernel with generic
386 support. It's a case of not reading the manual. (The AMD
486 works fine with 486 support built into the kernel.) Most all
distributions come with the generic 386 support built into the kernel for
this very reason.
Rick. Smith <riter311@gte.net>
Editors Note: Rick, your letter typifies the piles
of letters from satisfied AMD/Cyrix Linux users. Thanks for setting the
record straight, and especially thanks for pointing out the kernel issue.
The only point I have to make here is you mention that Linux is "iffy" on anything but the most common of hardware.... You may be correct in this point, but NT supports *less* hardware than Linux does... Just thought you might want to know that.
Cheers,
- Brendan
Brendan Rankin <brendan.rankin@siliconmetrics.com>
Editors Note: Brendon, you point out so well that I judged Linux much more critically than Windows, and in doing so understated the value of non-Intel processors. Thank you.
I am also worried that people will look at your sizing recommendations and think "hmm, 300MB for Windows and 1.137GB for Linux? Linux is fat!" Although I agree that installing everything means that nothing is left out, I have been able to install Linux on my home machine which has a 420MB drive. I was as accurate as possible in choosing the class of packages that I installed, but I did not choose individual packages. (Times that I have on other machines, the RPM package manager noted which dependent packages it needed and prompted me for permission to install them.)
Thank you for helping to dispel the FUD that surrounds Linux.
/Jonathan M. Prigot
Brigham and Women's Hospital
Editors Note: Jonathan, thanks so much for pointing
out the 300mb/1.137gb discrepancy. The 300mb was for a MINIMAL windows,
while the 1.137g was for MAXIMAL Linux. I was trying to make it easier
on the first-time installer by allowing him or her not to worry about what
packages to install. But to be fair, I've heard of perfectly functional
80mb Linux installations. I haven't had a chance to try FIPS yet, but I'm
looking forward to it.
Khimenko Victor <khim@sch57.msk.ru>
Editors Note: Thanks so much Victor, for pointing out mc (Midnight Commander). It's much more friendly than git, and is better for day to day administration.
You can also set the default runlevel in /etc/inittab to be X.
id:5:initdefault:
However, you're better off just starting X from a user's local .tcshrc
or .bashrc file.
jedi <jedi@penguin.lvcablemodem.com>
Editors Note: Jedi -- thanks for pointing out the value of xf86config. It's absolutely a more configurable video configuration tool, making Xconfigurator look kind of "one button". My advice for the Linux neophyte would be to try Xconfigurator because it's much easier. If the result is less than satisfactory, dig out all the specs for your card and monitor, and have at it with xf86config.
I'd advise readers to run X manually with the startx
command, rather than automatically, because it's too easy to get into a
video mode catch22, and if the user doesn't know about Ctrl-Alt-Backspace
they could conceivably damage their monitor. In a later email Jedi pointed
out that overdriving the monitor is much less likely in Linux than Windows.
Once again I've held Linux to a higher standard than Windows (which runs
only in graphic mode). Neverthelesss, I always recommend letting the user
choose to run x, and am so glad Linux gives us the choice.
'Linux Today' had a link to an article in the Motley Fool, which started me thinking along several related lines.
Microsoft's stock has gone up in price for the company's entire history. That makes people buy it, which drives it up further in a positive feedback loop. Microsoft's growth has also been maintained by its enormous profit margins, but it is now coming under increasing price pressure, not only from Linux but also from IBM, which is planning to give away an entry-level DB2 for Linux. Has IBM suddenly noticed MS's achilles heel?
Bad news for MS is good news for everyone else. The DoJ trial has become a PR catastrophe for them. They should have knuckled under and meekly signed a consent decree. This, combined with the problems they are having with Win2k, is causing the public perception of MS to change. Influential publications, like the Motley Fool and the Guardian, are starting to say that MS has lost its long-term potential, and it is time to start divesting in MS.
What if MS's stock price should fall? People would start selling it off, including the Microserfs who have been paid in stock options all these years. They would want to realize their gains before it fell further. The price of the stock would drop precipitously. Since the wages at MS are below industry average, all their top talent would leave, and their deadlines would slip back even further.
Win2k doesn't have plug'n'play, and I'm betting it won't have PnP by 3Q2000. If they start pushing it as an OS for the desktop, they will be able only to sell it on named systems from Hewlett-Packard, Compaq, Sony et al. -- machines with outrageous price tags and a fixed hardware configuration. But the industry is now going towards cheap, custom-built systems. This will make Linux more attractive. As a result of falling computer prices and the DoJ case, it will be much harder for MS to force computer makers to pre-install Win2k.
Perhaps Ed Muth knows that Win2k won't have PnP in time. In a debate with Bob Young, he stressed the "integration" of MS apps. This is Microspeak for bundling, and more "integration" means less flexibility in configuration (both H/W and S/W) and fewer choices for consumers.
Le roi est mort! Vive le roi!
Gareth Barnard <wander@nortexinfo.net>
Editors Note: When, in the May 1998 issue of Troubleshooting Professional, I speculated that Linux could overthrow Microsoft, it sounded like the rantings of a madman. Now mainstream media is speculating the same thing, and Microsoft themselves pointed to Linux as proof that they have competition. Gareth makes a unique contribution by pointing out Microsoft's financial Achilles heel. He succinctly points out that Microsoft is built on a financial bubble that has so far benefitted from positive feedback (catch 22, etc), but could fall just as fast when that positive feedback turns around. Thanks Gareth!
By submitting content, you give Troubleshooters.Com the non-exclusive, perpetual right to publish it on Troubleshooters.Com or any A3B3 website. Other than that, you retain the copyright and sole right to sell or give it away elsewhere. Troubleshooters.Com will acknowledge you as the author and, if you request, will display your copyright notice and/or a "reprinted by permission of author" notice. Obviously, you must be the copyright holder and must be legally able to grant us this perpetual right. We do not currently pay for articles.
Troubleshooters.Com reserves the right to edit any submission for clarity or brevity. Any published article will include a two sentence description of the author, a hypertext link to his or her email, and a phone number if desired. Upon request, we will include a hypertext link, at the end of the magazine issue, to the author's website, providing that website meets the Troubleshooters.Com criteria for links and that the author's website first links to Troubleshooters.Com. Authors: please understand we can't place hyperlinks inside articles. If we did, only the first article would be read, and we can't place every article first.
Submissions should be emailed to
@troubleshooters.com, with subject line Article Submission. The first
paragraph of your message should read as follows (unless other arrangements
are previously made in writing):