Troubleshooters.Com Presents

Troubleshooting Professional Magazine

Volume 2, Issue 11, December 1998
Intermittents

Copyright (C) 1998 by Steve Litt. All rights reserved. Materials from guest authors copyrighted by them and licensed for perpetual use to Troubleshooting Professional Magazine. All rights reserved to the copyright holder, except for items specifically marked otherwise (certain free software source code, GNU/GPL, etc.). All material herein provided "As-Is". User assumes all risk and responsibility for any outcome.

[ Troubleshooters.Com | Back Issues ]

Contents

Editors Desk
Letters to the Editor
Vote for Your Favorite Troubleshooting Professional
Definition of an Intermittent
Why Intermittents are So Hard to Troubleshoot
Sparse Intermittents and Events
Safety and Intermittents
Weapons in the War on Intermittents
An Interesting Source of Computer Intermittence
`Linux Log`How to Submit an Article
URLs Mentioned in this Issue

Editors Desk

By Steve Litt

Isn't it odd? You spend several years learning Troubleshooting Process, which guarantees solution of any reproducible (non intermittent) problem on a well defined system. Once word is out that you're an expert, they all want to hire you -- to fix intermittents. We live in a strange world.

Or maybe not so strange. Even though Troubleshooting Process doesn't guarantee solution of an intermittent, it sure helps. Since intermittents are an order of magnitude more difficult than reproducibles, ordinary personnel are given reproducibles while Troubleshooting Process experts get the intermittents. I guess the bottom line is we all need to become experts at intermittents. That's what this issue of Troubleshooting Professional is all about.

This issue sports our first-ever regular column, Linux Log. Every month, Linux Log will contain a Linux article fitting the theme of the issue.

And be sure to see the letters to the editor. The November issue, whose theme was Linux, broke all records for ANY Troubleshooters.Com web page. It appears to have been visited over 3000 times on 11/19/1998. And brought truckloads of mail, much of it (deserved) flames concerning my underestimation of non-Intel CPUs, and my holding Linux to a higher standard than Windows without making that fact clear. I find this month's letters to the editor intelligent and informative.

So kick back and relax as you read this issue. And remember, if you're a Troubleshooter, this is your magazine. Enjoy!

Steve Litt can be reached at Steve Litt's email address.

Vote for Your Favorite Troubleshooting Professional

By Steve Litt

Please vote for your favorite Troubleshooting Professional issue (by year and month) and/or article (by year, month and title). Submit your entry to Steve Litt's email address. The results will be announced in our second anniversary issue, which comes out in January.

Steve Litt can be reached at Steve Litt's email address.

Definition of an Intermittent

By Steve Litt

An intermittent is a problem for which there is no known procedure to consistently reproduce its symptom.

Note these words:

KNOWN Most problems can be reproduced. After all, if they couldn't be reproduced, they wouldn't happen by chance. The issue isn't whether the problem can be reproduced, it's whether the Troubleshooter can reproduce it at will, in order to perform authoritative tests. And to do so, he or she must be aware of the sequence of actions necessary and sufficient to reproduce the fault. The point here is that intermittence, or non-reproducibility, is dependent on the information available to the Troubleshooter, rather than the condition of the system.
Here's a true story illustration: My continuously running software hung once or twice a day. I finally managed to see it happen, capture the exact sequence of input files causing it, assembled that sequence permanently by marking them read-only. From then on it happened every time. The reproduction procedure was to assemble that exact combination of input files in that order.

PROCEDURE Reproduction often depends on a complex combination of factors, and often on the order in which those factors appear. Thus reproduction requires a procedure to invoke the factors in proper sequence.

CONSISTENTLY I've heard many people define a problem as "reproducible", and go on to explain that it will happen within an hour of turning the machine on. I'd call that an intermittent, because the machine is reproducing the problem, not the Troubleshooter. Unless it consistently happens in 57 minutes and 32 seconds, the Troubleshooter can't reproduce it. And, as we'll discuss in the next article, if the Troubleshooter can't make it happen, troubleshooting becomes much harder.

Steve Litt can be reached at Steve Litt's email address.

Why Intermittents are So Hard to Troubleshoot

By Steve Litt

Let's start by discussing why reproducibles are mathematically guaranteed soluble on a well defined system. The troubleshooter can reproduce the symptom at will. If the Troubleshooter performs a test that stops the known procedure from reproducing the symptom, he or she has clearly ruled out part of the search area. After a number of such tests, the technician will have narrowed the cause to a single component, which can then be repaired or replaced. Below is a drawing of the process as it applies to a reproducible. Assuming the (never achieved) ideal of exact binary search, twenty tests could find a single component in a system of 1,048,576 components.

Now lets look at an intermittent. For a given test, there's no way to know whether symptom vanished because of the test, or because the intermittent just picked that moment to change state. Inability to conclusively rule out sections necessitates re-visits to those sections and circular Troubleshooting. Time to solution can go up by factors of 10, 100, or occasionally more. I've seen intermittents take months to solve.

Steve Litt can be reached at Steve Litt's email address.

Sparse Intermittents and Events

By Steve Litt

The rarer the occurrence of the intermittent, the harder it is to fix. Intermittents so rare as to be economically unfeasible to reproduce are called sparse intermittents. And the sparsest of all is the one that happens once. That's called an event. Sometimes intermittents that happen more than once, but not on anything resembling a regular basis, are also called events. Except in cases where injury, death, or extreme economic hardship are involved, sparse intermittents are best ignored.

In safety related cases, solution is necessary but costly. The Universal Troubleshooting Process is helpless against sparse intermittence. They're better handled by other Troubleshooting Processes designed specifically for them, including one called Root Cause Analysis. We'll discuss that later.

Steve Litt can be reached at Steve Litt's email address.

Safety and Intermittents

By Steve Litt

For safety critical systems, intermittents present a special problem. Your brakes fail at 75 mph, but you manage to steer and engine brake your way to safety. The brake mechanic says he can't reproduce the problem. What now?

There's a part of us all that says "it was just a glitch". But if they fail again it might be on a mountain road with a 3000 foot drop.

The mechanic might replace the calipers as a likely suspect, and tell you it's probably OK. Customer testing is usually fine (if the customer is aware that this is being done), but in this case it could kill the customer.

The mechanic might replace everything in the braking system. But what if the root cause was outside the braking system. As cars get more computerized, this becomes more likely.

You might get a new car. While that would certainly eliminate the root cause, it's costly.

These are some of the special concerns when intermittents meet safety critical systems. Test and narrow just don't cut it. If the symptom happens again, lives can be lost. General maintenance won't cut it. The worst outcome would be to have the symptom vanish, only to wonder if it will happen again, and what injuries will result.

The right solution is to find the root cause, prove logically that it would have caused the EXACT symptom or syndrome, prove that it was likely to happen, find out exactly what caused that root cause to happen (often it's a problem with people, documentation, procedures, safety checks, etc.), and prevent future occurrence. And make no mistake: this costs big money.

Ultimately, every intermittent in a safety critical system becomes a tradeoff between safety and money. Take the car's brakes. Few of us would buy a new car or assign a committee to do a Root Cause Analysis. What we'd probably do is take the mechanic's advice about what to replace, then spend the next week testing the daylights out of the brakes on long, straight deserted roads. After that, we'd cross our fingers.

How different we'd react if the system was part of our nuclear defense system. A single failure could wipe out the human race. Here we'd get every expert, regardless of cost. We'd do a Root Cause Analysis, walk through everything, review every possibility. We wouldn't hesitate to spend a billion dollars.

Regarding safety, sparse intermittents are the worst. If brakes fail every 20 times, it's easy to test in a safe manner, reproduce the problem, and solve it. No big deal. Even in the nuclear defense system, we could disconnect all missiles (I'm hypothesizing here, I have no idea how this stuff works), and every 20 times get more info.

But if the brakes fail once a year, there's no practical way to reproduce it. We have the choice of bearing the cost of working backwards theoretically to try to isolate it, or doing our best and crossing our fingers.

It may be unpleasant to talk about, but intermittents in safety critical systems are always a tradeoff.

Steve Litt can be reached at Steve Litt's email address.

Weapons in the War on Intermittents

By Steve Litt

The Troubleshooter has several weapons in the war on intermittents.

General maintenance
Preventative maintenance
Turn the intermittent against itself
Convert the intermittent into a reproducible
Statistical analysis
Root Cause Analysis
Ignore it

General Maintenance

General Maintenance is step 5 in the Universal Troubleshooting Process, and was discussed extensively in the February, 1998 issue of Troubleshooting Professional Magazine. It involves steps like cleaning, adjusting, measuring to specification, observation, re-seating connections, improving cooling systems, etc. A classic example is cleaning the battery terminals on any car exhibiting starting problems. It may or may not fix the problem, but it's easy, every car should have clean battery terminals, and if it is the terminals, narrowing to the root cause would have taken much more time than doing the maintenance. General Maintenance is a gamble that over time it will save more time than it consumes.

We always face the question "what maintenance is appropriate for General Maintenance". That's discussed thoroughly in the February 1998 issue of Troubleshooting Professional, and also in Troubleshooters.Com. In intermittent situations, we include a much broader range of activities under the General Maintenance umbrella, because the alternatives are so much more difficult. As an example, in a computer with a reproducible problem, we'd almost never start by reseating ever card and replacing every cable. We'd do that for a computer intermittent. We're gambling that the time and money we spend will save us huge expenses troubleshooting that intermittent.

As mentioned in the article on Safety and Intermittents, General Maintenance is unsuitable for intermittents in safety related systems. There's nothing worse than having the problem disappear, and not knowing whether it will ever occur again, possibly resulting in injury or loss of life.

Preventative Maintenance

Preventative Maintenance (PM) is the single best weapon in the war on intermittents. It's quite compatible with safety related systems, because it's done before there's a symptom to erase. In such systems it saves a fortune in downtime, troubleshooting, and just maybe, loss of life.

PM is best done in situations where the organization recognizes the need, and is willing to spend the time and money. It works poorly where management believes "if it ain't broke, don't fix it". That type of management is optimized to put out fires, and that's what they spend most of their time doing. And of course, if their product is safety critical, the firefighters will soon be litigated out of business.

Turn the intermittent against itself

I once worked with an electronic tech who, well, let's just say he'll never compete with Einstein. The low frequency model of a transistor was a little more than he could handle. But give this guy an intermittent, and he'd fix it in no time. Simple. He'd wiggle everything, use a heat gun and freon to toggle the temperature, and wait for it to happen. He'd slowly narrow the wiggling and heating until he could toggle the symptom by manipulating a single component, then replace it. He needed absolutely no electronic theory. Like an Aikido master, he turned his opponent's strength against him.

Manipulation, coupled with astute observation, is often the quickest way to solve an intermittent.

Convert the intermittent into a reproducible

Let me repeat the true story I told earlier. I had a computer program to continuously scan bulletin boards, retrieve and process retrieved information. This was vital to the business of several real estate operations. Trouble was, it hung once or twice a day, and it was creating a huge problem. Several employees and I went to the site frequently to try to "catch" it, but we usually couldn't see it screw up. Even when we did, all we could do was to try to guess what happened from hindsight.

Finally, after seeing a crash, I shut down the system. I then used Netware's salvage command to bring back the input files processed just before the crash. I marked them read-only so they couldn't be renamed or deleted. I then fired up the system, which instantly crashed. Bringing those files to a local PC, I was able to make it happen every time. There was just something about that sequence of files that made it happen.

Now that it happened every time, it was just a matter of watching in the debugger til I found that I was passing a local string as a subroutine return. Of course, as soon as the subroutine ended the string variable went out of scope, and its memory was free for any other process to modify... The key to solving this sparse intermittent converting it from an intermittent to a reproducible by finding a series of files which would consistently reproduce the problem.

Statistical analysis

We're finally getting to the point where this is technologically feasible. The theory is simple. Hook a computer's outputs to the inputs and adjustments of the machine, hook the computer's inputs to test points on the machine, and run a program that exercises every combination of machine inputs and adjustments, in various orders of occurrence. The computer records occurrences of the intermittent symptom, and notes the combination of machine inputs and adjustments at that time, recording everything in a multidimementional database. A statistical package looks for any input, adjustment, or sequence thereof associated with symptom occurrence or cessation three standard deviations (or whatever) out, and bang, you've got a probable reproduction procedure. Once it's reproducible it's attacked like any other problem, using the 10 step Universal Troubleshooting Process. Or, the part falling to the edge of the bell curve can be replaced, and another test suite run. If the problem stops occurring, it's a statistically significant safe bet that the problem's solved. Statistical analysis of intermittents will end their reign of terror.

We're not quite there yet. The problem is that the same technology that allows our computers to do the troubleshooting has been incorporated in modern machines, making them orders of magnitude more complex, and therefore out of the reach of statistical analysis by an affordable computer. But we're getting closer, and some day we'll have it.

Statistical analysis is actually used today, although in non-rigorous forms. When the Troubleshooter wiggles, cleans, reseats, moves, tweaks, heats and cools, he or she observes the results and informally analyzes the statistical significance of the results. When the Troubleshooter finds something way off the bell curve, he or she narrows the scope of the problem.

And more rigorous statistical analysis are often practiced. We often use strip-chart recorders to record when symptoms happen, then compare the events to written logs of environmental factors. With today's technology, it would be quite easy to use a computer instead of a strip chart recorder, to record not only the events, but the environmental factors. And while it's at it, it could do a statistical analysis, and sound an alarm when it finds a probable reproduction sequence. The difference between this and the computer analysis described at the beginning of this section is:

The computer measures, but does not influence
The technician is responsible for defining what constitutes "occurrence". For instance, the technician finds a test point whose voltage changes when the problem occurs, and defines the normal and abnormal voltage ranges.

Speaking of statistical analysis we do today, look at software test suites. Properly done, lots of people sit down at terminals and exercise the application according to script, recording each time and the time and description of anomalies found. When all is said and done anomaly occurrences are correlated to activities, an analysis done, and complex reproduction sequences often reveal themselves. Note that today software can run test suites, rather than using live people.

Root Cause Analysis

Root cause is actually an entirely separate Troubleshooting Process, optimized for determining, after the fact, the root cause of sparse intermittents. It's especially useful in analyzing an intermittent so sparse it's an event, or a small number of events. A further benefit of root cause analysis is it can be used to solve multi-human systems like companies, departments, etc.

It's not presented as a rigorous series of steps, preferring to present the "steps" as a series of tools with a recommended order of performance. The tools include defining the problem and collecting data, task analysis, change analysis, control barrier analysis, event and causal factor charting, interviewing, determining root cause, developing corrective actions. A detailed description of Root Cause Analysis is beyond the scope of this article, but if you'd like to learn more a good starting point is the book "The Root Cause Analysis Handbook" by Max Ammerman, ISBN 0-527-76326-8. If you understand the Universal Troubleshooting Process, you'll find this book's concepts familiar, and its optimization to determining cause after occurrence of an event ingenious.

Ignore it

What do you do when your Windows 95/98 computer crashes? Most of us simply reboot. There are just too many causes for this problem to troubleshoot it to the root cause, and these crashes usually don't create a safety or data loss problem. The crashes are just an inconvenience we're willing to put up with. In general, the following factors point to ignoring an intermittent:

Rare intermittents without safety implications
Any intermittents without safety implications where finding a solution would be difficult (i.e. Windows).

Ignoring it is never an option when safety is involved. Thus, it's better to replace every component in a car's braking system than ignore a problem that caused a single brake failure incident. Note that the best solution would be to find and fix the probable cause, then test in a way likely to determine whether the problem had gone away.

Summary

Intermittents are never easy to solve, but the weapons described in this article can make life a little easier for the Troubleshooter.

Steve Litt can be reached at Steve Litt's email address.

An Interesting Source of Computer Intermittence

By Steve Litt

Michael Verstichelen's excellent CPU Overclocking Information website inspired this month's Troubleshooting Professional. Some of you might have read the article entitled "Litt Takes the Nine Count" in the October, 1997 Troubleshooting Professional Magazine. In that article I described how a series of intermittents installing Windows 95 on a no-name motherboard with a 166mhz Cyrix nearly made me lose it. Specifically, everything ran fine in DOS, and in Win95 on non-compressed disks, but it bombed every time (but at different stages of the install) while compressing the drives. I blamed it on either a cheap motherboard or Cyrix incompatibilities, until I read Mr. Verstichelen's site.

He describes how his overclocked system, which had been running perfectly under DOS/Win31, crashed incessantly under Win95. He said he was terribly distressed (kind of like I was in "Litt Takes the Nine Count"). He slowed the processor, and the problem went away. Michael is overclock-savvy, so he immediately suspected a heat problem (most overclocking problems involve heat). He applied heatsink compound (also called silicon grease), and was able to overclock it again with no Windows 95 problems.

It sounded exactly like my problems with the Cyrix/noname motherboard. I know most integrators don't bother with heatsink compound. And Michael Verstichelen says "AMD and Cyrix have tended so far to push their CPUs a lot more", meaning they run hotter at their rated clock speed. I wish I had that old motherboard/Cyrix to test whether this was the problem, but I refunded them. I'll bet you dollars to donuts a bigger heatsink, with a better fan, attached with heatsink compound, as well as use of several case fans, would have made the situation different.

If you're noticing lots of intermittence in your computer, underclock it by 20% to 40%. See if the problems decrease dramatically. If so, it's either a timing issue or a heating issue. If, at full rated clock speed, either your CPU or heatsink are too hot to touch for 5 seconds, get better cooling and see if that helps. Remember to use heatsink compound to better thermally couple the CPU and heatsink..

Always try to notice whether the intermittence happens the first few minutes after powering up in the morning. If it doesn't, but instead waits a few minutes to start happening, your intermittence is almost certainly caused by an overheated CPU.

And here are a few tips:

Use heatsink compound. Make sure your CPU is as tightly thermally coupled to its heatsink as possible.
Use a big heatsink with lots of mass and area.
Use a ball bearing fan. Make sure it's fast, and has plenty of torque.
Don't overclock. The time you gain from a tiny percentage speed increase, you might lose in crashes or bugs, and certainly you'll lose truckloads of time if you need to troubleshoot a cpu temperature intermittent. Please remember that if you overclock your 300 to a 333, your cpu runs only 10% faster. But since your bottlenecks on most operations are your disk, your modem line, and maybe your video, you might not notice the difference at all. But you better believe your cpu will run hotter. There are three exceptions to my "don't overclock" rule:

Near the end of your CPU's useful life, you can delay obsolescence a few months by overclocking. And if you burn something out, well, you were going to buy a new one anyway. Just make sure in any overclocking situation you keep your cpu cool.
There are some people who just like to soup things up. They know it's going to be a lot of work, expense, and they will experience failures, but it's worth it. Take the guy who drops a 440 with three deuces in a 68 Dodge Dart. He knows he could get a smoother, more comfortable, more reliable, and almost as fast ride with a Northstar-equipped Cadillac. But those Cadillacs are a dime a dozen -- no bragging rights, and none of his personal mark on the car. Likewise, the guy with a freon-cooled Pentium II 450 running at 600 has bragging rights. He knows he'll have crashes, intermittents, and that he'll pay a fortune to assemble the cooling system. He sees the benefit as a finely tuned machine, not as the utility that machine can give him. Note, however, that in certain cases the same bragging rights, a whole lot more power, and less problems can be achieved by building a multi-486 Linux "Supercomputer", as described in the May 1998 issue of Troubleshooting Professional Magazine.
You need bleeding edge technology. Some of us need power not available now, and can't wait for Moore's law to provide it. Money's not a major concern -- we just have to do it. Here's where we get into cryogenically cooled multiple PentiumII450's, etc.

Buy from a reliable vendor. Some vendors make a little extra money by selling a 300mhz, claiming it's a 333, and overclocking the system. Some go as far as to erase markings, or buy processors with the markings already erased. This practice often produces intermittents.
Watch your vendor like a hawk. A few days ago I stripped my Pentium 150 bare, including separating the heatsink and the CPU. On the CPU, stuck right on the intended contact point with the heatsink, was an adhesive paper control tag. Maybe vendors should require their employees to pass an IQ test right along with the drug test :-). When the vendor assembles your PC, watch him like a hawk. Don't assume he's smart.
Buy a name-brand motherboard. A little too much capacitance here, an overly long trace there, an unusual chipset, and the next thing we know we have intermittents. Noname boards cost $80. Brand name boards cost $120-$180, and you can find reviews of them on the 'net, to make sure they're reliable. One hundred dollars is a cheap price to pay to eliminate a major source of intermittence.
Don't overvolt. Some people crank the voltage on the CPU to try to get it to run faster. Since power dissipation is proportional to the square of the voltage, this is bound to overheat, causing intermittence or worse.
Use additional fans in your case. Try to have fans in the front of the case sucking air in, while the power supply fan blows it out.
Never have extra holes in the case. Always "cork up" holes made by cards you took out, or drives you took out. Holes in the case can short-circuit air flow away from your CPU's heatsink.

Litt Tries an Overclock

In the interest of journalistic completeness I'm writing this article with my Pentium II 300 clocked at 337.5mhz (75 * 4.5 instead of 66 * 4.5). You'll note I sped up the entire bus. I've run the bus at 75 before, but at 75 * 4 = 300, not 75 * 4.5 = 337.5. Before relating my findings (obviously, the CPU hasn't burned out yet), let me tell you a little about my system.

Built in December, 1997, it's a Pentium II 300 on an Abit LX6 motherboard, with 128meg of 10ns 4 clock SDRAM, 6.4 meg disk, Win98 installed clean and fresh. As fate would have it, my power supply burned out a week ago, so I replaced it. I had my vendor replace it (heck, the labor was free if I paid for the parts), and while he was at it I had him install a faster, higher torque ball bearing heat sink fan (with a new heatsink), and an additional fan sucking air into the front of the case. Although my vendor didn't use heatsink compound, he showed me that the new heatsink had this special kind of metal designed to promote thermal conductivity from the CPU. This metal looked something like mu metal or lead. I guess what I'm saying here is that this machine is equipped better than average to be overclocked.

Before the Overclock

Before the overclock, the center of the heatsink was warm to the firm touch of the non-calloused part of my fingertip, becoming uncomfortable but bearable after 5-8 seconds. I could keep my finger on there indefinitely without real pain. The CPU itself, which is covered with plastic, was not at all uncomfortable to the touch.

After the Overclock

After an hour running in an overclocked condition, the center of the heatsink was somewhat hot to the firm touch of the non-calloused part of my fingertips, becoming painful after about 5 seconds. I had to remove my finger after 10 seconds. The CPU itself, which is covered with plastic, was warm to the touch and slightly uncomfortable after 10 seconds of firm pressure from the non-calloused portion of my finger.

The computer's performance appeared "crisper", with menus operating quicker and screens painting faster. Programs appeared to load quicker, but I didn't stopwatch it, so it could have been an illusion caused by the "crispness". Please keep in mind that because I sped up the bus, memory access and disk access also sped up.

How Far Could I Have Gone?

It's pretty sure my overclock as described wasn't especially risky. What would have happened if I'd put it up to 75 * 5 = 375mhz? I don't know. Because this is my main business machine, I chose not to risk that even for five minutes. My guess is that the heatsink would have burned my finger within a second or two. I've never been comfortable with electronics running that hot -- even audio power transistors. But that's all a guess, because I was unwilling to experiment further with my main business computer.

I Clocked it Back Down

In fact, I clocked it back down to 75 * 4 = 300. I've previously run it like that for months at a time, and it does just fine, even with a PCI network card, so I have perfect confidence. After a suitable "cooling down" period, the CPU's heatsink went back to it's original perceived temperature. I plan to use this computer til a couple months into the new millennium, when presumably the machines being sold will be absolutely positively Y2K compliant. Given that long remaining use as my primary computer, I'm unwilling to risk a CPU overclock. It's nice to know, however, that as 1999 winds down and my machine is slow as molasses, I can clock it up to 338, or maybe even 375 (probably with more fans, bigger heatsink and gobs of heatsink compound), and have a somewhat contemporary machine.

If It's Not a Heat Problem

If, after getting the CPU to run cool, you still have the intermittence, here are some things to look at:

Virus Check

Viruses are designed to give intermittent-appearing problems. In response to oddball symptoms, run a virus check. Don't have it automatically fix the problem -- after all, you may want to toggle the symptom. I recommend Norton Antivirus, because they give free virus signature updates.

Memory Check

Run a diagnostic program designed to test memory. Ideally, all memory SIMMs or DIMMs should be identical, and bought at the same time, from the same manufacturer, with the same timing. If this is not the case, temporarily remove all SIMMs and DIMMs but the biggest chunk, and see if the intermittents disappear. If they do, find whatever memory solution best fits your pocketbook.

Be sure you're using the right memory. LX and newer boards require four clock SDRAM, not the 2 clock that worked on the TX's. The new 100mhz motherboards require a special kind of SDRAM to keep up with them. Make sure you're using the right ram for your setup.

Disk Check

After suitably backing up, run a disk check program (Windows comes with Scandisk). Make sure you do a write/read surface scan (that's why you want to back up first). Address any problems as appropriate.

Dumb Down the Bios

Some BIOS settings, especially Caching and Shadowing, can cause problems. Reboot the system, go into BIOS setup, and look for a "safe defaults" setting. If any kind of "safe defaults" setting is available, choose it. After doing that, re-detect all your drives. Shut off ALL caching and shadowing, including:
Disable Internal (L1) cache
Disable External (L2) cache
Disable Caching of system BIOS
Set Caching and shadowing of the video ROM to disabled.
Disable all caching and shadowing of memory locations C000-FFFF

See if this removes your intermittence. If so, don't worry about performance -- later you can turn most of the caching and shadowing on, 1 by 1, to gain back most of your performance while keeping any offending settings disabled.

Please note that it's possible the real cause is a hot processor, and dumbing down the bios appears to fix it, but really masks the CPU overheat problem. Under such a scenario, the true solution is to decrease the CPU's temperature.

Reseat Connections, Replace Cables

An excellent piece of General Maintenance is to reseat all connections, and to replace any cables that are remotely suspicious. Compared to the time necessary to troubleshoot an intermittent down to its root cause, this General Maintenance is economical in both time and money.

Summary

Computer intermittence can be a real pain, but fortunately there are steps we can take to limit it.

Steve Litt can be reached at Steve Litt's email address.

Linux Log

Linux Log is now a regular column in Troubleshooting Professional Magazine, authored by Steve Litt. Each month we'll explore a facet of Linux as it relates to that month's theme. Today we'll discuss Linux from the point of view of intermittence.

A substantial factor influencing the likelihood of intermittents is the quality of design of the system. A poorly designed, unmodular, or overly complex system boosts the likelihood of intermittents:

Poor design increases the likelihood of component failure.
Poor design increases the likelihood of user error which is interpreted as system failure. If such user error is rare, the "failure" will be seen as intermittent in nature.
Problems in unmodular or overly complex systems require convoluted reproduction sequences, and thus are often perceived as intermittent.

That brings me to what I like so much about Linux. It's better designed than its competitors. Less problems, and especially less of those horrible, time consuming intermittents. Linux is constructed pretty much like you'd expect an operating system, with a kernel, drivers, user shell and a file system. Everything does what you'd expect it to do. The rare problems are easier to section off and easier to reproduce. And because of its superior design, Linux has far fewer failures.

On our desktops hourly crashes and reboots are a hassle, but we live with them. I've heard of nobody who could set up a crashless (let's say for 2 months at a time) Windows system, even NT.

On the other hand, when we're running a network or a busy website, crashes must be assigned a cause and eliminated -- not ignored. There's just too much chance of a crash leaving a database or file update in an illegal state, no matter how much commit, rollback and transaction processing we have. And of course, frequent crashes are seen by online customers as a sign of amateurism.

Linux stays "up" for days and weeks at a time. And when it fails, it's much easier to find the cause. Windows is a nice, convenient operating system for personal use. And indeed, Windows NT offers improved reliability. But when reliability is at a premium, and problems, especially intermittents, are intolerable, Linux is my choice among the low cost operating systems.

Steve Litt can be reached at Steve Litt's email address.

Letters to the Editor

All letters become the property of the publisher (Steve Litt), and may be edited for clarity or brevity. We especially welcome additions, clarifications, corrections or flames from vendors whose products have been reviewed in this magazine. We reserve the right to not publish letters we deem in bad taste (bad language, obscenity, hate, lewd, violence, etc.).

The November issue of Troubleshooting Professional Magazine (Linux issue) produced more letters to the editor than all previous issues combined. It appears I made the following mistakes:

I exaggerated any Cyrex/AMD incompatibilities.
I held Linux to a higher standard than Windows, but failed to mention that fact.
I left out valuable features of Linux

Based on the success of the letter writers, I'm changing my processor recommendation to this: Evidence is that Cyrix and AMD will run just fine with Linux, assuming the processor is kept cool and the Kernel is compiled as generic 386. Therefore these processors may represent the best bang for the buck. On your very first Linux installation, you may wish to eliminate any question of incompatibility by using an Intel processor. On subsequent installations, choose your processor on the basis of price and benefits.

Here's an elaboration of what I expect from Linux: I expect (and have not been disappointed) Linux to work in a logical fashion, and keep working that way day after day. I expect it to go easy on system resources. I forgive Windows for crashing or hanging hourly. After all, crashing is what Windows does best. I hold Linux to a higher standard. Life isn't fair.

Cyrix and AMD Rock!

Sir;

Your statement; "Cyrex/Amd are iffy in Linux" in the current "Troubleshooting Pro" is quite frankly, a load of bull! I have been running Linux for three years on various boxes almost all of them with AMD or Cyrix cpus. My very first Linux box had an AMD 486/66 and ran Redhat 2.0 with no problems. I currently own a K6-300, a K6-233, two AMD 5X86's (one over clocked at 160mhz) a Cyrix PR200+ and a Intel 486/66, all at one time or the other have run Linux with no problems. Most still do. I don't know who gave you this bogus info but he doesn't know his head from a hole in the ground.

You can put a complete distribution of Linux on a dos partition; I have installed the "Monkey" five floppy distribution, (available from SunSITE) on a 80meg drive, along with Win 3.1, ran the SVGA X-server, had full networking, with room to spare, all on the above mentioned Intel 486/66. 8meg of ram is enough for Linux from the shell but I recommend at lease 16meg for the X-server.
--
Rick,

Rick writes further:
After further thought, here's what I think is going on; People with a Cyrix or AMD K5-K6 think that they have a "Pentium clone" and build their kernel with Pentium support. WRONG! Only Intel makes a Pentium, The AMD K5-K6 and Cyrix 686 processors while providing Pentium performance are not Pentiums! You must build your kernel with generic 386 support. It's a case of not reading the manual. (The AMD 486 works fine with 486 support built into the kernel.) Most all distributions come with the generic 386 support built into the kernel for this very reason.

Rick. Smith <riter311@gte.net>

Editors Note: Rick, your letter typifies the piles of letters from satisfied AMD/Cyrix Linux users. Thanks for setting the record straight, and especially thanks for pointing out the kernel issue.

Equal Justice for All

Steve,

The only point I have to make here is you mention that Linux is "iffy" on anything but the most common of hardware.... You may be correct in this point, but NT supports *less* hardware than Linux does... Just thought you might want to know that.

Cheers,

- Brendan
Brendan Rankin <brendan.rankin@siliconmetrics.com>

Editors Note: Brendon, you point out so well that I judged Linux much more critically than Windows, and in doing so understated the value of non-Intel processors. Thank you.

I just read your fine issue on Linux. While overall I feel that it was good, I just have two minor quibbles. First, it is not necessary to delete the Windows partition and then reinstall Windows. (I don't know about your experience, but our machines come with Windows and the vendor's proprietary software, uh "enhancements", preinstalled. If you reinstall Windows, then you'll never see those vendor drivers again.) There is on my RH 5.1 disk (I'm still waiting for my 5.2 kit to arrive) a program called fips.exe in the \dosutils directory. FIPS is a "poor person's" PartitionMagic program that allows one to resize partitions without deleting their contents. I have used this program to resize FAT16 partitions to make space for Linux.

I am also worried that people will look at your sizing recommendations and think "hmm, 300MB for Windows and 1.137GB for Linux? Linux is fat!" Although I agree that installing everything means that nothing is left out, I have been able to install Linux on my home machine which has a 420MB drive. I was as accurate as possible in choosing the class of packages that I installed, but I did not choose individual packages. (Times that I have on other machines, the RPM package manager noted which dependent packages it needed and prompted me for permission to install them.)

Thank you for helping to dispel the FUD that surrounds Linux.

/Jonathan M. Prigot
Brigham and Women's Hospital

Editors Note: Jonathan, thanks so much for pointing out the 300mb/1.137gb discrepancy. The 300mb was for a MINIMAL windows, while the 1.137g was for MAXIMAL Linux. I was trying to make it easier on the first-time installer by allowing him or her not to worry about what packages to install. But to be fair, I've heard of perfectly functional 80mb Linux installations. I haven't had a chance to try FIPS yet, but I'm looking forward to it.

Linux Has The Features

Why GIT ? Do you ever seen MC (Midnight Commander) ? It's internal editor is with Shift-arrows, Shift-Del/Shift-Ins/Ctrl-Ins for console is very attractive for former Windows user (and you could use it via telnet as well, but of course this will be not so easy: F3, F5, F6, etc). Or try `cd ftp://ftp.cdrom.com/pub` command. Or try press enter on .tar.gz, .zip or .rpm file. To me ability to see archives and ftp sites is more attractive then ability to send ascii mail for example...

Khimenko Victor <khim@sch57.msk.ru>

Editors Note: Thanks so much Victor, for pointing out mc (Midnight Commander). It's much more friendly than git, and is better for day to day administration.

You forgot about xf86config. That is the native configurator that comes with Xfree86. It tends to be more reliable than Xconfigurator but has a less nice frontend. Occasionally a user will need to remove their /etc/X11/XF86Config file and start over with xf86config, reconfiguring X from scratch.

You can also set the default runlevel in /etc/inittab to be X.

id:5:initdefault:

However, you're better off just starting X from a user's local .tcshrc
or .bashrc file.

jedi <jedi@penguin.lvcablemodem.com>

Editors Note: Jedi -- thanks for pointing out the value of xf86config. It's absolutely a more configurable video configuration tool, making Xconfigurator look kind of "one button". My advice for the Linux neophyte would be to try Xconfigurator because it's much easier. If the result is less than satisfactory, dig out all the specs for your card and monitor, and have at it with xf86config.

I'd advise readers to run X manually with the startx command, rather than automatically, because it's too easy to get into a video mode catch22, and if the user doesn't know about Ctrl-Alt-Backspace they could conceivably damage their monitor. In a later email Jedi pointed out that overdriving the monitor is much less likely in Linux than Windows. Once again I've held Linux to a higher standard than Windows (which runs only in graphic mode). Neverthelesss, I always recommend letting the user choose to run x, and am so glad Linux gives us the choice.

Microsoft's Finances: More or Less Stable than Windows?

Dear sir;

'Linux Today' had a link to an article in the Motley Fool, which started me thinking along several related lines.

Microsoft's stock has gone up in price for the company's entire history. That makes people buy it, which drives it up further in a positive feedback loop. Microsoft's growth has also been maintained by its enormous profit margins, but it is now coming under increasing price pressure, not only from Linux but also from IBM, which is planning to give away an entry-level DB2 for Linux. Has IBM suddenly noticed MS's achilles heel?

Bad news for MS is good news for everyone else. The DoJ trial has become a PR catastrophe for them. They should have knuckled under and meekly signed a consent decree. This, combined with the problems they are having with Win2k, is causing the public perception of MS to change. Influential publications, like the Motley Fool and the Guardian, are starting to say that MS has lost its long-term potential, and it is time to start divesting in MS.

What if MS's stock price should fall? People would start selling it off, including the Microserfs who have been paid in stock options all these years. They would want to realize their gains before it fell further. The price of the stock would drop precipitously. Since the wages at MS are below industry average, all their top talent would leave, and their deadlines would slip back even further.

Win2k doesn't have plug'n'play, and I'm betting it won't have PnP by 3Q2000. If they start pushing it as an OS for the desktop, they will be able only to sell it on named systems from Hewlett-Packard, Compaq, Sony et al. -- machines with outrageous price tags and a fixed hardware configuration. But the industry is now going towards cheap, custom-built systems. This will make Linux more attractive. As a result of falling computer prices and the DoJ case, it will be much harder for MS to force computer makers to pre-install Win2k.

Perhaps Ed Muth knows that Win2k won't have PnP in time. In a debate with Bob Young, he stressed the "integration" of MS apps. This is Microspeak for bundling, and more "integration" means less flexibility in configuration (both H/W and S/W) and fewer choices for consumers.

Le roi est mort! Vive le roi!

Gareth Barnard <wander@nortexinfo.net>

Editors Note: When, in the May 1998 issue of Troubleshooting Professional, I speculated that Linux could overthrow Microsoft, it sounded like the rantings of a madman. Now mainstream media is speculating the same thing, and Microsoft themselves pointed to Linux as proof that they have competition. Gareth makes a unique contribution by pointing out Microsoft's financial Achilles heel. He succinctly points out that Microsoft is built on a financial bubble that has so far benefitted from positive feedback (catch 22, etc), but could fall just as fast when that positive feedback turns around. Thanks Gareth!

Submit letters to the editor to Steve Litt's email address, and be sure the subject reads "Letter to the Editor". We regret that we cannot return your letter, so please make a copy of it for future reference.

How to Submit an Article -- NOTE BETTER POLICY

We anticipate two to five articles per issue, with issues coming out monthly. We look for articles that pertain to the Troubleshooting Process, or articles on tools, equipment or systems with a Troubleshooting slant. This can be done as an essay, with humor, with a case study, or some other literary device. A Troubleshooting poem would be nice. Submissions may mention a specific product, but must be useful without the purchase of that product. Content must greatly overpower advertising. Submissions should be between 250 and 2000 words long.

By submitting content, you give Troubleshooters.Com the non-exclusive, perpetual right to publish it on Troubleshooters.Com or any A3B3 website. Other than that, you retain the copyright and sole right to sell or give it away elsewhere. Troubleshooters.Com will acknowledge you as the author and, if you request, will display your copyright notice and/or a "reprinted by permission of author" notice. Obviously, you must be the copyright holder and must be legally able to grant us this perpetual right. We do not currently pay for articles.

Troubleshooters.Com reserves the right to edit any submission for clarity or brevity. Any published article will include a two sentence description of the author, a hypertext link to his or her email, and a phone number if desired. Upon request, we will include a hypertext link, at the end of the magazine issue, to the author's website, providing that website meets the Troubleshooters.Com criteria for links and that the author's website first links to Troubleshooters.Com. Authors: please understand we can't place hyperlinks inside articles. If we did, only the first article would be read, and we can't place every article first.

Submissions should be emailed to
@troubleshooters.com, with subject line Article Submission. The first paragraph of your message should read as follows (unless other arrangements are previously made in writing):

I (your name), am submitting this article for possible publication in Troubleshooters.Com. I understand that by submitting this article I am giving the publisher, Steve Litt, perpetual license to publish this article on Troubleshooters.Com or any other A3B3 website. Other than the preceding sentence, I understand that I retain the copyright and full, complete and exclusive right to sell or give away this article. I acknowledge that Steve Litt reserves the right to edit my submission for clarity or brevity. I certify that I wrote this submission and no part of it is owned by, written by or copyrighted by others.

After that paragraph, write the title, text of the article, and a two sentence description of the author.

URLs Mentioned in this Issue

http://www.cam.org/~agena/overcloc.html: Michael Verstichelen's CPU Overclocking Information page.
http://www.concentric.net/~Maxainc/indexm.htm: Website of Max Ammerman, author of "The Root Cause Analysis Handbook".
http://www.troubleshooters.com/tpromag/9802.htm: The 2/98 Troubleshooting Professional Magazine, Theme: General Maintenance.
http://www.troubleshooters.com/tpromag/9805.htm: The 2/98 Troubleshooting Professional Magazine, Theme: Linux, featuring parallel supercomputers.
http://www.troubleshooters.com/tpromag/9811.htm: The 11/98 Troubleshooting Professional Magazine, Theme: Linux, featuring a soup through nuts tutorial.
http://www.troubleshooters.com is Steve Litt's website.
http://www.troubleshooters.com/sympwiz.htm: Steve Litt's Java Symptom Description Wizard.

KNOWN	Most problems can be reproduced. After all, if they couldn't be reproduced, they wouldn't happen by chance. The issue isn't whether the problem can be reproduced, it's whether the Troubleshooter can reproduce it at will, in order to perform authoritative tests. And to do so, he or she must be aware of the sequence of actions necessary and sufficient to reproduce the fault. The point here is that intermittence, or non-reproducibility, is dependent on the information available to the Troubleshooter, rather than the condition of the system. Here's a true story illustration: My continuously running software hung once or twice a day. I finally managed to see it happen, capture the exact sequence of input files causing it, assembled that sequence permanently by marking them read-only. From then on it happened every time. The reproduction procedure was to assemble that exact combination of input files in that order.
PROCEDURE	Reproduction often depends on a complex combination of factors, and often on the order in which those factors appear. Thus reproduction requires a procedure to invoke the factors in proper sequence.
CONSISTENTLY	I've heard many people define a problem as "reproducible", and go on to explain that it will happen within an hour of turning the machine on. I'd call that an intermittent, because the machine is reproducing the problem, not the Troubleshooter. Unless it consistently happens in 57 minutes and 32 seconds, the Troubleshooter can't reproduce it. And, as we'll discuss in the next article, if the Troubleshooter can't make it happen, troubleshooting becomes much harder.