assumptions and their ills

Yesterday I did something dumb, and I only realized it today because I don’t trust an easy success. Let’s see if you can spot the flaw in my reasoning:

Background:

  • A process (X) is run on a series of items in a queue.
  • Items are added to the queue continuously, about 500 per hour.
  • A processor (Z) is started once an hour. It performs X on all the items in the queue, then quits once the queue is empty.
  • If there are any errors, the processor emails them to me after it quits.

The problem:

  1. I noticed 100 random failures in process X each hour.
  2. I hypothesized that X is failing due to intermittent system unavailability.
  3. I checked the hypothesis by looking for clusters of X failures at times of high load. (There were.)
  4. I “fixed” it by pausing the Z processor for 60 seconds whenever there’s a failure (to let system resources recover).
  5. 12 hours after the fix, I got no failure emails and declared victory.
  6. Not so fast: Not only did I not fix the problem, I caused something worse.

Can you figure out what I did wrong?

Hints and solution, Invisiclue™ style (select to view):

  • The fix was a single line of code: “sleep 60″ whenever a process X failed.
  • The failures weren’t caused by system load; the errors clustered because high load = more items going through the queue. Waiting fixed nothing, so there were still 100 failures per hour.
  • Process X normally takes a fraction of a second.
  • Processor Z starts once an hour, but that doesn’t mean it stops once an hour.
  • Errors are only reported when Z stops.
  • The pause makes process X take a minute longer, about 100 times an hour. There are only 60 minutes in an hour.
  • Since new items are added to the queue constantly, the queue never gets empty and Z never stops. Thus, no errors.
  • There are no other hints.
  • Really.