September | 2010 | Global Spin

Yesterday I did something dumb, and I only realized it today because I don’t trust an easy success. Let’s see if you can spot the flaw in my reasoning:

Background:

A process (X) is run on a series of items in a queue.
Items are added to the queue continuously, about 500 per hour.
A processor (Z) is started once an hour. It performs X on all the items in the queue, then quits once the queue is empty.
If there are any errors, the processor emails them to me after it quits.

The problem:

I noticed 100 random failures in process X each hour.
I hypothesized that X is failing due to intermittent system unavailability.
I checked the hypothesis by looking for clusters of X failures at times of high load. (There were.)
I “fixed” it by pausing the Z processor for 60 seconds whenever there’s a failure (to let system resources recover).
12 hours after the fix, I got no failure emails and declared victory.
Not so fast: Not only did I not fix the problem, I caused something worse.

Can you figure out what I did wrong? Continue reading →

Global Spin

a glimpse into the tiny mind of Chris Radcliff

Monthly Archives: September 2010

assumptions and their ills