Hard disk MTBF and failure rates in plain English

I was doing some research to buy a new hard disk to store my digital life on. The hard disk industry is going through interesting times the last few years. Disk capacities mostly plateued, rotational speeds have slowed down and for the first time we see bigger drives being slower than some smaller ones. The Backblaze hard drive stats have revealed some big differences in reliability between manufacturers and models.

The good news is that since the Backblaze statistics have started getting attention, hard drive manufacturers seem to have taken action and reliability appears to be improving. At the time of writing this, HGST seems to be the reliability king and the situation between Western Digital and Seagate seems to have flipped. Seagate who was producing some attrociously unreliable drives in 2014 and 2015 appears to have improved.

original

This shift to improving reliability, in some cases to the expense of raw performance, is far from being a bad thing. SSDs have covered most high performance use cases and I find that reliability is the among the most important aspects of a hard drive.

Mean Time Before Failure (MTBF)

In most hard drive specification sheets, the calculated reliability is quoted as MTBF hours. Typical values are in the range of 500,000 to 1,500,000 hours. For instance for the HGST HMS5C4040BLE640 the MTBF specification is 800K hours = 92.6 years. Unsurprisingly, this does not imply that the hard drive you are buying will last for a century. The MTBF values represent the estimated failure rate of the component based on lots of assumptions and most importantly represent the constant failure rate part of the Bathtub_curve.

350px-Bathtub_curve.svg

In order to convert the MTBF to a failure rate we apply the formula

FailureRate = 1/MTBF

In the HGST HMS5C4040BLE640 this gives us an annualized failure rate of

FailureRate = 1/MTBF = 1/92.6 = 1.07%

What surprised me is that this value now kind of makes sense! It is in the same range as the 0.45% annualized failure rate observed in the Backblaze study. So, if your drive does not die within the first few weeks, it will survive the n=3 year warranty period with a probability of

ProbabilityOfSurvival(n) = (1 - FailureRate)^n

Applying the formula gives us 96.8% (based on the theoretical 1.07% failure rate) or 98.6% (based on the Backblaze failure rate) for the drive to survive for 3 years.

All of the above are mostly a theoretical exercise though. It assumes a constant failure rate, ignores the "bathtub curve" and the fact that each manufacturer uses it's own methodology to produce an estimated MTBF. Quoted MTBFs do not seem to closely correlate with the actual reliability of the drives. If you are interested in a more structured analysis of reliability metrics and graphical alternatives, check out Statistical Analysis of Field Data for Repairable Systems by
David Trindade and Swami Nathan.

Finally I will refer you to this excellent APC white paper where the author makes a fairly amusing calculation of the MTBF of a human.

EDIT:
Seagate has a very informative article that explains how MTBF is derived, and why they have stopped using it in favor of Annualized Failure Rate (AFR).

Show Comments