The 5 SMART stats that actually predict hard drive failure

Discussions about anything Computer Hardware Related. Overclocking, underclocking and talk about the latest or even the oldest technology. PCA Reviews feedback
Post Reply
User avatar
FlyingPenguin
Flightless Bird
Posts: 33161
Joined: Wed Nov 22, 2000 11:13 am
Location: Central Florida
Contact:

The 5 SMART stats that actually predict hard drive failure

Post by FlyingPenguin »

http://www.computerworld.com/article/28 ... ilure.html

Interesting read.

However, you have to understand why SMART is almost totally useless - except on data center servers like the ones used in that article.

SMART usually gives you practically no warning on an average workstation, because by the time any of the SMART indicators go into the red, the damage has slowly crept on the drive and it's usually on it's last legs at that point. That's because most workstation PCs don't hammer drives very hard (unless maybe you're using it as a full-time video editor). Most of your data is "stale" (unchanging) we don't work drives very hard on home or even office PCs.

However a server in a data center that's doing hardcore work - like Backblaze's data center drives - get hammered hard and continuously, and that's more likely to give you some early warning.

Now the same thing can be accomplished using Spinrite on your own drives. A lot of companies use Spinrite to "certify" new hard drives. I myself run Spinrite on level 4 on all my important spinning drives (server & workstation) every 6 months or so. I also run it on a new drive and any used drive that I want to put on the shelf as a spare.

Running Spinrite level 4 on a drive REALLY hammers it: each byte on the drive is read, inverted, written, read again, and inverted again to restore it to the original data, and at the same time forcing the magnetic recording of that byte to be strengthened and re-written wherever the track currently resides (as a drive ages, and the mechanics get sloppy, the tracks drift, and old data becomes unreadable because the heads don't quite line up with the old track). This process also forces the drive's internal controller to read and write every single byte, and make it aware of any problems.

Assuming your controller supports SMART (often it needs to be enabled in BIOS, although Spinrite will try to enable it itself if it can) you get a real-time SMART data display in Spinrite while it's running a scan. One of the first things you notice on a modern drive is the horrendous amount of ECC corrections that occur during a level 4 scan. Some modern mobos don't properly send SMART data through BIOS to a DOS app like Spinrite (which is why Spinrite 6.1 will bypass BIOS and not run in DOS). That's why I keep an old Intel Core 2 Quad Q6600 PC as my workbench computer, because it fully supports SMART in BIOS for both IDE and SATA (yes, it has IDE ports and SATA ports in it).

Check out this drive, which is a 2 year old Seagate 3.5" 1TB drive out of a dead desktop USB enclosure (as usual, the enclosure's USB interface failed - I see this a lot). I just finished scanning it today:

Image

I personally suspect that drive manufacturers put drives that don't pass quality control for desktops PCs, into these USB enclosures. USB drives generally are only used for archiving, and thus don't get worked as hard as a PC. I see bad SMART data on USB enclosure drives all the time. That's why I would rather buy a good quality desktop drive and install it in my own enclosure.

This drive is 90% of the way through the level 4 scan (been running 18 hours) and it has over 125 million ECC corrections. That's not over it's lifetime, that's just in the last 18 hours that Spinrite has been running! That means that while Spinrite is doing it's thing, the drive controller has fixed 125+ million errors in the process of trying to read and write all those sectors Spinrite is accessing aggressively. What's scary is that this is NORMAL on a modern drive! Modern drives - even a new one right out of the box - are constantly performing error correction because the platter densities are so incredibly high, that writing one sector can interfere with the magnetic flux of an adjoining sector. Keep in mind that spinning drives are actually analog recordings - bits are written and read as analog voltages. A positive voltage is a 1 and a negative voltage a zero. If the recording is weak, the drive may have to re-read the sector several times to read it properly and check it against it's CRC checksum to guarantee a good read. This is where all these ECC Correction errors come from. If it has to do this over a certain number of times, then the sector gets swapped out as bad.

In the photo above, notice the two red dots in the top bar? That means ECC correction is above factory recommendation and that enough ECC errors occurred to raise a red flag - even in SMART's normally brain-dead scoring system. Ideally you never want to see any red bars after a full Level 4 scan. I personally would throw away any drive showing more than 40% of any bar in red.

I've marked this drive as "Early failure indicated". The red dots are an early indicator that the drive is wearing out. Regular daily use of the drive in a home PC would never give any indication of this - nor would it show up in normal SMART data. It's only because I hammered this drive with a level 4 scan for 20 hours that it showed it's weakness. The drive MIGHT work just fine for another 3 years in a home PC (or it might die in a month - I wouldn't trust it). In time, as those ECC errors occur even more often, the read speeds would drop and the OS would start feeling sluggish (yes, that's why some old PCs get slower) and then one day some important OS kernel file will become unreadable and it will BSOD and won't boot anymore. Or a partition table will be on an unreadable sector and you won't be able to see any files on it. Or maybe it will just go right into click of death when the mechanism wears out.

I wouldn't use this drive as a boot drive - except maybe on a charity PC on some junky 8 year old computer in my garage. What I usually do with drives like this is use it for archival backup. Every once in a while I like to copy all my important files off my server (I do have other backups, including a cloud backup), put it on a drive like this, and throw it in my safe deposit box. The drive isn't going to deteriorate sitting in there unused, and will certainly be readable one more time in 5 or 6 years if I need to.

If this was a client's drive, I'd replace it and give them this drive in an anti-static bag as an archive backup to put in a drawer, or a fire safe. Write the date on it and you have a drive with all your data on it up to this date that anyone with a SATA to USB adapter can read for them.


BE AWARE that a Spinrite level 4 scan on a 1TB drive can take 15 - 24 hours or more, depending on your controller speed, and the condition of the drive (I've seen really creaky drives take twice as long). Steve is working on Spinrite 6.1 (a free upgrade) which will be much faster because it will access SATA controllers directly instead of using IDE emulation via DOS as it does now.

ALSO, as I've said before, you don't want to run a level 4 scan on a drive that's already near death's door. Level 4 is preventative maintenance - refreshing all the stale sectors on the drive - and for certifying a drive. Level 2 is what you want to use to recover data from an unreadable drive - it's much less aggressive: reads every byte, forcing the drive controller to perform ECC correction or sector swaps on any weak sectors, and only writing if it comes across an unreadable sector (Spinrite will try VERY hard to read as much of that bad sector one last time, and then have the drive swap it out - this is when you will see those green RECOVERED SECTOR "R" icons or the red BAD SECTOR "B" icons in the drive map in Spinrite's main display. Spinrite will warn you that you may kill the drive, if it can read SMART, and SMART indicates that the drive is a breath away from the grave. Spinrite will sternly warn you to try to backup any data you can still read before even performing a level 2 scan on a drive in that bad a shape.

ALSO BE AWARE that you should never run Spinrite 4 on an SSD. It will only prematurely wear out memory cells. Instead you should run Level 2 on an SSD which will force the controller to check the integrity of every byte, and swap out any bad cells.

If you want more detailed info on what the SMART indicators mean, and how to read them, Steve has a detailed explanation on this page:
https://www.grc.com/sr/smart-studymode.htm

SMART isn't totally useless, and I have a nifty free utility called Active Hard Disk Monitor running on my server, to give me some idea of the condition of the drives: http://www.disk-monitor.com/
I dunno if the Free Trial is time limited. I'm running an old version on my home server that is a free version that never expires. Even at $6, it's still worth it.
---
“The Government of Spain will not applaud those who set the world on fire just because they show up with a bucket.” - Prime Minister of Spain, Pedro Sánchez

Image
Post Reply