ATA Bad Sectors HOWTO
ATA Bad Sectors HOWTO
by Herbert Molenda
Everyone gets bad sectors eventually (unless you are lucky/spoiled enough to get a new PC every year); they are noted by the tell-tale error messages:
Buffer I/O error on device sdc2, logical block 173539331
ata4: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ 0x3/11/04
ata4: status=0x51 { DriveReady SeekComplete Error }
ata4: error=0x40 { UncorrectableError }
That is, of course if you are running a normal OS like *nix, good luck figuring out what the hell is going on when this happens in Windows. Probably some kind of generic “Input/Output error” messages will be flying about, if you even get a bootable system.
Regardless of what you were running go out and get yourself a copy of RIP (Recovery-is-possible Bootable Linux Rescue CD). Most of the operations you are about to perform require the drives to be in an unmounted (offline) state, so doing this from a running system may be impossible and is definitely not recommended.
First off you should be sure you are actually dealing with bad sectors, and not something on the filesystem level such as simple corruption. When something occurs at the filesystem level (software), sometimes the system misplaces or corrupts individual files and this is easily fixable using a filesystem scan/fix program such as (scandisk, fsck) and it will either be able to piece the files back together or not (probably not). However when true bad sectors occur at the hardware level (above-mentioned errors and sometimes repetitive clicking sounds from the actual drive), the fix involves manipulating the actual blocks on the drive, regardless of the filesystems and partitions present.
Statistically speaking, once a drive develops bad blocks/sectors the likelihood of more occurring is greatly increased. With today’s drive prices its not worth using the drive any longer, conduct the fix to get the filesystem mountable again and get whatever data you can onto a new drive. 99% of modern drives contain a feature set called S.M.A.R.T which includes many low-level scanning and maintenance tools. Moreover, these drives are not stupid, when an attempt to write to a bad block is made, the drive automatically adds that block to a list and never attempts to use it again. Note that I highlighted the term write, since regular scanning programs find bad sectors by reading them; this will not force reallocation. They will not attempt writes since they do not want to destroy the data contained in the block (albeit small, usually 1-4kB depending on the filesystem’s block size, unless the block is contained with the free space portion) which is pretty much useless anyway. Very rarely a bad block will suddenly start functioning after 3-5 read attempts.
So the procedure in a nutshell: Find the bad blocks, force reallocation by writing zeros to the block address, repair the filesystem, backup.
Find the bad blocks:
S.M.A.R.T has what is called an “Extended Offline Test” which goes block-by-block attempting to read until it finds a failure, then aborts. This test runs in the background and after a certain time you can view the results. This will let you know the first block which is unreadable, if you suspect there being many blocks it is then prudent to run an additional script to check the surrounding blocks, or simply re-run the smartctl test.# smartctl -t long -d ata /dev/sdc
# smartctl -l selftest -d ata /dev/sdc smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error 1 Extended offline Completed: read failure 40% 16786 2037331
The Logical Block Address (LBA) of the first error is given, however to be able to do anything with it we must translate it to the actual address of the block as seen by the filesystem. Depending on the filesystem in question you must either find out or know the block size being used, here are some examples (if you are not sure use an appropriate tool for your filesystem and find out):
FAT32 (default): 512 bytes NTFS (default): 512 bytes ext3 (default): 512 bytes reiserfs (default): 4096 bytes
In this example (since I was repairing a reiserfs system) I am using 4096 bytes as the block (or sometimes known as sector in Windows) size. Take a look at your partition layout to determine where the block lies.
# fdisk -ul ... Disk /dev/sdc: 122.9 GB, 122942324736 bytes 255 heads, 63 sectors/track, 14946 cylinders, total 240121728 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/sdc1 63 2008124 1004031 82 Linux swap / Solaris /dev/sdc2 2008125 240107489 119049682+ 83 Linux
I have truncated the above output, but you can see that the LBA# 2037331 lies in the second partition of the third drive /dev/sdc2. In the case of reiserfs or any non-512 byte block size, a simple calculation is necessary to get the filesystem’s address of the block.
# echo "(2037331-2008125)*512/4096" | bc -l 3650.750000000000000000003650 is our block (notice the damage is actually near the last 1/4 of it)!
Zero the block:
# dd if=/dev/zero of=/dev/sdc2 bs=4096 count=1 seek=3650 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000255 seconds, 16.1 MB/sIf the speed noted about is something like 1000MB/s+ then it’s likely that your command has something wrong.
Re-scan
The reason I rescan the entire disk is to make sure no new bad blocks are developing in the previously confirmed area (if they are, your chances of repair just dropped dramatically).# smartctl -t long -d ata /dev/sdcIf your long scan completes without error, that’s it! Run the appropriate filesystem repair utility and reboot. If not, repeat.
Alternative
If you are realizing the number of bad blocks increasing or there simply being large areas of damage a complete duplicate of the partition/drive may be in order. A program called dd_rescue is ideal for this as it will copy the entire area block-by-block and write out zeros automatically for bad blocks at your destination.Most people in this case will not have the free space available locally to duplicate an entire drive. Either go out and buy a new drive, or mount some space on another machine over ssh/smb.
# sshfs username@192.168.0.2:/home/username /mnt/hd
Usage: smbmnt mount-point [options]
Use dd_rescue to copy the entire drive to a file, or directly to another drive mounted locally.
# dd_rescue -A -v -b 4096 /dev/sdc2 /mnt/hd/badblocks.img
Always set -B to the filesystem block size. When the copy operation completes successfully you will have an idea of how much data was lost, and then run the filesystem repair utility on that to rebuild.