Archive for April 16th, 2014

Save data from failing hard disk on Linux – Rescuing data from failing disk with bad blocks

Wednesday, April 16th, 2014

save-data-from-failing-hard-drive-data-recovery-badblocks-linux_1.jpg
Sooner or later your Linux Desktop or Linux server hard drive will start breaking up, whether you have a hardware or software RAID 1, 6 or 10 you can  and good hard disk health monitoring software you can react on time but sometimes as admins we have to take care of old servers which either have RAID 0 or missing RAID configuration and or disk firmware is unable to recognize failing blocks on time and remap them. Thus it is quite useful to have techniques to save data from failing hard disk drives with physical badblocks.

With ddrescue tool there is still hope for your Linux data though disk is full of unrecoverable I/O errors.

apt-cache show ddrescue
 

apt-cache show ddrescue|grep -i description -A 12

Description: copy data from one file or block device to another
 dd_rescue is a tool to help you to save data from crashed
 partition. Like dd, dd_rescue does copy data from one file or
 block device to another. But dd_rescue does not abort on errors
 on the input file (unless you specify a maximum error number).
 It uses two block sizes, a large (soft) block size and a small
 (hard) block size. In case of errors, the size falls back to the
 small one and is promoted again after a while without errors.
 If the copying process is interrupted by the user it is possible
 to continue at any position later. It also does not truncate
 the output file (unless asked to). It allows you to start from
 the end of a file and move backwards as well. dd_rescue does
 not provide character conversions.

 

To use ddrescue for saving data first thing is to shutdown the Linux host boot the system with a Rescue LiveCD like SystemRescueCD – (Linux system rescue disk), Knoppix (Most famous bootable LiveCD / LiveDVD), Ubuntu Rescue Remix or BackTrack LiveCD – (A security centered "hackers" distro which can be used also for forensics and data recovery), then mount the failing disk (I assume disk is still mountable :). Note that it is very important to mount the disk as read only, because any write operation on hard drive increases chance that it completely becomes unusable before saving your data!

To make backup of your whole hard disk data to secondary mounted disk into /mnt/second_disk

# mkdir /mnt/second_disk/rescue
# mount /dev/sda2 /mnt/second_disk/rescue
# dd_rescue -d -r 10 /dev/sda1 /mnt/second_disk/rescue/backup.img
# mount -o loop /mnt/second_disk/rescue/backup.img

In above example change /dev/sda2 to whatever your hard drive device is named.

Whether you have already an identical secondary drive attached to the Linux host and you would like to copy whole failing Linux partition (/dev/sda) to the identical drive (/dev/sdb) issue:

ddrescue -d -f -r3 /dev/sda /dev/sdb /media/PNY_usb/rescue.logfile

If you got just a few unreadable files and you would like to recover only them then run ddrescue just on the damaged files:

ddrescue -d –R -r 100 /damaged/disk/some_dir/damaged_file /mnt/secondary_disk/some_dir/recoveredfile

-d instructs to use direct I/O
-R retrims the error area on each retry
-r 100 sets the retry limit to 100 (tries to read data 100 times before resign)

Of course this is not always working as on some HDDs recovery is impossible due to hard physical damages, if above command can't recover a file in 10 attempts it is very likely that it never succeeds …

A small note to make here is that there is another tool dd_rescue (make sure you don't confuse them) – which is also for recovery but GNU ddrescue performs better with recovery.
How ddrescue works is it keeps track of the bad sectors, and go back and try to do a slow read of that data in order to read them.
By the way BSD users would happy to know there is ddrescue port already, so data recovery on BSDs *NIX filesystems if you're a Windows user you can use ddrescue to recover data too via Cygwin.
Of course final data recovery is also very much into God's hands so before launching ddrescue, don't forget to say a prayer 🙂

How to detect failing storage LUN on Linux – multipath

Wednesday, April 16th, 2014

detect-failing-lun-on-linux-failing-scsi-detection

If you login to server and after running dmesg – show kernel log command you get tons of messages like:

# dmesg

 

end_request: I/O error, dev sdc, sector 0
sd 0:0:0:1: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdb, sector 0
sd 0:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sda, sector 0
sd 0:0:0:2: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdc, sector 0
sd 0:0:0:1: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdb, sector 0
sd 0:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sda, sector 0
sd 0:0:0:2: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdc, sector 0
sd 0:0:0:1: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdb, sector 0
sd 0:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sda, sector 0

 

In /var/log/messages there are also log messages present like:

# cat /var/log/messages
...

 

Apr 13 09:45:49 sl02785 kernel: end_request: I/O error, dev sda, sector 0
Apr 13 09:45:49 sl02785 kernel: klogd 1.4.1, ———- state change ———-
Apr 13 09:45:50 sl02785 kernel: sd 0:0:0:1: SCSI error: return code = 0x00010000
Apr 13 09:45:50 sl02785 kernel: end_request: I/O error, dev sdb, sector 0
Apr 13 09:45:55 sl02785 kernel: sd 0:0:0:2: SCSI error: return code = 0x00010000
Apr 13 09:45:55 sl02785 kernel: end_request: I/O error, dev sdc, sector 0

 

 

This is a sure sign something is wrong with SCSI hard drives or SCSI controller, to further debug the situation you can use the multipath command, i.e.:


multipath -ll | grep -i failed
_ 0:0:0:2 sdc 8:32 [failed][faulty]
_ 0:0:0:1 sdb 8:16 [failed][faulty]
_ 0:0:0:0 sda 8:0 [failed][faulty]

 

As you can see all 3 drives merged (sdc, sdb and sda) to show up on 1 physical drive via the remote network connectedLUN to the server is showing as faulty. This is a clear sign something is either wrong with 3 hard drive members of LUN – (Logical Unit Number) (which is less probable)  or most likely there is problem with the LUN network  SCSI controller. It is common error that LUN SCSI controller optics cable gets dirty and needs a physical clean up to solve it.

In case you don't know what is storage LUN? – In computer storage, a logical unit number, or LUN, is a number used to identify a logical unit, which is a device addressed by the SCSI protocol or protocols which encapsulate SCSI, such as Fibre Channel or iSCSI. A LUN may be used with any device which supports read/write operations, such as a tape drive, but is most often used to refer to a logical disk as created on a SAN. Though not technically correct, the term "LUN" is often also used to refer to the logical disk itself.

What LUN's does is very similar to Linux's Software LVM (Logical Volume Manager).