Archive for March 15th, 2013

Preventive measures against hard disk failures with smard / Installing smartmontools on Linux

Friday, March 15th, 2013

Many admins might not know about smartmontools Linux package. It provides two useful tools  smartctl and smard which use (Self Monitoring and Reporting Technology system) often abreviated as S.M.A.R.T.. SMART support is nowdays available across any modern ATA, SATA and SCSI hard disks. smartontools package is installable via default package repositories on virtually all different Linux distributions. Having smartmontools installed on all critical productive server is a must for the reason it serves as early notification system in case if hard disk is on the down-verge of break-up (i.e. physical media of hard disk storage starts getting damaged). Through the last 14 years I worked as Linux sysadmin. I've used smartmontools on hundreds of servers and on many times it save companies hundreds of dollars by simply reporting a system hdd is dying and by replacing the server or hard disk with identifically configured ones. smartmontools supports monitoring of single  hard disks as well as ones configured on a hardware level to work in some RAID array. As of time of writing you can check list of smartmontools supported hardware RAID-Controllers here.

1. Installing smartmontools

a) To install smartmontools on Debian and Ubuntu and other .deb based servers:

debian:~# apt-get install --yes smartmontools
.....

b) On CentOS, Fedora,RHEL and other RPM based  install with:

[root@centos ~]# yum --yes install smartmontools
.....

2. Configuring and Enabling smartd hard disk health monitoring

a) on Debian and derivatives

Edit /etc/default/smartmontools:

debian:~# vim /etc/default/smartmontools

By default file looks smth. like;

 

# Defaults for smartmontools initscript (/etc/init.d/smartmontools)
# This is a POSIX shell fragment

# List of devices you want to explicitly enable S.M.A.R.T. for
# Not needed (and not recommended) if the device is monitored by smartd
#enable_smart="/dev/hda /dev/hdb"
#enable_smart="/dev/hda"
# uncomment to start smartd on system startup
#start_smartd=yes

# uncomment to pass additional options to smartd on startup
#smartd_opts="–interval=1800"

Config file should look something like;

 

# Defaults for smartmontools initscript (/etc/init.d/smartmontools)
# This is a POSIX shell fragment

# List of devices you want to explicitly enable S.M.A.R.T. for
# Not needed (and not recommended) if the device is monitored by smartd
#enable_smart="/dev/hda /dev/hdb"
enable_smart="/dev/sda"
# uncomment to start smartd on system startup
start_smartd=yes

# uncomment to pass additional options to smartd on startup
#smartd_opts="–interval=1800"

 

b) on CentOS, RHEL, Fedora  for smartd options

By default on RPM based distros there is no need for special configuration. However for some custom cases edit /etc/sysconfig/smartmontools and /etc/smartd.conf

c) Enabling smartmontools

[root@centos default]# /etc/init.d/smartd start
Starting smartd:           [  OK  ]

3. Checking hard disk failure status with smartctl

Checking whether a SMART hard disk consistency check Passes is done simplest with:

debian:~# /usr/sbin/smartctl -H /dev/sda

smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

SMART Health Status: OK

 

 

debian:~# /usr/sbin/smartctl -i /dev/sda1

smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.7 and 7200.7 Plus family
Device Model:     ST340014AS
Serial Number:    4MQ0LV3B
Firmware Version: 3.43
User Capacity:    40,020,664,320 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Fri Mar 15 15:27:12 2013 EET
SMART support is: Available – device has SMART capability.
SMART support is: Enabled

To print as much information as possible for hard disk health status;

 

[root@centos default]# /usr/sbin/smartctl -a /dev/sda1

smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.7 and 7200.7 Plus family
Device Model:     ST340014AS
Serial Number:    4MQ0LV3B
Firmware Version: 3.43
User Capacity:    40,020,664,320 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Fri Mar 15 15:14:53 2013 EET
SMART support is: Available – device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:          ( 423) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (  19) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   052   045   006    Pre-fail  Always       –       172137473
  3 Spin_Up_Time            0x0002   098   098   000    Old_age   Always       –       0
  4 Start_Stop_Count        0x0033   096   096   020    Pre-fail  Always       –       4198
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       –       0
  7 Seek_Error_Rate         0x000f   090   060   030    Pre-fail  Always       –       945095084
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       –       22769
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       –       0
 12 Power_Cycle_Count       0x0033   099   099   020    Pre-fail  Always       –       1084
194 Temperature_Celsius     0x0022   038   046   000    Old_age   Always       –       38 (0 15 0 0)
195 Hardware_ECC_Recovered  0x001a   052   045   000    Old_age   Always       –       172137473
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       –       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      –       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       –       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      –       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       –       0

SMART Error Log Version: 1
ATA Error Count: 33 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 33 occurred at disk power-on lifetime: 21588 hours (899 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  — — — — — — —
  40 51 00 77 c3 6a e0  Error: UNC at LBA = 0x006ac377 = 6996855

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  — — — — — — — —  —————-  ——————–
  c8 00 08 77 c3 6a e0 00      14:07:39.385  READ DMA
  ec 00 00 00 00 00 a0 00      14:07:35.553  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      14:07:35.550  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 00      14:07:35.547  IDENTIFY DEVICE
  c8 00 08 77 c3 6a e0 00      14:07:35.543  READ DMA

Error 32 occurred at disk power-on lifetime: 21588 hours (899 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  — — — — — — —
  40 51 00 77 c3 6a e0  Error: UNC at LBA = 0x006ac377 = 6996855

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  — — — — — — — —  —————-  ——————–
  c8 00 08 77 c3 6a e0 00      14:07:23.940  READ DMA
  ec 00 00 00 00 00 a0 00      14:07:35.553  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      14:07:35.550  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 00      14:07:35.547  IDENTIFY DEVICE
  c8 00 08 77 c3 6a e0 00      14:07:35.543  READ DMA

Error 31 occurred at disk power-on lifetime: 21588 hours (899 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  — — — — — — —
  40 51 00 77 c3 6a e0  Error: UNC at LBA = 0x006ac377 = 6996855

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  — — — — — — — —  —————-  ——————–
  c8 00 08 77 c3 6a e0 00      14:07:23.940  READ DMA
  ec 00 00 00 00 00 a0 00      14:07:23.937  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      14:07:20.071  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 00      14:07:20.057  IDENTIFY DEVICE
  c8 00 08 77 c3 6a e0 00      14:07:20.044  READ DMA

Error 30 occurred at disk power-on lifetime: 21588 hours (899 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  — — — — — — —
  40 51 00 77 c3 6a e0  Error: UNC at LBA = 0x006ac377 = 6996855

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  — — — — — — — —  —————-  ——————–
  c8 00 08 77 c3 6a e0 00      14:07:23.940  READ DMA
  ec 00 00 00 00 00 a0 00      14:07:23.937  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      14:07:20.071  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 00      14:07:20.057  IDENTIFY DEVICE
  c8 00 08 77 c3 6a e0 00      14:07:20.044  READ DMA

Error 29 occurred at disk power-on lifetime: 21588 hours (899 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  — — — — — — —
  40 51 00 77 c3 6a e0  Error: UNC at LBA = 0x006ac377 = 6996855

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  — — — — — — — —  —————-  ——————–
  c8 00 08 77 c3 6a e0 00      14:07:23.940  READ DMA
  ec 00 00 00 00 00 a0 00      14:07:23.937  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 00      14:07:20.071  SET FEATURES [Set transfer mode]
  ec 00 00 00 00 00 a0 00      14:07:20.057  IDENTIFY DEVICE
  c8 00 08 77 c3 6a e0 00      14:07:20.044  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         1         –

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

4. Visualizing smartd collected data in GUI with gsmartcontrol

For people who prefer to visualize things in Graphical environment smartd service hard disk health data can be viewed in nice graphical interface wth gsmartcontrol tool. Most Linux servers don't have graphical environment as having a X server with any graphics manager is a waste of system resources thus installing gsmartcontrol doesn't make much sense, however for monitoring and reporting for upcoming Hard Disk issues gsmartcontrol is a good one to have.

a) To install gsmartcontrol on Debian and Ubuntu Linux;

debian:~# apt-get install --yes gsmartcontrol
....

 

b) Installing gsmartcontrol on CentOS, Fedora, RHEL and SuSE;

gsmartcontrol has a binary package builds for all major Linux distributions, except Slackware Linux. For any of RPM based Linux distros. Go and download required smartmontools distro version and type binary from here then install the RPMs one by one with the usual:

[root@centos ~]# rpm -ivh glimm*
....
[root@centos ~]# rpm -ivh libglademm*
....
[root@centos ~]# rpm -ivh libsigc*
....
[root@centos ~]# rpm -ivh cairomm*
....
[root@centos ~]# rpm -ivh gsmartcontrol*
....

Below, are 2 screenshots of GSmartControl taken from my

gsmartmontools Debian stable Linux screenshot monitor hard disk health in graphical environment

Lenovo gsmartcontrol Thinkpad Device information /dev/sda ST9160824AS screenshot 
If you get something different from Overall health self-assessment test PASSED, this means hard disk has a surface damage and needs to be replaced ASAP. If during hard disk normal operation HDD hits I/O errors and you can't afford to have a GUI environment just for gsmartcontrol, errors gets logged in dmesg hence dmesg could be useful to provide you with info of a failing hard drive.