Posts Tagged ‘nagios’

Monitoring Linux hardware Hard Drives / Temperature and Disk with lm_sensors / smartd / hddtemp and Zabbix Userparameter lm_sensors report script

Thursday, April 30th, 2020

monitoring-linux-hardware-with-software-temperature-disk-cpu-health-zabbix-userparameter-script

I'm part of a  SysAdmin Team that is partially doing some minor Zabbix imrovements on a custom corporate installed Zabbix in an ongoing project to substitute the previous HP OpenView monitoring for a bunch of Legacy Linux hosts.
As one of the necessery checks to have is regarding system Hardware, the task was to invent some simplistic way to monitor hardware with the Zabbix Monitoring tool.  Monitoring Bare Metal servers hardware of HP / Dell / Fujituse etc. servers  in Linux usually is done with a third party software provided by the Hardware vendor. But as this requires an additional services to run and sometimes is not desired. It was interesting to find out some alternative Linux native ways to do the System hardware monitoring.
Monitoring statistics from the system hardware components can be obtained directly from the server components with ipmi / ipmitool (for more info on it check my previous article Reset and Manage intelligent  Platform Management remote board article).
With ipmi
 hardware health info could be received straight from the ILO / IDRAC / HPMI of the server. However as often the Admin-Lan of the server is in a seperate DMZ secured network and available via only a certain set of routed IPs, ipmitool can't be used.

So what are the other options to use to implement Linux Server Hardware Monitoring?

The tools to use are perhaps many but I know of two which gives you most of the information you ever need to have a prelimitary hardware damage warning system before the crash, these are:
 

1. smartmontools (smartd)

Smartd is part of smartmontools package which contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology system (SMART) built into most modern ATA/SATA, SCSI/SAS and NVMe disks

Disk monitoring is handled by a special service the package provides called smartd that does query the Hard Drives periodically aiming to find a warning signs of hardware failures.
The downside of smartd use is that it implies a little bit of extra load on Hard Drive read / writes and if misconfigured could reduce the the Hard disk life time.

linux:~#  /usr/sbin/smartctl -a /dev/sdb2
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-5-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     KINGSTON SA400S37240G
Serial Number:    50026B768340AA31
LU WWN Device Id: 5 0026b7 68340aa31
Firmware Version: S1Z40102
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Apr 30 14:05:01 2020 EEST
SMART support is: Available – device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   000    Old_age   Always       –       100
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       –       2820
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       –       21
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      –       0
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      –       0
167 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      –       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       –       0
169 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      –       0
170 Unknown_Attribute       0x0000   100   100   010    Old_age   Offline      –       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       –       0
173 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      –       0
181 Program_Fail_Cnt_Total  0x0032   100   100   000    Old_age   Always       –       0
182 Erase_Fail_Count_Total  0x0000   100   100   000    Old_age   Offline      –       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       –       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       –       16
194 Temperature_Celsius     0x0022   034   052   000    Old_age   Always       –       34 (Min/Max 19/52)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       –       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       –       0
218 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       –       0
231 Temperature_Celsius     0x0000   097   097   000    Old_age   Offline      –       97
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       –       2104
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       –       1857
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       –       1141
244 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      –       32
245 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      –       107
246 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      –       15940

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

2. hddtemp

Usually if smartd is used it is useful to also use hddtemp which relies on smartd data.
 The hddtemp program monitors and reports the temperature of PATA, SATA
 or SCSI hard drives by reading Self-Monitoring Analysis and Reporting
 Technology (S.M.A.R.T.)
information on drives that support this feature.
 

linux:~# /usr/sbin/hddtemp /dev/sda1
/dev/sda1: Hitachi HDS721050CLA360: 31°C
linux:~# /usr/sbin/hddtemp /dev/sdc6
/dev/sdc6: KINGSTON SV300S37A120G: 25°C
linux:~# /usr/sbin/hddtemp /dev/sdb2
/dev/sdb2: KINGSTON SA400S37240G: 34°C
linux:~# /usr/sbin/hddtemp /dev/sdd1
/dev/sdd1: WD Elements 10B8: S.M.A.R.T. not available

3. lm-sensors / i2c-tools 

 Lm-sensors is a hardware health monitoring package for Linux. It allows you
 to access information from temperature, voltage, and fan speed sensors.
i2c-tools
was historically bundled in the same package as lm_sensors but has been seperated cause not all hardware monitoring chips are I2C devices, and not all I2C devices are hardware monitoring chips.

The most basic use of lm-sensors is with the sensors command

linux:~# sensors
i350bb-pci-0600
Adapter: PCI adapter
loc1:         +55.0 C  (high = +120.0 C, crit = +110.0 C)

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +28.0 C  (high = +78.0 C, crit = +88.0 C)
Core 0:         +26.0 C  (high = +78.0 C, crit = +88.0 C)
Core 1:         +28.0 C  (high = +78.0 C, crit = +88.0 C)
Core 2:         +28.0 C  (high = +78.0 C, crit = +88.0 C)
Core 3:         +28.0 C  (high = +78.0 C, crit = +88.0 C)

 


On CentOS Linux useful tool is also  lm_sensors-sensord.x86_64 – A Daemon that periodically logs sensor readings to syslog or a round-robin database, and warns of sensor alarms.

In Debian Linux there is also the psensors-server (an HTTP server providing JSON Web service which can be used by GTK+ Application to remotely monitor sensors) useful for developers
psesors-server

psensor-linux-graphical-tool-to-check-cpu-hard-disk-temperature-unix

If you have a Xserver installed on the Server accessed with Xclient or via VNC though quite rare,
You can use xsensors or Psensora GTK+ (Widget Toolkit for creating Graphical User Interface) application software.

With this 3 tools it is pretty easy to script one liners and use the Zabbix UserParameters functionality to send hardware report data to a Company's Zabbix Sserver, though Zabbix has already some templates to do so in my case, I couldn't import this templates cause I don't have Zabbix Super-Admin credentials, thus to work around that a sample work around is use script to monitor for higher and critical considered temperature.
Here is a tiny sample script I came up in 1 min time it can be used to used as 1 liner UserParameter and built upon something more complex.

SENSORS_HIGH=`sensors | awk '{ print $6 }'| grep '^+' | uniq`;
SENSORS_CRIT=`sensors | awk '{ print $9 }'| grep '^+' | uniq`; ;SENSORS_STAT=`sensors|grep -E 'Core\s' | awk '{ print $1" "$2" "$3 }' | grep "$SENSORS_HIGH|$SENSORS_CRIT"`;
if [ ! -z $SENSORS_STAT ]; then
echo 'Temperature HIGH';
else 
echo 'Sensors OK';
fi 

Of course there is much more sophisticated stuff to use for monitoring out there


Below script can be easily adapted and use on other Monitoring Platforms such as Nagios / Munin / Cacti / Icinga and there are plenty of paid solutions, but for anyone that wants to develop something from scratch just like me I hope this
article will be a good short introduction.
If you know some other Linux hardware monitoring tools, please share.

Fix to Nagios is currently not checking for external commands

Wednesday, August 24th, 2011

While I was deploying a new Nagios install to Monitor some Windows hosts I’ve came across the following error in Nagios’s web interface:

Sorry, but Nagios is currently not checking for external commands, so your command will not be committed!
Read the documentation for information on how to enable external commands...

This error is caused by an option configuration for /etc/nagios/nrpe.cfg (part of the nrpe-nagios-server Debian package.

The config variable in nrpe.cfg causing the error is check_external_command=0 , the fix comes to changing the variable to:

check_external_command=1

As well as restart the /etc/init.d/nagios-nrpe-server and /etc/init.d/nagios3 services:

debian:~# /etc/init.d/nagios3 restart
...
debian:~# /etc/init.d/nagios-nrpe-server
...

This changes has work out the error Sorry, but Nagios is currently not checking for external commands, so your command will not be committed! , however immediately after another kind of error appared in Nagios web interface when I tried to use the send Nagios commands button. The error was:

Error: Could not stat() command file '/var/lib/nagios3/rw/nagios.cmd'!

This error is due to a deb package, which seems to be affecting the current deb versions of Nagios shipped with Debian 6 Squeeze stable, as well as the Latest Ubuntu release 11.04.

Thanksfully there is a work around to the problem I found online, to fix it up I had to execute the commands:

debian:~# /etc/init.d/nagios3 stop debian:~# dpkg-statoverride --update --add nagios www-data 2710 /var/lib/nagios3/rw
debian:~# dpkg-statoverride --update --add nagios nagios 751 /var/lib/nagios3
debian:~# /etc/init.d/nagios3 start

And hooray Thanks God the error is gone 😉

How to change mail sent from in Nagios on Debian GNU/Linux 6

Wednesday, August 24th, 2011

I’ve been playing with configuring a new nagios running on a Linux host which’s aim is to monitor few Windows servers.
The Linux host’s exim is configured to act as relay host to another SMTP server, so all email ending up in the Linux localhost on port 25 is forwarded to the remote SMTP.

The remote smtp only allows the Linux to send email only in case if a real existing username@theserverhostname.com is passed it, otherwise it rejects mail and does not sent properly the email.
As the newly configured Nagios installatio is supposed to do e-mail notification, I was looking for a way to change the default user with which Nagios sends mails, which is inherited directly after the username with which /usr/sbin/nagios3 and /usr/sbin/nrpe are running (on Debian this is nagios@theserverhostname.com).

Thanksfully, there is a work around, I’ve red some forum threads explaning that the username with whch nagios sends mail can be easily changed from /etc/nagios3/commands.cfg by passing the -a “From: custom_user@myserverhostname.com” to all occurance of /usr/bin/mail -s , its preferrable that the -a custom_user@myserverhostname.com is inserted before the -s “” subject option. Hence the occurance of mail command should be changed from:

| /usr/bin/mail -s "** $NOTIFICATIONTYPE$

To:

| /usr/bin/mail -a "From: custom_user@theserverhostname.com" -s "** $NOTIFICATIONTYPE$

Now to read it’s new configurations nagios requirs restart:

debian:~# /etc/init.d/nagios3 restart
...

Now in case of failed services or Hosts Down nagios will send it’s mail from the custom user custom_user@theserverhostname.com and nagios can can send mail properly via the remote relay SMTP host 😉

How to check Host is up with Nagios for servers with disabled ICMP (ping) protocol

Friday, July 15th, 2011

At the company where I administrate some servers, they’re running Nagios to keep track of the servers status and instantly report if problems with connectivity to certain servers occurs.

Now one of the servers which had configured UP host checks is up, but because of heavy ICMP denial of service attacks to the servers the ICMP protocol ping is completely disabled.

In Nagios this host was constantly showing as DOWN in the usual red color, so nagios reported issue even though all services on the client are running fine.

As this is quite annoying, I checked if Nagios supports host checking without doing the ICMP ping test. It appeared it does through something called in nagios Submit passive check result for host

Enabling the “Submit passive check result for this host” could be done straight from Nagios’s web interface (so I don’t even have to edit configurations! ;).
Here is how I did it. In Nagios I had to navigate to:

Hosts -> Click over my host (hosting1) which showed in red as down

Nagios disable ICMP ping report for hosts

You see my down host which I clicked over showing in red in above pic.

On next Nagios screen I had to select, Disable active checks of this host

Nagios Disable active ICMP checks of this host
and press on the Commit button.

Next following text appears on browser:

Your command request was successfully submitted to Nagios for processing.

Note: It may take a while before the command is actually processed.

Afterwards I had to click on Submit passive check result for this host and in:
Check Output to type in:

check_tcp -p 80

Here is the Screenshot of the Command Options dialog:

Nagios submit passive check with check TCP -p 80

That’s all now Nagious should start checking the down host by doing a query if the webserver on port 80 is up and running instead of pinging it.
As well as the server is no longer shown in the Nagio’s Down host list.