nagios Archives - ☩ Walking in Light with Christ - Faith, Computing, Diary ☩ Walking in Light with Christ

Posts Tagged ‘nagios’

Monitoring Linux hardware Hard Drives / Temperature and Disk with lm_sensors / smartd / hddtemp and Zabbix Userparameter lm_sensors report script

Thursday, April 30th, 2020

monitoring-linux-hardware-with-software-temperature-disk-cpu-health-zabbix-userparameter-script

I'm part of a SysAdmin Team that is partially doing some minor Zabbix imrovements on a custom corporate installed Zabbix in an ongoing project to substitute the previous HP OpenView monitoring for a bunch of Legacy Linux hosts.
As one of the necessery checks to have is regarding system Hardware, the task was to invent some simplistic way to monitor hardware with the Zabbix Monitoring tool. Monitoring Bare Metal servers hardware of HP / Dell / Fujituse etc. servers in Linux usually is done with a third party software provided by the Hardware vendor. But as this requires an additional services to run and sometimes is not desired. It was interesting to find out some alternative Linux native ways to do the System hardware monitoring.
Monitoring statistics from the system hardware components can be obtained directly from the server components with ipmi / ipmitool (for more info on it check my previous article Reset and Manage intelligent Platform Management remote board article).
With ipmi hardware health info could be received straight from the ILO / IDRAC / HPMI of the server. However as often the Admin-Lan of the server is in a seperate DMZ secured network and available via only a certain set of routed IPs, ipmitool can't be used.

So what are the other options to use to implement Linux Server Hardware Monitoring?

The tools to use are perhaps many but I know of two which gives you most of the information you ever need to have a prelimitary hardware damage warning system before the crash, these are:

1. smartmontools (smartd)

Smartd is part of smartmontools package which contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology system (SMART) built into most modern ATA/SATA, SCSI/SAS and NVMe disks.

Disk monitoring is handled by a special service the package provides called smartd that does query the Hard Drives periodically aiming to find a warning signs of hardware failures.
The downside of smartd use is that it implies a little bit of extra load on Hard Drive read / writes and if misconfigured could reduce the the Hard disk life time.

linux:~# /usr/sbin/smartctl -a /dev/sdb2
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-5-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: KINGSTON SA400S37240G
Serial Number: 50026B768340AA31
LU WWN Device Id: 5 0026b7 68340aa31
Firmware Version: S1Z40102
User Capacity: 240,057,409,536 bytes [240 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Apr 30 14:05:01 2020 EEST
SMART support is: Available – device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 100 100 000 Old_age Always – 100
9 Power_On_Hours 0x0032 100 100 000 Old_age Always – 2820
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always – 21
148 Unknown_Attribute 0x0000 100 100 000 Old_age Offline – 0
149 Unknown_Attribute 0x0000 100 100 000 Old_age Offline – 0
167 Unknown_Attribute 0x0000 100 100 000 Old_age Offline – 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always – 0
169 Unknown_Attribute 0x0000 100 100 000 Old_age Offline – 0
170 Unknown_Attribute 0x0000 100 100 010 Old_age Offline – 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always – 0
173 Unknown_Attribute 0x0000 100 100 000 Old_age Offline – 0
181 Program_Fail_Cnt_Total 0x0032 100 100 000 Old_age Always – 0
182 Erase_Fail_Count_Total 0x0000 100 100 000 Old_age Offline – 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always – 0
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always – 16
194 Temperature_Celsius 0x0022 034 052 000 Old_age Always – 34 (Min/Max 19/52)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always – 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always – 0
218 Unknown_Attribute 0x0032 100 100 000 Old_age Always – 0
231 Temperature_Celsius 0x0000 097 097 000 Old_age Offline – 97
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always – 2104
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always – 1857
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always – 1141
244 Unknown_Attribute 0x0000 100 100 000 Old_age Offline – 32
245 Unknown_Attribute 0x0000 100 100 000 Old_age Offline – 107
246 Unknown_Attribute 0x0000 100 100 000 Old_age Offline – 15940

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

2. hddtemp

Usually if smartd is used it is useful to also use hddtemp which relies on smartd data.
The hddtemp program monitors and reports the temperature of PATA, SATA
or SCSI hard drives by reading Self-Monitoring Analysis and Reporting
Technology (S.M.A.R.T.) information on drives that support this feature.

linux:~# /usr/sbin/hddtemp /dev/sda1
/dev/sda1: Hitachi HDS721050CLA360: 31°C
linux:~# /usr/sbin/hddtemp /dev/sdc6
/dev/sdc6: KINGSTON SV300S37A120G: 25°C
linux:~# /usr/sbin/hddtemp /dev/sdb2
/dev/sdb2: KINGSTON SA400S37240G: 34°C
linux:~# /usr/sbin/hddtemp /dev/sdd1
/dev/sdd1: WD Elements 10B8: S.M.A.R.T. not available

3. lm-sensors / i2c-tools

Lm-sensors is a hardware health monitoring package for Linux. It allows you
to access information from temperature, voltage, and fan speed sensors.
i2c-tools was historically bundled in the same package as lm_sensors but has been seperated cause not all hardware monitoring chips are I²C devices, and not all I²C devices are hardware monitoring chips.

The most basic use of lm-sensors is with the sensors command

linux:~# sensors
i350bb-pci-0600
Adapter: PCI adapter
loc1: +55.0 C (high = +120.0 C, crit = +110.0 C)

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +28.0 C (high = +78.0 C, crit = +88.0 C)
Core 0: +26.0 C (high = +78.0 C, crit = +88.0 C)
Core 1: +28.0 C (high = +78.0 C, crit = +88.0 C)
Core 2: +28.0 C (high = +78.0 C, crit = +88.0 C)
Core 3: +28.0 C (high = +78.0 C, crit = +88.0 C)

On CentOS Linux useful tool is also lm_sensors-sensord.x86_64 – A Daemon that periodically logs sensor readings to syslog or a round-robin database, and warns of sensor alarms.

In Debian Linux there is also the psensors-server (an HTTP server providing JSON Web service which can be used by GTK+ Application to remotely monitor sensors) useful for developers
psesors-server

If you have a Xserver installed on the Server accessed with Xclient or via VNC though quite rare,
You can use xsensors or Psensor – a GTK+ (Widget Toolkit for creating Graphical User Interface) application software.

With this 3 tools it is pretty easy to script one liners and use the Zabbix UserParameters functionality to send hardware report data to a Company's Zabbix Sserver, though Zabbix has already some templates to do so in my case, I couldn't import this templates cause I don't have Zabbix Super-Admin credentials, thus to work around that a sample work around is use script to monitor for higher and critical considered temperature.
Here is a tiny sample script I came up in 1 min time it can be used to used as 1 liner UserParameter and built upon something more complex.

SENSORS_HIGH=`sensors | awk '{ print $6 }'| grep '^+' | uniq`;
SENSORS_CRIT=`sensors | awk '{ print $9 }'| grep '^+' | uniq`; ;SENSORS_STAT=`sensors|grep -E 'Core\s' | awk '{ print $1" "$2" "$3 }' | grep "$SENSORS_HIGH|$SENSORS_CRIT"`;
if [ ! -z $SENSORS_STAT ]; then
echo 'Temperature HIGH';
else
echo 'Sensors OK';
fi
Of course there is much more sophisticated stuff to use for monitoring out there

Below script can be easily adapted and use on other Monitoring Platforms such as Nagios / Munin / Cacti / Icinga and there are plenty of paid solutions, but for anyone that wants to develop something from scratch just like me I hope this
article will be a good short introduction.
If you know some other Linux hardware monitoring tools, please share.

Tags: Adapter, around, Auto Offline Data Collection Disabled, awk, CPU, data, developers, Disk, Extended, firmware version, hard drives, hardware, hardware health, health, information, ISA, linux?, Monitoring, Monitoring Linux, nagios, package, pci, script, sensors, Short, software, system hardware, temperature, zabbix, Zabbix Userparameter
Posted in Linux, Monitoring, System Administration | 1 Comment »

Fix to Nagios is currently not checking for external commands

Wednesday, August 24th, 2011

While I was deploying a new Nagios install to Monitor some Windows hosts I’ve came across the following error in Nagios’s web interface:

Sorry, but Nagios is currently not checking for external commands, so your command will not be committed! Read the documentation for information on how to enable external commands...

This error is caused by an option configuration for /etc/nagios/nrpe.cfg (part of the nrpe-nagios-server Debian package.

The config variable in nrpe.cfg causing the error is check_external_command=0 , the fix comes to changing the variable to:

check_external_command=1

As well as restart the /etc/init.d/nagios-nrpe-server and /etc/init.d/nagios3 services:

debian:~# /etc/init.d/nagios3 restart ... debian:~# /etc/init.d/nagios-nrpe-server ...

This changes has work out the error Sorry, but Nagios is currently not checking for external commands, so your command will not be committed! , however immediately after another kind of error appared in Nagios web interface when I tried to use the send Nagios commands button. The error was:

Error: Could not stat() command file '/var/lib/nagios3/rw/nagios.cmd'!

This error is due to a deb package, which seems to be affecting the current deb versions of Nagios shipped with Debian 6 Squeeze stable, as well as the Latest Ubuntu release 11.04.

Thanksfully there is a work around to the problem I found online, to fix it up I had to execute the commands:

debian:~# /etc/init.d/nagios3 stop debian:~# dpkg-statoverride --update --add nagios www-data 2710 /var/lib/nagios3/rw debian:~# dpkg-statoverride --update --add nagios nagios 751 /var/lib/nagios3 debian:~# /etc/init.d/nagios3 start

And hooray Thanks God the error is gone 😉

Tags: Button, check, checking, command, config, deb, deb package, dpkg, ERROR, file, god, hooray, information, kind, lib, monitor, nagios, online, option, package, squeeze, Stable, stat, Thanksfully, Ubuntu, var, web interface, work, www data
Posted in Linux, System Administration | 8 Comments »

How to change mail sent from in Nagios on Debian GNU/Linux 6

Wednesday, August 24th, 2011

I’ve been playing with configuring a new nagios running on a Linux host which’s aim is to monitor few Windows servers.
The Linux host’s exim is configured to act as relay host to another SMTP server, so all email ending up in the Linux localhost on port 25 is forwarded to the remote SMTP.

The remote smtp only allows the Linux to send email only in case if a real existing username@theserverhostname.com is passed it, otherwise it rejects mail and does not sent properly the email.
As the newly configured Nagios installatio is supposed to do e-mail notification, I was looking for a way to change the default user with which Nagios sends mails, which is inherited directly after the username with which /usr/sbin/nagios3 and /usr/sbin/nrpe are running (on Debian this is nagios@theserverhostname.com).

Thanksfully, there is a work around, I’ve red some forum threads explaning that the username with whch nagios sends mail can be easily changed from /etc/nagios3/commands.cfg by passing the -a “From: custom_user@myserverhostname.com” to all occurance of /usr/bin/mail -s , its preferrable that the -a custom_user@myserverhostname.com is inserted before the -s “” subject option. Hence the occurance of mail command should be changed from:

| /usr/bin/mail -s "** $NOTIFICATIONTYPE$

To:

| /usr/bin/mail -a "From: custom_user@theserverhostname.com" -s "** $NOTIFICATIONTYPE$

Now to read it’s new configurations nagios requirs restart:

debian:~# /etc/init.d/nagios3 restart ...

Now in case of failed services or Hosts Down nagios will send it’s mail from the custom user custom_user@theserverhostname.com and nagios can can send mail properly via the remote relay SMTP host 😉

Tags: aim, com, command, custom, debian gnu, default user, e mail notification, email, exim, forum, forum threads, gnu linux, hosts, Linux, linux host, localhost, mail command, myserverhostname, nagios, notification, NOTIFICATIONTYPE, occurance, option, port, preferrable, relay, smtp server, Thanksfully, theserverhostname, username, usr, way, whch, windows servers, work
Posted in FreeBSD, Linux, System Administration | No Comments »

How to check Host is up with Nagios for servers with disabled ICMP (ping) protocol

Friday, July 15th, 2011

At the company where I administrate some servers, they’re running Nagios to keep track of the servers status and instantly report if problems with connectivity to certain servers occurs.

Now one of the servers which had configured UP host checks is up, but because of heavy ICMP denial of service attacks to the servers the ICMP protocol ping is completely disabled.

In Nagios this host was constantly showing as DOWN in the usual red color, so nagios reported issue even though all services on the client are running fine.

As this is quite annoying, I checked if Nagios supports host checking without doing the ICMP ping test. It appeared it does through something called in nagios Submit passive check result for host

Enabling the “Submit passive check result for this host” could be done straight from Nagios’s web interface (so I don’t even have to edit configurations! ;).
Here is how I did it. In Nagios I had to navigate to:

Hosts -> Click over my host (hosting1) which showed in red as down

You see my down host which I clicked over showing in red in above pic.

On next Nagios screen I had to select, Disable active checks of this host

and press on the Commit button.

Next following text appears on browser:

Your command request was successfully submitted to Nagios for processing.

Note: It may take a while before the command is actually processed.

Afterwards I had to click on Submit passive check result for this host and in:
Check Output to type in:

check_tcp -p 80

Here is the Screenshot of the Command Options dialog:

That’s all now Nagious should start checking the down host by doing a query if the webserver on port 80 is up and running instead of pinging it.
As well as the server is no longer shown in the Nagio’s Down host list.

Tags: Button, check result, checks, client, command options, command request, Commit, company, connectivity, denial of service, denial of service attacks, downYou, host, host list, hosts, ICMP, icmp ping, icmp protocol, Nagio, nagios, Output, ping, ping test, processing, request, screen, screenshot, servers, something, status, Submit, test, text, type, web interface
Posted in Linux, System Administration | 4 Comments »

☩ Walking in Light with Christ – Faith, Computing, Diary

Posts Tagged ‘nagios’

Daily Bible quote

GET ARTICLE UPDATES

Useful blog? Help it:

Links to Other Places

Recent Posts

Ads

Categories

About Myself

Recent Comments

Top Post Views

blogtopsites