Posts Tagged ‘log’

How to calculate connections from IP address with shell script and log to Zabbix graphic

Thursday, March 11th, 2021

Reading Time: 3minutes

We had to test the number of connections incoming IP sorted by its TCP / IP connection state.

For example:

TIME_WAIT, ESTABLISHED, LISTEN etc.


The reason behind is sometimes the IP address '192.168.0.1' does create more than 200 connections, a Cisco firewall gets triggered and the connection for that IP is filtered out. To be able to know in advance that this problem is upcoming. a Small userparameter script is set on the Linux servers, that does print out all connections from IP by its STATES sorted out.

 

The script is calc_total_ip_match_zabbix.sh is below:

#!/bin/bash
#  check ESTIMATED / FIN_WAIT etc. netstat output for IPs and calculate total
# UserParameter=count.connections,(/usr/local/bin/calc_total_ip_match_zabbix.sh)
CHECK_IP='192.168.0.1';
f=0; 

 

for i in $(netstat -nat | grep "$CHECK_IP" | awk '{print $6}' | sort | uniq -c | sort -n); do

echo -n "$i ";
f=$((f+i));
done;
echo
echo "Total: $f"

 

root@pcfreak:/bashscripts# ./calc_total_ip_match_zabbix.sh 
1 TIME_WAIT 2 ESTABLISHED 3 LISTEN 

Total: 6

 

root@pcfreak:/bashscripts# ./calc_total_ip_match_zabbix.sh 
2 ESTABLISHED 3 LISTEN 
Total: 5


images/zabbix-webgui-connection-check1

To make process with Zabbix it is necessery to have an Item created and a Depedent Item.

 

webguiconnection-check1

webguiconnection-check1
 

webgui-connection-check2-item

images/webguiconnection-check1

Finally create a trigger to trigger alarm if you have more than or eqaul to 100 Total overall connections.


images/zabbix-webgui-connection-check-trigger

The Zabbix userparameter script should be as this:

[root@host: ~]# cat /etc/zabbix/zabbix_agentd.d/userparameter_webgui_conn.conf
UserParameter=count.connections,(/usr/local/bin/webgui_conn_track.sh)

 

Some collleagues suggested more efficient shell script solution for suming the overall number of connections, below is less time consuming version of script, that can be used for the calculation.
 

#!/bin/bash -x
# show FIN_WAIT2 / ESTIMATED etc. and calcuate total
count=$(netstat -n | grep "192.168.0.1" | awk ' { print $6 } ' | sort -n | uniq -c | sort -nr)
total=$((${count// /+}))
echo "$count"
echo "Total:" "$total"

      2 ESTABLISHED
      1 TIME_WAIT
Total: 3

 


Below is the graph built with Zabbix showing all the fluctuations from connections from monitored IP. ebgui-check_ip_graph

 

How to enable HaProxy logging to a separate log /var/log/haproxy.log / prevent HAProxy duplicate messages to appear in /var/log/messages

Wednesday, February 19th, 2020

Reading Time: 9minutes

haproxy-logging-basics-how-to-log-to-separate-file-prevent-duplicate-messages-haproxy-haproxy-weblogo-squares
haproxy  logging can be managed in different form the most straight forward way is to directly use /dev/log either you can configure it to use some log management service as syslog or rsyslogd for that.

If you don't use rsyslog yet to install it: 

# apt install -y rsyslog

Then to activate logging via rsyslogd we can should add either to /etc/rsyslogd.conf or create a separte file and include it via /etc/rsyslogd.conf with following content:
 

Enable haproxy logging from rsyslogd


Log haproxy messages to separate log file you can use some of the usual syslog local0 to local7 locally used descriptors inside the conf (be aware that if you try to use some wrong value like local8, local9 as a logging facility you will get with empty haproxy.log, even though the permissions of /var/log/haproxy.log are readable and owned by haproxy user.

When logging to a local Syslog service, writing to a UNIX socket can be faster than targeting the TCP loopback address. Generally, on Linux systems, a UNIX socket listening for Syslog messages is available at /dev/log because this is where the syslog() function of the GNU C library is sending messages by default. To address UNIX socket in haproxy.cfg use:

log /dev/log local2 


If you want to log into separate log each of multiple running haproxy instances with different haproxy*.cfg add to /etc/rsyslog.conf lines like:

local2.* -/var/log/haproxylog2.log
local3.* -/var/log/haproxylog3.log


One important note to make here is since rsyslogd is used for haproxy logging you need to have enabled in rsyslogd imudp and have a UDP port listener on the machine.

E.g. somewhere in rsyslog.conf or via rsyslog include file from /etc/rsyslog.d/*.conf needs to have defined following lines:

$ModLoad imudp
$UDPServerRun 514


I prefer to use external /etc/rsyslog.d/20-haproxy.conf include file that is loaded and enabled rsyslogd via /etc/rsyslog.conf:

# vim /etc/rsyslog.d/20-haproxy.conf

$ModLoad imudp
$UDPServerRun 514​
local2.* -/var/log/haproxy2.log


It is also possible to produce different haproxy log output based on the severiy to differentiate between important and less important messages, to do so you'll need to rsyslog.conf something like:
 

# Creating separate log files based on the severity
local0.* /var/log/haproxy-traffic.log
local0.notice /var/log/haproxy-admin.log

 

Prevent Haproxy duplicate messages to appear in /var/log/messages

If you use local2 and some default rsyslog configuration then you will end up with the messages coming from haproxy towards local2 facility producing doubled simultaneous records to both your pre-defined /var/log/haproxy.log and /var/log/messages on Proxy servers that receive few thousands of simultanous connections per second.
This is a problem since doubling the log will produce too much data and on systems with smaller /var/ partition you will quickly run out of space + this haproxy requests logging to /var/log/messages makes the file quite unreadable for normal system events which are so important to track clearly what is happening on the server daily.

To prevent the haproxy duplicate messages you need to define somewhere in rsyslogd usually /etc/rsyslog.conf local2.none near line of facilities configured to log to file:

*.info;mail.none;authpriv.none;cron.none;local2.none     /var/log/messages

This configuration should work but is more rarely used as most people prefer to have haproxy log being written not directly to /dev/log which is used by other services such as syslogd / rsyslogd.

To use /dev/log to output logs from haproxy configuration in global section use config like:
 

global
        log /dev/log local2 debug
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin
        stats timeout 30s
        user haproxy
        group haproxy
        daemon

The log global directive basically says, use the log line that was set in the global section for whole config till end of file. Putting a log global directive into the defaults section is equivalent to putting it into all of the subsequent proxy sections.

Using global logging rules is the most common HAProxy setup, but you can put them directly into a frontend section instead. It can be useful to have a different logging configuration as a one-off. For example, you might want to point to a different target Syslog server, use a different logging facility, or capture different severity levels depending on the use case of the backend application. 

Insetad of using /dev/log interface that is on many distributions heavily used by systemd to store / manage and distribute logs,  many haproxy server sysadmins nowdays prefer to use rsyslogd as a default logging facility that will manage haproxy logs.
Admins prefer to use some kind of mediator service to manage log writting such as rsyslogd or syslog, the reason behind might vary but perhaps most important reason is  by using rsyslogd it is possible to write logs simultaneously locally on disk and also forward logs  to a remote Logging server  running rsyslogd service.

Logging is defined in /etc/haproxy/haproxy.cfg or the respective configuration through global section but could be also configured to do a separate logging based on each of the defined Frontend Backends or default section. 
A sample exceprt from this section looks something like:

#———————————————————————
# Global settings
#———————————————————————
global
    log         127.0.0.1 local2

    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon

    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats

#———————————————————————
defaults
    mode                    tcp
    log                     global
    option                  tcplog
    #option                  dontlognull
    #option http-server-close
    #option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 7
    #timeout http-request    10s
    timeout queue           10m
    timeout connect         30s
    timeout client          20m
    timeout server          10m
    #timeout http-keep-alive 10s
    timeout check           30s
    maxconn                 3000

 

 

# HAProxy Monitoring Config
#———————————————————————
listen stats 192.168.0.5:8080                #Haproxy Monitoring run on port 8080
    mode http
    option httplog
    option http-server-close
    stats enable
    stats show-legends
    stats refresh 5s
    stats uri /stats                            #URL for HAProxy monitoring
    stats realm Haproxy\ Statistics
    stats auth hproxyauser:Password___          #User and Password for login to the monitoring dashboard

 

#———————————————————————
# frontend which proxys to the backends
#———————————————————————
frontend ft_DKV_PROD_WLPFO
    mode tcp
    bind 192.168.233.5:30000-31050
    option tcplog
    log-format %ci:%cp\ [%t]\ %ft\ %b/%s\ %Tw/%Tc/%Tt\ %B\ %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq
    default_backend Default_Bakend_Name


#———————————————————————
# round robin balancing between the various backends
#———————————————————————
backend bk_DKV_PROD_WLPFO
    mode tcp
    # (0) Load Balancing Method.
    balance source
    # (4) Peer Sync: a sticky session is a session maintained by persistence
    stick-table type ip size 1m peers hapeers expire 60m
    stick on src
    # (5) Server List
    # (5.1) Backend
    server Backend_Server1 10.10.10.1 check port 18088
    server Backend_Server2 10.10.10.2 check port 18088 backup


The log directive in above config instructs HAProxy to send logs to the Syslog server listening at 127.0.0.1:514. Messages are sent with facility local2, which is one of the standard, user-defined Syslog facilities. It’s also the facility that our rsyslog configuration is expecting. You can add more than one log statement to send output to multiple Syslog servers.

Once rsyslog and haproxy logging is configured as a minumum you need to restart rsyslog (assuming that haproxy config is already properly loaded):

# systemctl restart rsyslogd.service

To make sure rsyslog reloaded successfully:

systemctl status rsyslogd.service


Restarting HAproxy

If the rsyslogd logging to 127.0.0.1 port 514 was recently added a HAProxy restart should also be run, you can do it with:
 

# /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)


Or to restart use systemctl script (if haproxy is not used in a cluster with corosync / heartbeat).

# systemctl restart haproxy.service

You can control how much information is logged by adding a Syslog level by

    log         127.0.0.1 local2 info


The accepted values are the standard syslog security level severity:

Value Severity Keyword Deprecated keywords Description Condition
0 Emergency emerg panic System is unusable A panic condition.
1 Alert alert   Action must be taken immediately A condition that should be corrected immediately, such as a corrupted system database.
2 Critical crit   Critical conditions Hard device errors.
3 Error err error Error conditions  
4 Warning warning warn Warning conditions  
5 Notice notice   Normal but significant conditions Conditions that are not error conditions, but that may require special handling.
6 Informational info   Informational messages  
7 Debug debug   Debug-level messages Messages that contain information normally of use only when debugging a program.

 

Logging only errors / timeouts / retries and errors is done with option:

Note that if the rsyslog is configured to listen on different port for some weird reason you should not forget to set the proper listen port, e.g.:
 

  log         127.0.0.1:514 local2 info

option dontlog-normal

in defaults or frontend section.

You most likely want to enable this only during certain times, such as when performing benchmarking tests.

(or log-format-sd for structured-data syslog) directive in your defaults or frontend
 

Haproxy Logging shortly explained


The type of logging you’ll see is determined by the proxy mode that you set within HAProxy. HAProxy can operate either as a Layer 4 (TCP) proxy or as Layer 7 (HTTP) proxy. TCP mode is the default. In this mode, a full-duplex connection is established between clients and servers, and no layer 7 examination will be performed. When in TCP mode, which is set by adding mode tcp, you should also add option tcplog. With this option, the log format defaults to a structure that provides useful information like Layer 4 connection details, timers, byte count and so on.

Below is example of configured logging with some explanations:

Log-format "%ci:%cp [%t] %ft %b/%s %Tw/%Tc/%Tt %B %ts %ac/%fc/%bc/%sc/%rc %sq/%bq"

haproxy-logged-fields-explained
Example of Log-Format configuration as shown above outputted of haproxy config:

Log-format "%ci:%cp [%tr] %ft %b/%s %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r"

haproxy_http_log_format-explained1

To understand meaning of this abbreviations you'll have to closely read  haproxy-log-format.txt. More in depth info is to be found in HTTP Log format documentation


haproxy_logging-explained

Logging HTTP request headers

HTTP request header can be logged via:
 

 http-request capture

frontend website
    bind :80
    http-request capture req.hdr(Host) len 10
    http-request capture req.hdr(User-Agent) len 100
    default_backend webservers


The log will show headers between curly braces and separated by pipe symbols. Here you can see the Host and User-Agent headers for a request:

192.168.150.1:57190 [20/Dec/2018:22:20:00.899] website~ webservers/server1 0/0/1/0/1 200 462 – – —- 1/1/0/0/0 0/0 {mywebsite.com|Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 } "GET / HTTP/1.1"

 

Haproxy Stats Monitoring Web interface


Haproxy is having a simplistic stats interface which if enabled produces some general useful information like in above screenshot, through which
you can get a very basic in browser statistics and track potential issues with the proxied traffic for all configured backends / frontends incoming outgoing
network packets configured nodes
 experienced downtimes etc.

haproxy-statistics-report-picture

The basic configuration to make the stats interface accessible would be like pointed in above config for example to enable network listener on address
 

https://192.168.0.5:8080/stats


with hproxyuser / password config would be:

# HAProxy Monitoring Config
#———————————————————————
listen stats 192.168.0.5:8080                #Haproxy Monitoring run on port 8080
    mode http
    option httplog
    option http-server-close
    stats enable
    stats show-legends
    stats refresh 5s
    stats uri /stats                            #URL for HAProxy monitoring
    stats realm Haproxy\ Statistics
    stats auth hproxyauser:Password___          #User and Password for login to the monitoring dashboard

 

 

Sessions states and disconnect errors on new application setup

Both TCP and HTTP logs include a termination state code that tells you the way in which the TCP or HTTP session ended. It’s a two-character code. The first character reports the first event that caused the session to terminate, while the second reports the TCP or HTTP session state when it was closed.

Here are some essential termination codes to track in for in the log:
 

Here are some termination code examples most commonly to see on TCP connection establishment errors:

Two-character code    Meaning
—    Normal termination on both sides.
cD    The client did not send nor acknowledge any data and eventually timeout client expired.
SC    The server explicitly refused the TCP connection.
PC    The proxy refused to establish a connection to the server because the process’ socket limit was reached while attempting to connect.


To get all non-properly exited codes the easiest way is to just grep for anything that is different from a termination code –, like that:

tail -f /var/log/haproxy.log | grep -v ' — '


This should output in real time every TCP connection that is exiting improperly.

There’s a wide variety of reasons a connection may have been closed. Detailed information about all possible termination codes can be found in the HAProxy documentation.
To get better understanding a very useful reading to haproxy Debug errors with  is in haproxy-logging.txt in that small file are collected all the cryptic error messages codes you might find in your logs when you're first time configuring the Haproxy frontend / backend and the backend application behind.

Another useful analyze tool which can be used to analyze Layer 7 HTTP traffic is halog for more on it just google around.

Why du and df reporting different on a filesystem / How to fix inconsistency between used space on FS and disk showing full strangeness

Wednesday, July 24th, 2019

Reading Time: 6minutes

linux-why-du-and-df-shows-different-result-inconsincy-explained-filesystem-full-oddity

If you're a sysadmin on a large server environment such as a couple of hundred of Virtual Machines running Linux OS on either physical host or OpenXen / VmWare hosted guest Virtual Machine, you might end up sometimes at an odd case where some mounted partition mount point reportsits file use different when checked with
df
cmd than when checked with du command, like for example:
 

root@sqlserver:~# df -hT /var/lib/mysql
Filesystem   Type  Size Used Avail Use% Mounted On
/dev/sdb5      ext4    19G  3,4G    14G  20% /var/lib/mysql

Here the '-T' argument is used to show us the filesystem.

root@sqlserver:~# du -hsc /var/lib/mysql
0K    /var/lib/mysql/
0K    total

 

1. Simple debug on what might be the root cause for df / du inconsistency reporting

 

Of course the basic thing to do when in that weird situation is to be totally shocked how this is possible and to investigate a bit what is the biggest first level sub-directories that eat up the space on the mounted location, with du:

 

# du -hkx –max-depth=1 /var/lib/mysql/|uniq|sort -n
4       /var/lib/mysql/test
8       /var/lib/mysql/ezmlm
8       /var/lib/mysql/micropcfreak
8       /var/lib/mysql/performance_schema
12      /var/lib/mysql/mysqltmp
24      /var/lib/mysql/speedtest
64      /var/lib/mysql/yourls
144     /var/lib/mysql/narf
320     /var/lib/mysql/webchat_plus
424     /var/lib/mysql/goodfaithair
528     /var/lib/mysql/moonman
648     /var/lib/mysql/daniel
852     /var/lib/mysql/lessn
1292    /var/lib/mysql/gallery

The given output is in Kilobytes so it is a little bit hard to read, if you're used to Mbytes instead, do

 

 # du -hmx –max-depth=1 /var/lib/mysql/|uniq|sort -n|less

 

I've also investigated on the complete /var directory contents sorted by size with:

 

 # du -akx ./ | sort -n
5152564    ./cache/rsnapshot/hourly.2/localhost
5255788    ./cache/rsnapshot/hourly.2
5287912    ./cache/rsnapshot
7192152    ./cache


Even after finding out the bottleneck dirs and trying to clear up a bit, continued facing that inconsistently shown in two commands and if you're likely to be stunned like me and try … to move some files to a different filesystem to free up space or assigned inodes with a hope that shown inconsitency output will be fixed as it might be caused  due to some kernel / FS caching ?? and this will eventually make the mounted FS to refresh …

But unfortunately, if you try it you'll figure out clearing up a couple of Megas or Gigas will make no difference in cmd output.

In my exact case /var/lib/mysql is a separate mounted ext4 filesystem, however same issue was present also on a Network Filesystem (NFS) and thus, my first thought that this is caused by a network failure problem or NFS bug turned to be wrong.

After further short investigation on the inodes on the Filesystem, it was clear enough inodes are available:
 

# df -i /var/lib/mysql
Filesystem       Inodes  IUsed   IFree IUse% Mounted on
/dev/sdb5      1221600  2562 1219038   1% /var/lib/mysql

 

So the filled inodes count assumed issue also has been rejected.
P.S. (if you're not well familiar with them read manual, i.e. – man 7 inode).
 

– Remounting the mounted filesystem

To make sure the filesystem shown inconsistency between du and df is not due to some hanging network mount or bug, first logical thing I did is to remount the filesytem showing different in size, in my case this was done with:
 

# mount -o remount,rw -t ext4 /var/lib/mysql

For machines with NFS remote mounted storage locations, used:

# mount -o remount,rw -t nfs /var/www


FS remount did not solved it so I continued to ponder what oddity and of course I thought of a workaround (in case if this issues are caused by kernel bug or OS lib issue) reboot might be the solution, however unfortunately restarting the VMs was not a wanted easy to do solution, thus I continued investigating what is wrong …

Next check of course was to check, what kind of network connections are opened to the affected hosts with:
 

# netstat -tupanl


Did not found anything that might point me to the reported different Megabytes issue, so next step was to check what is the situation with currently opened files by running processes on the weird df / du reported systems with lsof, and boom there I observed oddity such as multiple files

 

# lsof -nP | grep '(deleted)'

COMMAND   PID   USER   FD   TYPE DEVICE    SIZE NLINK  NODE NAME
mysqld   2588  mysql    4u   REG 253,17      52     0  1495 /var/lib/mysql/tmp/ibY0cXCd (deleted)
mysqld   2588  mysql    5u   REG 253,17    1048     0  1496 /var/lib/mysql/tmp/ibOrELhG (deleted)
mysqld   2588  mysql    6u   REG 253,17       777884290     0  1497 /var/lib/mysql/tmp/ibmDFAW8 (deleted)
mysqld   2588  mysql    7u   REG 253,17       123667875     0 11387 /var/lib/mysql/tmp/ib2CSACB (deleted)
mysqld   2588  mysql   11u   REG 253,17       123852406     0 11388 /var/lib/mysql/tmp/ibQpoZ94 (deleted)

 

Notice that There were plenty of '(deleted)' STATE files shown in memory an overall of 438:

 

# lsof -nP | grep '(deleted)' |wc -l
438


As I've learned a bit online about the problem, I found it is also possible to find deleted unlinked files only without any greps (to list all deleted files in memory files with lsof args only):

 

# lsof +L1|less


The SIZE field (fourth column)  shows a number of files that are really hard in size and that are kept in open on filesystem and in memory, totally messing up with the filesystem. In my case this is temp files created by MYSQLD daemon but depending on the server provided service this might be apache's www-data, some custom perl / bash script executed via a cron job, stalled rsync jobs etc.
 

2. Check all the list open files with the mysql / root user as part of the the server filesystem inconsistency debugging with:

 

– Grep opened files on server by user

# lsof |grep mysql
mysqld    1312                       mysql  cwd       DIR               8,21       4096          2 /var/lib/mysql
mysqld    1312                       mysql  rtd       DIR                8,1       4096          2 /
mysqld    1312                       mysql  txt       REG                8,1   20336792   23805048 /usr/sbin/mysqld
mysqld    1312                       mysql  mem       REG               8,21      24576         20 /var/lib/mysql/tc.log
mysqld    1312                       mysql  DEL       REG               0,16                 29467 /[aio]
mysqld    1312                       mysql  mem       REG                8,1      55792   14886933 /lib/x86_64-linux-gnu/libnss_files-2.28.so

 

# lsof | grep root
COMMAND    PID   TID TASKCMD          USER   FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
systemd      1                        root  cwd       DIR                8,1       4096          2 /
systemd      1                        root  rtd       DIR                8,1       4096          2 /
systemd      1                        root  txt       REG                8,1    1489208   14928891 /lib/systemd/systemd
systemd      1                        root  mem       REG                8,1    1579448   14886924 /lib/x86_64-linux-gnu/libm-2.28.so

Other command that helped to track the discrepancy between df and du different file usage on FS is:
 

# du -hxa  / | egrep '^[[:digit:]]{1,1}G[[:space:]]*'
 

 

3. Fixing large files kept in memory filesystem problem


What is the real reason for ending up with this file handlers opened by running backgrounded programs on the Linux OS?
It could be multiple  but most likely it is due to exceeded server / client interactions or breaking up RAM or HDD drive with writing plenty of logs on the FS without ending keeping space occupied or Programming library bugs used by hanged service leaving the FH opened on storage.

What is the solution to file system files left in memory problem?

The best solution is to first fix custom script or hanged service and then if possible to simply restart the server to make the kernel / services reload or if this is not possible just restart the problem creation processes.

Once the process is identified like in my case this was MySQL on systemd enabled newer OS distros, just do:

 

 

# systemctl restart mysqld.service


or on older init.d system V ones:

# /etc/init.d/service restart


For custom hanged scripts being listed in ps axuwef you can grep the pid and do a kill -HUP (if the script is written in a good way to recognize -HUP and restart the sub-running process properly – BE EXTRA CAREFUL IF YOU'RE RESTARTING BROKEN SCRIPTS as this might cause your running service disruptions …).

# pgrep -l script.sh
7977 script.sh


# kill -HUP PID

 

Now finally this should either mitigate or at best case completely solve the reported disagreement between df and du, after which the calculated / reported disk space should be back to normal and show up approximately the same (note that size changes a bit as mysql service is writting data) constantly extending the size between the two checks.

 

# df -hk /var/lib/mysql; du -hskc /var/lib/mysql
Filesystem       Inodes  IUsed   IFree IUse% Mounted on
/dev/sdb5        19097172 3472744 14631296  20% /var/lib/mysql
3427772    /var/lib/mysql
3427772    total

 

What we learned?

What I've explained in this article is why and how it comes that 'zoombie' files reside on a filesystem
appearing to be eating disk space on a mounted local or network partition, giving strange inconsistent
reports, leading to system service disruptions and impossibility to have correctly shown information on used
disk space on mounted drive.

I went through with some standard logic on debugging service / filesystem / inode issues up explainat, that led me to the finding about deleted files being kept in filesystem and producing the filesystem strange sized / showing not correct / filled even after it was extended with tune2fs and was supposed to have extra 50GBs.

Finally it was explained shortly how to HUP / restart hanging script / service to fix it.

Some few good readings that helped to fix the issue:

What to do when du and df report different usage is here
df in linux not showing correct free space after file removal is here
Why do “df” and “du” commands show different disk usage?
 

xorg on Toshiba Satellite L40 14B with Intel GM965 video hangs up after boot and the worst fix ever / How to reinstall Ubuntu by keeping the old personal data and programs

Wednesday, April 27th, 2011

Reading Time: 4minutes

black screen ubuntu troubles

I have updated Ubuntu version 9.04 (Jaunty) to 9.10 and followed the my previous post update ubuntu from 9.04 to Latest Ubuntu

I expected that a step by step upgrade from a release to release will work like a charm and though it does on many notebooks it doesn't on Toshiba Satellite L40

The update itself went fine, whether I used the update-manager -d and followed the above pointed tutorial, however after a system restart the PC failed to boot the X server properly, a completely blank screen with blinking cursor appeared and that was all.

I restarted the system into the 2.6.35-28-generic kernel rescue-mode recovery kernel in order to be able to enter into physical console.

Logically the first thing I did is to check /var/log/messages and /var/log/Xorg.0.log but I couldn't find nothing unusual or wrong there.

I suspected something might be wrong with /etc/X11/xorg.conf so I deleted it:

ubuntu:~# rm -f /etc/X11/xorg.conf

and attempted to re-create the xorg.conf X configuration with command:

ubuntu:~# dpkg-reconfigure xserver-xorg

This command was reported to be the usual way to reconfigure the X server settings from console, but in my case (for unknown reasons) it did nothing.

Next the command which was able to re-generate the xorg.conf file was:

ubuntu:~# X -configure

The command generates a xorg.conf sample file in /root/xorg.conf.* so I used the conf to put it in /etc/X11/xorg.conf X's default location and restarted in hope that this would fix the non-booting issue.

Very sadly again the black screen of death appeared on the notebook toshiba screen.
I further thought of completely wipe out the xorg.conf in hope that at least it might boot without the conf file but this worked out neither.

I attempted to run the Xserver with a xorg.conf configured to work with vesa as it's well known vesa X server driver is supposed to work on 99% of the video cards, as almost all of them nowdays are compatible with the vesa standard, but guess what in my case vesa worked not!

The only version of X I can boot in was the failsafe X screen mode which is available through the grub's boot menu recovery mode.

Further on I decided to try few xorg.conf which I found online and were reported to work fine with Intel GM965 internal video , and yes this was also unsucessful.

Some of my other futile attempts were: to re-install the xorg server with apt-get, reinstall the xserver-xorg-video-intel driver e.g.:

ubuntu:~# apt-get install --reinstall xserver-xorg xserver-xorg-video-intel

As nothing worked out I was completely pissed off and decided to take an alternative approach which will take a lot of time but at least will probably be succesful, I decided to completely re-install the Ubuntu from a CD after backing up the /home directory and making a list of available packages on the system, so I can further easily run a tiny bash one-liner script to install all the packages which were previously existing on the laptop before the re-install:

Here is how I did it:

First I archived the /home directory:

ubuntu:/# tar -czvf home.tar.gz home/
....

For 12GB of data with some few thousands of files archiving it took about 40 minutes.

The tar spit archive became like 9GB and I hence used sftp to upload it to a remote FTP server as I was missing a flash drive or an external HDD where I can place the just archived data.

Uploading with sftp can be achieved with a command similar to:

sftp user@yourhost.com
Password:
Connected to yourhost.com.
sftp> put home.tar.gz

As a next step to backup in a file the list of all current installed packages, before I can further proceed to boot-up with the Ubuntu Maverich 10.10 CD and prooceed with the fresh install I used command:

for i in $(dpkg -l| awk '{ print $2 }'); do
echo $i; done >> my_current_ubuntu_packages.txt

Once again I used sftp as in above example to upload my_current_update_packages.txt file to my FTP host.

After backing up all the stuff necessery, I restarted the system and booted from the CD-rom with Ubuntu.
The Ubuntu installation as usual is more than a piece of cake and even if you don't have a brain you can succeed with it, so I wouldn't comment on it 😉

Right after the installation I used the sftp client once again to fetch the home.tar.gz and my_current_ubuntu_packages.txt

I placed the home.tar.gz in /home/ and untarred it inside the fresh /home dir:

ubuntu:/home# tar -zxvf home.tar.gz

Eventually the old home directory was located in /home/home so thereon I used Midnight Commander ( the good old mc text file explorer and manager ) to restore the important user files to their respective places.

As a last step I used the my_current_ubuntu_packages.txt in combination with a tiny shell script to install all the listed packages inside the file with command:

ubuntu:~# for i in $(cat my_current_ubuntu_packagespackages.txt); do
apt-get install --yes $i; sleep 1;
done

You will have to stay in front of the computer and manually answer a ncurses interface questions concerning some packages configuration and to be honest this is really annoying and time consuming.

Summing up the overall time I spend with this stupid Toshiba Satellite L40 with the shitty Intel GM965 was 4 days, where each day I tried numerous ways to fix up the X and did my best to get through the blank screen xserver non-bootable issue, without a complete re-install of the old Ubuntu system.
This is a lesson for me that if I stumble such a shitty issues I will straight proceed to the re-install option and not loose my time with non-sense fixes which would never work.

Hope the article might be helpful to somebody else who experience some problems with Linux similar to mine.

After all at least the Ubuntu Maverick 10.10 is really good looking in general from a design perspective.
What really striked me was the placement of the close, minimize and maximize window buttons , it seems in newer Ubuntus the ubuntu guys decided to place the buttons on the left, here is a screenshot:

Left button positioning of navigation Buttons in Ubuntu 10.10

I believe the solution I explain, though very radical and slow is a solution that would always work and hence worthy 😉
Let me hear from you if the article was helpful.

How to disable tidy HTML corrector and validator to output error and warning messages

Sunday, March 18th, 2012

Reading Time: 2minutes

I've noticed in /var/log/apache2/error.log on one of the Debian servers I manage a lot of warnings and errors produced by tidy HTML syntax checker and reformatter program.

There were actually quite plenty frequently appearing messages in the the log like:

...
To learn more about HTML Tidy see http://tidy.sourceforge.net
Please fill bug reports and queries using the "tracker" on the Tidy web site.
Additionally, questions can be sent to html-tidy@w3.org
HTML and CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 1 - Warning: plain text isn't allowed in <head> elements
line 1 column 1 - Info: <head> previously mentioned
line 1 column 1 - Warning: inserting implicit <body>
line 1 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML 3.2
4 warnings, 0 errors were found!
...

I did a quick investigation on where from this messages are logged in error.log, and discovered few .php scripts in one of the websites containing the tidy string.
I used Linux find + grep cmds find in all php files the "tidy "string, like so:

server:~# find . -iname '*.php'-exec grep -rli 'tidy' '{}' ;
find . -iname '*.php' -exec grep -rli 'tidy' '{}' ; ./new_design/modules/index.mod.php
./modules/index.mod.php
./modules/index_1.mod.php
./modules/index1.mod.php

Opening the files, with vim to check about how tidy is invoked, revealed tidy calls like:

exec('/usr/bin/tidy -e -ashtml -utf8 '.$tmp_name,$rett);

As you see the PHP programmers who wrote this website, made a bigtidy mess. Instead of using php5's tidy module, they hard coded tidy external command to be invoked via php's exec(); external tidy command invocation.
This is extremely bad practice, since it spawns the command via a pseudo limited apache shell.
I've notified about the issue, but I don't know when, the external tidy calls will be rewritten.

Until the external tidy invocations are rewritten to use the php tidy module, I decided to at least remove the tidy warnings and errors output.

To remove the warning and error messages I've changed:

exec('/usr/bin/tidy -e -ashtml -utf8 '.$tmp_name,$rett);

exec('/usr/bin/tidy --show-warnings no --show-errors no -q -e -ashtml -utf8 '.$tmp_name,$rett);

The extra switches meaning is like so:

q – instructs tidy to produce quiet output
-e – show only errors and warnings
–show warnings no && –show errors no, completely disable warnings and error output

Onwards tidy no longer logs junk messages in error.log Not logging all this useless warnings and errors has positive effect on overall server performance especially, when the scripts, running /usr/bin/tidy are called as frequently as 1000 times per sec. or more

Check linux install date / How do I find out how long a Linux server OS was installed?

Wednesday, March 30th, 2016

Reading Time: 2minutes

linux-check-install-date-howto-commands-on-debian-and-fedora-tux_the_linux_penguin_by_hello

To find out the Linux install date, there is no one single solution according to the Linux distribution type and version, there are some common ways to get the Linux OS install age.
Perhaps the most popular way to get the OS installation date and time is to check out when the root filesystem ( / ) was created, this can be done with tune2fs command

 

server:~# tune2fs -l /dev/sda1 | grep 'Filesystem created:'
Filesystem created:       Thu Sep  6 21:44:22 2012

 

server:~# ls -alct /|tail -1|awk '{print $6, $7, $8}'
sep 6 2012

 

root home directory is created at install time
 

 

server:~# ls -alct /root

 

root@server:~# ls -lAhF /etc/hostname
-rw-r–r– 1 root root 8 sep  6  2012 /etc/hostname

 

For Debian / Ubuntu and other deb based distributions the /var/log/installer directory is being created during OS install, so on Debian the best way to check the Linux OS creation date is with:
 

root@server:~# ls -ld /var/log/installer
drwxr-xr-x 3 root root 4096 sep  6  2012 /var/log/installer/
root@server:~# ls -ld /lost+found
drwx—— 2 root root 16384 sep  6  2012 /lost+found/

 

On Red Hat / Fedora / CentOS, redhat based Linuces , you can use:

 

rpm -qi basesystem | grep "Install Date"

 

basesystem is the package containing basic Linux binaries many of which should not change, however in some cases if there are some security updates package might change so it is also good to check the root filesystem creation time and compare whether these two match.

Howto Fix “sysstat Cannot open /var/log/sysstat/sa no such file or directory” on Debian / Ubuntu Linux

Monday, February 15th, 2016

Reading Time: 2minutes

sysstast-no-such-file-or-directory-fix-Debian-Ubuntu-Linux-howto
I really love sysstat and as a console maniac I tend to install it on every server however by default there is some <b>sysstat</b> tuning once installed to make it work, for those unfamiliar with <i>sysstat</i> I warmly recommend to check, it here is in short the package description:<br /><br />
 

server:~# apt-cache show sysstat|grep -i desc -A 15
Description: system performance tools for Linux
 The sysstat package contains the following system performance tools:
  – sar: collects and reports system activity information;
  – iostat: reports CPU utilization and disk I/O statistics;
  – mpstat: reports global and per-processor statistics;
  – pidstat: reports statistics for Linux tasks (processes);
  – sadf: displays data collected by sar in various formats;
  – nfsiostat: reports I/O statistics for network filesystems;
  – cifsiostat: reports I/O statistics for CIFS filesystems.
 .
 The statistics reported by sar deal with I/O transfer rates,
 paging activity, process-related activities, interrupts,
 network activity, memory and swap space utilization, CPU
 utilization, kernel activities and TTY statistics, among
 others. Both UP and SMP machines are fully supported.
Homepage: http://pagesperso-orange.fr/sebastien.godard/

 

If you happen to install sysstat on a Debian / Ubuntu server with:

server:~# apt-get install –yes sysstat


, and you try to get some statistics with sar command but you get some ugly error output from:

 

server:~# sar Cannot open /var/log/sysstat/sa20: No such file or directory


And you wonder how to resolve it and to be able to have the server log in text databases periodically the nice sar stats load avarages – %idle, %iowait, %system, %nice, %user, then to FIX that Cannot open /var/log/sysstat/sa20: No such file or directory

You need to:

server:~# vim /etc/default/sysstat


By Default value you will find out sysstat stats it is disabled, e.g.:

ENABLED="false"

Switch the value to "true"

ENABLED="true"


Then restart sysstat init script with:

server:~# /etc/init.d/sysstat restart

However for those who prefer to do things from menu Ncurses interfaces and are not familiar with Vi Improved, the easiest way is to run dpkg reconfigure of the sysstat:

server:~# dpkg –reconfigure


sysstat-reconfigure-on-gnu-linux

 

root@server:/# sar
Linux 2.6.32-5-amd64 (pcfreak) 15.02.2016 _x86_64_ (2 CPU)

0,00,01 CPU %user %nice %system %iowait %steal %idle
0,15,01 all 24,32 0,54 3,10 0,62 0,00 71,42
1,15,01 all 18,69 0,53 2,10 0,48 0,00 78,20
10,05,01 all 22,13 0,54 2,81 0,51 0,00 74,01
10,15,01 all 17,14 0,53 2,44 0,40 0,00 79,49
10,25,01 all 24,03 0,63 2,93 0,45 0,00 71,97
10,35,01 all 18,88 0,54 2,44 1,08 0,00 77,07
10,45,01 all 25,60 0,54 3,33 0,74 0,00 69,79
10,55,01 all 36,78 0,78 4,44 0,89 0,00 57,10
16,05,01 all 27,10 0,54 3,43 1,14 0,00 67,79


Well that's it now sysstat error resolved, text reporting stats data works again, Hooray! 🙂

No space left on device with free disk space / Why no space left on device while there is plenty of disk space on drive – Running out of Inodes

Tuesday, November 17th, 2015

Reading Time: 5minutes

no_space_left-on-device-while-there-is-disk-space-running-out-of-file-inodes-unix_linux_file_system_diagram.gif

 

On one of the servers, I'm administrating the websites started showing some Mysql database table corrup errors like:
 

 

Table './database_name/site_news_list_com' is marked as crashed and last (automatic?) repair failed

The server is using Oracle MySQL server community stable edition on Debian GNU / Linux 6.0, so I first thought during work the server crashed either due to some bug issue in MySQL or it crashed due to some PHP cron job that did something messy. Thus to solve the crashed tables, tried using mysqlcheck tool which helped pretty fine, at many times whether there were database / table corruptions. I've run the following set of mysqlcheck commands with root (superuser) in a bash shell after logging in through SSH:

:

server:~# /usr/bin/mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf \–check –all-databases -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log
server:~# /usr/bin/mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf –analyze –all-databases -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log
server:~# /usr/bin/mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf \–auto-repair –optimize –all-databases -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log
server:~# /usr/bin/mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf \–optimize –all-databases -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log


In order for above commands to work, I've created the /root/.my.cnf containing my root (mysql CLI) mysql username and password, e.g. file has content like below:

 

[client]
user=root
password=MySecretPassword8821238

 

Btw a good note here is its generally a good idea (if you want to have consistent mysql databases) to automatically execute via a cron job 2 times a month, I've in root cronjob the following:

 

crontab -u root -l |grep -i mysqlcheck
04 06 5,10,15,20,25,1 * * /usr/bin/mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf \–check –all-databases –silent -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log 07 06 5,10,15,20,25,1 * * /usr/bin/mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf –analyze –all-databases –silent -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log 12 06 5,10,15,20,25,1 * * /usr/bin/mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf \–auto-repair –optimize –all-databases –silent -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log 17 06 5,10,15,20,25,1 * * /usr/bin/mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf \–optimize –all-databases –silent -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log


Strangely I got a lot of errors that some .MYI / .MYD .frm temp files, necessery for the mysql tables recovery can't be written inside /home/mysql/database_name

That was pretty weird and I thought there might be some issues with permissions, causing the inability to write, due to some bug or something so I went straight and checked /home/mysql/database_name permissions, e.g.::

 

server:/home/mysql/database_name# ls -ld soccerfame
drwx—— 2 mysql mysql 36864 Nov 17 12:00 soccerfame
server:/home/mysql/database_name# ls -al1|head -n 10
total 1979012
drwx—— 2 mysql mysql 36864 Nov 17 12:00 .
drwx—— 36 mysql mysql 4096 Nov 17 11:12 ..
-rw-rw—- 1 mysql mysql 8712 Nov 17 10:26 1_campaigns_diez.frm
-rw-rw—- 1 mysql mysql 14672 Jul 8 18:57 1_campaigns_diez.MYD
-rw-rw—- 1 mysql mysql 1024 Nov 17 11:38 1_campaigns_diez.MYI
-rw-rw—- 1 mysql mysql 8938 Nov 17 10:26 1_campaigns.frm
-rw-rw—- 1 mysql mysql 8738 Nov 17 10:26 1_campaigns_logs.frm
-rw-rw—- 1 mysql mysql 883404 Nov 16 22:01 1_campaigns_logs.MYD
-rw-rw—- 1 mysql mysql 330752 Nov 17 11:38 1_campaigns_logs.MYI


As seen from above output, all was perfect with permissions, so it should have been something else, so I decided to try to create a random file with touch command inside /home/mysql/database_name directory:

 

touch /home/mysql/database_name/somefile-to-test-writtability.txt touch: cannot touch ‘/scr1/data/somefile-to-test-writtability.txt‘: No space left on device


Then logically I thought the /home/mysql/ mounted ext4 partition got filled, because of crashed SQL database or a bug thus, checked with disk free command df whether there is enough space on server:

server:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md1 20G 7.6G 11G 42% /
udev 10M 0 10M 0% /dev
tmpfs 13G 1.3G 12G 10% /run
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md2 256G 134G 110G 55% /home

Well that's weird? Obviously only 55% of available disk space is used and available 134G which was more than enough so I got totally puzzled why, files can't be written.

Then very logically, I thought it might be that /home directory has remounted asread only, because the SSD memory disk on server is failing and checked for errors in dmesg, i.e.:

 

server:~# dmesg|grep -i error


Also checked how exactly was partition mounted, to check whether it is (RO) read-only:

 

server:~# mount -l|grep -i /home
/dev/md2 on /home type ext4 (rw,relatime,discard,data=ordered)


Now everything become even more weirder, as obviously the disk continued to be claiming no space left on device, while in reality there was plenty of disk space.

Then after running a quick research on the internet for the no space left on device with free disk space, I've come across this great superuser.com thread which let me realize the partition run out of inodes and that's why no new file inodes could be assigned and therefore, the linux kernel is refusing to write the file on ext4 partition.

For those who haven't heard of Linux Partition Inodes here is link to Wikipedia and a quick quote:

 

In a Unix-style file system, the inode is a data structure used to represent a filesystem object, which can be one of various things including a file or a directory. Each inode stores the attributes and disk block location(s) of the filesystem object's data.[1] Filesystem object attributes may include manipulation metadata (e.g. change,[2] access, modify time), as well as owner and permission data (e.g. group-id, user-id, permissions).[3]
Directories are lists of names assigned to inodes. The directory contains an entry for itself, its parent, and each of its children.


Once I understood it is the inodes, I checked how many of them are occupied with cmd:

 

server:~# df -i /home
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/md2 17006592 17006592 0 100% /home


You see, there were 0 (zero) free file inodes on server and that was the reason for no space left on device while there was actually free disk space

To clean up (free) some inodes on partition, first thing I did is to delete all old logs which were inside /home and files I positively know not to be necessery, then to find which directories allocating most innodes used:

 

server:~# find . -xdev -type f | cut -d "/" -f 2 | sort | uniq -c | sort -n


If you're on a regular old fashined IDE Hard Drive and not SSD or you have too much files inside this command will take really long …:

Therefore a better solution might be to frist:

a) Try to find root folders with large inodes count:

for i in /home/*; do echo $i; find $i |wc -l; done
Try to find specific folders:


You should get output like:

 

/home/new_website
606692
/home/common
73
/home/pcfreak
5661
/home/hipo
33
/home/blog
13570
/home/log
123
/home/lost+found
1

b) Then once you know the directory allocating most inodes, run the command again to see the sub-directories with most files (eating) partition innodes:

 

for i in /home/webservice/*; do echo $i; find $i |wc -l; done

 

One usual large folder which could free you some nodes is the linux source headers, but in my case it was simply a lot of tiny old logs being logged on the system for few years in the past without cleaning:

After deleting the log dirs and cache folder in my case /home/new_website/{log,cache}:

server:~# rm -rf /home/new_website/log/*
server:~# rm -rf /home/new_website/cache/*

 

 

a) Then, stopping Apache webserver to check prevent Apache to use MySQl databases while running database repair and restaring MySQL:
 

server:~# /etc/init.d/apache2 stop Restarting MySQL server
..
server:~# /etc/init.d/mysql restart
..


b) And re-issuing MySQL Check / Repair / Optimize database commands:
 

 

mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf \–check –all-databases -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log

mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf –analyze –all-databases -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log

mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf \–auto-repair –optimize –all-databases -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log

mysqlcheck –defaults-extra-file=/etc/mysql/debian.cnf \–optimize –all-databases -u root -p`grep -i password /root/.my.cnf |sed -e 's#password=##g'`>> /var/log/cronwork.log

c) And finally starting the Apache Webserver again:
 

server:~# /etc/init.d/apache2 start


Some innodse got freed up:
 

server:~# df -i /home Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/md2 17006592 16797196 209396 99% /home


And hooray by God's Grace and with help of prayers of The most Holy Theotokos (Virgin) Mary, websites started again !

Fix “Secure Connection Failed” – An error occured SSL received a record that exceeded the maximum permissible length howto

Monday, September 14th, 2015

Reading Time: 2minutes

secure-connection-failed-an-error-occured-during-connection-ssl-received-a-record-that-exceeds-the-maximum-permissible-length-fix-howto
When I was trying to establish a new Internal Business SSL certificate on one of the 6 months planned SPLIT projects (e.g. duplicate a range systems environment to another one), I've stumbled a very odd SSL issue. Once I've setup all the virtualhost SSL configurations properly (identical SSL configuration directives and Apache Webserver version to another host and testing in a browser I was getting the following error:
 

Secure Connection Failed

An error occurred during a connection to 10.253.39.93.

SSL received a record that exceeded the maximum permissible length.

(Error code: ssl_error_rx_record_too_long)


Below is a screenshot:

https://pc-freak.net/images/secure-connection-failed-an-error-occured-during-connection-ssl-received-a-record-that-exceeds-the-maximum-permissible-length.png

The page you are trying to view can not be shown because the authenticity of the received data could not be verified. Please contact the web site owners to inform them of this problem. Alternatively, use the command found in the help menu to report this broken site.

The first logical thing to do was to check the error.log but there was no any errors there that point me to anything meaningful, besides that the queries I was making to the Domain doesn't show off as requests neither in Apache access.log nor in error.log so this was puzzling.
I thought I might have messed up something during Key file / CSR generation time so I revoked old certificate and reissued it.

 

$ openssl x509 -text -in test-pegasusgas-eon.intranet.eon-vertrieb.com.crt |less ertificate: Data: Version: 3 (0x2) Serial Number:

Shows that all is fine with certificate Then when trying to test remote certificate with SSL command:

 

openssl s_client -CApath test-pegasusgas-eon.intranet.eon-vertrieb.com.crt -connect test-pegasusgas-eon.intranet.eon-vertrieb.com:443


: There was an error After plenty of research in Google I come to conclusion something is either wrong with Listen httpd.conf directive or NameVirtualHost is binded to port 80 or some other port different from 443, however surprisingly I did not used the NameVirtualHost at all in my apache config. After a lot of pondering I finally spot it. The whole certificate isseus were caused by:

< – Less than sign

which I missaw and forget to clean up from template during IP paste (obtained from /sbin/ifconfig |grep -i xx.xx.xx.xx). So finally in order to fix the SSL error I had to just delete <, e.g.:
 

<VirtualHost <10.253.39.35:443>

had to become:

 

<Virtualhost 10.253.39.35:443>

Such a minor thing took me 3 hours of pondering to resolve and thanksfully it is finally fixed! Then of course had to restart Apache to make fixed Vhost settings working:
 

# apachectl stop; sleep 2; apachectl start

So now the SSL works again, thanks God!

Resolving “nf_conntrack: table full, dropping packet.” flood message in dmesg Linux kernel log

Wednesday, March 28th, 2012

Reading Time: 5minutes

nf_conntrack_table_full_dropping_packet
On many busy servers, you might encounter in /var/log/syslog or dmesg kernel log messages like

nf_conntrack: table full, dropping packet

to appear repeatingly:

[1737157.057528] nf_conntrack: table full, dropping packet.
[1737157.160357] nf_conntrack: table full, dropping packet.
[1737157.260534] nf_conntrack: table full, dropping packet.
[1737157.361837] nf_conntrack: table full, dropping packet.
[1737157.462305] nf_conntrack: table full, dropping packet.
[1737157.564270] nf_conntrack: table full, dropping packet.
[1737157.666836] nf_conntrack: table full, dropping packet.
[1737157.767348] nf_conntrack: table full, dropping packet.
[1737157.868338] nf_conntrack: table full, dropping packet.
[1737157.969828] nf_conntrack: table full, dropping packet.
[1737157.969928] nf_conntrack: table full, dropping packet
[1737157.989828] nf_conntrack: table full, dropping packet
[1737162.214084] __ratelimit: 83 callbacks suppressed

There are two type of servers, I've encountered this message on:

1. Xen OpenVZ / VPS (Virtual Private Servers)
2. ISPs – Internet Providers with heavy traffic NAT network routers
 

I. What is the meaning of nf_conntrack: table full dropping packet error message

In short, this message is received because the nf_conntrack kernel maximum number assigned value gets reached.
The common reason for that is a heavy traffic passing by the server or very often a DoS or DDoS(Distributed Denial of Service) attack. Sometimes encountering the err is a result of a bad server planning (incorrect data about expected traffic load by a company/companeis) or simply a sys admin error…

– Checking the current maximum nf_conntrack value assigned on host:

linux:~# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_max
65536

– Alternative way to check the current kernel values for nf_conntrack is through:

linux:~# /sbin/sysctl -a|grep -i nf_conntrack_max
error: permission denied on key 'net.ipv4.route.flush'
net.netfilter.nf_conntrack_max = 65536
error: permission denied on key 'net.ipv6.route.flush'
net.nf_conntrack_max = 65536

– Check the current sysctlnf_conntrack active connections

To check present connection tracking opened on a system:

:

linux:~# /sbin/sysctl net.netfilter.nf_conntrack_count
net.netfilter.nf_conntrack_count = 12742

The shown connections are assigned dynamicly on each new succesful TCP / IP NAT-ted connection. Btw, on a systems that work normally without the dmesg log being flooded with the message, the output of lsmod is:

linux:~# /sbin/lsmod | egrep 'ip_tables|conntrack'
ip_tables 9899 1 iptable_filter
x_tables 14175 1 ip_tables

On servers which are encountering nf_conntrack: table full, dropping packet error, you can see, when issuing lsmod, extra modules related to nf_conntrack are shown as loaded:

linux:~# /sbin/lsmod | egrep 'ip_tables|conntrack'
nf_conntrack_ipv4 10346 3 iptable_nat,nf_nat
nf_conntrack 60975 4 ipt_MASQUERADE,iptable_nat,nf_nat,nf_conntrack_ipv4
nf_defrag_ipv4 1073 1 nf_conntrack_ipv4
ip_tables 9899 2 iptable_nat,iptable_filter
x_tables 14175 3 ipt_MASQUERADE,iptable_nat,ip_tables

 

II. Remove completely nf_conntrack support if it is not really necessery

It is a good practice to limit or try to omit completely use of any iptables NAT rules to prevent yourself from ending with flooding your kernel log with the messages and respectively stop your system from dropping connections.

Another option is to completely remove any modules related to nf_conntrack, iptables_nat and nf_nat.
To remove nf_conntrack support from the Linux kernel, if for instance the system is not used for Network Address Translation use:

/sbin/rmmod iptable_nat
/sbin/rmmod ipt_MASQUERADE
/sbin/rmmod rmmod nf_nat
/sbin/rmmod rmmod nf_conntrack_ipv4
/sbin/rmmod nf_conntrack
/sbin/rmmod nf_defrag_ipv4

Once the modules are removed, be sure to not use iptables -t nat .. rules. Even attempt to list, if there are any NAT related rules with iptables -t nat -L -n will force the kernel to load the nf_conntrack modules again.

Btw nf_conntrack: table full, dropping packet. message is observable across all GNU / Linux distributions, so this is not some kind of local distribution bug or Linux kernel (distro) customization.
 

III. Fixing the nf_conntrack … dropping packets error

– One temporary, fix if you need to keep your iptables NAT rules is:

linux:~# sysctl -w net.netfilter.nf_conntrack_max=131072

I say temporary, because raising the nf_conntrack_max doesn't guarantee, things will get smoothly from now on.
However on many not so heavily traffic loaded servers just raising the net.netfilter.nf_conntrack_max=131072 to a high enough value will be enough to resolve the hassle.

– Increasing the size of nf_conntrack hash-table

The Hash table hashsize value, which stores lists of conntrack-entries should be increased propertionally, whenever net.netfilter.nf_conntrack_max is raised.

linux:~# echo 32768 > /sys/module/nf_conntrack/parameters/hashsize
The rule to calculate the right value to set is:
hashsize = nf_conntrack_max / 4

– To permanently store the made changes ;a) put into /etc/sysctl.conf:

linux:~# echo 'net.netfilter.nf_conntrack_count = 131072' >> /etc/sysctl.conf
linux:~# /sbin/sysct -p

b) put in /etc/rc.local (before the exit 0 line):

echo 32768 > /sys/module/nf_conntrack/parameters/hashsize

Note: Be careful with this variable, according to my experience raising it to too high value (especially on XEN patched kernels) could freeze the system.
Also raising the value to a too high number can freeze a regular Linux server running on old hardware.

– For the diagnosis of nf_conntrack stuff there is ;

/proc/sys/net/netfilter kernel memory stored directory. There you can find some values dynamically stored which gives info concerning nf_conntrack operations in "real time":

linux:~# cd /proc/sys/net/netfilter
linux:/proc/sys/net/netfilter# ls -al nf_log/

total 0
dr-xr-xr-x 0 root root 0 Mar 23 23:02 ./
dr-xr-xr-x 0 root root 0 Mar 23 23:02 ../
-rw-r--r-- 1 root root 0 Mar 23 23:02 0
-rw-r--r-- 1 root root 0 Mar 23 23:02 1
-rw-r--r-- 1 root root 0 Mar 23 23:02 10
-rw-r--r-- 1 root root 0 Mar 23 23:02 11
-rw-r--r-- 1 root root 0 Mar 23 23:02 12
-rw-r--r-- 1 root root 0 Mar 23 23:02 2
-rw-r--r-- 1 root root 0 Mar 23 23:02 3
-rw-r--r-- 1 root root 0 Mar 23 23:02 4
-rw-r--r-- 1 root root 0 Mar 23 23:02 5
-rw-r--r-- 1 root root 0 Mar 23 23:02 6
-rw-r--r-- 1 root root 0 Mar 23 23:02 7
-rw-r--r-- 1 root root 0 Mar 23 23:02 8
-rw-r--r-- 1 root root 0 Mar 23 23:02 9

 

IV. Decreasing other nf_conntrack NAT time-out values to prevent server against DoS attacks

Generally, the default value for nf_conntrack_* time-outs are (unnecessery) large.
Therefore, for large flows of traffic even if you increase nf_conntrack_max, still shorty you can get a nf_conntrack overflow table resulting in dropping server connections. To make this not happen, check and decrease the other nf_conntrack timeout connection tracking values:

linux:~# sysctl -a | grep conntrack | grep timeout
net.netfilter.nf_conntrack_generic_timeout = 600
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 180
net.netfilter.nf_conntrack_icmp_timeout = 30
net.netfilter.nf_conntrack_events_retry_timeout = 15
net.ipv4.netfilter.ip_conntrack_generic_timeout = 600
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent2 = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000
net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack = 30
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close = 10
net.ipv4.netfilter.ip_conntrack_tcp_timeout_max_retrans = 300
net.ipv4.netfilter.ip_conntrack_udp_timeout = 30
net.ipv4.netfilter.ip_conntrack_udp_timeout_stream = 180
net.ipv4.netfilter.ip_conntrack_icmp_timeout = 30

All the timeouts are in seconds. net.netfilter.nf_conntrack_generic_timeout as you see is quite high – 600 secs = (10 minutes).
This kind of value means any NAT-ted connection not responding can stay hanging for 10 minutes!

The value net.netfilter.nf_conntrack_tcp_timeout_established = 432000 is quite high too (5 days!)
If this values, are not lowered the server will be an easy target for anyone who would like to flood it with excessive connections, once this happens the server will quick reach even the raised up value for net.nf_conntrack_max and the initial connection dropping will re-occur again …

With all said, to prevent the server from malicious users, situated behind the NAT plaguing you with Denial of Service attacks:

Lower net.ipv4.netfilter.ip_conntrack_generic_timeout to 60 – 120 seconds and net.ipv4.netfilter.ip_conntrack_tcp_timeout_established to stmh. like 54000

linux:~# sysctl -w net.ipv4.netfilter.ip_conntrack_generic_timeout = 120
linux:~# sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 54000

This timeout should work fine on the router without creating interruptions for regular NAT users. After changing the values and monitoring for at least few days make the changes permanent by adding them to /etc/sysctl.conf

linux:~# echo 'net.ipv4.netfilter.ip_conntrack_generic_timeout = 120' >> /etc/sysctl.conf
linux:~# echo 'net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 54000' >> /etc/sysctl.conf