Archive for the ‘Programming’ Category

Add line numbering to text file with prefix text on Linux with awk

Monday, March 25th, 2013

 

Recently I blogged a tiny article on how to add line numbering to ASCII text files with nl cmd. Today I needed to do the same line numbering, except I wanted to add a prefix text "R0" to each line number to match requirements of program which will later take use of .txt database file. I looked through nl (number lines) command manual but didn't find option with which I can insert a prefix string to each numbered line. I'm not a regular expression GURU, thus I asked for some help in irc.freenode.net on how to do it. Thanks to the kind guys in #bash I got it.

Here is how to add line numbering starting from 01onwards by adding a text prefix "R0":

$ awk '{ printf "R0%02d %s\n", NR, $0 }' text_file_to_number.txt > numbered_text_file_with_prefix.txt

The file to number in my case included a huge text file containing verses from Holy Bible, here is few lines from it before and after parsing and numbering with string prefix:

 

In the beginning God created the heaven and the earth.
                — Genesis 1:1
%
And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of
the waters.
                — Genesis 1:2
%
And God said, Let there be light: and there was light.
                — Genesis 1:3
 

 

R001 In the beginning God created the heaven and the earth.
R002            — Genesis 1:1
R003 %
R004 And the earth was without form, and void; and darkness was upon
R005 the face of the deep. And the Spirit of God moved upon the face of
R006 the waters.
R007            — Genesis 1:2
R008 %
R009 And God said, Let there be light: and there was light.
R010            — Genesis 1:3

 

 

To add any other prefix except R0 for example SAMPLE_PREFIX_STRING, substitute in above awk expression R0 with whatever expression;

 

awk '{ printf "SAMPLE_PREFIX_STRING%02d %s\n", NR, $0 }' text_file_to_number.txt > numbered_text_file_with_prefix.txt

Share this on

Linux: Delete empty lines from text file with sed, awk, grep and vim

Saturday, March 23rd, 2013

As a system administrator, sometimes is necessary to do basic plain text processing for various sysadmin tasks. One very common task I do to remove empty lines in file. There are plenty of ways to do it i.e. – with grep, sed, awk, bash, perl etc.

1. Deleting empty file lines with sed

The most standard way to do it is with sed, as sed was written to do in shell quick regexp. Here is how;

sed '/^\s*$/d' file_with_empty_lines.txt > output_no_empty_lines.txt

2. Deleting empty file lines with awk

It is less of writting with awk, but I always forget the syntax and thus I like more sed, anyways here is how with awk;

cat file_with_empty_lines.txt | awk 'NF' >
output_no_empty_lines.txt

3. Deleting empty lines with grep

Grep  regular expression can be used. Here is grep cmd to cut off empty lines from file;

grep -v '^\s*$' file_with_empty_lines.txt >
output_no_empty_lines.txt

4. Delete empty files with vi / vim text editor

Open vi / vim text editor

$ vim

Press Esc+: and if empty lines doesn't have empty space characters use command

g/^$/d

Whether, empty lines contain " " – space characters (which are not visible in most text editors), use vi cmd:
g/^ $/d

Share this on

A tiny minimalistic CHAT Client Program writen in C

Sunday, July 29th, 2012

A friend of mine (Dido) who is learning C programming, has written a tiny chat server / client (peer to peer) program in C. His program is a very good learning curve for anyone desiring to learn basic C socket programming.
The program is writen in a way so it can be easily modified to work over UDP protocol with code:

struct sockaddr_in a;
a_sin_family=AF_INET;
a_sin_socktype=SOCK_DGRAM;

Here are links to the code of the Chat server/client progs:

Tiny C Chat Server Client source code

Tiny C Chat Client source code

To Use the client/server compile on the server host tiny-chat-serer-client.c with:

$ cc -o tiny-chat-server tiny-chat-server.c

Then on the client host compile the client;

$ cc -o tiny-chat-client tiny-chat-client.c

On the server host tiny-chat-server should be ran with port as argument, e.g. ;

$ ./tiny-chat-server 8888

To chat with the person running tiny-chat-server the compiled server should be invoked with:

$ ./tiny-chat-client 123.123.123.123 8888

123.123.123.123 is the IP address of the host, where tiny-chat-server is executed.
The chat/server C programs are actually a primitive very raw version of talk.

The programs are in a very basic stage, there are no condition checks for incorrectly passed arguments and with wrongly passed arguments it segfaults. Still for C beginners its useful …

Share this on

How to count lines of PHP source code in a directory (recursively)

Saturday, July 14th, 2012

Count PHP and other programming languages lines of source code (source code files count) recursively

Being able to count the number of PHP source code lines for a website is a major statistical information for timely auditting of projects and evaluating real Project Managment costs. It is inevitable process for any software project evaluation to count the number of source lines programmers has written.
In many small and middle sized software and website development companies, it is the system administrator task to provide information or script quickly something to give info on the exact total number of source lines for projects.

Even for personal use out of curiousity it is useful to know how many lines of PHP source code a wordpress or Joomla website (with the plugins) contains.
Anyone willing to count the number of PHP source code lines under one directory level, could do it with:::

serbver:~# cd /var/www/wordpress-website
server:/var/www/wordpress-website:# wc -l *.php
17 index.php
101 wp-activate.php
1612 wp-app.php
12 wp-atom.php
19 wp-blog-header.php
105 wp-comments-post.php
12 wp-commentsrss2.php
90 wp-config-sample.php
85 wp-config.php
104 wp-cron.php
12 wp-feed.php
58 wp-links-opml.php
59 wp-load.php
694 wp-login.php
236 wp-mail.php
17 wp-pass.php
12 wp-rdf.php
15 wp-register.php
12 wp-rss.php
12 wp-rss2.php
326 wp-settings.php
451 wp-signup.php
110 wp-trackback.php
109 xmlrpc.php
4280 total

This will count and show statistics, for each and every PHP source file within wordpress-website (non-recursively), to get only information about the total number of PHP source code lines within the directory, one could grep it, e.g.:::

server:/var/www/wordpress-website:# wc -l *.php |grep -i '\stotal$'
4280 total

The command grep -i '\stotal$' has \s in beginning and $ at the end of total keyword in order to omit erroneously matching PHP source code file names which contain total in file name; for example total.php …. total_blabla.php …. blabla_total_bla.php etc. etc.

The \s grep regular expression meaning is "put empty space", "$" is placed at the end of tital to indicate to regexp grep only for words ending in string total.

So far, so good … Now it is most common that instead of counting the PHP source code lines for a first directory level to count complete number of PHP, C, Python whatever source code lines recursively – i. e. (a source code of website or projects kept in multiple sub-directories). To count recursively lines of programming code for any existing filesystem directory use find in conjunction with xargs:::

server:/var/www/wp-website1# find . -name '*.php' | xargs wc -l
1079 ./wp-admin/includes/file.php
2105 ./wp-admin/includes/media.php
103 ./wp-admin/includes/list-table.php
1054 ./wp-admin/includes/class-wp-posts-list-table.php
105 ./wp-admin/index.php
109 ./wp-admin/network/user-new.php
100 ./wp-admin/link-manager.php
410 ./wp-admin/widgets.php
108 ./wp-content/plugins/akismet/widget.php
104 ./wp-content/plugins/google-analytics-for-wordpress/wp-gdata/wp-gdata.php
104 ./wp-content/plugins/cyr2lat-slugs/cyr2lat-slugs.php
,,,,
652239 total

As you see the cmd counts and displays the number of source code lines encountered in each and every file, for big directory structures the screen gets floated and passing | less is nice, e.g.:

find . -name '*.php' | xargs wc -l | less

Displaying lines of code for each file within the directories is sometimes unnecessery, whether just a total number of programming source code line is required, hence for scripting purposes it is useful to only get the source lines total num:::

server:/var/www/wp-website1# find . -name '*.php' | xargs wc -l | grep -i '\stotal$'

Another shorter and less CPU intensive one-liner to calculate the lines of codes is:::

server:/var/www/wp-website1# ( find ./ -name '*.php' -print0 | xargs -0 cat ) | wc -l

Here is one other shell script which displays all file names within a directory with the respective calculated lines of code

For more professional and bigger projects using pure Linux bash and command line scripting might not be the best approach. For counting huge number of programming source code and displaying various statistics concerning it, there are two other tools – SLOCCount
as well as clock (count lines of code)

Both tools, are written in Perl, so for IT managers concerned for speed of calculating projects source (if too frequent source audit is necessery) this tools might be a bit sluggish. However for most projects they should be of a great add on value, actually SLOCCount was already used for calculating the development costs of GNU / Linux and other projects of high importance for Free Software community and therefore it is proven it works well with ENORMOUS software source line code calculations written in programming languages of heterogenous origin.

sloccount and cloc packages are available in default Debian and Ubuntu Linux repositories, so if you're a Debilian user like me you're in luck:::

server:~# apt-cache search cloc$
cloc - statistics utility to count lines of code
server:~# apt-cache search sloccount$
sloccount - programs for counting physical source lines of code (SLOC)

Well that's all folks, Cheers en happy counting ;)

Share this on

Color Psychology – Color Mind Programming or how big companies boost their sales and make up your mind

Thursday, June 21st, 2012

Colors Programming Color mind Programming, how big companies boost their sales and make up your mind

As I've pointed earlier there is plenty of "secretly" kept and less known by public research on how colors influence us daily. The biggest companies are heavily taking advantage of what is found and known for colors impact on our minds (psyche). Actually there is a whole branch in psychology which deals with impact of colors perception on us.Besides companies, many modern governments are well aware of the many facts on how citizens percept colors and use this in color 'installment' in government offices and government institutions.

There is no universal knowledge on how colors completely affect us as every human on earth is very unique and saying this or that color has this or that impact on indivirual or group is not 100% accurate. However there are general traits nowdays formed especially with globalization and unification of TV ads and big companies corporate image, a unification started on how different nationality people perceive colors.

Nowdays in developed countries there are more and more people who perceive certain colors in similar fashion. Therefore every serious top marketer should carefully study colors and their relation with ancient time people believes and understanding on what each of the 'rainbow' colors symbolize. Most likely because there is no completely unified understanding of colors between various individuals may companies like Google and Microsoft started using all the rainbow colors in their basic company logos and branding for more on this topic please check my previous blog post Color trick Microsoft and Google use to keep their users loyal

Another large industry area, where color programming is very heavy is Computer and Video Games. You certainly still remember large portions of the games like Sega’s Sonic the HendgeHog or Mario Super Bros. or even the old arcade machines with games like Punisher or Cadillacs and Dinosaurs, Street Fighter etc.
All this old arcade games have a big portion of Color programming embedded in and this is one of the main reasons we remember them for a long time and playing them evoked such a strong feelings in youth.

This trend of using colors to make up our minds is being observed for many other physical goods as well as is starting to get more and more heavy adoption by websites branding on the internet.
Actually those with most succesful businesses on the internet have already integrated some kind of color programming scheme. An example for this would be the Internet top domain names seller GoDaddy. The have adopted a green scheme as a primary color combined with some other ones to create in the customer a feeling of ecology, naturality, peace and solitude.

The study of color programming is one major field to be known by anyone truly willing to understand why certain big store chainslike Carrefour, Lidl, Billa, MediaMarkt – in western europe or TechnoMarket, TechnoPolis (MediaMarkt copied tech equipment by shops here in Bulgaria) are decorated inside the way they are. I personally didn't like the concept of color programming since from Spiritual point of view it is a big evil. Trying to manipulate people perception to do something you would like to in general is very evil from spiritual point of view. A mixture of rainbow colors in a natural environment for example flowers in the wood or wild mountain place is one thing, but making it artificial and placing it in certain pre-desired order is totally another. Besides that the colors in the natural environment are natural and therefore the impact on us even if colorful is very much better than if it is done with a certain intention like in the big supermarkets stores, fast just food companies – McDonalds, Burger King etc.

The research on color mind influcence – Color mind programming is a controversial science. Nowdays many big businesses however use this as a granted science, even whole business sects with some mambo-jambo believes universities, children garden and schools in modern countries have employed the use of some type of color programming aiming to influence their pupils, students (organizational members – you call it).
Color mind programming and heavy use on advertisements on the TV, the Internet, Stores and mostly everywhere are however starting to took their tolls. The high increase in mental problems and dumbness in developed and some undeveloped countries as well as the increased number of people who go insane because of too much color programming is reality. The believe that mental programming is one of the ultimate tools to influence somebody and push him to do things you want like consome more of a product or generally consume (buy) more goods creates another severe issue it makes people to constantly over-consume (eat more than the body needs) and this increases the number of over-consumption evoked diseases …

But color programming doesn't stop with just the material (physical) surrounding world it is a concept highly employed in online based marketing. Online business is seen on so many top used websites, social networks like take for instance (facebook). It is so spread that even the software primary vendors like Microsoft, search engines Google Inc. have already heavily employed the color programming as a basis of their products.

There is another reason why most vendors nowdays issue their physical or 'virtual' products so colorful using all the colors of the rainbow. The reason is the fact that as a kids through animation, cartoons, toys and surrounding environment we have been exposed already even from our very youth age to a kind of color programming through kids toys we've been given by our parents). Hence the young years color programming became a basis for a future time color programming. The colorfulness of our kids years are already sub-consciously stored in our minds, so almost naturally there is a feeling of joy to pop-up once we see something childishly colorful.
 

Share this on

PHP system(); hide command output – How to hide displayed output with exec();

Saturday, April 7th, 2012

I've recently wanted to use PHP's embedded system(""); – external command execute function in order to use ls + wc to calculate the number of files stored in a directory. I know many would argue, this is not a good practice and from a performance view point it is absolutely bad idea. However as I was lazy to code ti in PHP, I used the below line of code to do the task:

<?
echo "Hello, ";
$line_count = system("ls -1 /dir/|wc -l");
echo "File count in /dir is $line_count \n";
?>

This example worked fine for me to calculate the number of files in my /dir, but unfortunately the execution output was also visialized in the browser. It seems this is some kind of default behaviour in both libphp and php cli. I didn't liked the behaviour so I checked online for a solution to prevent the system(); from printing its output.

What I found as a recommendations on many pages is instead of system(); to prevent command execution output one should use exec();.
Therefore I used instead of my above code:

<?
echo "Hello, ";
$line_count = exec("ls -1 /dir/|wc -l");
echo "File count in /dir is $line_count \n";
?>

By the way insetad of using exec();, it is also possible to just use ` (backtick) – in same way like in bash scripting's .

Hence the above code can be also written for short like this:

<?
echo "Hello, ";
$line_count = `ls -1 /dir/|wc -l`;
echo "File count in /dir is $line_count \n";
?>

:)

Share this on

Tiny PHP script to dump your browser set HTTP headers (useful in debugging)

Friday, March 30th, 2012

While browsing I stumbled upon a nice blog article

Dumping HTTP headers

The arcitle, points at few ways to DUMP the HTTP headers obtained from user browser.
As I'm not proficient with Ruby, Java and AOL Server what catched my attention is a tiny php for loop, which loops through all the HTTP_* browser set variables and prints them out. Here is the PHP script code:

<?php<br />
foreach($_SERVER as $h=>$v)<br />
if(ereg('HTTP_(.+)',$h,$hp))<br />
echo "<li>$h = $v</li>\n";<br />
header('Content-type: text/html');<br />
?>

The script is pretty easy to use, just place it in a directory on a WebServer capable of executing php and save it under a name like:
show_HTTP_headers.php

If you don't want to bother copy pasting above code, you can also download the dump_HTTP_headers.php script here , rename the dump_HTTP_headers.php.txt to dump_HTTP_headers.php and you're ready to go.

Follow to the respective url to exec the script. I've installed the script on my webserver, so if you are curious of the output the script will be returning check your own browser HTTP set values by clicking here.
PHP will produce output like the one in the screenshot you see below, the shot is taken from my Opera browser:

Screenshot show HTTP headers.php script Opera Debian Linux

Another sample of the text output the script produce whilst invoked in my Epiphany GNOME browser is:

HTTP_HOST = www.pc-freak.net
HTTP_USER_AGENT = Mozilla/5.0 (X11; U; Linux x86_64; en-us) AppleWebKit/531.2+ (KHTML, like Gecko) Version/5.0 Safari/531.2+ Debian/squeeze (2.30.6-1) Epiphany/2.30.6
HTTP_ACCEPT = application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
HTTP_ACCEPT_ENCODING = gzip
HTTP_ACCEPT_LANGUAGE = en-us, en;q=0.90
HTTP_COOKIE = __qca=P0-2141911651-1294433424320;
__utma_a2a=8614995036.1305562814.1274005888.1319809825.1320152237.2021;wooMeta=MzMxJjMyOCY1NTcmODU1MDMmMTMwODQyNDA1MDUyNCYxMzI4MjcwNjk0ODc0JiYxMDAmJjImJiYm; 3ec0a0ded7adebfeauth=22770a75911b9fb92360ec8b9cf586c9;
__unam=56cea60-12ed86f16c4-3ee02a99-3019;
__utma=238407297.1677217909.1260789806.1333014220.1333023753.1606;
__utmb=238407297.1.10.1333023754; __utmc=238407297;
__utmz=238407297.1332444980.1586.413.utmcsr=pc-freak.net|utmccn=(referral)|utmcmd=referral|utmcct=/blog/

You see the script returns, plenty of useful information for debugging purposes:
HTTP_HOST – Virtual Host Webserver name
HTTP_USER_AGENT – The browser exact type useragent returnedHTTP_ACCEPT – the type of MIME applications accepted by the WebServerHTTP_ACCEPT_LANGUAGE – The language types the browser has support for
HTTP_ACCEPT_ENCODING – This PHP variable is usually set to gzip or deflate by the browser if the browser has support for webserver returned content gzipping.
If HTTP_ACCEPT_ENCODING is there, then this means remote webserver is configured to return its HTML and static files in gzipped form.
HTTP_COOKIE – Information about browser cookies, this info can be used for XSS attacks etc. :)
HTTP_COOKIE also contains the referrar which in the above case is:
__utmz=238407297.1332444980.1586.413.utmcsr=pc-freak.net|utmccn=(referral)
The Cookie information HTTP var also contains information of the exact link referrar:
|utmcmd=referral|utmcct=/blog/

For the sake of comparison show_HTTP_headers.php script output from elinks text browser is like so:

* HTTP_HOST = www.pc-freak.net
* HTTP_USER_AGENT = Links (2.3pre1; Linux 2.6.32-5-amd64 x86_64; 143x42)
* HTTP_ACCEPT = */*
* HTTP_ACCEPT_ENCODING = gzip,deflate * HTTP_ACCEPT_CHARSET = us-ascii, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16, windows-1250, windows-1251, windows-1252, windows-1256,
windows-1257, cp437, cp737, cp850, cp852, cp866, x-cp866-u, x-mac, x-mac-ce, x-kam-cs, koi8-r, koi8-u, koi8-ru, TCVN-5712, VISCII,utf-8 * HTTP_ACCEPT_LANGUAGE = en,*;q=0.1
* HTTP_CONNECTION = keep-alive
One good reason, why it is good to give this script a run is cause it can help you reveal problems with HTTP headers impoperly set cookies, language encoding problems, security holes etc. Also the script is a good example, for starters in learning PHP programming.

 

Share this on

Auto restart Apache on High server load (bash shell script) – Fixing Apache server temporal overload issues

Saturday, March 24th, 2012

I've written a tiny script to check and restart, Apache if the server encounters, extremely high load avarage like for instance more than (>25). Below is an example of a server reaching a very high load avarage:;

server~:# uptime
13:46:59 up 2 days, 18:54, 1 user, load average: 58.09, 59.08, 60.05
load average: 0.09, 0.08, 0.08

Sometimes high load avarage is not a problem, as the server might have a very powerful hardware. A high load numbers is not always an indicator for a serious problems. Some 16 CPU dual core (2.18 Ghz) machine with 16GB of ram could probably work normally with a high load avarage like in the example. Anyhow as most servers are not so powerful having such a high load avarage, makes the machine hardly do its job routine.

In my specific, case one of our Debian Linux servers is periodically reaching to a very high load level numbers. When this happens the Apache webserver is often incapable to serve its incoming requests and starts lagging for clients. The only work-around is to stop the Apache server for a couple of seconds (10 or 20 seconds) and then start it again once the load avarage has dropped to less than "3".

If this temporary fix is not applied on time, the server load gets increased exponentially until all the server services (ssh, ftp … whatever) stop responding normally to requests and the server completely hangs …

Often this server overloads, are occuring at night time so I'm not logged in on the server and one such unexpected overload makes the server unreachable for hours.
To get around the sudden high periodic load avarage server increase, I've written a tiny bash script to monitor, the server load avarage and initiate an Apache server stop and start with a few seconds delay in between.

#!/bin/sh
# script to check server for extremely high load and restart Apache if the condition is matched
check=`cat /proc/loadavg | sed 's/\./ /' | awk '{print $1}'`
# define max load avarage when script is triggered
max_load='25'
# log file
high_load_log='/var/log/apache_high_load_restart.log';
# location of inidex.php to overwrite with temporary message
index_php_loc='/home/site/www/index.php';
# location to Apache init script
apache_init='/etc/init.d/apache2';
#
site_maintenance_msg="Site Maintenance in progress - We will be back online in a minute";
if [ $check -gt "$max_load" ]; then>
#25 is load average on 5 minutes
cp -rpf $index_php_loc $index_php_loc.bak_ap
echo "$site_maintenance_msg" > $index_php_loc
sleep 15;
if [ $check -gt "$max_load" ]; then
$apache_init stop
sleep 5;
$apache_init restart
echo "$(date) : Apache Restart due to excessive load | $check |" >> $high_load_log;
cp -rpf $index_php_loc.bak_ap $index_php_loc
fi
fi

The idea of the script is partially based on a forum thread – Auto Restart Apache on High Loadhttp://www.webhostingtalk.com/showthread.php?t=971304Here is a link to my restart_apache_on_high_load.sh script

The script is written in a way that it makes two "if" condition check ups, to assure 100% there is a constant high load avarage and not just a temporal 5 seconds load avarage jump. Once the first if is matched, the script first tries to reduce the server load by overwritting a the index.php, index.html script of the website with a one stating the server is ongoing a maintenance operations.
Temporary stopping the index page, often reduces the load in 10 seconds of time, so the second if case is not necessery at all. Sometimes, however this first "if" condition cannot decrease enough the load and the server load continues to stay too high, then the script second if comes to play and makes apache to be completely stopped via Apache init script do 2 secs delay and launch the apache server again.

The script also logs about, the load avarage encountered, while the server was overloaded and Apache webserver was restarted, so later I can check what time the server overload occured.
To make the script periodically run, I've scheduled the script to launch every 5 minutes as a cron job with the following cron:

# restart Apache if load is higher than 25
*/5 * * * * /usr/sbin/restart_apache_on_high_load.sh >/dev/null 2>&1

I have also another system which is running FreeBSD 7_2, which is having the same overload server problems as with the Linux host.
Copying the auto restart apache on high load script on FreeBSD didn't work out of the box. So I rewrote a little chunk of the script to make it running on the FreeBSD host. Hence, if you would like to auto restart Apache or any other service on FreeBSD server get /usr/sbin/restart_apache_on_high_load_freebsd.sh my script and set it on cron on your BSD.

This script is just a temporary work around, however as its obvious that the frequency of the high overload will be rising with time and we will need to buy new server hardware to solve permanently the issues, anyways, until this happens the script does a great job :)

I'm aware there is also alternative way to auto restart Apache webserver on high server loads through using monit utility for monitoring services on a Unix system. However as I didn't wanted to bother to run extra services in the background I decided to rather use the up presented script.

Interesting info to know is Apache module mod_overload exists – which can be used for checking load average. Using this module once load avarage is over a certain number apache can stop in its preforked processes current serving request, I've never tested it myself so I don't know how usable it is. As of time of writting it is in early stage version 0.2.2
If someone, have tried it and is happy with it on a busy hosting servers, please share with me if it is stable enough?

Share this on

How to delete million of files on busy Linux servers (Work out Argument list too long)

Tuesday, March 20th, 2012

How to Delete million or many thousands of files in the same directory on GNU / Linux and FreeBSD

If you try to delete more than 131072 of files on Linux with rm -f *, where the files are all stored in the same directory, you will get an error:

/bin/rm: Argument list too long.

I've earlier blogged on deleting multiple files on Linux and FreeBSD and this is not my first time facing this error.
Anyways, as time passed, I've found few other new ways to delete large multitudes of files from a server.

In this article, I will explain shortly few approaches to delete few million of obsolete files to clean some space on your server.
Here are 3 methods to use to clean your tons of junk files.

1. Using Linux find command to wipe out millions of files

a.) Finding and deleting files using find's -exec switch:

# find . -type f -exec rm -fv {} \;

This method works fine but it has 1 downside, file deletion is too slow as for each found file external rm command is invoked.

For half a million of files or more, using this method will take "long". However from a server hard disk stressing point of view it is not so bad as, the files deletion is not putting too much strain on the server hard disk.
b.) Finding and deleting big number of files with find's -delete argument:

Luckily, there is a better way to delete the files, by using find's command embedded -delete argument:

# find . -type f -print -delete

c.) Deleting and printing out deleted files with find's -print arg

If you would like to output on your terminal, what files find is deleting in "real time" add -print:

# find . -type f -print -delete

To prevent your server hard disk from being stressed and hence save your self from server normal operation "outages", it is good to combine find command with ionice, e.g.:

# ionice -c 3 find . -type f -print -delete

Just note, that ionice cannot guarantee find's opeartions will not affect severely hard disk i/o requests. On  heavily busy servers with high amounts of disk i/o writes still applying the ionice will not prevent the server from being hanged! Be sure to always keep an eye on the server, while deleting the files nomatter with or without ionice. if throughout find execution, the server gets lagged in serving its ordinary client requests or whatever, stop the execution of the cmd immediately by killing it from another ssh session or tty (if physically on the server).

2. Using a simple bash loop with rm command to delete "tons" of files

An alternative way is to use a bash loop, to print each of the files in the directory and issue /bin/rm on each of the loop elements (files) like so:

for i in *; do
rm -f $i;
done

If you'd like to print what you will be deleting add an echo to the loop:

# for i in $(echo *); do \
echo "Deleting : $i"; rm -f $i; \

The bash loop, worked like a charm in my case so I really warmly recommend this method, whenever you need to delete more than 500 000+ files in a directory.

3. Deleting multiple files with perl

Deleting multiple files with perl is not a bad idea at all.
Here is a perl one liner, to delete all files contained within a directory:

# perl -e 'for(<*>){((stat)[9]<(unlink))}'

If you prefer to use more human readable perl script to delete a multitide of files use delete_multple_files_in_dir_perl.pl

Using perl interpreter to delete thousand of files is quick, really, really quick.
I did not benchmark it on the server, how quick exactly is it, but I guess the delete rate should be similar to find command. Its possible even in some cases the perl loop is  quicker …

4. Using PHP script to delete a multiple files

Using a short php script to delete files file by file in a loop similar to above bash script is another option.
To do deletion  with PHP, use this little PHP script:

<?php
$dir = "/path/to/dir/with/files";
$dh = opendir( $dir);
$i = 0;
while (($file = readdir($dh)) !== false) {
$file = "$dir/$file";
if (is_file( $file)) {
unlink( $file);
if (!(++$i % 1000)) {
echo "$i files removed\n";
}
}
}
?>

As you see the script reads the $dir defined directory and loops through it, opening file by file and doing a delete for each of its loop elements.
You should already know PHP is slow, so this method is only useful if you have to delete many thousands of files on a shared hosting server with no (ssh) shell access.

This php script is taken from Steve Kamerman's blog . I would like also to express my big gratitude to Steve for writting such a wonderful post. His post actually become  inspiration for this article to become reality.

You can also download the php delete million of files script sample here

To use it rename delete_millioon_of_files_in_a_dir.php.txt to delete_millioon_of_files_in_a_dir.php and run it through a browser .

Note that you might need to run it multiple times, cause many shared hosting servers are configured to exit a php script which keeps running for too long.
Alternatively the script can be run through shell with PHP cli:

php -l delete_millioon_of_files_in_a_dir.php.txt.

5. So What is the "best" way to delete million of files on Linux?

In order to find out which method is quicker in terms of execution time I did a home brew benchmarking on my thinkpad notebook.

a) Creating 509072 of sample files.

Again, I used bash loop to create many thousands of files in order to benchmark.
I didn't wanted to put this load on a productive server and hence I used my own notebook to conduct the benchmarks. As my notebook is not a server the benchmarks might be partially incorrect, however I believe still .they're pretty good indicator on which deletion method would be better.

hipo@noah:~$ mkdir /tmp/test
hipo@noah:~$ cd /tmp/test;
hiponoah:/tmp/test$ for i in $(seq 1 509072); do echo aaaa >> $i.txt; done

I had to wait few minutes until I have at hand 509072  of files created. Each of the files as you can read is containing the sample "aaaa" string.

b) Calculating the number of files in the directory

Once the command was completed to make sure all the 509072 were existing, I used a find + wc cmd to calculate the directory contained number of files:

hipo@noah:/tmp/test$ find . -maxdepth 1 -type f |wc -l
509072

real 0m1.886s
user 0m0.440s
sys 0m1.332s

Its intesrsting, using an ls command to calculate the files is less efficient than using find:

hipo@noah:/tmp/test$ time ls -1 |wc -l
509072

real 0m3.355s
user 0m2.696s
sys 0m0.528s

c) benchmarking the different file deleting methods with time

- Testing delete speed of find

hipo@noah:/tmp/test$ time find . -maxdepth 1 -type f -delete
real 15m40.853s
user 0m0.908s
sys 0m22.357s

You see, using find to delete the files is not either too slow nor light quick.

- How fast is perl loop in multitude file deletion ?

hipo@noah:/tmp/test$ time perl -e 'for(<*>){((stat)[9]<(unlink))}'real 6m24.669suser 0m2.980ssys 0m22.673s

Deleting my sample 509072 took 6 mins and 24 secs. This is about 3 times faster than find! GO-GO perl :)
As you can see from the results, perl is a great and time saving, way to delete 500 000 files.

- The approximate speed deletion rate of of for + rm bash loop

hipo@noah:/tmp/test$ time for i in *; do rm -f $i; done

real 206m15.081s
user 2m38.954s
sys 195m38.182s

You see the execution took 195m en 38 secs = 3 HOURS and 43 MINUTES!!!! This is extremely slow ! But works like a charm as the running of deletion didn't impacted my normal laptop browsing. While the script was running I was mostly browsing through few not so heavy (non flash) websites and doing some other stuff in gnome-terminal) :)

As you can imagine running a bash loop is a bit CPU intensive, but puts less stress on the hard disk read/write operations. Therefore its clear using it is always a good practice when deletion of many files on a dedi servers is required.

b) my production server file deleting experience

On a production server I only tested two of all the listed methods to delete my files. The production server, where I tested is running Debian GNU / Linux Squeeze 6.0.3. There I had a task to delete few million of files.
The tested methods tried on the server were:

- The find . type -f -delete method.

- for i in *; do rm -f $i; done

The results from using find -delete method was quite sad, as the server almost hanged under the heavy hard disk load the command produced.

With the for script all went smoothly. The files were deleted for a long long time (like few hours), but while it was running, the server continued with no interruptions..

While the bash loop was running, the server load avarage kept at steady 4
Taking my experience in mind, If you're running a production, server and you're still wondering which delete method to use to wipe some multitude of files, I would recommend you go  the bash for loop + /bin/rm way. Yes, it is extremely slow, expect it run for some half an hour or so but puts not too much extra load on the server..

Using the PHP script will probably be slow and inefficient, if compared to both find and the a bash loop.. I didn't give it a try yet, but suppose it will be either equal in time or at least few times slower than bash.

If you have tried the php script and you have some observations, please drop some comment to tell me how it performs.

To sum it up;

Even though there are "hacks" to clean up some messy parsing directory full of few million of junk files, having such a directory should never exist on the first place.

Frankly, keeping millions of files within the same directory is very stupid idea.
Doing so will have a severe negative impact on a directory listing performance of your filesystem in the long term.

If you know better (more efficient) ways to delete a multitude of files in a dir please share in comments.

Share this on

John McCarthy Creator and Father of Modern Artificial Intelligence and Lisp programming language creator passed away at 84

Wednesday, October 26th, 2011

John McCarthy Creator of Lisp programming language and Invetor of modern Artificial Intelligence

Yesterday night, one more Computer Genius – John McCarthy has passed away at the age of 84.
John McCarthy is mostly famous for the creation of Lisp Programming language, which was probably the most used programming language in the short past. There are plenty of corporate old iron hardwares which still run programs written in Lisp. Lisp is the language in which Richard Stallman has created his so famous EMACS text editor for GNU.

Computer Technology students, should have studied certainly Lisp in the form of Lisp Scheme.
Lisp is the the second oldest high level programming language only to be predeceded by Fortran .
Lisp gave birth to the so called Macro programming languages
and was invented by McCarthy in 1958, while he was in Massachusetts MIT university.
What is so important about Lisp is that it is de-facto the first language in the world which was written to be suitable for AI (Artificial Intelligence) researches. There is plenty of interesting information about Lisp as well as a number of forks and variations circulating for almost all the existing major operating systems nowdays.

Besides LISP creation McCarthy was in the first team who did a the first Remote Computer Chess game. The game played was among USSR and US scientists, where the moves were transferred by telegraph.
In 1972 MCCarthy was awarded with the Turing Award – (Today probably the most prestigious award for incredible technology achievements in the world).
McCarth's home website had a lot of great papers on programming languages, mathematical theory of computation and most importantly philosophical words and notes on Artificial Intelligence
His site has a lot of his essays as well as his personal views on the world and predictions (foreseen probabilities by him) on the world future.
McCarthy had even written a short Sci-Fi story (The Robot and The Baby), the story aim was to explore the question, whether robots should have simulated emotions.John McCarthy AI later days life picture

John McCarthy is among the brightest computer genius who ever live on this planet as well as a true "icon" for a computer hacker. The news for his death is quite shocking especially after the sudden death of the creator of C programming Language and UNIX Denis Ritchie , and a week earlier the pass of Steve Jobs
It seems like no coincidence, that the brightest computer minds are departuring this life, probably God is taking them one by one just like he gave them the gifts to invent and revolutionize the technology we use today.
Surely McCarthy has left a huge landmark on technology and his name will be in the books for the generations to come.

Share this on