Posts Tagged ‘printing’

How to delete million of files on busy Linux servers (Work out Argument list too long)

Tuesday, March 20th, 2012

How to Delete million or many thousands of files in the same directory on GNU / Linux and FreeBSD

If you try to delete more than 131072 of files on Linux with rm -f *, where the files are all stored in the same directory, you will get an error:

/bin/rm: Argument list too long.

I've earlier blogged on deleting multiple files on Linux and FreeBSD and this is not my first time facing this error.
Anyways, as time passed, I've found few other new ways to delete large multitudes of files from a server.

In this article, I will explain shortly few approaches to delete few million of obsolete files to clean some space on your server.
Here are 3 methods to use to clean your tons of junk files.

1. Using Linux find command to wipe out millions of files

a.) Finding and deleting files using find's -exec switch:

# find . -type f -exec rm -fv {} \;

This method works fine but it has 1 downside, file deletion is too slow as for each found file external rm command is invoked.

For half a million of files or more, using this method will take "long". However from a server hard disk stressing point of view it is not so bad as, the files deletion is not putting too much strain on the server hard disk.
b.) Finding and deleting big number of files with find's -delete argument:

Luckily, there is a better way to delete the files, by using find's command embedded -delete argument:

# find . -type f -print -delete

c.) Deleting and printing out deleted files with find's -print arg

If you would like to output on your terminal, what files find is deleting in "real time" add -print:

# find . -type f -print -delete

To prevent your server hard disk from being stressed and hence save your self from server normal operation "outages", it is good to combine find command with ionice, e.g.:

# ionice -c 3 find . -type f -print -delete

Just note, that ionice cannot guarantee find's opeartions will not affect severely hard disk i/o requests. On  heavily busy servers with high amounts of disk i/o writes still applying the ionice will not prevent the server from being hanged! Be sure to always keep an eye on the server, while deleting the files nomatter with or without ionice. if throughout find execution, the server gets lagged in serving its ordinary client requests or whatever, stop the execution of the cmd immediately by killing it from another ssh session or tty (if physically on the server).

2. Using a simple bash loop with rm command to delete "tons" of files

An alternative way is to use a bash loop, to print each of the files in the directory and issue /bin/rm on each of the loop elements (files) like so:

for i in *; do
rm -f $i;
done

If you'd like to print what you will be deleting add an echo to the loop:

# for i in $(echo *); do \
echo "Deleting : $i"; rm -f $i; \

The bash loop, worked like a charm in my case so I really warmly recommend this method, whenever you need to delete more than 500 000+ files in a directory.

3. Deleting multiple files with perl

Deleting multiple files with perl is not a bad idea at all.
Here is a perl one liner, to delete all files contained within a directory:

# perl -e 'for(<*>){((stat)[9]<(unlink))}'

If you prefer to use more human readable perl script to delete a multitide of files use delete_multple_files_in_dir_perl.pl

Using perl interpreter to delete thousand of files is quick, really, really quick.
I did not benchmark it on the server, how quick exactly is it, but I guess the delete rate should be similar to find command. Its possible even in some cases the perl loop is  quicker …

4. Using PHP script to delete a multiple files

Using a short php script to delete files file by file in a loop similar to above bash script is another option.
To do deletion  with PHP, use this little PHP script:

<?php
$dir = "/path/to/dir/with/files";
$dh = opendir( $dir);
$i = 0;
while (($file = readdir($dh)) !== false) {
$file = "$dir/$file";
if (is_file( $file)) {
unlink( $file);
if (!(++$i % 1000)) {
echo "$i files removed\n";
}
}
}
?>

As you see the script reads the $dir defined directory and loops through it, opening file by file and doing a delete for each of its loop elements.
You should already know PHP is slow, so this method is only useful if you have to delete many thousands of files on a shared hosting server with no (ssh) shell access.

This php script is taken from Steve Kamerman's blog . I would like also to express my big gratitude to Steve for writting such a wonderful post. His post actually become  inspiration for this article to become reality.

You can also download the php delete million of files script sample here

To use it rename delete_millioon_of_files_in_a_dir.php.txt to delete_millioon_of_files_in_a_dir.php and run it through a browser .

Note that you might need to run it multiple times, cause many shared hosting servers are configured to exit a php script which keeps running for too long.
Alternatively the script can be run through shell with PHP cli:

php -l delete_millioon_of_files_in_a_dir.php.txt.

5. So What is the "best" way to delete million of files on Linux?

In order to find out which method is quicker in terms of execution time I did a home brew benchmarking on my thinkpad notebook.

a) Creating 509072 of sample files.

Again, I used bash loop to create many thousands of files in order to benchmark.
I didn't wanted to put this load on a productive server and hence I used my own notebook to conduct the benchmarks. As my notebook is not a server the benchmarks might be partially incorrect, however I believe still .they're pretty good indicator on which deletion method would be better.

hipo@noah:~$ mkdir /tmp/test
hipo@noah:~$ cd /tmp/test;
hiponoah:/tmp/test$ for i in $(seq 1 509072); do echo aaaa >> $i.txt; done

I had to wait few minutes until I have at hand 509072  of files created. Each of the files as you can read is containing the sample "aaaa" string.

b) Calculating the number of files in the directory

Once the command was completed to make sure all the 509072 were existing, I used a find + wc cmd to calculate the directory contained number of files:

hipo@noah:/tmp/test$ find . -maxdepth 1 -type f |wc -l
509072

real 0m1.886s
user 0m0.440s
sys 0m1.332s

Its intesrsting, using an ls command to calculate the files is less efficient than using find:

hipo@noah:/tmp/test$ time ls -1 |wc -l
509072

real 0m3.355s
user 0m2.696s
sys 0m0.528s

c) benchmarking the different file deleting methods with time

– Testing delete speed of find

hipo@noah:/tmp/test$ time find . -maxdepth 1 -type f -delete
real 15m40.853s
user 0m0.908s
sys 0m22.357s

You see, using find to delete the files is not either too slow nor light quick.

– How fast is perl loop in multitude file deletion ?

hipo@noah:/tmp/test$ time perl -e 'for(<*>){((stat)[9]<(unlink))}'real 6m24.669suser 0m2.980ssys 0m22.673s

Deleting my sample 509072 took 6 mins and 24 secs. This is about 3 times faster than find! GO-GO perl 🙂
As you can see from the results, perl is a great and time saving, way to delete 500 000 files.

– The approximate speed deletion rate of of for + rm bash loop

hipo@noah:/tmp/test$ time for i in *; do rm -f $i; done

real 206m15.081s
user 2m38.954s
sys 195m38.182s

You see the execution took 195m en 38 secs = 3 HOURS and 43 MINUTES!!!! This is extremely slow ! But works like a charm as the running of deletion didn't impacted my normal laptop browsing. While the script was running I was mostly browsing through few not so heavy (non flash) websites and doing some other stuff in gnome-terminal) 🙂

As you can imagine running a bash loop is a bit CPU intensive, but puts less stress on the hard disk read/write operations. Therefore its clear using it is always a good practice when deletion of many files on a dedi servers is required.

b) my production server file deleting experience

On a production server I only tested two of all the listed methods to delete my files. The production server, where I tested is running Debian GNU / Linux Squeeze 6.0.3. There I had a task to delete few million of files.
The tested methods tried on the server were:

– The find . type -f -delete method.

– for i in *; do rm -f $i; done

The results from using find -delete method was quite sad, as the server almost hanged under the heavy hard disk load the command produced.

With the for script all went smoothly. The files were deleted for a long long time (like few hours), but while it was running, the server continued with no interruptions..

While the bash loop was running, the server load avarage kept at steady 4
Taking my experience in mind, If you're running a production, server and you're still wondering which delete method to use to wipe some multitude of files, I would recommend you go  the bash for loop + /bin/rm way. Yes, it is extremely slow, expect it run for some half an hour or so but puts not too much extra load on the server..

Using the PHP script will probably be slow and inefficient, if compared to both find and the a bash loop.. I didn't give it a try yet, but suppose it will be either equal in time or at least few times slower than bash.

If you have tried the php script and you have some observations, please drop some comment to tell me how it performs.

To sum it up;

Even though there are "hacks" to clean up some messy parsing directory full of few million of junk files, having such a directory should never exist on the first place.

Frankly, keeping millions of files within the same directory is very stupid idea.
Doing so will have a severe negative impact on a directory listing performance of your filesystem in the long term.

If you know better (more efficient) ways to delete a multitude of files in a dir please share in comments.

How to print simple text pages on Linux in console with old LPR parallel port attached printer

Tuesday, February 7th, 2012

LPT parallel port pinout diagram with explanations

Many younger people, might not know lpr command, historically it was heavily used for printing in the early GNU / Linux days.
lpr ships the text to be printed to the printer which is physically attached on LPT (Line Print Terminal) parallel port . Those who lived the DOS era surely know in those "ancient" days, everyone who wanted to print has to use the LPT parallel port

Present time, everyone knows there is almost no modern printer that is attached to the PC via LPT port but rather the USB port is used for communication between the printer the computer.
Nevertheless The USB printers on Linux are managed by CUPS, the lpr command is still functional shipping the text to be printed via CUPS (cups-lpd daemon).
Before cups-lpd was introduced the service managing the print jobs was lpd
Hence lpr is still functional.

To print a plain text file of one page with lpr on Linux:

linux:~# cat text-file-to-print.txt | lpr

For multiple printers to switch between multiple printers there is the PRINTER shell variable:

linux:~# export PRINTER=printer-Name-and-Type

To print a really long text file (a book in TXT) the pr command comes handy. As you can read in the cmd manual pr converts text files for printing

Lets say you would like to have a 60 lines of text per printed page, the cmd to issue is:

linux:~# pr -l60 text-file-to-print.txt | lrp

All queued printing jobs can be reviewed with the lpq, if you have a printer attached try:

linux:~# lpq
lp is ready and printing
Rank Owner Job Files Total Size
active hipo 1 text-file-to-print.txt 62045 bytes

Since some years it is pretty rare for people to use lpq, since most of the parallel printing is managed by CUPS server, what most people use nowdays to check the printer queue is lpstat : e.g.

linux:~# lpstat
...

Printing status and all things related to queued jobs for printing gets logged in /var/log/lpr.log

There is even more simplistic way to print directly to the printer (if the printer is attached via a LPT port) through the kernel /dev/lp, for example:

linux:~# cat text-file-to-print.txt >> /dev/lp

For more than one printer attached the naming of /dev/lp, might probably be /dev/lp0, /dev/lp1 etc.
The lprm command also exists in case if you would like to cancel a printjob in the queue. Lets say I want to cancel a job in the queue with Job ID 5:

linux:~# lrpm 5
...

To cancel a current running job in the middle the /usr/bin/cancel command exists.

An interesting historic fact is that nowdays opening lpr, lpq or any of the other tools for simple text mode printing one sees on top of the page Apple Inc.

Lets clear this up CUPS (Common Unix Printing System) (open source) printing platform is not owned by Apple, since it is licensed under GPL2 and LGPL. The reason why the Apple Inc. shows up in man pages is because in year 2007, the founder of CUPS printing server Michael Sweet hired him to work for Apple Inc. "purchasing" the CUPS source. However as we know they did not really purchased the code, because the code was already belonging to the community (licensed under GPL2). Apple however as a marketing trick used the fact that Sweet worked for them and as probably as a matter of marketing asked him to place the Apple Inc. in the copyright source and manual areas. Obviously this is not true, since Apple Inc. does not hold copyright for CUPS as CUPS can be copied by anyone (its open source) 😉

Most of the people will never print using this commands, since printing is now, ages ahead, anyways for simple people (like me), who just need to print a text with no special fonts or graphics text printing is just great.

Text printing is also a good learning experience for Linux novices and is good to be known just as a piece of UNIX history.

How to install Samsung ML-2010 (ML-2010P) Mono Laser Printer on Xubuntu GNU/Linux

Wednesday, January 18th, 2012

I had to make one old Samsung ML-2010P Laser Printer work on Xubuntu Linux . I've had some issues in installing it, I couldn't fine any step by step tutorial online, on how the printer can be made work fine on Linux. Therefore I took the time to experiment and see if I could make it work. Since the printer is old, not much people are interested any more in making the printer operational on Linux, hence I couldn't find too much relevant posts and sites on the net, anyways thanks God after a bit of pondering I finally succeeded to make the Samsung ML-2010P printer to print on Linux.This are the exact steps one has to follow to make this old bunch of hardware to play nice on Linux:

1. use lsusb to list the printer model

root@linux:~# lsusb |grep -i samsung
Bus 001 Device 003: ID 04e8:326c Samsung Electronics Co., Ltd ML-2010P Mono Laser Printer

You see the printer reports as Samsung Electronics Co., Ltd ML-2010P Mono Laser Printer

2. Install cups printing service required packages

root@linux:~# apt-get install cups cups-bsd cups-client cups-common
root@linux:~# apt-get install cups-driver-gutenprint ghostscript-cups
root@linux:~# apt-get install python-cups python-cupshelpers

3. Install foomatic packages

root@linux:~# apt-get install foomatic-db foomatic-db-engine foomatic-db-gutenprint
root@linux:~# apt-get install foomatic-filters python-foomatic

4. Install hpijs hplip printconfand other packages necesssery for proper printer operation

root@linux:~# apt-get install hpijs hplip hplip-data ijsgutenprint
root@linux:~# apt-get install min12xxw openprinting-pdds printconf foo2zjs

P.S. Some of the packages I list might already have been installed as a dependency to another package, as I'm writting this article few days after I've succeeded installing the printer, I don't remember the exact install order.

5. Install splix (SPL Driver for Unix)

Here is a quote taken from Spix's project website:

"SpliX is a set of CUPS printer drivers for SPL (Samsung Printer Language) printers.
If you have a such printer, you need to download and use SpliX. Moreover you will find documentation about this proprietary language.
"

root@linux:~# apt-get install splix

For more information on splix, check on Splix SPL driver for UNIX website http://splix.ap2c.org/

You can check on the projects website the Samsung ML 2010 Printer is marked as Working
Next step is to configure the Printer

6. Go to Cups interface on localhost in browser and Add the Samsung printer.

Use Firefox, SeaMonkey or any browser of choice to configure CUPS:

Type in the browser:

http://localhost:631

Next a password prompt will appear asking for a user/pass. The user/pass you have to use is the same as the password of the user account you're logged on with.

UNIX Linux Administration CUPS Printer adding Samsung ML 2010 ML-2010P Xubuntu

Click on the Add Printer button and choose to add the Samsung ML-2010.

Then restart the CUP Service (cupsd) to make it load the new settings:

root@linux:~# /etc/init.d/cups restart

Now give the printer a try in printing some page in SeaMonkey, Chrome or Firefox (the quickest way is through pressing CTRL + P )

Following this steps, I've managed to run the printer on Xubuntu Linux, though the same steps if followed should most probably make the Samsnung ML 2010 play nice with other Linux distributions with a little or no adjustments.
I'll be glad to hear if someone succeeded in making the printer work on other distributions, if so please drop me a comment.
That's all folks! Enjoy printing 😉

How to convert any internet Webpage to PDF from command line on GNU/Linux

Friday, September 30th, 2011

Linux webpage html to pdf command line convertor wkhtmltopdf

If you're looking for a command line utility to generate PDF file out of any webpage located online you are looking for Wkhtmltopdf
The conversion of webpages to PDF by the tool is done using Apple's Webkit open source render.
wkhtmltopdf is something very useful for web developers, as some webpages has a requirement to produce dynamically pdfs from a remote website locations.
wkhtmltopdf is shipped with Debian Squeeze 6 and latest Ubuntu Linux versions and still not entered in Fedora and CentOS repositories.

To use wkhtmltopdf on Debian / Ubuntu distros install it via apt;

linux:~# apt-get install wkhtmltodpf
...

Next to convert a webpage of choice use cmd:

linux:~$ wkhtmltopdf www.pc-freak.net www.pc-freak.net_website.pdf
Loading page (1/2)
Printing pages (2/2)
Done

If the web page to be snapshotted in long few pages a few pages PDF will be generated by wkhtmltopdf
wkhtmltopdf also supports to create the website snapshot with a specified orientation Landscape / Portrait

-O Portrait options to it, like so:

linux:~$ wkhtmltopdf -O Portrait www.pc-freak.net www.pc-freak.net_website.pdf

wkhtmltopdf has many useful options, here are some of them:
 

  • Javascript disabling – Disable support for javascript for a website
  • Grayscale pdf generation – Generates PDf in Grayscale
  • Low quality pdf generation – Useful to shrink the output size of generated pdf size
  • Set PDF page size – (A4, Letter etc.)
  • Add zoom to the generated pdf content
  • Support for password HTTP authentication
  • Support to use the tool over a proxy
  • Generation of Table of Content based on titles (only in static version)
  • Adding of Header and Footers (only in static version)

To generate an A4 page with wkhtmltopdf:

wkhtmltopdf -s A4 www.pc-freak.net/blog/ www.pc-freak.net_blog.pdf

wkhtmltopdf looks promising but seems a bit buggy still, here is what happened when I tried to create a pdf without setting an A4 page formatting:

linux:$ wkhtmltopdf www.pc-freak.net/blog/ www.pc-freak.net_blog.pdf
Loading page (1/2)
OpenOffice path before fixup is '/usr/lib/openoffice' ] 71%
OpenOffice path is '/usr/lib/openoffice'
OpenOffice path before fixup is '/usr/lib/openoffice'
OpenOffice path is '/usr/lib/openoffice'
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
** (:12057): DEBUG: NP_Initialize
** (:12057): DEBUG: NP_Initialize succeeded
Printing pages (2/2)
Done
Printing pages (2/2)
Segmentation fault

Debian and Ubuntu version of wkhtmltopdf does not support TOC generation and Adding headers and footers, to support it one has to download and install the static version of wkhtmltopdf
Using the static version of the tool is also the only option for anyone on Fedora or any other RPM based Linux distro.