Archive for April 30th, 2010

Checking your website for broken Links on Linux with linkchecker and htcheck / How to find broken links on your website

Friday, April 30th, 2010

Have you wondered how could you check your website for broken links? Cause I did!
You might wonder why should I care so hard about the broken links?

Well it’s simple broken links on your webpage will have an influence on how Google indexesyour website, and how often does it bother to crawl on your website in other words, havinga eagly eye on your broken links interface is a vital for every self respecting web developer,as well as for system administrators.
From a web development perspective, it’s important that your website have as less 404 error pages possible since that is important for Webpages W3C Compliancy.
On other hand if you’re a SEO Specialist, having as less broken links on your domains is vital for Google Pageranking,Yahoo, Live.com, Altavista, Yandex etc, as well as for general Good Search Engine Indexing.
Having said all that you should already feel the topic is really interesting. I believe not many people has wrote stuff aboutit online.
That’s why I decided to share with you a possible way on how to track your broken web domain for broken pages on the Linux and possibly other Unix compatible architectures.

There are plenty of tools available that could be used for finding out the broken linkson your website using Linux operating system.

I used apt-get in order to look for a link checker software for Linux

noah:/home/hipo/Desktop# apt-cache search 'link check'
htcheck-php - Simple php interface to database generated by ht://Check
htcheck - Utility for checking web site for dead/external links
linkchecker - check websites and HTML documents for broken links
linklint - A fast link checker and web site maintenance tool

The 2 error link reporting tools I used were:

1. linkchecker

and

2. Htcheck

I’ll evaluate both of the tools and will share with you my impressions of the two really valuable,broken link checking tools for Linux.

Let me begin with a short introduction on what you could expect from the linkchecker broken links (error 404) links.

Here are the linkchecker Featuresrecursive and multithreaded checking
output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats
HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links support
restriction of link checking with regular expression filters for URLs
proxy support
username/password authorization for HTTP and FTP and Telnet
honors robots.txt exclusion protocol
Cookie support
HTML and CSS syntax check
Antivirus check
a command line interface
a GUI client interface
a (Fast)CGI web interface (requires HTTP server)

Luckily linchecker has a Debian package port so installing it comes as easily as executing:

root@noah:~# apt-get install linkchecker linkchecker-gui

However at the present moment on Debian Sid (Testing/Unstable) linkchecker-gui is missing some dependencies with libqt and python
so I was not able to install and test the Graphic User Interface for Linkchecker .Anyways here is a screenshot of the linkchecker GUI interface in order to give you a glipmse on what to expect if you succeed in installing it on Mac OS X or some other operating system.

Linkchecker GUI, Link Checker Graphic User Interface

Using linkchecker’s command line interface is really straight forward you just have too invoke the linkchecker command and pipe it too the tee shell command

Here is how:

root@noah:~# linkcheker https://www.pc-freak.net/ | tee -a www.pc-freak.net-broken-links-linkchecker.log

Though it’s simplicity to use from a first look checking the manual of linkchecker reveals quite many interesting usage parameters, so be sure also to take a look at the manual.
Of course it might be wise to combine linkchecker with some bash scripting in order to pereodically review your website or websites for broken links.

I intend to do that in the coming days so if I write some script that uses linkchecker and facilitates the search for a broken links I’ll post it on the blog.

Having said all that linkchecker goody, let me proceed further to Htcheck

HtCheck is really wondeful and it in a certain sense better than linkchecker, because it offers some extra possibilities like for instance generation of reports which could be stored in MySQL and could be visualized any time via a web browser.

Here is a descrition extracted from HtCheck’s website:

ht://Check is more than a link checker. It is a console application written for Linux systems in C++ and derived from ht://Dig.

It can retrieve information through HTTP/1.1 and store the information in a MySQL database, and it is particularly suitable for
small Internet domains or Intranet.

Its purpose is to help a webmaster manage one or more related sites: after a “crawl”, ht://Check gives back very useful summaries
and reports, including broken links, anchors not found, content-types and HTTP status codes summaries, etc.

From version 1.2.3, ht://Check also performs accessibility checks in accordance with the principles of the University of Toronto’s
Open Accessibility Checks (OAC) project, allowing users to discover site-wide barriers like images without proper alternatives,
missing titles, etc.

ht://Check can also be used for Web structure analysis, as it stores information regarding links between HTML documents.

I have to admit this htcheck-php is really handy! To use the extra php web interface to htcheck you’ll need the htcheck-php package installed.

To install both htcheck and it’s web interface on Debian you’ll need to issue the command:

root@noah:~# apt-get install htcheck htcheck-php

Now there are few more things to do before you could start using htcheck.
You’ll need to edit /etc/htcheck/htcheck.conf

There you will need to change at least the start_url variable.

Another necessery thing will be to use phpmyadmin or the console mysql client in order to create the required htcheck username and password and grant some relevant permissions to the htcheck user in MySQL.

Yet if you try to execute the htcheck binary (which by the way is written in C) to generate you will experience a problem with connecting to mysql’s database and you will most likely get the error message.

noah:/home/hipo# htcheck
Error (1045): Access denied for user 'root'@'localhost' (using password: NO)
! htcheck: Database error

That really pissed me off but anyways you’re lucky that I got it for you.
This whole issue is well documented in htcheck’s installation notes which you can read here

If you’re lazy reading the whole document just skip and read The Htcheck MySQL Connection Settings part

The solutions to the above pointed htcheck problem, where htcheck could not connect to the database is easily solvable, by creating a .my.cnf file in your home directory e.g. ~/ .

Let’s say you’re running with a root user the htcheck, all you need to do is edit /root/.my.cnf and place in it:

[client]
host=127.0.0.1
user=htcheck
password=yoursqlpassword

That’s it now issue again the htcheck command again, so that it could create the proper “htcheck” database (created by default) and store crawl your website for broken links and generate and store the reports in your MySQL server.

root@noah:~# htcheck -i

In the above example the “-i” option passed to htcheck will take care for “htcheck”‘s database to be rebuilt, that’s necessery especially if you made any changes in /etc/htcheck/htcheck.conf after the first time you have invoked on htcheck and you’d like the new configuration changes to reflected in the generated reports in MySQL.

However if you run the htcheck tool for a first time, you can start it without the “-i” flag.

In order to configure htcheck’s web reporting interface to be properly show website crawling statistics you’ll have also to edit /etc/htcheck/global.inc.php and set the username and password variables according to the ones you have previously choose while creating the htcheck’s MySQL username and password.

As a last step before you could use the htcheck’s Web gui interface through your browser is to either configure a virtualhost for htcheck in your Apache configuration or simply make an Apache Alias from your Apache configuration, on Debian, you’ll have to edit /etc/apache2/apache2.conf
Place the following Apache Alias in order to be able to access your htcheck’s statistics from your default configured Apache domain name.

Alias /usr/share/htcheck/php/ /htcheck/

In order to load the new Apache configurations as usual you’ll need an Apache WebServer restart

root@noah:~# /etc/init.d/apache2 restart

Here you can take a quick look what to expect from Htcheck’s PHP Web Error 404 reporting interface on a Debian GNU/Linux System:

htcheck php graphic user interface, htcheck gui interface

Now Enjoy htcheck neat Error page discover tool and it’s web statistics interface!