Posts Tagged ‘URLs’

How to make a mirror of website on GNU / Linux with wget / Few tips on wget site mirroring

Wednesday, February 22nd, 2012

Reading Time: 4minutes


Everyone who used Linux is probably familiar with wget or has used this handy download console tools at least thousand of times. Not so many Desktop GNU / Linux users like Ubuntu and Fedora Linux users had tried using wget to do something more than single files download.
Actually wget is not so popular as it used to be in earlier linux days. I've noticed the tendency for newer Linux users to prefer using curl (I don't know why).

With all said I'm sure there is plenty of Linux users curious on how a website mirror can be made through wget.
This article will briefly suggest few ways to do website mirroring on linux / bsd as wget is both available on those two free operating systems.

1. Most Simple exact mirror copy of website

The most basic use of wget's mirror capabilities is by using wget's -mirror argument:

# wget -m

Creating a mirror like this is not a very good practice, as the links of the mirrored pages will still link to external URLs. In other words link URL will not pointing to your local copy and therefore if you're not connected to the internet and try to browse random links of the webpage you will end up with many links which are not opening because you don't have internet connection.

2. Mirroring with rewritting links to point to localhost and in between download page delay

Making mirror with wget can put an heavy load on the remote server as it fetches the files as quick as the bandwidth allows it. On heavy servers rapid downloads with wget can significantly reduce the download server responce time. Even on a some high-loaded servers it can cause the server to hang completely.
Hence mirroring pages with wget without explicity setting delay in between each page download, could be considered by remote server as a kind of DoS – (denial of service) attack. Even some site administrators have already set firewall rules or web server modules configured like Apache mod_security which filter requests to IPs which are doing too frequent HTTP GET /POST requests to the web server.
To make wget delay with a 10 seconds download between mirrored pages use:

# wget -mk -w 10 -np --random-wait

The -mk stands for -m/-mirror and -k / shortcut argument for –convert-links (make links point locally), –random-wait tells wget to make random waits between o and 10 seconds between each page download request.

3. Mirror / retrieve website sub directory ignoring robots.txt "mirror restrictions"

Some websites has a robots.txt which restricts content download with clients like wget, curl or even prohibits, crawlers to download their website pages completely.

/robots.txt restrictions are not a problem as wget has an option to disable robots.txt checking when downloading.
Getting around the robots.txt restrictions with wget is possible through -e robots=off option.
For instance if you want to make a local mirror copy of the whole sub-directory with all links and do it with a delay of 10 seconds between each consequential page request without reading at all the robots.txt allow/forbid rules:

# wget -mk -w 10 -np -e robots=off --random-wait

4. Mirror website which is prohibiting Download managers like flashget, getright, go!zilla etc.

Sometimes when try to use wget to make a mirror copy of an entire site domain subdirectory or the root site domain, you get an error similar to:

Sorry, but the download manager you are using to view this site is not supported.
We do not support use of such download managers as flashget, go!zilla, or getright

This message is produced by the site dynamic generation language PHP / ASP / JSP etc. used, as the website code is written to check on the browser UserAgent sent.
wget's default sent UserAgent to the remote webserver is:

As this is not a common desktop browser useragent many webmasters configure their websites to only accept well known established desktop browser useragents sent by client browsers.
Here are few typical user agents which identify a desktop browser:

  • Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0
  • Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/20100101 Firefox/6.0
  • Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4
  • Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.2a1pre) Gecko/20110324 Firefox/4.2a1pre

etc. etc.

If you're trying to mirror a website which has implied some kind of useragent restriction based on some "valid" useragent, wget has the -U option enabling you to fake the useragent.

If you get the Sorry but the download manager you are using to view this site is not supported , fake / change wget's UserAgent with cmd:

# wget -mk -w 10 -np -e robots=off \
--referer="" \--user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20070725 Firefox/" \--header="Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" \--header="Accept-Language: en-us,en;q=0.5" \--header="Accept-Encoding: gzip,deflate" \--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" \--header="Keep-Alive: 300"

For the sake of some wget anonimity – to make wget permanently hide its user agent and pretend like a Mozilla Firefox running on MS Windows XP use .wgetrc like this in home directory.

5. Make a complete mirror of a website under a domain name

To retrieve complete working copy of a site with wget a good way is like so:

# wget -rkpNl5 -w 10 --random-wait

Where the arguments meaning is:
-r – Retrieve recursively
-k – Convert the links in documents to make them suitable for local viewing
-p – Download everything (inline images, sounds and referenced stylesheets etc.)
-N – Turn on time-stamping
-l5 – Specify recursion maximum depth level of 5

6. Make a dynamic pages static site mirror, by converting CGI, ASP, PHP etc. to HTML for offline browsing

It is often websites pages are ending in a .php / .asp / .cgi … extensions. An example of what I mean is for instance the URL You see the url page is tutorial.php once mirrored with wget the local copy will also end up in .php and therefore will not be suitable for local browsing as .php extension is not understood how to interpret by the local browser.
Therefore to copy website with a non-html extension and make it offline browsable in HTML there is the –html-extension option e.g.:

# wget -mk -w 10 -np -e robots=off \
--random-wait \

A good practice in mirror making is to set a download limit rate. Setting such rate is both good for UP and DOWN side (the local host where downloading and remote server). download-limit is also useful when mirroring websites consisting of many enormous files (documental movies, some music etc.).
To set a download limit to add –limit-rate= option. Passing by to wget –limit-rate=200K would limit download speed to 200KB.

Other useful thing to assure wget has made an accurate mirror is wget logging. To use it pass -o ./my_mirror.log to wget.

Remove URL from comments in WordPress Blogs and Websites to mitigate comment spam URLs in pages

Friday, February 20th, 2015

Reading Time: 5minutes

If you're running a WordPress blog or Website where you have enabled comments for a page and your article or page is well indexing in Google (receives a lot of visit / reads ) daily, your site posts (comments) section is surely to quickly fill in with a lot of "Thank you" and non-sense Spam comments containing an ugly link to an external SPAM or Phishing website.

Such URL links with non-sense message is a favourite way for SPAMmers to raise their website incoming (other website) "InLinks" and through thatincrease current Search Engine position. 

We all know a lot of comments SPAM is generally handled well by Akismet but unfortunately still many of such spam comments fail to be identified as Spam  because spam Bots (text-generator algorithms) becomes more and more sophisticated with time, also you can never stop paid a real-persons Marketers to spam you with a smart crafted messages to increase their site's SEO ).
In all those cases Akismet WP (Anti-Spam) plugin – which btw is among the first "must have"  WP extensions to install on a new blog / website will be not enough ..

To fight with worsening SEO because of spam URLs and to keep your site's SEO better (having a lot of links pointing to reported spam sites will reduce your overall SEO Index Rate) many WordPress based bloggers, choose to not use default WordPress Comments capabilities – e.g. use exnternal commenting systems such as Disqus – (Web Community of Communities), IntenseDebate, LiveFyre, Vicomi

However as Disqus and other 3rd party commenting systems are proprietary software (you don't have access to comments data as comments are kept on proprietary platform and shown from there), I don't personally recommend (or use) those ones, yes Disqus, Google+, Facebook and other comment external sources can have a positive impact on your SEO but that's temporary event and on the long run I think it is more advantageous to have comments with yourself.
A small note for people using Disqos and Facebook as comment platforms – (just imagine if Disqos or Facebook bankrupts in future, where your comments will be? 🙂 )

So assuming that you're a novice blogger and I succeeded convincing you to stick to standard (embedded) WordPress Comment System once your site becomes famous you will start getting severe amount of comment spam. There is plenty of articles already written on how to remove URL comment form spam in WordPress but many of the guides online are old or obsolete so in this article I will do a short evaluation on few things I tried to remove comment spam and how I finally managed to disable URL link spam to appear on site.

1. Hide Comment Author Link (Hide-wp-comment-author-link)

This plugin is the best one I found and I started using it since yesterday, I warmly recommend this plugin because its very easy, Download, Unzip, Activate and there you're anything typed in URL field will no longer appear in Posts (note that the URL field will stay so if you want to keep track on person's input URL you can get still see it in Wp-Admin). I'm using default WordPress WRC (Kubrick), but I guess in most newer wordpress plugins is supposed to work. If you test it on another theme please drop a comment to inform whether works for you.  Hide Comment Author Linkworks on current latest Wordpress 4.1 websites.

A similar plugin to hide-wp-author-link that works and you can use is  Hide-n-Disable-comment-url-field, I tested this one but for some reason I couldn't make it work.

Whatever I type in Website field in above form, this is wiped out of comment once submitted 🙂

2. Disable hide Comment URL (disable-hide-comment-url)

I've seen reports disable-hide-comment-url works on WordPress 3.9.1, but it didn't worked for me, also the plugin is old and seems no longer maintaned (its last update was 3.5 years ago), if it works for you please please drop in comment your WP version, on WP 4.1 it is not working.


3. WordPress Anti-Spam plugin

WordPress Anti-Spam plugin is a very useful addition plugin to install next to Akismet. The plugin is great if you don't want to remove commenter URL to show in the post but want to cut a lot of the annoying Spam Robots crawling ur site.

Anti-spam plugin blocks spam in comments automatically, invisibly for users and for admins.

  • no captcha, because spam is not users' problem
  • no moderation queues, because spam is not administrators' problem
  • no options, because it is great to forget about spam completely

Plugin is easy to use: just install it and it just works.

Anti bot works fine on WP 4.1

4. Stop Spam Comments

Stop Spam Comments is:

  • Dead simple: no setup required, just activate it and enjoy your spam-free website.
  • Lightweight: no additional database queries, it doesn't add script files or other assets in your theme. This means your website performance will not be affected and your server will thank you.
  • Invisible by design: no captchas, no tricky questions or any other user interaction required at all.

Stop Spam Comments works fine on WP 4.1.

I've mentioned few of the plugins which can help you solve the problem, but as there are a lot of anti-spam URL plugins available for WP its up to you to test and see what fits you best. If you know or use some other method to protect yourself from Comment Url Spam to share it please.

Import thing to note is it usually a bad idea to mix up different anti-spam plugins so don't enable both Stop Spam Comments and WordPress Anti Spam plugin.

5. Comment Form Remove Url field Manually 

This (Liberian) South) African blog describes a way how to remove URL field URL manually

In short to Remove Url Comment Field manually either edit function.php (if you have Shell SSH access) or if not do it via Wp-Admin web interface:

WordPress admin page –> Appearance –> Editor

Paste at the end of file following PHP code:


add_filter('comment_form_default_fields', 'remove_url');
 function remove_url($fields)
 return $fields;

Now to make changes effect, Restart Apache / Nginx Webserver and clean any cache if you're using a plugin like W3 Total Cache plugin etc.

Other good posts describing some manual and embedded WordPress ways to reduce / stop comment spam is here and here, however as it comes to my blog, none of the described manual (code hack) ways I found worked on WordPress v. 4.1.
Thus I personally stuck to using Hide and Disable Comment URL plugin  to get rid of comment website URL.

Preserve domain name after redirect with mod_rewrite and some useful mod rewrite redirect and other examples – Redirect domain without changing URL

Friday, July 11th, 2014

Reading Time: 4minutes

If you're a webhosting company sysadmin, sooner or later you will be asked by application developer or some client to redirect from an Apache webserver to some other webserver / URL's IP, in a way that the IP gets preserved after the redirect.

I'm aware of two major ways to do the redirect on webserver level:

1. To redirect From Apache host A to Webserver on host B using ReverseProxy mod_proxy

2. To use Mod Rewrite to redirect all client requests on host A to host B.

There is quite a lot to be said and is said and written online on using mod_rewrite to redirect URLs.
So in this article I will not say nothing new but just present some basic scenarios on Redirecting with mod rewrite and some use cases.
Hope this examples, will help some colleague sys-admin to solve some his crazy boss redirection tasks 🙂 I'm saying crazy boss because I already worked for a  start-up company which was into internet marketing and the CEO has insane SEO ideas, often impossible to achieve …

a) Dynamic URL Redirect from Apache host A to host B without changing domain name in browser URL and keeping everything after the query in

Lets say you want to redirect incoming traffic to DomainA to DomainB keeping whole user browser request, i.e.


Passthe the whole request including /whole/a/lot/of/sub/directory/query.php

so when Apache redirects to redirect to:

In browser 
To do it with Mod_Rewrite either you have to add in .htaccess mod_rewite rules:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^ [OR]
RewriteCond %{HTTP_HOST} ^
RewriteRule ^(.*)$1 [P]

or include this somewhere in VirtualHost configuration of your domain

Above mod_rewrite will make any request to to forward to while preserving the hostname in browser URL bar to old domain, however still contet will be served by

to redirect to

WARNING !!  If you're concerned about your SEO well positioning in search Engines, be sure to never ever use such redirects. Making such redirects will cause two domains to show up duplicate content
and will make Search Engines to reduce your Google, Yahoo, Yandex etc. Pagerank!!

Besides that such,redirect will use mod_rewrite on each and every redirect so from performance stand point it is a CPU killer (for such redirect using native mod_proxy ProxyPass is much more efficient – on websites with hundred of thousands of requests daily using such redirects will cause you to spend your  hardware badly  …)

P.S. ! Mod_Rewrite and Proxy modules needs to be previously enabled
On Debian Linux, make sure following links are existing and pointing to proper existing files from /etc/apache2/mods-available/ to /etc/apache2/mods-enabled

debian:~#  ls -al /etc/apache2/mods-available/*proxy*
-rw-r–r– 1 root root  87 Jul 26  2011 /etc/apache2/mods-available/proxy_ajp.load
-rw-r–r– 1 root root 355 Jul 26  2011 /etc/apache2/mods-available/proxy_balancer.conf
-rw-r–r– 1 root root  97 Jul 26  2011 /etc/apache2/mods-available/proxy_balancer.load
-rw-r–r– 1 root root 803 Jul 26  2011 /etc/apache2/mods-available/proxy.conf
-rw-r–r– 1 root root  95 Jul 26  2011 /etc/apache2/mods-available/proxy_connect.load
-rw-r–r– 1 root root 141 Jul 26  2011 /etc/apache2/mods-available/proxy_ftp.conf
-rw-r–r– 1 root root  87 Jul 26  2011 /etc/apache2/mods-available/proxy_ftp.load
-rw-r–r– 1 root root  89 Jul 26  2011 /etc/apache2/mods-available/proxy_http.load
-rw-r–r– 1 root root  62 Jul 26  2011 /etc/apache2/mods-available/proxy.load
-rw-r–r– 1 root root  89 Jul 26  2011 /etc/apache2/mods-available/proxy_scgi.load

debian:/etc/apache2/mods-avaialble:~# ls *proxy*
proxy.conf@  proxy_connect.load@  proxy_http.load@  proxy.load@

Ifit is is not enabledto enable proxy support in Apache on Debian / Ubuntu Linux, either create the symbolic links as you see them from above paste or issue with root:

a2enmod proxy_http
a2enmod proxy


b) Redirect Main Domain requests to other Domain specific URL

RewriteEngine On
RewriteCond %{HTTP_HOST} ^
RewriteRule ^(.*) [P]

Note that no matter what kind of subdirectory you request on (lets say you type in ) it will get redirected to:

Sometimes this is convenient for SEO, because it can make you to redirect any requests (including mistakenly typed requests by users or Bot Crawlers to real existing landing page).

c) Redirecting an IP address to a Domain Name

This probably a very rare thing to do as usually a Domain Name is redirected to an IP, however if you ever need to redirect IP to Domain Name:

RewriteCond %{HTTP_HOST} ^##.##.##.##
RewriteRule (.*)$1 [R=301,L]

Replace ## with digits of your IP address, the is used to escape the (.) – dots are normally interpreted by mod_rewrite.

d) Rewritting URL extensions from .htm to .php, doc to docx etc.

Lets say you're updating an old website with .htm or .html to serve .php files with same names as old .htmls use following rewrite rules:. Or all your old .doc files are convertedand replaced with .docx and you need to make Apache redirect all .doc requests to .docx.

Options +FollowSymlinks
RewriteEngine on
RewriteRule ^(.*).html$ $1.php [NC]

Options +FollowSymlinks
RewriteEngine on
RewriteRule ^(.*).doc$ $1.docx [NC]

The [NC] flag at the end means "No Case", or "case-insensitive"; Meaning it will not matter whether files are requested with capital or small letters, they will just show files if file under requested name is matched.

Using such a redirect will not cause Apache to redirect old files .html, .htm, .doc and they will still be accessible again creating duplicate content which will have a negavite impact on Search Engine Optimization.

The better way to do old extensioned files redirect is by using:

Options +FollowSymlinks
RewriteEngine on
RewriteRule ^(.+).htm$$1.php [R,NC]

[R] flag would tell make mod_rewrite send HTTP "MOVED TEMPORARILY" redirection, aka, "302" to browser. This would cause search engines and other spidering entities will automatically update their links to the new locations.

e) Grabbing content from URL with Mod Rewrite and passing it to another domain

Lets say you want zip files contained in directory files/ to be redirected from your current webserver on domainA to domainB's download.php script and be passed as argument to the script

Options +FollowSymlinks
RewriteEngine on
RewriteRule ^files/([^/]+)/([^/]+).zip$1&file=$2 [R,NC]

f) Shortening URLs with mod_rewrite

This is ueful If you have a long URL address accessible via some fuzzy long hard to remember URL address and you want to make it acessible via a shorter URL without phyisally moving the files within a short named directory, do:

Options +FollowSymlinks
RewriteEngine On
RewriteRule ^james-brown /james-brown/files/download/download.php

Above rule would make requests coming to be opened via http://mysite/public/james-brown/files/download/download.php?

g) Get rid of the www in your domain name

Nowdays many people are used to typing, if this annoys you and you want them not to see in served URLs the annoying www nonsense, use this:

Options +FollowSymlinks
RewriteEngine on
RewriteCond %{http_host} ^ [NC]
RewriteRule ^(.*)$$1 [R=301,NC]

That's mostly some common uses of mod rewrite redirection, there are thousands of nice ones. If you know others, please share?

References and thanks to:

How to redirect domain without changing the URL

More .htaccess tips and tricks – part 2



How to fix bug with WordPress domain extra trailing slash (Double wordpress trailing slash)

Monday, July 9th, 2012

Reading Time: 2minutes

How to fix bug with wordpress extra slash, domain double slash issue pic

2 of the wordpress installations, I take care for had been reported an annoying bug today by some colleagues.
The bug consisted in double trailing slash at the end of the domain url e.g.;

As a result in the urls everywhere there was the double trailing slash appearing i.e.::


The bug was reported to happen in the multiolingual version of the wordpress based sites, as the Qtranslate plugin is used on this installations to achieve multiple languages it seemed at first logical that the double slash domain and url wordpress issues are caused for some reason by qTranslate.

Therefore, I initially looked for the cause of the problem, within the wordpress admin settings for qTranslate plugin. After not finding any clue pointing the bug to be related to qTranslate, I've then checked the settings for each individual wordpress Page and Post (There in posts usually one can manually set the exact url pointing to each post and page).
The double slash appeared also in each Post and Page and it wasn't possible to edit the complete URL address to remove the double trailin slashes. My next assumption was the cause for the double slash appearing on each site link is because of something wrong with the sites .htaccess, therefore I checked in the wp main sites directory .htaccess
Strangely .htacces seemed OKAY and there was any rule that somehow might lead to double slashes in URL. WP-sites .htaccess looked like so:

server:/home/wp-site1/www# cat .htaccess
RewriteEngine On
RewriteBase /

# Rewrite rules for new content and scripts folder
RewriteRule ^jscripts/(.*)$ wp-includes/js/$1
RewriteRule ^gallery/(.*)$ wp-content/uploads/$1
RewriteRule ^modules/(.*)$ wp-content/plugins/$1
RewriteRule ^gui/(.*)/(.*)$ wp-content/themes/$1/$2 [L]

# Disable direct acceees to wp files if referer is not valid
#RewriteCond %{THE_REQUEST} .wp-*
#RewriteCond %{REQUEST_URI} .wp-*
#RewriteCond %{REQUEST_URI} !.*media-upload.php.*
#RewriteCond %{HTTP_REFERER} !.*cadia.*
#RewriteRule . /error404 [L]

# Standard WordPress rewrite
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

Onwards, I thought a possible way to fix bug by adding mod_rewrite rules in .htaccess which would do a redirect all requests to to etc. like so:

RewriteRule ^/(.*)$ /$1

This for unknown reasons to me didn't worked either, finally thanks God I remembered to check the variables in wp-config.php (some month ago or so I added there some variables in order to improve the wordpress websites opening times).

I've figured out I did a mistake in one of the variables by adding an ending slash to the URL. The variable added was:


whether instead it should be without the ending trailing slash like so:


By removing the ending trailing slash:



fixed the issue.
Cheers 😉

How to list Files in a directory and generate web URLS with PHP

Tuesday, April 10th, 2012

Reading Time: < 1minuteI needed a short PHP script that reads all, my .html files in a directory and then generates html a hrefs links pointing to each of the html files stored in the directory.

Here is the short code I come up:

if ($handle = opendir(“$directory_to_open”)) {
while (false !== ($file = readdir($handle)) && $i <= $max_files)
if ($file != “.” && $file != “..”)
$thelist .= ‘| ‘.str_replace(“.html”,””,$file).’ |’;
echo “$thelist”; }

In my case the directories with html were planned to contain, less than 100 files a directory, so in order to show links to only the first 100 files in the directory, I used the $max_files=100 and a check if value is reached in the while loop. For anyone who want to build html you see in above while if $max_files is reached then the while loop exits.

Because by default the files returned contained the naming format file_name.html, whether I wanted to show only the file name without the .html extensions used str_replace(); to get rid of file extensions string.