Posts Tagged ‘php extension’

How to make a mirror of website on GNU / Linux with wget / Few tips on wget site mirroring

Wednesday, February 22nd, 2012

how-to-make-mirror-of-website-on-linux-wget

Everyone who used Linux is probably familiar with wget or has used this handy download console tools at least thousand of times. Not so many Desktop GNU / Linux users like Ubuntu and Fedora Linux users had tried using wget to do something more than single files download.
Actually wget is not so popular as it used to be in earlier linux days. I've noticed the tendency for newer Linux users to prefer using curl (I don't know why).

With all said I'm sure there is plenty of Linux users curious on how a website mirror can be made through wget.
This article will briefly suggest few ways to do website mirroring on linux / bsd as wget is both available on those two free operating systems.

1. Most Simple exact mirror copy of website

The most basic use of wget's mirror capabilities is by using wget's -mirror argument:

# wget -m http://website-to-mirror.com/sub-directory/

Creating a mirror like this is not a very good practice, as the links of the mirrored pages will still link to external URLs. In other words link URL will not pointing to your local copy and therefore if you're not connected to the internet and try to browse random links of the webpage you will end up with many links which are not opening because you don't have internet connection.

2. Mirroring with rewritting links to point to localhost and in between download page delay

Making mirror with wget can put an heavy load on the remote server as it fetches the files as quick as the bandwidth allows it. On heavy servers rapid downloads with wget can significantly reduce the download server responce time. Even on a some high-loaded servers it can cause the server to hang completely.
Hence mirroring pages with wget without explicity setting delay in between each page download, could be considered by remote server as a kind of DoS – (denial of service) attack. Even some site administrators have already set firewall rules or web server modules configured like Apache mod_security which filter requests to IPs which are doing too frequent HTTP GET /POST requests to the web server.
To make wget delay with a 10 seconds download between mirrored pages use:

# wget -mk -w 10 -np --random-wait http://website-to-mirror.com/sub-directory/

The -mk stands for -m/-mirror and -k / shortcut argument for –convert-links (make links point locally), –random-wait tells wget to make random waits between o and 10 seconds between each page download request.

3. Mirror / retrieve website sub directory ignoring robots.txt "mirror restrictions"

Some websites has a robots.txt which restricts content download with clients like wget, curl or even prohibits, crawlers to download their website pages completely.

/robots.txt restrictions are not a problem as wget has an option to disable robots.txt checking when downloading.
Getting around the robots.txt restrictions with wget is possible through -e robots=off option.
For instance if you want to make a local mirror copy of the whole sub-directory with all links and do it with a delay of 10 seconds between each consequential page request without reading at all the robots.txt allow/forbid rules:

# wget -mk -w 10 -np -e robots=off --random-wait http://website-to-mirror.com/sub-directory/

4. Mirror website which is prohibiting Download managers like flashget, getright, go!zilla etc.

Sometimes when try to use wget to make a mirror copy of an entire site domain subdirectory or the root site domain, you get an error similar to:

Sorry, but the download manager you are using to view this site is not supported.
We do not support use of such download managers as flashget, go!zilla, or getright

This message is produced by the site dynamic generation language PHP / ASP / JSP etc. used, as the website code is written to check on the browser UserAgent sent.
wget's default sent UserAgent to the remote webserver is:
Wget/1.11.4

As this is not a common desktop browser useragent many webmasters configure their websites to only accept well known established desktop browser useragents sent by client browsers.
Here are few typical user agents which identify a desktop browser:
 

  • Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0
  • Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/20100101 Firefox/6.0
  • Mozilla/6.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4
  • Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.2a1pre) Gecko/20110324 Firefox/4.2a1pre

etc. etc.

If you're trying to mirror a website which has implied some kind of useragent restriction based on some "valid" useragent, wget has the -U option enabling you to fake the useragent.

If you get the Sorry but the download manager you are using to view this site is not supported , fake / change wget's UserAgent with cmd:

# wget -mk -w 10 -np -e robots=off \
--random-wait
--referer="http://www.google.com" \--user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" \--header="Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" \--header="Accept-Language: en-us,en;q=0.5" \--header="Accept-Encoding: gzip,deflate" \--header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" \--header="Keep-Alive: 300"

For the sake of some wget anonimity – to make wget permanently hide its user agent and pretend like a Mozilla Firefox running on MS Windows XP use .wgetrc like this in home directory.

5. Make a complete mirror of a website under a domain name

To retrieve complete working copy of a site with wget a good way is like so:

# wget -rkpNl5 -w 10 --random-wait www.website-to-mirror.com

Where the arguments meaning is:
-r – Retrieve recursively
-k – Convert the links in documents to make them suitable for local viewing
-p – Download everything (inline images, sounds and referenced stylesheets etc.)
-N – Turn on time-stamping
-l5 – Specify recursion maximum depth level of 5

6. Make a dynamic pages static site mirror, by converting CGI, ASP, PHP etc. to HTML for offline browsing

It is often websites pages are ending in a .php / .asp / .cgi … extensions. An example of what I mean is for instance the URL http://php.net/manual/en/tutorial.php. You see the url page is tutorial.php once mirrored with wget the local copy will also end up in .php and therefore will not be suitable for local browsing as .php extension is not understood how to interpret by the local browser.
Therefore to copy website with a non-html extension and make it offline browsable in HTML there is the –html-extension option e.g.:

# wget -mk -w 10 -np -e robots=off \
--random-wait \
--convert-links http://www.website-to-mirror.com

A good practice in mirror making is to set a download limit rate. Setting such rate is both good for UP and DOWN side (the local host where downloading and remote server). download-limit is also useful when mirroring websites consisting of many enormous files (documental movies, some music etc.).
To set a download limit to add –limit-rate= option. Passing by to wget –limit-rate=200K would limit download speed to 200KB.

Other useful thing to assure wget has made an accurate mirror is wget logging. To use it pass -o ./my_mirror.log to wget.
 

Enable Apache libphp extension to interpret PHP scripts on FreeBSD 9.1

Saturday, January 12th, 2013

Enable php scripts to be interpreted / executed by PHP on freebsd
First you have to have installed and properly set up Apache from port, in my case this is Apache:

 

freebsd# pkg_info | grep -i apache
ap22-mod_fastcgi-2.4.6_3 A fast-cgi module for Apache
apache22-2.2.23_4   Version 2.2.x of Apache web server with prefork MPM.
apr-1.4.6.1.4.1_3   Apache Portability Library

I've installed it from source port /usr/ports/www/apache22, with:

freebsd# cd /usr/ports/www/apache22;
freebsd# make install clean
.....
Then to be able to start Apache from init script and make it run automatically on FBSD system reboot:

 

echo 'apache22_enable="YES"' >> /etc/rc.conf

I've also installed php5-extensions port;

freebsd# cd /usr/ports/lang/php5-extensions/
freebsd# make install clean
....
freebsd# cp -rpf /usr/local/etc/php.ini-production /usr/local/etc/php.ini

I had to select the exact Apache PHP library extensions I need, after selecting and installing, here is the list of PHP extensions installed on system:

freebsd# pkg_info | grep -i php5
php5-5.4.10         PHP Scripting Language
php5-bz2-5.4.10     The bz2 shared extension for php
php5-ctype-5.4.10   The ctype shared extension for php
php5-dom-5.4.10     The dom shared extension for php
php5-filter-5.4.10  The filter shared extension for php
php5-gd-5.4.10      The gd shared extension for php
php5-gettext-5.4.10 The gettext shared extension for php
php5-hash-5.4.10    The hash shared extension for php
php5-iconv-5.4.10   The iconv shared extension for php
php5-json-5.4.10    The json shared extension for php
php5-mbstring-5.4.10 The mbstring shared extension for php
php5-mcrypt-5.4.10  The mcrypt shared extension for php
php5-mysql-5.4.10   The mysql shared extension for php
php5-pdo-5.4.10     The pdo shared extension for php
php5-pdo_sqlite-5.4.10 The pdo_sqlite shared extension for php
php5-phar-5.4.10    The phar shared extension for php
php5-posix-5.4.10   The posix shared extension for php
php5-session-5.4.10 The session shared extension for php
php5-simplexml-5.4.10 The simplexml shared extension for php
php5-sqlite3-5.4.10 The sqlite3 shared extension for php
php5-tokenizer-5.4.10 The tokenizer shared extension for php
php5-xml-5.4.10     The xml shared extension for php
php5-xmlreader-5.4.10 The xmlreader shared extension for php
php5-xmlwriter-5.4.10 The xmlwriter shared extension for php
php5-zip-5.4.10     The zip shared extension for php
php5-zlib-5.4.10    The zlib shared extension for php

By default DirectoryIndex is not set to process index.php and .php, file extensions will not be interpreted by libphp, instead requests to .php, just opens them as plain text files.

In Apache config httpd.conf, libphp5 module should be displaying as loaded, like so:

freebsd# grep -i php5 /usr/local/etc/apache22/httpd.conf
LoadModule php5_module        libexec/apache22/libphp5.so

Next step find in /usr/local/etc/apache22/httpd.conf lines:

<IfModule dir_module>

DirectoryIndex index.html

Change

DirectoryIndex index.html

to

DirectoryIndex index.php index.html

(If you would like index.php to be processed as primary whether an Apache directory contains both .php and .html files.

After DirectoryIndex index.php, paste following;

<IfModule mod_dir.c>
    <IfModule mod_php3.c>
        <IfModule mod_php5.c>
            DirectoryIndex index.php index.php3 index.html
        </IfModule>
        <IfModule !mod_php4.c>
            DirectoryIndex index.php3 index.html
        </IfModule>
    </IfModule>
    <IfModule !mod_php3.c>
        <IfModule mod_php5.c>
            DirectoryIndex index.php index.html index.htm
        </IfModule>
        <IfModule !mod_php4.c>
            DirectoryIndex index.html
        </IfModule>
    </IfModule>
</IfModule>

Open /usr/local/etc/apache22/httpd.conf. I use vim so:

vim /usr/local/etc/apache22/httpd.conf

and press CTRL+g to go to last line of file. Last line is:

Include etc/apache22/Includes/*.conf

I press I to insert text and paste before the line:

AddType application/x-httpd-php .php
AddType application/x-httpd-php-source .phpS

AddType application/x-httpd-php .php .htm .html

And save with the usual vim esc+ :x! (exit, save and overwrite changes command).

Then restart Apache to load new settings, after testing Apache config is okay;

freebsd# apache2ctl -t
Syntax OK
freebsd# /usr/local/sbin/apachectl -k restart

To test php conduct the usual test if php is interpretting code with phpinfo(); by creating  file php_info.php and pasting inside:

<?php
phpinfo();
?>

One important note, to make here is if you try to use phpinfo(); test code like:

<?
phpinfo();
?>

You will get in your browser empty pages content – which usually appear only, if PHP execution fails. Even if you try to enable PHP errors to be displayed in browser by setting

display_errors = On in /usr/local/etc/php.ini, or configuring a separate php error_log file with setting variable error_log, i.e.:

error_log = /var/log/php_error.log

No error or warning is to be both displayed in browser or recorded in log. After consulting in irc.freenode.net #php, I was pointed out by nezZario that this unusual behavior is normal for PHP 5.4, as well as explained this behavior is controlled by var called:

Short Open Tags

To enable Short Open Tags to interpret PHP code inside <? set in /usr/local/etc/php.ini

short_open_tag = On