How Malcolm sets up his laptop for fairs

Real snappy title 8-) .. this is in three parts. First mirroring the web sites, second running the Apache web server (that's the easy part) and third how to set up Apache to pretend to be 100 assorted web sites while working offline.

Just remember that I make the mirror copies at work (10Mbit/sec connection) and then take the files home on a CD, currently as one 100Mb, one 257Mb and one 264Mb .zip file because each .zip file can't hold more than 64k files!). This unpacks to pretty much a full Gb of hard disk space. I don't recommend anything more than a refresh of one or two sites over the telephone line!

I'm not offering to make fresh mirror copies on CD for every fair, but if I make one for a fair myself then I'm prepared to circulate that CD while it is still passably fresh.

Mirroring the web sites

I make my mirror copies of web sites with a tool called wget. It is a unix tool at heart but the windows port has served me well. You can read about wget at http://sunsite.dk/wget/index.html and you can download the windows binary from http://www.geocities.com/heiko_herold/. Make sure you get wget v1.6 (or later, eventually) for v1.5.3 does not fetch the .css files that some sites need to display properly.

I have a file (mine is called mirror.wrc, contents shown below) which I set up for use (this is in a DOS box) with set wgetrc=mirror.wrc. Then I invoke wget with something like wget http://www.genuki.org.uk/ -o genuki.log.

mirror=on
convert_links=on
ignore_length=on
reclevel=99
no_parent=on
exclude_directories=/cgi-bin
reject=*\?*

I limit the recursion depth because I have seen some pathological (massively crosslinked) sites cause wget to blow it's stack memory allocation.

The convert_links option is set on so that when all files have been fetched from a site, wget runs through all the files and changes all references within the mirrored site (only this one, not the whole collection!) to use relative paths rather than absolute paths. This ensures that the mirror behaves when it is served up with extra directories between it and the server root.

The convert_links option is, of course, liable to change the size of the file by a few bytes. Therefore the ignore_length option is turned on to prevent wget refetching the file on the basis of the local and remote copy being of differing sizes. (The mirror option has turned on the checking of timestamps.)

I run this in a 'mirror' directory and actually collect the data by leaving one or more of three small ,bat files running when I leave work in the evening. I will make these files available when I have 'cleaned them up' a little. Note that after the first mirror run, subsequent runs will just refresh the copy on disk ... except for those nasty servers who don't deliver timestamps on what they serve up, for those the entire file set is refetched.

At this point you have a load of sites individually mirrored into subdirectories under your 'mirror' directory. There are two practical problems with this setup, neither of which really prevent the mirror being useable as long as you are prepared to waffle about how you aren't working live but with a copy of about 100 web sites on your hard disk (that may impress some of the punters!). It all boils down to having to access the files using a 'file:' reference instead of an 'http:' reference to a web server.

The first problem is that while wget fixed up absolute paths to be relative paths it did not fix up directory URLs into explicit index.htm(l) references. When you follow a directory URL you will get a directory listing and will then have to pick the index file from the list.

The second problem is that links between the mirrored sites don't work. They have been left as 'http:' references and will fail. Again you have to select the correct target file by hand.

The Apache web server

The solution to the above problems is (of course!) to run a web server on the demonstration computer. Having now done this I can confirm it is simplicity itself. I went to http://www.apache.org and followed the 'Apache Server' link and then the 'download' link. The binaries directory leads to a win32 directory which leads to http://httpd.apache.org/dist/binaries/win32/apache_1.3.17-win32-no_src.msi which I downloaded and installed. The configuration requirements are minimal. You can quote your workstation name as 'localhost' and I set up an alias for my mirror directory (just following the commented example in the file) so that while http://localhost/ took me to the default Apache documentation, http://localhost/mirror/ took me to my mirror directory.

At this stage the problem with directory URLs has gone away, Apache returns the index file. however, the links between sites are still a problem.

By the way, Apache will be set up to start up automatically when you start windows. To stop that you'll have to dive into the registry and delete the Apache entry from 'RunServices' (I'll quote the full registry branch when I get a chance to remember it!)

Pretending to be the world

Apache has another trick up its sleeve ...

Before exploiting it you need to make use of a Windows-TCP/IP trick. Normally your web browser would have to look up the numeric address of, say www.genuki.org.uk, using the internet's DNS (domain name server) system. It doesn't have to though. You can define this address in your \windows\hosts file. You can also make use of the reserved IP address 127.0.0.1 which is always special and always loops the TCP/IP packets back to the sending machine. So you need a file defining all the sites you have mirrored as being at address 127.0.0.1 - I took the sample (hosts.sam) file and added lots of lines to leave me with this hosts file. Remember to rename this to something else when you connect to the real internet!

So now you have 100+ sites names which will be passed to your copy of Apache instead of trying to access the real site. To complete the act you need to do some more with the Apache configuration.

I have my mirror directory of drive D: and this appeared to cause some problems. So I swapped the mirror directory with the standard Apache documentaion directory so that my d:\mirror became the server doument root and the Apache documentaion became the server's /apache/ directory. Then I followed the documented instructions (well, almost) for mass virtual hosting.

Here is the default configuration file httpd.default.conf. and here is my final version httpd.conf. There is also a unix diff file in httpd.diff.conf. I'll outline the changes and their significance ...

LoadModule rewrite_module modules/mod_rewrite.so

This line has to be uncommented to activate the module that will allow the server to respond to lots of server names and select the correct mirror directory.
#DocumentRoot "c:/Program Files/Apache Group/Apache/htdocs"
DocumentRoot "d:/mirror"

That makes the mirror directory the starting point for requests to the server instead of the installed documentation tree.
# This should be changed to whatever you set DocumentRoot to.
#
#<Directory "c:/Program Files/Apache Group/Apache/htdocs">
<Directory "d:/mirror">

That has to change to match the changed document root.
UseCanonicalName Off

That has to be set off so that the server reacts using whatever name it is addressed with rather than insisting it is called 'localhost' or whatever.
#LogFormat "%h %l %u %t \"%r\" %>s %b" common
LogFormat "%t %{Host}i \"%r\" %>s %b" common

Just a luxury to add the host name to ther server log. You will probably never bother looking at the log anyway!
Alias /apache "c:/Program Files/Apache Group/Apache/htdocs"

<Directory "c:/Program Files/Apache Group/Apache/htdocs">
Options Indexes MultiViews
AllowOverride None
Order allow,deny
Allow from all
</Directory>

Alias /mirror "d:/mirror"

<Directory "d:/mirror">
Options Indexes MultiViews
AllowOverride None
Order allow,deny
Allow from all
</Directory>

The first of this pair of simlar sections makes the installed documentation available as /apache/ on your server. The second part isn't really needed. I put that in there at first but it became redundant when I made the mirror directory the server root.
#
# Virtual hosts for offline working
#

RewriteEngine On

#RewriteMap lowercase int:tolower

RewriteCond %{SERVER_NAME} !^localhost
RewriteCond %{SERVER_NAME} !^mda-portable
#RewriteCond %{REQUEST_URI} !^/icons
RewriteCond %{REQUEST_URI} !^/cgi-bin
RewriteCond %{REQUEST_URI} !^/mirror
RewriteCond %{REQUEST_URI} !^/apache
RewriteRule ^/(.*)$ /%{SERVER_NAME}/$1
#${lowercase:%{SERVER_NAME}}/$1

This was lifted pretty much from the documentation on mass virtual hosting. The lowercase translation seemed to give me trouble so I removed it on the grounds that the win32 filestore is case insensitive anyway. I added the first two conditions so that no clever mapping is done for those site names (mda-portable is my laptop when plugged into an ethernet port at work). I disabled the exclusion for /icons becasue we need each virtual site to use its own /icons directory. The mirrors don't include cgi-bin so that is 'caught' so that the directory installed with Apache is used. I also set things so that references to /mirror and /apache would not be subjected to the magic.
So what that means is that if the request isn't to any of the excluded servers and isn't to any of the excluded directories ... a request for http://sitename/path/filename is treated (invisbly to the browser) as http://sitename/sitename/path/filename and hey presto the cross links between the 100+ mirrored sites leap into life.

Of course, the request won't have been passed to your local Apache server unless the site is listed in \windows\hosts, so if you mirror an extra site make sure it is added to that file.

Sorry that that turns out to be rather impenetrable, it was written in haste!