To run a web service you need four things: a connection to the outside world (network), a machine to run the service from (hardware), a program to run it with (software) and people to maintain both the server and the data it serves (‘wetware’).

Your network needs to be a permanent connection and your server needs to have a constant IP address. You neeed to know what the support is for your network, who to contact in case of problems, when the vulnerable periods are, etc. (The CUDN has Monday to Saturday 0800–0930,1700–1900 and Sunday 0000 to Monday 0800 for its vulnerable periods.)

Your machine must have the power to support the number of hits the server will get. Note that it must be powerful to cope with the peak demand and not just the mean or modal demand. The most important element of your hardware for a web server is the network card; buy a good one. Next most important is the amount of RAM. The CPU comes last in the list. Unless you are planning to run very computationally expensive CGI programs you don't need the latest, greatest, fastest chip in the world.

Clearly you neeed a good web server program. The program described in this course is Apache. It has all the facilities you could want (and then some), is free. (and that's ‘free’ as in speech as well as in beer. It is also the most widely used web server on the Internet, being used by >65% of active sites. (Source: Netcraft Web Server Survey, February 2002.)

Finally, it is important to realise that to provide a web service rather than just a server you need people. Pages get dated, users' needs change, links pointing out of your site go stale, dump tapes need to be changed and error reports need to be addressed.

Other programs

Figure 6. Support tools

Editors
HTML checkers
Graphics manipulators
Scanners etc.
Log file analyser
CGI programs

Figure 7. Support tools: Text editors

Plain text editor
Configuration files
HTML data files
emacs, vi, pico

Figure 8. Deprecated support tools: HTML editors

There exist specialist HTML editors
Inflexible & incomplete
Poor quality HTML
Plain text editors still pretty good
Avoid MS Word like the plague

Figure 9. Support tools: HTML checkers

Check HTML syntax
Check HTML quality
Check links still work
weblint
cron job

Figure 10. Support tools: Graphics manipulators

Best all-rounder is gimp—the GNU Image Manipulation Program
Also ee—Electric Eyes
Bother available as Red Hat packages.

Figure 11. Support tools: Scanners etc.

Flat bed scanners
Digital cameras

A web server serves out web pages. However, to populate the web site the pages need to be written and checked and log files may need to be analysed.

Firstly you will need to write the web pages, or possibly edit those submitted by others. The author still regards a plain text editor (emacs, vi or perhaps even pico -w) as the best tool for editing web pages. Contrary to popular belief, the dedicated web authoring tools are still not very good. Of the various authoring packages by far the worst is Microsoft Word's ‘save as HTML’ feature. The quality of HTML generated by this is appalling and it should be avoided like the plague.

The HTML in the page still needs to be checked for syntax, link integrity and accessibility. This is true whether or not a dedicated HTML authoring package was used; indeed, if one was used then a ‘second opinion’ is all the more important. The text itself should also be checked for spelling and grammar, but beware the rather over-simplistic grammar rules in some word processors.

In addition to the text there are all the other media formats, with static graphics the most common. The GNU Image Manipulation Program (the GIMP), gimp, is a massively powerful GUI image manipulator that starts at the level of Adobe Photoshop but takes things much further, including having a scripting language. For simple viewing, croping, rescaling and format conversion the Electric Eyes program, ee, is considerably simpler to learn and use. Images can be initially created interactively (e.g. with the GIMP), with a digital camera or by scanning in photographs.

Figure 12. Support tools: CGI programs

Common Gateway Interface
Not covered in this course
SSI
SSIexec
PHP
perl CGI module
python CGI module

The other support you may need is for CGI programs. First you need to make a decision: are you going to permit the running of programs on the server? While this course is not going to review the technologies in any depth it should give some idea of the spectrum available and the dangers asssociated with them.

The author is aware of only one vulnerability of a system through the web server itself. He is, however, aware of many break-ins via the CGI programs run by web servers. Static pages are vastly more secure.

The simplest of this style of program is the ‘server side include’ facility. This allows you to add certain tags to a web page which are not valid HTML but which are transformed by the server into valid HTML with dynamic content. A common example is the SSI tag that says when the page was last modified. However, consider whether you want the ‘last updated’ tag to be the last time you fixed your spelling (automatic version) or the last time you changed the content (manual version). Slightly beyond this is the ‘server side include executable’ where the tag runs an external program to generate the content. It is at this point, where an extra program is run by the web server, that you need to start being very careful about security. (It is possible to turn off the SSI executable feature while retaining the weaker SSI functionality.)

PHP takes this one stage further, offering a scripting language embedded in the HTML to provide powerful functionality and logic. The Perl and Python CGI modules take the page author away from HTML all together. The CGI modules are presented with a URL (and some input data for POST queries) and have to write their own HTTP as well as HTML, in the format described for the ‘as is’ pages in the section called “Writing HTTP rather than HTML”. The modules provide simple function calls for most of this though.

Figure 13. Support tools: Secure access

ssh: Replacement for rsh, rlogin, rcp
Maching daemon: sshd
Red Hat package
Unix Support's CD

Finally, unless you plan to work exclusively at the console (you don't) you will need secure network access to your server. Don't use telnet or the ‘r-commands’ (rlogin, rsh, rcp and rsync) but their secure analogues provided by the ‘ssh’ suite of programs.

Red Hat Linux version 7.0 and above ship with an SSH system. Also, Unix Support provides a CD with ssh clients for most platforms including a Red Hat Linux packaging of the software suite for the Intel platform. The CD is free from the CS Reception.

Installation

Figure 14. Example server

3Com 3c905B, 700MHz Athlon, 256MB RAM, 20GB disc
Red Hat Linux 7.3
Apache v1.3.23

The example server we are going to use for this course is a 700MHz Athlon with 256MB of RAM a 1GB disc and a 3Com 3c905B card. This is adequate for a production server. If it was very heavily used I would increase the disc size. The RAM and the CPU are perfectly adequate.

We will be running Red Hat Linux 7.3. Typically we would not be running X on the web server but we will for this example because we will be our own client too. We will run with Apache 1.3.23 which is the version shipped with Red Hat Linux 7.3.

Figure 15. Apache installation

As root
Unix Support's NFS server
Mount Red Hat mirror
Locate Apache package
Install Apache package
Unmount Red Hat mirror

Figure 16. Apache installation: Mounting the mirror

Unix Support mirror: nfs-uxsup.csx.cam.ac.uk
Red Hat mirror: /linux/redhat

# mount -o ro nfs-uxsup.csx.cam.ac.uk:/linux/redhat /mnt
# cd /mnt/updates/7.3/en/os/i386/
# ls -l apache-*
-rw-r--r--    ...    apache-1.3.23-14.i386.rpm
-rw-r--r--    ...    apache-devel-1.3.23-14.i386.rpm
-rw-r--r--    ...    apache-manual-1.3.23-14.i386.rpm

Figure 17. Apache installation: Examining the package

# rpm --query --info --package apache-1.3.23-14.i386.rpm
Name        : apache                     Relocations: (not relocateable)
Version     : 1.3.23                          Vendor: Red Hat, Inc.
Release     : 14                          Build Date: Wed 19 Jun 2002 16:55:48
Install date: (not installed)             Build Host: daffy.perf.redhat.com
Group       : System Environment/Daemons  Source RPM: apache-1.3.23-14.src.rpm
Size        : 1248999                          License: Apache Software License
Packager    : Red Hat, Inc. &lt;http://bugzilla.redhat.com/bugzilla&gt;
Summary     : The most widely used Web server on the Internet.
Description :
Apache is a powerful, full-featured, efficient, and freely-available
Web server. Apache is also the most popular Web server on the
Internet.

Figure 18. Apache installation: Examining the package

# rpm --query --list --package apache-1.3.23-14.i386.rpm
/etc/httpd/conf
/etc/httpd/conf/httpd.conf
 ...
/etc/rc.d/init.d/httpd.init
 ...
/var/www
/var/www/html
/var/www/html/index.html
/var/www/icons
/var/www/icons/a.gif
 ...
/usr/man/man8/httpd.8
 ...
/usr/sbin/httpd
 ...

Figure 19. Apache installation: Installing the package

# rpm --install apache-1.3.23-14.i386.rpm
# cd
# umount /mnt

This has not started the server.
Please remember to unmount the mirror.

We install Apache as root and then configure it so that root will not be needed subsequently for the configuration or administration of the server except to shut it down or restart it.

We use the network file system (NFS) to mount Unix Support's mirror of the Red Hat distribution. Within it (/mnt/updates/7.3/en/os/i386/) are all the software packages, including Apache (apache-1.3.12-2.i386.rpm).

We examine the Apache package for information and a listing of its contents and finally we install it. Once we've done the installation we unmount the file server.

This installation has not started the server but has arranged that it will be started on the next reboot. (Though we don't need to and won't reboot just to start it.)

Figure 20. Apache installation: Configuration file layout

               +--- conf/ ---+--- *.conf
               |                               +--- access.log
/etc/httpd/ ---+--- logs -> /var/log/httpd/ ---+
               |                               +--- error.log
               +--- modules -> /usr/lib/apache

Figure 21. Apache installation: Data file layout


            +--- cgi-bin/                  empty
            |
/var/www/---+--- icons/  --- *.gif
            |
            +--- html/   --- index.html    default

Figure 22. Apache installation: System file layout

/usr/sbin: Binaries
/usr/man: Manual pages
/etc/rc.d: Startup/Shutdown scripts
/etc/logrotate.d: Log rotation

The files installed come in three classes: the configuration files (/etc/httpd/, /etc/logrotate.d/apache) that the server managers need access to, the data files (/var/www/) that the web page authors and editors need access to and the system files (everything else) that we aren't going to touch. We will define groups to keep these categories apart.

If you were prepared to do all the updates to the web pages as root and had no special requirements such as access controls then you could just run the program now. The website exists under /var/www/html/ and I wish you much happiness with rogether. However a small amount of work (typically about 15 minutes) will make everything a lot easier and safer.

Linux system configuration

Figure 23. Configuring the operating system

Package provides a user and group for the daemon
We need to add a group for the apache administrators
And at least one group for the web authors
Avoid use of root
Log rotation

Figure 24. Configuring the O/S: User & groups

# groupadd -r webadmins
# groupadd -r webeditor
# vi /etc/group

Figure 25. Configuring the O/S: File permissions as installed

# ls -ld /var/www /etc/httpd /var/log/httpd
drwxr-xr-x 3 root root     1024 Jun 27 12:09 /etc/httpd
drwxr-xr-x 5 root root     1024 Jun 27 12:09 /var/www
drwxr-xr-x 2 root root     1024 Jun 27 16:36 /var/log/httpd

Only root can make modifications.

Figure 26. Configuring the O/S: File permissions

Change the group to webadmins:

# chgrp -R webadmins /etc/httpd /var/log/httpd /etc/logrotate.d/apache
# chgrp -R webeditor /var/www

Let the group write to the directories:

# chmod -R g+w /var/www /etc/httpd /var/log/httpd /etc/logrotate.d/apache

Make the group ownership ‘setgid’:

# find /var/www /etc/httpd /var/log/httpd -type d -exec chmod g+s {} \;

Figure 27. Configuring the O/S: File permissions—as changed

# ls -ld /var/www /etc/httpd /var/log/httpd /etc/logrotate.d/apache
drwxrwsr-x 3 root webadmins 1024 Jun 27 12:09 /etc/httpd
-rw-rw-r-- 1 root webadmins  172 Jun 27 12:09 /etc/logrotate.d/apache
drwxrwsr-x 5 root webeditor 1024 Jun 27 12:09 /var/www
drwxrwsr-x 2 root webadmins 1024 Jun 27 12:09 /var/log/httpd

The daemon will run as user apache.
How can the daemon write its log files?
It starts life and opens the log files as user root.

Figure 28. Being a webadmin

A fresh login will pick up membership of group webadmins.
This gives access to existing webadmins-writable files.
Files created in setgid directories will be owned by group webadmins
Check your permissions mask

We create system groups for the administration of the server and the management of the web pages. On a small server these can be combined. On large servers you may well want multiple groups to manage various subsets of the web pages. We specifically want to avoid requiring root access to reconfigure the web server. root will be used to start or stop the server and nothing else.

We set the permissions on the data and configuration directories so that members of the relevant group can make changes (g+w on files and directories) and any files or subdirectories created will have matching group ownership (g+s on directories).

The chmod (change the mode (permissions) of a file system object) and chgrp (change the group ownership of a file system object) commands (and the chown command which changes the user ownership of a file system object, though we aren't using that command here) have a -R option to make them behave recursively. Every file system object beneath the named directories will have their mode or group modified.

The find command is slightly trickier. We want to apply the g+s mode change to every directory beneath the named directories but we don't want to apply it to the files. The find command shown starts at each of the three directories listed and checks each file system element beneath them, testing to see if the element in question is a directory (‘-type d’). If it is then it executes a command (‘-exec ... \;’) and that command is ‘chmod g+s dir’. (‘{}’ is replaced by the name of the file system element being considered.)

Figure 29. Starting the server

# /etc/rc.d/init.d/httpd start
Starting httpd:                                            [  OK  ]

While we're here, we shall describe the manual stopping of the server, which we will hardly ever need, and the manual restarting of the server which we will use frequently in this course to bring in a new configuration file. Restarting is just stopping and starting wrapped into a single command.

Figure 30. Restarting or stopping the server

# /etc/rc.d/init.d/httpd restart
Shutting down http:                                        [  OK  ]
Starting httpd:                                            [  OK  ]

# /etc/rc.d/init.d/httpd stop
Shutting down http:                                        [  OK  ]

General configuration

Figure 31. Configuring the service

As a webadmin, not as root!
Directory: /etc/httpd/conf/
Directory and contents are group-writable by webadmins
httpd.conf: Configuration file
srm.conf & access.conf: Obsolete & empty

Directory: /etc/logrotate.d/
apache: Controls the rotation of the log files.
File is writable by members of group webadmins.

Red Hat's packaging of Apache's configuration files echoes an obsolete format of having three distinct configuration files in the /etc/httpd/conf/ directory. In this course we will put all our configuration in the single file: /etc/httpd/conf/httpd.conf and we will write this file from scratch to better learn what it all means.

The only other configuration file we will need is the log rotation file in /etc/logrotate.d/apache. We need to change this only if we change either the log files being kept or the duration they are kept for. These two reasons take on extra significance given the lunacy of the 1998 Data Protection Act. The client machine names or addresses that appear in logs and your record of what they have fetched may constitute personal data. We will return to this file in the section called “Log rotation”

Figure 32. httpd.conf: Running the daemon

ServerType standalone
ServerRoot /etc/httpd
DocumentRoot /var/www/html
Port 80
User apache
Group apache
ServerAdmin rjd4@cam.ac.uk
ServerName www.inst.cam.ac.uk
ErrorLog /var/log/httpd/error_log
LogLevel info
Options None

Figure 33. Syntax: Running the daemon

ServerType standalone
The daemon will not rely on inetd to launch it on demand but will run permanently.
ServerRoot /etc/httpd
Any files refered to in this configuration file will either be fully qualified or resolved relative to this directory.
DocumentRoot /var/www/html
The documents to be served are found in this directory.
Port 80
This is the standard port of WWW services. It is privileged on a Unix system so must be opened by root. Once opened, the port can be passed to unprivileged services (e.g. running user apache). Ports 8000 and 8080 are commonly used ports for completely unprivileged servers.
User apache
Group apache
We created a user and group specifically for the webserver. These two lines tell the server to use them. The server can change its user and group ids only if it is started as root.
ServerAdmin rjd4@cam.ac.uk
Some error messages displayed to the client can contain a contact email address. This is where it is defined.
ServerName www.inst.cam.ac.uk
You may not need this line. If your machine's real name is boring.inst.cam.ac.uk but there is a DNS record pointing www.inst.cam.ac.uk to it as well then you want the server to identify itself as www.inst.cam.ac.uk. This is how you override the machine's host name.
ErrorLog /var/log/httpd/error_log
Any error messages will be logged to the file /var/log/httpd/error_log.
LogLevel info
An error in Apache comes with a severity rating. This directive specifies what the minimum level to log is.
Options None
Apache has various options, almost all of which default to ‘on’. We will turn them off so we are forced to meet them explicitly in this course.

Figure 34. Syntax: Suboptions to LogLevel

emerg
Emergencies—system is unusable. e.g ‘Child cannot open lock file. Exiting.’
alert
Alert—Action must be taken immediately. e.g ‘getpwuid: couldn't determine user name from uid.’
crit
Critical condition—Any different from alert? e.g ‘socket: Failed to get a socket, exiting child’
error
Error condition—effects a single transfer, not the system as a whole. e.g ‘Premature end of script headers’
warn
Warning e.g ‘child process 1234 did not exit, sending another SIGHUP’
notice
Notice—Normal but significant condition. e.g ‘caught SIGTERM, shutting down’
info
Informational messages e.g ‘Server seems busy, (you may need to increase StartServers, or Min/Max SpareServers).’
debug
Debugging messages e.g ‘Opening config file /etc/httpd/conf/httpd.conf’

Figure 35. “Pool” of daemons

Single initially launched daemon.
- Runs as root
- Answers no requests
- Maintains a “pool” of child daemons
Pool of child daemons that do the real work.
- These do the real work
- Run as user apache
- Answer a certain number of requests and then die
Parameters for experts only!

Figure 36. httpd.conf: Parameters for daemon pool

PidFile /var/run/httpd.pid
LockFile /var/lock/httpd.lock
ScoreBoardFile /var/run/httpd.scoreboard
Timeout 300
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 15
MinSpareServers 5
MaxSpareServers 20
StartServers 8
MaxClients 150
MaxRequestsPerChild 100

Figure 37. Apache's functionality

Our server has very little functionality.
It serves all documents as ‘text/plain’.
It can only log errors.
We can add functionality as we need it.
‘Modules’

We can run a web server with just the configuration lines we have met so far. It will be not very good, to say the least. Its principal failing is that it has no concept of the MIME content types of the objects it serves and dishes everything up as MIME content type ‘text/plain’. If we look at http://localhost/index.html we see the HTML source (because the browser has been told that the document is of type text/plain). We need to add some functionality to Apache: the ability to determine what MIME content type a document is.

Apache's functionality comes in a set of files called ‘modules’. We start by clearing any default modules built into the system by default. Without this line many modules would be available by default. Partly because this is a lesson and partly because all good system administrators are control freaks (regarding the systems, not the users!) the only modules used here will be the ones we explicitly add. The mod_so.c module is built in to the Apache binary. But because we have cleared the module list it is not turned on by default. This is the module that allows us to load extra modules that are not built into the binary.

Figure 38. httpd.conf: Initialising the modules

# Start with an empty module list

ClearModuleList
AddModule mod_so.c

Figure 39. Syntax: Starting up the module system

ClearModuleList
Lose all information about modules in use.
AddModule mod_so.c
Use the mod_so.c module. Because it is built in to the binary we don't need to specify the external file the module lives in.

Figure 40. httpd.conf: Following symbolic links

Options +FollowSymLinks

The server at the moment also doesn't respect symbolic links, refusing to follow them either for pages or directories. Following symbolic links is an option under Apache and, as you will recall, we turned off all options so that we would notice them. There are two options relevant to symbolic links: FollowSymLinks and SymLinksIfOwnerMatch.

Figure 41. Syntax: Option suboptions for symbolic links

Options +FollowSymLinks
The web server will follow symbolic links.
Options +SymLinksIfOwnerMatch
The web server will follow symbolic links if the owner of the link (typically its creator) and the owner of the target of the link are the same.

The Options directive has a catch. If we give the line

Options FollowSymLinks

then this completely overrides any previous Options lines and FollowSymLinks becomes the only option in force. For this reason, we use the modifier syntax

Options +FollowSymLinks

which adds the option to the set of options in force.

Web pages' MIME types

Figure 42. httpd.conf: Adding support for MIME types

LoadModule mime_module        modules/mod_mime.so
AddModule mod_mime.c

TypesConfig /etc/mime.types
DefaultType text/plain

AddEncoding x-compress Z
AddEncoding x-gzip gz tgz

Now we see our first use of an external module. The syntax for the process is rather obscure. This is unfortunate but nothing we can't handle.

Figure 43. Syntax: Loading an external module

LoadModule mime_module modules/mod_mime.so
This line says that the file modules/mod_mime.so (resolved relative to the ServerRoot definition at the start of the configuration file) contains a module called mime_module. This module is added to the list of modules that the server knows about. As yet the server won't use the module; it just knows where to get it should it be called upon to use it.
AddModule mod_mime.c
This line tells the server to look through all the modules it knows about (either built-in or located with LoadModule directives) looking for a module whose original source file was called mod_mime.c (stupid, but that's how they chose to do it) and activate it.

When a module is activated some commands are added to the set permitted in the configuration file. The three directives used here (‘TypesConfig’, ‘DefaultType’ and ‘AddEncoding’) are all provided by mime_module module and would be invalid without the preceding LoadModule and AddModule lines.

Next we will consider those extra commands that the mod_mime module adds. Unless the module is loaded and added before these commands are used they will result in a syntax error.

Figure 44. mod_mime: Directives

TypesConfig /etc/mime.types
Red Hat ships with a file called /etc/mime.types (part of the mailcap package) which identifies the file name extensions used for various MIME content types on the system. This line instructs the web server to use that file to identify MIME content types of files.
DefaultType text/plain
This says that if the server cannot determine the MIME content type of the file it is about to send then it should presume text/plain.
AddEncoding x-compress Z
This declares that any file whose name ends in ‘.Z’ should be declared as having MIME encoding type ‘x-compress’ (i.e. it is compressed) and the file name without the .Z suffix should be used to determine the underlying MIME content type.

Figure 45. Some lines from /etc/mime.types

# MIME type                 Extension
application/activemessage
application/andrew-inset    ez
application/applefile
application/mac-binhex40    hqx
application/octet-stream    bin dms lha lzh exe class
application/postscript      ai eps ps
application/x-dvi           dvi
application/x-javascript    js
image/gif                   gif
image/jpeg                  jpeg jpg
image/x-xwindowdump         xwd
message/partial
message/rfc822
model/vrml                  wrl vrml
text/plain                  asc txt
text/html                   html htm

Access logs

At the moment we are only logging errors. There is an independent mechanism to log transfers and it comes as a module. Furthermore, we have no means to deal with the log files generated. This section will address the first issue and the following will address dealing with log files once we've got them.

Figure 46. httpd.conf: Logging transfers

LoadModule config_log_module  modules/mod_log_config.so
AddModule mod_log_config.c

HostnameLookups On
IdentityCheck Off

CustomLog /var/log/httpd/access_log "%t %h \"%r\" %>s %B"

Figure 47. mod_log_config: Directives

CustomLog filename "format"
Log to the file with the given format. Multiple log files may be defined.
HostnameLookups On
Convert IP addresses to hostnames.
IdentityCheck On
Do an ident lookup for each incoming request.

Figure 48. mod_log_config: Logging escape sequences

%t: Time of the request
%h: Remote hostname
%r: First line of the request
%s: Status code
%B: Data bytes sent

The CustomLog directive takes two arguments. The first is the file name to log to and the second is the format of the log itself. The format line consists of a series of ‘escape sequences’ (the things starting with percentage characters). Each of these is replaced by some piece of information about the request or the server's response to it. There is no reason why you should not have more than one log file; you just have multiple CustomLog lines each defining a different log file.

The simple escape sequence is ‘%X’ for some value of ‘X’. See the figure for the most useful examples. It is possible to log an arbitrary header from the query or response. For the server it is usually of more use to see the incoming headers. See the syntax description for some examples. Of most use in log files is the referring page. For example, you could strip out just those log lines with status code 404 (page not found) and check the refering page. If it's an internal page you can fix the link and if it's external you can contact the webmaster responsible.

The %h code requires the server to perform a lookup in the DNS to turn the IP address of the incoming request into a name. This is not an expensive operation, but if your web site is very heavily used you may want to avoid it. There are two ways to go about this. You can use %a instead. This just logs the IP address and attempts no lookup. Alternatively you can use %h but set the directive HostnameLookups Off. Under these circumstances %h behaves like %a. However, if you want to do access control based on client host name you must have HostnameLookups On, hence the provision of %a.

The %l escape also requires some explanation. The ident protocol provides a means for the server to ask of the client the name of the user on the client (or some tag uniquely identifying the user) who is making the connection. This is only possible if the client system is running the corresponding ident server. This server is quite common on multi-user systems and almost unknown on single-user systems. Again, the load is small for a lightly loaded web server but potentially severe for a heavily loaded one. (Far more so than for the hostname lookups.)

Finally we need to explain the ‘%>s’ construction. We will see in a later section that some modules run a page through quite intricate processing. ‘%s’ is the status code for the processing of the query and ‘%>s’ is the status code finally returned to the client. The latter is typically what we really want. The figure below lists the most commonly seen status codes. The full set can be found in RFC 2616.

Figure 49. Common status codes

200
OK
301
Moved Permanently
307
Temporary Redirect
400
Bad Request
401
Unauthorized
403
Forbidden
404
Not Found
500
Internal Server Error
505
HTTP Version Not Supported

Figure 50. mod_log_config: Common logging escape sequences

%a: Client's IP address
%B: Bytes sent, excluding HTTP headers.
%f: The name of the file served.
%h: Remote hostname, or IP address is hostname lookups are off.
%l: Remote logname from identd if IdentityCheck is on.
%r: The first (typically only) line of the request.
%s: Status code of the request.
%T: Number of seconds taken to service the request.
%t: Time of the request.
%U: The URL requested.
%u: The userid used if this is a page that requires userid/password.
%{header}i: Argument of header in the incoming request
%{header}o: Argument of header in the outgoing response

The escape sequences can be more involved than this. Full details are in the Apache documentation.

The %i logging option records the value of an incoming, request header. The most commonly useful headers are given below.

Figure 51. HTTP request headers

Authorization: Access rights to restricted pages.
From: E-mail address of the user making the request. (Often blank.)
If-Modified-Since: Only send the data if necessary.
Referer: The URL of the referring page.
User-Agent: The web client. Many lie.

Figure 52. Some example log lines

[17/Apr/2000:10:10:25 +0100] hostname "GET /index.html HTTP/1.0" 200 1316
[17/Apr/2000:10:11:00 +0100] hostname "GET /bogus.html HTTP/1.0" 404 0
[17/Apr/2000:10:12:00 +0100] hostname \
 "GET http://elsewhere/index.html HTTP/1.0" 200 1316
[17/Apr/2000:10:30:23 +0100] hostname \
 "GET /cgi-bin/phf?Qalias=x%0a/bin/cat/%20/etc/passwd HTTP/1.0" 404 0

The figure has four example log lines in the format defined in our configuration file.

[17/Apr/2000:10:10:25 +0100] hostname "GET /index.html HTTP/1.0" 200 1316

The first line shows a succesful transfer of the URL http://machine/index.html. Note that the client need only request the local part of the URL having determined what machine to connect to itself.

[17/Apr/2000:10:11:00 +0100] hostname "GET /bogus.html HTTP/1.0" 404 0

The second line shows an unsuccessful transfer request. The file being looked for does not exist (status code 404). Note that the logged number of bytes sent back is 0.

[17/Apr/2000:10:12:00 +0100] hostname "GET http://elsewhere/index.html HTTP/1.0" 200 1316

The third line is an example of someone trying to use the server as a proxy server. If a request comes in for a fully qualified URL some servers (and Apache if you configure it appropriately) will act as a web client, fetch that URL and pass it back to you. By default Apache does not do this. Instead, it ignores the http://elsewhere component and treats it as a request for the local URL /index.html. Note that this request generates a status code 200 and returns 1316 bytes—exactly the same number as in line one.

[17/Apr/2000:10:30:23 +0100] hostname \
 "GET /cgi-bin/phf?Qalias=x%0a/bin/cat/%20/etc/passwd HTTP/1.0" 404 0

The fourth line is an example of an unsuccesful hacking attempt. The phf script has a hole permitting arbitrary shell commands to be run. Note that these would have run as the user apache which has no special privilege, but it is still a way in.

The Data Protection Act (1998). The Data Ptotection Comissioner's office has advised us that machine names and IP addresses that can be used to identify an individual (e.g. that of the computer in a student's room) may constitute personal data in the meaning of the DPA(98). Until there is an expensive test case and some ignorant, senile, senior judge pronounces precedent we won't know for certain.

Log rotation

In this section we consider what we can do with the logs and, in particular, how to stop them growing out of control.

Figure 53. /etc/logrotate.conf

# rotate log files weekly
weekly

# keep 4 weeks worth of backlogs
rotate 4

# send errors to root
errors root

# create new (empty) log files after rotating old ones
create

# RPM packages drop log rotation information into this directory
include /etc/logrotate.d

Figure 54. /etc/logrotate.d/apache—as shipped

/var/log/httpd/access_log /var/log/httpd/error_log {
    missingok
    sharedscripts
    postrotate
        /bin/kill -HUP `cat /var/run/httpd.pid 2>/dev/null` 2> /dev/null || true
    endscript
}

Red Hat Linux provides a service called ‘log rotation’ which provides a uniform mechanism to stop log files growing out of control over time. At regular intervals (nightly, weekly and monthly are all common) the log file error_log, say, is renamed to error_log.1. If there was a previously existing error_log.1 it is renamed to error_log.2, error_log.2 to error_log.3 and so on up to some limit. The default frequency of this operation is defined in the file /etc/logrotate.conf to be weekly and the number of log files kept is set to default to 4. error_log.3 is discarded rather than renamed to error_log.4. A new error_log is created.

The directory /etc/logrotate.d/ contains the rotation instructions specific to the log files for a particular package. The log files for the apache package are kept in the file /etc/logrotate.d/apache. These are given as /var/log/httpd/error_log and /var/log/httpd/access_log. The empty brackets after the /var/log/httpd/error_log line means that there is no special action needed after the error log file has been rotated. The three lines in the brackets after the /var/log/httpd/access_log line identify a (single line) shell script that should be run after the access log file has been rotated. This sends the HUP signal to the web daemon which causes it to reopen all its log files so that it is now logging to the newly created log files rather than the .1 versions.

While this course does not consider the analog log analysis program, we will remark that the log rotation script is a good place to run it from. Each time the system rotates a log file, analog gets to process it. We might also want to address the DPA(98) issues here by insisting that the log files not be world-readable. The create line stipulates that when the logs files are rotated a new, empty one is to be created which is read/write to root, read-only to members of group webadmins and not readable at all by anyone else.

Figure 55. /etc/logrotate.d/apache—as modified

/var/log/httpd/access_log /var/log/httpd/error_log {
    missingok
    sharedscripts
    create 0640 root webadmins
    postrotate
        /bin/kill -HUP `cat /var/run/httpd.pid 2>/dev/null` 2> /dev/null || true
    endscript
}

Aliases

Figure 56. Resolving a URL to a file via an alias

By default, the ‘local part’ of any URL is converted to a file name by simply resolving it as a file name relative to ServerRoot, which is /var/www/html/ on a Red Hat Linux installation. So, for example, the URL http://server/wombat/index.html would resolve to the file /var/www/html/wombat/index.html.

However, sometimes we want a URL to point out of the ServerRoot directory tree. For example we can see that the Red Hat Linux Apache installation puts a collection of GIF files in /var/www/icons/ which is not below /var/www/html/. We might want the URL http://server/icons/new.gif to resolve to the file /var/www/icons/new.gif which it won't by default.

We can accomplish this in two ways: either we create a symbolic link from /var/www/html/icons to /var/www/icons/ or we tell Apache to override the ServerRoot setting in certain regards. As this is an Apache course, we will do the latter.

Figure 57. httpd.conf: Aliases in Apache configuration

# Aliases

LoadModule alias_module       modules/mod_alias.so
AddModule mod_alias.c

Alias /icons/  /var/www/icons/

As before, to add functionality to Apache we need a module. In this case it is the mod_alias module. This module adds a number of keywords to the configuration syntax but we need only one for now. In the slide the Alias directive maps a set of URLs with local parts starting /icons/ to the directory /var/www/icons/.

Handling directories

Figure 58. Access log: Failing to read a directory

[27/Apr/2000:15:47:11 +0100] hostname "GET /index.html HTTP/1.0" 200 2537
[27/Apr/2000:15:48:09 +0100] hostname "GET / HTTP/1.0" 404 0

http://server/index.html works
http://server/ doesn't

At the moment, while our web server can handle files, determine their MIME content and encoding types from their names' extensions and log their transfer, it still can't handle URLs that resolve to directories. Attempts to get such a URL (e.g. the top level URL for the site as a whole) give 404 errors. This is clearly unacceptable.

There are two ways to handle this and most sites implement both.

The first is to provide automatic indexing. Given a URL corresponding to a directory, the web server will create an HTML web page giving a list of all the entries in that directory. These can be annotated with icons (or their ALT text) to identify the corresponding MIME content types. They can be labelled with sizes, titles etc. or left completely plain. We will start with the basic functionality (and the relevant module) and slowly add in some flashier functions.

The other approach is to nominate one or more filenames so that if such a file exists within a directory then that file will be displayed instead. The name index.html is traditional for this, but is not compulsory.

Automatic indexing

Figure 59. httpd.conf: Module for automatic indexing

# Automatic indexing of directory URLs

LoadModule autoindex_module   modules/mod_autoindex.so
AddModule mod_autoindex.c

Options +Indexes

Figure 60. Browser's view of automatic indexing

                    Index of /
     * Parent Directory
     * index.html
     * poweredby.png

If we simply add the automatic indexing module and enable automatic indexing with an Option statement then we see lists of contents for directory URLs (including index.html). Notice that the three links shown are one directory, one HTML file and a graphic in PNG format but there is no indication of the MIME content type in the page shown. Each entry is simply preceded by a bullet.

Figure 61. httpd.conf: ‘Fancy’ indexing

IndexOptions +FancyIndexing

Figure 62. Browser's view of fancy indexing

                          Index of /

Name                Last modified       Size  Description
  __________________________________________________________________
 Parent Directory    25-Apr-2000 14:00      -
 index.html          25-Apr-2000 18:08     2k
 poweredby.png       01-Mar-2000 18:37     1k
     _____________________________________________________________

Figure 63. httpd.conf: Fancy indexing options

IndexOptions +SuppressLastModified +ScanHTMLTitles

Figure 64. Browser's view of fancy indexing options

                          Index of /

Name                Size  Description
  __________________________________________________________________

 Parent Directory        -
 index.html             2k  Test Page for the Apache Web Server on Re>
 poweredby.png          1k
     _____________________________________________________________

The mod_autoindex module adds a large number of directives to the allowed set. We'll start with just IndexOptions. This allows us to modify the displayed format. Almost always it is passed the FancyIndexing suboption which turns on the “long form” listing seen in the figure. In conjunction with this are a number of other options to modify this long form of the output, as shown in the figure. The figure below below lists the more useful options to IndexOptions.

Figure 65. httpd.conf: Adding icons to the fancy listing

IndexOptions IconWidth IconHeight

AddIconByType (HTM,/icons/layout.gif)   text/html
AddIconByType (TXT,/icons/text.gif)     text/*
AddIconByType (IMG,/icons/image2.gif)   image/*
AddIconByType (MOD,/icons/world2.gif)   model/*
AddIconByType (SND,/icons/sound2.gif)   audio/*
AddIconByType (VID,/icons/movie.gif)    video/*

We can very usefully augment the automatic listings by adding icons (or the corresponding alternative text) to the lines of output depending on the MIME content types of the files. The directive AddIconByType is provided for this purpose. Its first argument is a pair: the ALT text and the icon. Its second argument is the MIME contents type or types it should be used for. Note that wild cards can be used for the MIME content subtype.

Whenever an image is included in a page it should have its WIDTH and HEIGHT parameters explicitly specified but Apache doesn't have the facility to parse the image files it serves to determine these numbers automatically so a compromise is made. All the icons shipped with Apache are the same size. The IndexOptions parameters IconHeight and IconWidth instruct Apache to include these values (which are wired in to the module's source). All the Apache icons have width 20 pixels and height 22 pixels. If you choose to replace the icons you are strongly recommended to make them all the same size and to use the line

IndexOptions IconWidth=X IconHeight=Y

in the httpd.conf file, to supply their values.

In this example I use one icon for HTML pages (by far the most common, we might expect) and another icon for all the other text subtypes. If the distribution of your MIME content types is different you might choose a different strategy. One place where this might make sense is with the application subtypes, where lumping them all together as “application content types” is not particularly useful.

Figure 66. httpd.conf: Application subtypes

AddIconByType (_PS,/icons/a.gif)        application/postscript
AddIconByType (PDF,/icons/a.gif)        application/pdf
AddIconByType (HQX,/icons/binhex.gif)   application/mac-binhex40
AddIconByType (DVI,/icons/dvi.gif)      application/x-dvi
AddIconByType (TEX,/icons/tex.gif)      application/x-tex
AddIconByType (TAR,/icons/tar.gif)      application/x-tar
AddIconByType (BIN,/icons/binary.gif)   application/octet-stream
AddIconByType (XXX,/icons/unknown.gif)  application/*

There is a vast array of application subtypes. Every application-specific data type can claim one using the “x-” extension subtypes. The mainstream applications have applied for “real” application subtypes. The application types you have on your website should be represented by useful icons (there are plenty) and the default (unknown.gif in our case) should only be used very rarely. The image in file /var/www/icons/icon.sheet.gif shows all of them in a single picture.

Figure 67. httpd.conf: Directories

AddIcon (_UP,/icons/back.gif)   ..
AddIcon (DIR,/icons/folder.gif) ^^DIRECTORY^^
AddIcon (---,/icons/blank.gif)  ^^BLANKICON^^

Directories don't have MIME types so we need to explicitly add an icon for these. To do this, we use AddIcon which associates icons with items either by name or by special controls. For example, we can match on the name “..” to provide an icon for the reference to the parent directory. There are also some special controls, written “^^DIRECTORY^^” and “^^BLANKICON^^”, match directories and places where no icon would be used (to get the formatting right).

Figure 68. Browser's view of a fully labelled web page

                          Index of /
        Name                    Size  Description
  __________________________________________________________________________
 [_UP]  Parent Directory            -
 [HTM]  index.html                 2k  Test Page for the Apache Web Server on R
e>
 [DIR]  manual/                     -
 [IMG]  poweredby.png              1k
     _________________________________________________________________

On the subject of formatting, we need to point out a few problems. Because Apache generates PRE formatted pages rather than tables it is important that all the icons be the same size and that all the ALT text be the same length (traditionally three characters). It doesn't appear possible to put spaces in the ALT text so I tend to use underscores for spaces and three dashes for the blank icon (because it precedes a horizontal rule which in text browsers are written with a row of hyphens).

It is possible to modify the widths of the displayed columns. The IndexOptions directive has suboptions NameWidth=x and DescriptionWidth=y. The variables x and y an be either an explicit number of characters or an asterisk. In the former case the name column is made as wide as its widest element and the description column is sized to make the whole thing 79 columns wide.

Figure 69. mod_autoindex: IndexOptions suboptions

FancyIndexing: Turns on the “long” format.
ScanHTMLTitles: Display the HTML title or web pages as their description. This can be intensive on the disc.
SuppressDescription: Turn off the description column altogether.
SuppressLastModified: Turn off the column for the last modification date and time.
SuppressSize: Turn off the column for the size of documents.
IconWidth[=X]: Specify the width of all the icons in pixels (defaults to 20).
IconHeight[=Y]: Specify the height of all the icons in pixels (defaults to 22).
NameWidth=X: Width in characters of the file name column. An asterisk means “as wide as the widest element”.
DescriptionWidth=Y: Width in characters of the “description” or “title scan” column. An asterisk means that the whole row should be 79 characters wide.

Figure 70. httpd.conf: Headers and footers

HeaderName HEADER.html
ReadmeName README.html

Figure 71. Browser's view of headers and footers

   This is some text to go at the top of the page above the listing.
        Name                    Size  Description
  __________________________________________________________________________
 [_UP]  Parent Directory            -
 [HTM]  HEADER.html                1k
 [HTM]  README.html                1k
 [HTM]  index.html                 2k  Test Page for the Apache Web Server on R
e>
 [DIR]  manual/                     -
 [IMG]  poweredby.png              1k
     _________________________________________________________________

In addition to customising the listing itself, we can also append information to the top and bottom of the listing. The mod_autoindex module provides two directives HeaderName and ReadmeName for this purpose. The HeaderName directive specifies the name of a file whose contents are placed above the listing and the ReadmeName a file whose contents go beneath it.

The filenames must correspond to a MIME content text type. If it is text/html then they are included directly into the generated HTML directory listing. If they are text/plain then they are included within a PRE block.

Note that the text above the listing replaces the original text “Index of /”. Also note that the HEADER.html and README.html files appear in the listing and the last directive from the mod_autoindex module we will consider is IndexIgnore. This takes a number of regular expressions following it. Files that match one or more of these expressions is not listed in the index.

Figure 72. httpd.conf: Suppressing files from the listing

IndexIgnore .??* *~ *# HEADER* README* SCCS RCS CVS

Default directory index files

Figure 73. httpd.conf: Default files

# Default files in directory URLs

LoadModule dir_module         modules/mod_dir.so
AddModule mod_dir.c

DirectoryIndex index.html index.htm

The other approach to dealing with directory URLs is to define a filename such that if that file appears within the directory it is displayed instead of the directory itself. The mod_dir module provides precisely this functionality.

It provides the DirectoryIndex directive which gives a list of names which should be tried. Note that it can take an absolute local path. In the example quoted if a directory URL was quoted then its index.html file would be used if it existed. If it didn't exist then, if the file index.htm existed it would be used. Finally, if neither existed, and mod_autoindex module was loaded then the directory listing would be given. If the module was not loaded then a 404 “file not found” error would be given.

If you use both the mod_autoindex and mod_dir modules then in the configuration file, mod_autoindex must precede mod_dir. If they are placed in the other order then the mod_dir is ignored. The author has no idea why this is and assumes it is a bug.

Writing HTTP rather than HTML

We saw in the logging section that HTTP (the transfer protocol, not the language of the web pages) has the concept of status codes, with 200 being the ‘OK’ response and 404 being the ‘file not found’ response. From time to time, we may want to force the generation of a particular error message or status code. There are two ways to go about doing this.

The core Apache system has a directive called ErrorDocument. This lets us specify exactly what page will be sent back to accompany a 404, say, status code.

Figure 74. httpd.conf: Setting the 404 error document

ErrorDocument 404 /errors/404.html
ErrorDocument 500 "Oops, server goof."

Figure 75. Syntax: Specifying error messages

ErrorDocument nnn "text": If the server generates status code nnn then a text/plain page will be returned with that status code and text as the text.
ErrorDocument nnn URL: If the server generates status code nnn then the local web page at URL will be returned along with status code nnn.

This depends on the server generating a specific status code. You will recall that status code 403 corresponds to ‘forbidden’. We might want to indicate that trying to fetch a particular URL was expressly forbidden rather than just not present. For example, given a directory URL, we might want to display an index.html file if one exists but give a 403 status code if one does not. So we need a way to generate pages with status codes of other than 200. We could do this just by turning off or on the indexing option but the mechanisms described here provide more flexibility.

This functionality is provided by a module called mod_asis. This lets us provide web pages that aren't HTML or any other MIME type but which are the entire HTTP response to a query. This allows us to add status codes and other HTTP metadata beyond just the HTML content.

First let's see what a full HTTP session looks like.

Figure 76. Faking a browser with telnet

$ telnet draig.csi.cam.ac.uk 80
Trying 131.111.10.224...
Connected to draig.csi.cam.ac.uk.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.1 200 OK
Date: Tue, 16 May 2000 08:54:29 GMT
Server: Apache/1.3.12 (Unix)  (Red Hat/Linux)
Last-Modified: Tue, 25 Apr 2000 17:08:10 GMT
ETag: "f242-9e9-3905d0fa"
Content-Length: 2537
Connection: close
Content-Type: text/html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
  <HEAD>
 ...
 </BODY>
</HTML>

Figure 77. HTTP response headers

HTTP/1.1 200 OK: The HTTP protocol version number (our query was version 1.0 but the server is entitled to reply with version 1.1), followed by the status code and a text explanation of the status code.
Date: The timestamp of the response.
Server: A description of the responding server.
Last-Modified: When the page was last modified.
ETag: ‘Entity tag’: a key used to uniquely identify this version of the page for caches etc.
Content-Length: Number of bytes in the body of the response. (i.e. the HTML page, but not the HTTP headers.)
Connection: Whether the TCP connection should be kept open after this transfer to allow further requests.
Content-Type: The MIME content type of the following document
Blank line: The separator between the headers and the body of the web page.

So, if we are going to generate a status code 403, say, then we will need to create that first line and perhaps some others. The module will assist us with many of them, though.

The module works as follows: we create a new, fake MIME content type called httpd/send-as-is and associate it with files ending with one or more suffixes (.asis, traditionally). The module then causes the server to process these files as nearly raw HTTP rather than as HTML or some other MIME content type. Because httpd/send-as-is is not a true MIME type, we don't want to define it in the /etc/mime.types file, so we use the AddType directive of the mod_mime module to define it purely within the web server. This gives us a module dependency: the mod_asis module cannot be used without the mod_mime module already being added.

Figure 78. Adding the mod_asis module

# Send .asis files "as is"

AddType httpd/send-as-is asis

LoadModule asis_module        modules/mod_asis.so
AddModule mod_asis.c

Now, if we wanted to provide for forbidding directory indexing in certain directories as opposed to providing an index.html file, we could provide the DirectoryIndex line

DirectoryIndex index.html index.asis

Then, if a user creates a index.html file it is treated as usual. If there is no index.html file but there is an index.asis file it is used and send ‘as is’. If there is neither then the directory is autoindexed.

Let's now look at constructing a plausible index.asis file.

Figure 79. A plausible index.asis file

Status: 403 Directory searching is prohibited
Content-Type: text/html

<!DOCTYPE HTML PUBLIC
 "-//W3C//DTD HTML 4.0 Transitional//EN"
 "http://www.w3.org/TR/REC-html40/strict.dtd">

<HTML><HEAD>
<TITLE>Security policy violation</TITLE>
</HEAD><BODY>
<H1>Security policy violation</H1>
<P>This web site's security policy prohibits the autoindexing of this
directory.  Your request has been logged.</P>
</BODY></HTML>

A more useful page would give links to a search engine and the such like. More importantly, observe the headers at the start of the page, split from the body by the first blank line of the page. (The line is truly empty; there are no spaces or other whitespace characters in it.) The Status: header introduces the status code and the explanatory text message. We don't get to specify the HTTP version being spoken; the server will take care of that for us. Any following lines (before the blank line) that look like HTTP headers will be passed through untouched and must be valid HTTP header lines. The server will add the Server:, Date: and Connection: lines and we should not write these.

Figure 80. Faking a browser with telnet again

$ telnet draig.csi.cam.ac.uk 80
GET /two/ HTTP/1.0

Trying 131.111.10.224...
Connected to draig.csi.cam.ac.uk.
Escape character is '^]'.
Connection closed by foreign host.
HTTP/1.1 403 Directory searching is prohibited
Date: Tue, 16 May 2000 11:30:40 GMT
Server: Apache/1.3.12 (Unix)  (Red Hat/Linux)
Connection: close
Content-Type: text/html

<!DOCTYPE HTML PUBLIC
 "-//W3C//DTD HTML 4.0 Transitional//EN"
 "http://www.w3.org/TR/REC-html40/strict.dtd">

<HTML><HEAD>
<TITLE>Security policy violation</TITLE>
</HEAD><BODY>
<H1>Security policy violation</H1>
<P>This web site's security policy prohibits the autoindexing of this
directory.  Your request has been logged.</P>
</BODY></HTML>

If we inspect the access log file we will see the 403 lines there too.

[16/May/2000:12:06:30 +0100] hydra.csi.cam.ac.uk "GET /two/ HTTP/1.0" 403 345

This is where we get to see the difference between the logging code ‘%s’ and ‘%>s’. The former would log a status code of 200 because the .asis file was processed correctly. The latter shows 403 because that is the ultimate status code after all the internal reprocessing is complete.

Users' own web pages

Figure 81. httpd.conf: User directories

# Users' web pages

LoadModule userdir_module     modules/mod_userdir.so
AddModule mod_userdir.c

UserDir public_html

Apache contains a mechanism (read module) that allows users to supply their own web pages. It is not uncommon for a web server to offer nothing but these and to have no ‘central’ web pages at all (except perhaps for a top level index.html file. The mod_userdir module provides a single directive, UserDir. It can be used in a number of ways, however.

Figure 82. user_dir: Remapping http://server/~user/index.html

UserDir public_html
Maps URL to ~/user/public_html/index.html.
UserDir /home/userpages
Maps URL to /home/userpages/user/index.html.
UserDir /home/*/webstuff
Maps URL to /home/user/webstuff/index.html.
UserDir http://other/home/userpages
Maps URL to http://other/home/userpages/user/index.html
UserDir http://other/*/webstuff
Maps URL to http://other/user/webstuff/index.html

Delegating the controls for certain pages

So far, our editing of the file httpd.conf has set parameters for the entire server. On occasion it is appropriate to have one set of parameters for one set of web pages and another for other parts. We need some way to pass directives applicable just to a certain set of pages. There are a number of ways to describe subsets of pages and Apache supports them all. We will restrict ourselves to just the simplest in this course.

The simplest is by considering subtrees of the web pages. We can tag a directory with some special options and have those apply to every web page beneath it. For example we might want to restrict access to everything under /var/www/html/restricted/.

We might want to tag multiple directory trees by applying these overrides to every directory that matches a regular expression, rather than by specifying its explicit name. For example, anything under any directory called restricted might get special rules.

We might just specify a regular expressions that matches files and apply the rules to any files (as opposed to directories) that match. So any web page called special.html might get nonstandard rules.

Alternatively, rather than specify the restriction by file name (after the URL has been resolved) we might change the rules according to the URL quoted before this gets mapped onto a file (or directory) name. Again this could be a subtree of URLs or the set of URLs that match a regular expression.

Using the directory structure to control options also permits the placing of special files in the directory structure to control the trees beneath them (traditionally called .htaccess). These control files, in turn, might benefit from the filename matching rules to stop their being fetched by clients.

While these all make sense in isolation, the combination of rules governing directory trees, URLs, filename regular expressions and URL regular expressions is a recipe for trouble. We are going to approach this issue from the KISS (‘Keep It Simple, Stupid!’) standpoint and restrict ourselves to directory subtrees here.

Figure 83. A simple restriction example

By default:
index.html files to be respected.
Automatic indexing permitted.
Under /var/www/html/fubar/:
index.html files to be respected.
Automatic indexing forbidden.

Our configuration file will run with

DirectoryIndex index.html

but we need

Options +Indexes

for the default case and

Options -Indexes

for the /fubar/ subdirectory. The next element of the configuration file we will examine provides precisely this functionality.

Figure 84. httpd.conf: Restricting options to subdirectories

# Default
Options +Indexes

# Subdirectory restriction
<Directory /var/www/html/fubar/>
Options -Indexes
</Directory>

The <Directory> tag limits the application of parts of the configuration to just those files and directories beneath /var/www/html/fubar/.

We start to see a problem here, though. Inevitably, the directory structure will get larger and larger. The set of overrides and rules will get longer and longer. More and more people will need access to the httpd.conf file. More and more lines will get added to it. This is bad. What is needed is a way to delegate the controls over a directory tree to the directory itself. This facility exists, using control files in some directories, traditionally called .htaccess. The file can, however, have any name we choose to give it. However, before we start delegating control we might want to restrict just what configurations in the httpd.conf file we are prepared to have overridden in the delegated control files.

Figure 85. httpd.conf: Delegation of (some) control

AccessFileName .config

<Directory /var/www/html>
AllowOverride AuthConfig FileInfo Indexes
</Directory>

The delegated control file was originally used to control access to subtrees of web pages (and we'll see how to do this soon) and the name of the directive that sets it (AccessFileName) reflects that history. It is a more general overriding facility, though, so to reinforce that, we'll use the name AccessFileName directive to set the name of the delgated configuration file to be .config.

The second line specifies what facets of the Apache configuration can be overridden in the .config files. This aspect of the Apache control mechanim is not as refined as it might be, unfortunately. Any directive that appears in the httpd.conf file and which ‘makes sense’ applied to a directory tree (more precisely, any directive that could appear in a <Directory>...<Directory/> block) can be placed in this subconfiguration file.

Figure 86. Core functionality: Delegation of (some) control

AccessFileName fname
Within the document tree the a file fname will override the default behaviour with the behaviour specified within (insofar as is permitted).
AllowOverride suboptions
This directive specifies exactly what aspects of the configuration may and may not be overridden in the files named by the AccessFileName directive.

Figure 87. Core functionality: AllowOverride suboptions

AuthConfig
Control the mechanisms used for authenticating users for access to restricted documents. See the section on access control for more on this option.
FileInfo
This permits the use of the directives found in the MIME module to change or add MIME types.
Indexes
This permits the use of the directives found in the two directory modules.
Options
Allow the use of the Options directive in the delegated control files.
All
Permit all overrides.
None
Permit no overrides. Ignore the delegated control files.

Now let's return to the case study in the slide. We will drop the subdirectory directives entirely and instead specify what overrides we will permit and in what file. We then have a .config directory that changes the options again.

Figure 88. httpd.conf: Restricting options to subdirectories

# Default
Options +Indexes
AccessFileName .config
<Directory /var/www/html>
AllowOverride Options
</Directory>

Figure 89. /var/www/html/fubar/.config contents

Options -Indexes

Now we start to see Apache creak at the seams. Note that to change the nature of indexes (using the IndexOptions directive) we would need to allow the override Indexes. However, because turning automatic indexing on or off (Options +Indexes or Options -Indexes) is handled by the Options directive we have to permit the Options override. This is unfortunate because there are other suboptions to Options that we might not want to delegate. Mercifully, indexing is the exception rather than the rule in this regard. In most other cases the controls over delegation do make sense.

The next question to address is how these files nest. If I have a default state of Options +Indexes and a file /var/www/html/fubar/.config containing Options -Indexes, what happens if I have a file /var/www/html/fubar/snafu/.config with Options +Indexes? As you might expect the fubar/snafu/.config overrides the fubar/.config file for the contents of fubar/snafu/.

Access control by client IP address

There are basically two ways to restrict access to web pages: by the client's IP address or by making the client quote a userid and password. For the time being we will control the entire web site. We can then use the previous section to control just subsets of the site. In this section we will restrict by IP address and in the following section we will describe a superior system.

So, first of all, a warning about IP access restrictions: web proxies can really spoil your day. A web proxy is a system that forwards web requests on to another server. So if www.inst.cam.ac.uk restricts access to clients inside cam.ac.uk it is vulnerable to proxies within cam.ac.uk. If randompc.example.com makes a direct request it will be rejected. However, if randompc.example.com makes a request of a web proxy, proxy.college.cam.ac.uk, then the proxy forwards the request to www.inst.cam.ac.uk. The latter sees a query coming from within cam.ac.uk and honours it. The proxy then forwards ther answer back to randompc.example.com. The moral of this tale is to use client address restriction only when you restrict to a set of machines you control (enough to restrict proxies on them). Don't use it blindly.

To give an example of how hard this is the Computing Service discovered it was running an unintended proxy which allowed the CS minutes (restricted to the CS internal network by IP address) to be read from any machine in the world if they knew about our proxy. The CS friendly probing suite now probes for web proxies so you won't be surprised by yours the way we were by ours.

And now, we will demonstrate how to add client address access restrictions in the Apache configuration file using the mod_access module.

Figure 90. httpd.conf: Access restrictions

# Access control by IP address

LoadModule access_module      modules/mod_access.so
AddModule mod_access.c

order deny,allow
allow from .csi.cam.ac.uk
deny from all
allow from .csx.cam.ac.uk

The order line is read first. The ‘deny,allow’ argument specifies that

initially all requests will be honoured
then all the deny lines will be applied
then all the allow lines will be applied

regardless of the order the lines appear in.

In the example given, access is permitted to the site from clients in the two domains csi.cam.ac.uk and csx.cam.ac.uk but no others. Note the use of the leading dot to indicate that, for example, .csx.cam.ac.uk is a domain and not a hostname. Also note that for access control by domain name to work you need to have HostnameLookups set to On.

Figure 91. Request from randompc.example.com

Initial state: Access allowed
deny from all: Access denied
allow from .csi.cam.ac.uk: Inapplicable—No change
allow from .csx.cam.ac.uk: Inapplicable—No change
Final state: Access denied

Figure 92. Request from ghoul.csi.cam.ac.uk

Initial state: Access allowed
deny from all: Access denied
allow from .csi.cam.ac.uk: Applicable—Access allowed
allow from .csx.cam.ac.uk: Inapplicable—No change
Final state: Access allowed

Figure 93. mod_access: “allow” directives

order deny,allow
Initially all access allowed,
then apply all deny lines,
then apply all allow lines.

order allow,deny
Initially all access denied,
then apply all allow lines,
then apply all deny lines.

allow from all
All requests are allowed.

allow from host.inst.cam.ac.uk
Requests from the host are allowed. Requires HostnameLookups On.

allow from .inst.cam.ac.uk
requests from hosts within the domain are allowed. Requires HostnameLookups On.

allow from 131.111.11.84
Requests from the host are permitted.

allow from 131.111.11.0/255.255.255.0
Requests from any IP address starting 131.111.11. are allowed.

allow from 131.111.11.0/24
Requests from any IP address starting 131.111.11. are allowed. (The first three numbers correspond to the first 24 bits of the IP address quoted.)

Figure 94. mod_access: “deny” directives

deny from ...
As per allow from ...

Access control by user authentication

As said before, the author advises very strongly against restricting access to .cam.ac.uk. You may be able to get away with restrictions to inst.cam.ac.uk if you rule your institution with an iron fist. Often it is far more useful is to require authorised users to authenticate themselves against the server. The mechanisms for this are split over a variety of modules and core Apache functionality depending on how to want to run the authentication. We will take the simplest approach here and authenticate against a text password file. Equivalent modules exist for authenticating against more complex databases. This becomes necessary if the database gets too big and linear text file searching too slow.

Figure 95. httpd.conf: Restricting access to authenticated users

LoadModule auth_module modules/mod_auth.so
AddModule mod_auth.c

<Directory /var/www/html/restricted>
AuthType Basic
AuthName wombat
AuthUserFile /etc/httpd/conf/passwd
require valid-user
</Directory>

Figure 96. Creating an Apache password file

$ touch /etc/httpd/conf/passwd
$ ls -l /etc/httpd/conf/passwd
-rw-rw-r--    1 root     webadmin        0 Jun  1 10:12 passwd
$ htpasswd /etc/httpd/conf/passwd demouser
New password: dem0user
Re-type new password: dem0user
Adding password for user demouser

First let's consider what we've done to the httpd.conf file. We have included a module mod_auth whose function is to permit checking IDs against a plain text password file. This module provides us with the AuthUserFile directive which specifies the location of that password file. The AuthName and AuthType directives belong to the core Apache functionality because they are independent of the supporting database. The AuthType directive specifies the mechanism that is going to be used to transmit the ID and password. If we are going to use the mod_auth module we must specify Basic as the authentication type because this is the only one widely understood. This sends passwords unencrypted over HTTP.

Basic authentication is best illustrated by using telnet as our web client again.

Figure 97. Basic authentication uncovered—1

$ telnet hydra.csi.cam.ac.uk 80
Trying 131.111.11.148...
Connected to hydra.csi.cam.ac.uk.
Escape character is '^]'.
GET /restricted/ HTTP/1.0

HTTP/1.1 401 Authorization Required
Date: Thu, 01 Jun 2000 10:29:37 GMT
Server: Apache/1.3.12 (Unix)  (Red Hat/Linux)
WWW-Authenticate: Basic realm="wombat"
Connection: close
Content-Type: text/html; charset=iso-8859-1
 ...
Connection closed by foreign host.

So our attempt to get the /restricted/ URL fails with a status code 401 ‘Authorization Required’. Note the HTTP header line

WWW-Authenticate: Basic realm="wombat"

On receipt of this status code and header line a sensible browser will prompt the user for an ID and password for the server, quoting the realm ‘wombat’. The concept of realms allows us to split the web site into more than one distinctly controlled area. For one directory tree (/var/www/html/restricted/ we can demand IDs and passwords for one realm (wombat) and for another tree we can demand a different set of IDs and passwords.

The browser will then send back the same request as before but this time quoting the ID and password given, Base64 encoded. (The Base64 encoding of ‘demouser:dem0user’ is ‘ZGVtb3VzZXI6ZGVtMHVzZXI=’.)

Figure 98. Basic authentication uncovered—2

$ telnet hydra.csi.cam.ac.uk 80
Trying 131.111.11.148...
Connected to hydra.csi.cam.ac.uk.
Escape character is '^]'.
GET /restricted/ HTTP/1.0
Authorization: Basic ZGVtb3VzZXI6ZGVtMHVzZXI=

HTTP/1.1 200 OK
Date: Thu, 01 Jun 2000 11:09:15 GMT
Server: Apache/1.3.12 (Unix)  (Red Hat/Linux)
Last-Modified: Thu, 01 Jun 2000 10:28:10 GMT
ETag: "6b543-144-39363aba"
Accept-Ranges: bytes
Content-Length: 324
Connection: close
Content-Type: text/html
 ...

The browser will typically remember the userid and password for realm ‘wombat’ and if challenged for the same realm again won't reprompt the user.

Figure 99. ID-based access restriction logic

Authenticate the ID
Is the ID allowed access?

To date we have just explained how the Basic authentication authenticates a web user. We still haven't really explained why the user is subsequently let in. There are two sides to the permissions: First, the client must authenticate themseves to the server as a particular ID. Second, the ID must, of itself, have permission to access the pages. This second stage is covered with the require directive. The line in our example file

require valid-user

means that any user from the /etc/httpd/conf/passwd file is allowed access if they can quote the password.

Figure 100. An example /etc/httpd/conf/passwd file

demouser:RGMhGsfmvLQeE
bob:ylxjJ83Fx7p8E
tom:C6QeAIpNqz9IE
dick:yfPWrksACScys
harry:tXFkoaIYJqbrk

The password file maintained by the htpasswd program uses the same password hashing algorithm as the traditional Unix password file, but note that you cannot use the system password file for the Apache system. This file must be maintained separately. Also note that the IDs used in this file are not login names. There need be no relation at all between the IDs used for web authentication and the system's login names.

Figure 101. A more refined access control

/var/www/html/restricted/alpha: Any valid user
/var/www/html/restricted/beta: tom, dick, harry
/var/www/html/restricted/gamma: bob, tom

Figure 102. httpd.conf: Finer grained access control

LoadModule auth_module modules/mod_auth.so
AddModule mod_auth.c

<Directory /var/www/html/restricted>
AuthType Basic
AuthName wombat
AuthUserFile /etc/httpd/conf/passwd
</Directory>

<Directory /var/www/html/restricted/alpha>
require valid-user
</Directory>

<Directory /var/www/html/restricted/beta>
require user tom dick harry
</Directory>

<Directory /var/www/html/restricted/gamma>
require user bob tom
</Directory>

In the slide we see an alternative use of requires. Here, we set up a single mechanism to authenticate the clients for the directory /var/www/html/restricted and three different schemes for determining who (once authenticated) is allowed in.

Figure 103. httpd.conf: Access control by groups

LoadModule auth_module modules/mod_auth.so
AddModule mod_auth.c

<Directory /var/www/html/restricted>
AuthType Basic
AuthName wombat
AuthUserFile /etc/httpd/conf/passwd
AuthGroupFile /etc/http/conf/group
</Directory>

<Directory /var/www/html/restricted/alpha>
require valid-user
</Directory>

<Directory /var/www/html/restricted/beta>
require group betagrp
</Directory>

<Directory /var/www/html/restricted/gamma>
require group gammagrp
</Directory>

Figure 104. An example /etc/httpd/conf/group file

betagrp: tom dick harry
gammagrp: bob tom

There is one level of sophistication above lists of users: lists of groups. In addition to the password file for web IDs to be authenticated there can be a group file assigning these web IDs to web groups. Again, these are completely independent of the Unix login groups and note that the web group file has a different format from the Unix group file.

It's worth recalling that anything that appears in a <Directory> block can also appear in the directory's corresponding .htaccess (or whatever you chose to call it with the AccessFileName directive) file.

Figure 105. mod_auth: Directives

AuthType Basic: Specifies the ‘basic’ authentication mechanism.
AuthName realm: Specifies the ‘security realm’.
AuthUserFile file: Specifies the web ID password file.
AuthGroupFile file: Specifies the web group file.
require valid-user: Any authenticated ID may have access.
require user user1 user2: ID must be authenticated and be one of user1 or user2 to have access.
require group grp1 grp2: ID must be authenticated and be in group grp1 or grp2 to have acces

Virtual hosts

Figure 106. HTTP request headers

GET / HTTP/1.0
Connection: Keep-Alive
User-Agent: Mozilla/4.72 [en] (X11; U; Linux 2.2.14-6.1.1 i686)
Host: hydra.csi.cam.ac.uk
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Encoding: gzip
Accept-Language: es, en
Accept-Charset: iso-8859-1,*,utf-8

A header that is part of the HTTP/1.1 spec. but has been a standard extension to HTTP/1.0 in all browsers is the Host: header. This identifies by name the host the brower was trying to connect to.

At first glance this is pointless. If the browser hadn't been trying to connect to the server it wouldn't have been connecting to this instance of Apache in the first place! However, it is possible to have multiple different names all pointing to the same IP address and hence the same instance of Apache.

There are several ways to do this in the DNS but the most common and easiest is to have a single real name for the IP address (its ‘canonical name’) specified in the DNS by an A record (so called because it looks up the address corresponding to a name) and one or more aliases. These aliases are other names defined to be equivalent ot the real name by CNAME records in the DNS (so called because they look up the canonical name of the alias).

Figure 107. DNS entries

www-uxsup.csx.cam.ac.uk.  1D IN CNAME  nymph.csi.cam.ac.uk.
nymph.csi.cam.ac.uk.      1D IN A      131.111.10.245

By explicit inclusion of the originally requested host name it is possible to have a multiplicity of websites each corresponding to different names for the same host. This is managed in the configuration file with the <VirtualHost> directive.

Figure 108. httpd.conf: Setting up a virtual host

# Virtual host example
<VirtualHost cockatrice.csi.cam.ac.uk>
DocumentRoot /var/www/cock
</VirtualHost>

The slide shows the setting up of a virtual host with the system definitions but with a different document root. You might want to create separate Unix user groups for the control of the content of the virtual host data and the ‘canonical’ host data.

On systems that run multiple virtual hosts, it is very common for the canonical document root to have nothing but a home page saying ‘go to one of these virtual hosts’ and for all the data to be under the document trees for the various virtual hosts.

The title of this document is: Web Server Management: Running Apache on Red Hat Linux
URL: http://www-uxsup.csx.cam.ac.uk/courses/apache/student.html