Making Unix Web Servers Case-Insensitive

If you run a web server on Windows NT with IIS, URLs are case-insensitive. A request for /News/Story.html and /news/story.html both work. On Unix, they do not. The file system is case-sensitive, so /News/Story.html and /news/story.html are two different paths, and if only one exists, the other gives you a 404 error.

This is a real problem. When you migrate content between NT and Unix servers, links break because of case mismatches. When users type URLs by hand, they get the case wrong. When other sites link to your pages, they sometimes get the capitalization wrong. On NT, none of this matters. On Unix, every one of these becomes a broken link.

Ranjit Bhatnagar and I wrote two scripts that solve this. They work together as a pair.

The first script is make-dbm-of-urls.pl. It crawls the document tree on your web server and builds a DBM database that maps every URL to its lowercase equivalent. You run it periodically, or after publishing new content, to keep the database current. The database is a simple lookup table: given a lowercased URL, it returns the actual path on disk with the correct capitalization.

The second script is not_found.cgi. You configure your web server to call this script whenever it would normally return a 404 error. The script takes the requested URL, lowercases it, and looks it up in the DBM database. If it finds a match, it sends a redirect to the correctly capitalized URL. The user never sees an error page. They just end up at the right place.

What I like about not_found.cgi is that it does more than just redirect. It adapts its error message based on how the user arrived at the bad URL. The script checks the HTTP Referer header and responds differently depending on the situation:

If the user followed a link from another web site, the error page tells them that the link on the referring site is broken, and suggests they contact the webmaster of that site to fix it.

If the user typed the URL directly or came from a bookmark, the page tells them to double-check the address or update their bookmark.

If the user followed a link from within our own site, the page acknowledges that we have a broken link and says we will fix it.

This is a small thing, but it makes the error experience much less confusing. A generic “page not found” message leaves the user wondering what went wrong. Our version tells them exactly what happened and what to do about it.

We use this at Philadelphia Newspapers on our Unix web servers. It handles the case sensitivity problem cleanly without requiring us to rename thousands of files or change how our publishing system works. The DBM lookup is fast, so there is no noticeable delay when a redirect happens.

The scripts are written in Perl and should work with any Unix web server that supports CGI and custom error handlers, which is most of them. You can get them from my web site at http://rajiv.org/free/.

Here is the full source code for the URL database builder:

#!/usr/local/bin/perl

# Program name: make-dbm-of-urls
# Installed in: /inet/programs/

# Version: 1.0 1996/Sep/22

# Author: Rajiv Pant (Betul)  [email protected]  http://rajiv.org
# & Ranjit Bhatnagar  [email protected]  http://moonmilk.volcano.org

# Note: You will need to adjust all folder locations in this program
# to suit your system.

# Purpose/Description:
# --------------------
#
# Makes a list (actually a dbm hash) of all the files under the web
# document root.
# The keys to the hash are the file paths in all lower case letters,
# the corresponding values are the actual pathnames which may be in
# mixed case.
#
# This list is used by:
# * A server API/CGI program to make the Unix based web server ignore
# upper/lower case when someone requests a url like NT does. When a page
# is not found, the server runs an api module or cgi program that converts
# the url to all lowercase and checks this hash table, if the page is
# found, it forwards the browser to it. If not, it gives the usual
# not found message.
# * The indexing program to make the site searchable.
#
#
# Q. Why would I want to make my Unix web server ignore case in URLs ?
# A. Several reasons. Many sites use a mixture of naming conventions
#    especially when many people work on the site. Also, when people
#    upload files to the unix servers from PCs or MACs, the case may
#    vary depending on program used to transfer, it's configuration,
#    and the file name itself.
#    Also, if your unix server shares a disk with an NT server using
#    Samba or NFS, and you want to make it searchable using MS Index
#    Server or NT based search program, this ensures that URLs will
#    always work.
#    It makes it easier for you to give out your urls without saying
#    "with an uppercase F and a lowercase o".
#    If NT web servers do not care about case in URLs, why should unix ?
#
# Author: Rajiv Pant (Betul)  [email protected]  http://rajiv.org



# ---- Libraries used ----

require 5.003 ;

use File::Find ;	# Part of standard perl distribution.

use Fcntl ;		# Part of standard perl distribution.

# Note: If you do not have Berkeley DB installed, any of the
# other Perl DBMish modules (GDBM_File, NDBM_File, ODBM_File, SDBM_File
# will also suffice.)

use DB_File ; 		# Part of standard perl distribution.


# ---- /Libraries used ----



# --- Directories and files ---

# This is the web server's document root. If you would like this
# program to handle some other virtual roots too, you should list
# them here.

$document_root	= '/disk2/web' ;


# $indices_dir is where the search indexes and some related files
# are stored.

$indices_dir	= '/datafiles/indices' ;


# $exclude_list is a list of folders under document root which
# should not be inclded in this list. Any folders inside these
# folders are also skipped. This plain text file follows a simple
# format which is explained below.

$exclude_list	= '/pin/pub/exclude-from-search.txt' ;


# $dbm_of_urls is the name of the dbm that will contain this hash
# table (associative array) of all lowercase urls to their real
# path names.

$dbm_of_urls	= $indices_dir . '/dbm_of_urls' ;


# --- /Directories and files ---




# ---- Reading the exclude list ----

# A short, sample exclude list file follows.
# The file can contain comments. Any line containing a # is considered
# a comment. To use the sample file below, you will have to remove the
# comment sign and space "# " that prefixes each entry.
#
# -- Sample begins in next line --
# ads
# clients/mohan
# clients/vic/adultpages
# messages/error
# test
# -- Sample ends in previous line --

open (EL, $exclude_list) ;
while (>EL<)
  {
  s/\s//g ;		# Removing spaces, tabs and newlines.
  next if /#/ ;		# Skipping comments.
  next unless /\w/ ;	# Skipping blank lines.

  push @not_to_be_indexed, $_ ;
  }
close (EL) ;

#print join "\n", @not_to_be_indexed ; exit ; # debug

# ---- /Reading the exclude list ----




# ---- main ----

# Note: Depending on how you set up your system, you may want to
# first remove the existing dbm file before adding urls to it here.

tie %dbm_of_urls, DB_File, $dbm_of_urls, O_RDWR|O_CREAT, 0644 ;

&find (\&add_url_to_dbm, $document_root) ;

untie %dbm_of_urls ;

# ---- /main ----




# The add_url_to_dbm subroutine is called by the find subroutine as
# it recurses the directory tree. When the make_dir_list subroutine
# sees a directory in the not to be indexed list, it tells find() to
# not recurse any more into that folder any more. find skips to the
# next folder and the list gets built saving system resources that
# would have been wasted in a complete traversal.

sub add_url_to_dbm
{
if (-d and
    grep $File::Find::name =~ /^$document_root\/$_\// , @not_to_be_indexed)
  { $File::Find::prune = 1 }

else
  {
  ($URL) = $File::Find::name =~ /^$document_root\/(.*)$/ ;
  ($in_lower_case = $URL) =~ tr/A-Z/a-z/ ;
  $dbm_of_urls{$in_lower_case} = $URL ;
  }

} # ---- end of sub add_url_to_dbm ----

# Author: Rajiv Pant (Betul)   [email protected]   http://rajiv.org

And here is the smart 404 handler:

#!/usr/local/bin/perl
# Program name:	not_found.cgi
# Installed in:	/programs/messages/
# Runs on:	all our unix web servers

# Author: Rajiv Pant (Betul)  http://rajiv.org   [email protected]

# Version: 2.0. (Updated for make_dbm_of_urls) 1996/Sep/22

# Note: You will need to adjust all folder locations in this program
# to suit your system.


BEGIN { $| = 1 }


# ---- Libraries used ----

require 5.003 ;

use File::Find ;        # Part of standard perl distribution.

use Fcntl ;             # Part of standard perl distribution.

# See not about DB_File in the companion program make-dbm-of-urls
use DB_File ;           # Part of standard perl distribution.


# ---- /Libraries used ----




# --- Directories and files ---

# This is the web server's document root. If you would like this
# program to handle some other virtual roots too, you should list
# them here.

$document_root  = '/inet/web' ;


# $indices_dir is where the search indexes and some related files
# are stored.

$indices_dir    = '/inet/index' ;


# $dbm_of_urls is the name of the dbm that will contain this hash
# table (associative array) of all lowercase urls to their real
# path names.

$dbm_of_urls    = $indices_dir . '/dbm_of_urls' ;


# --- /Directories and files ---




# --- Reading CGI values ---
# In a server API version of this program, read in the corresponding
# values.

$referer		= $ENV{'HTTP_REFERER'} ;
$server_name		= $ENV{'SERVER_NAME'} ;
($referer_server)	= $referer =~ m%[A-Za-z]+://([^/]+)/% ;
$path_info		= $ENV{'PATH_INFO'} ;
$path_info		= $ENV{'REDIRECT_URL'} unless $path_info ;

($in_lower_case)	= $path_info =~ /^\/(.*)$/ ;
$in_lower_case		=~ tr/A-Z/a-z/ ;

# --- /Reading CGI values ---

tie %dbm_of_urls, DB_File, $dbm_of_urls, O_READ, 0 ;

$url = $dbm_of_urls{$in_lower_case} ;

untie %dbm_of_urls ;


if ($url)
  {
  print <<EOM;
Location: http://$server_name/$url
Content-type: text/html


<html>
<head>
<META HTTP-EQUIV="Refresh" CONTENT="0; URL=/$url">
<title>The page you requested is at: /$url</title>
</head>
<body>
The page you asked for, /$path_info is located at
<a href="/$url">/$url</a>. Your browser should have
taken you there automatically.
</body>
</html>
EOM
  exit ;
  }


# else continuing with regular error handling ...


print <<EOM;
Content-type: text/html

<html>

<head>
<title>
$path_info not found on $server_name
</title>
</head>

<!-- Technical Problems ? Contact Rajiv Pant (Betul)   http://rajiv.org -->
<!-- betul\@rajiv.org -->

<body><!-- default grey background for error message page -->

<center>
<h3>
The page
<font color="#de0031">
$path_info
</font>
you requested could not be found on $server_name
</h3>
</center>
<font size=+1>
EOM

if ($referer eq '')
  {
  print <<EOM;
Please check if the URL
<font color="#de0031">http://$server_name$path_info</font>
you have typed in is accurate.
<p>
If you have come here via an old bookmark you had made on
our site, it is possible that the page has been deleted or moved
or was a part of a virtual space. In that case, you may want to
look for it on our site again.
EOM
  }
elsif ($referer =~ /\.rajiv\.com/)
  {
print <<EOM;
If you like, you can
<a href="mailto:online.staff\@rajiv.com">send us an email</a>
that the <a href="$referer">$referer</a> page on our site contains
an incorrect link to <font color="#de0031">$path_info</font>.
EOM
  }
else
  {
print <<EOM;
You were referred to the incorrect <font color="#de0031">$path_info</font>
link from the site <font color="#de0031">$referer_server</font>.
<p>

We would appriciate it if you could email the maintainer of the
page <a href="$referer">$referer</a> about this incorrect link to
<font color="#de0031">http://$server_name$path_info</font>.
<p>

You may find the email address on that site, or you may try sending to
<a href="webmaster\@referer_server">webmaster\@$referer_server</a>.

EOM
  }

print <<EOM;
<p>
<i>You may find the information you are looking for via our</i>
<br>
<a href="/search/">
<li>Search System
</a>
<br>
<a href="/help/contents.html">
<li>Table of Contents
</a>

</font>


<hr size=1 noshade>
<table width=100% border=0>
<tr>
<td align=right>
<a href="mailto:online.staff\@rajiv.com">
online.staff\@rajiv.com
</a>
</td>
</tr>
</table>

<!-- Rajiv Pant (Betul)  betul\@rajiv.org   http://rajiv.org -->

</body>

</html>
EOM