How the UK Caves database works

About the site

This site is run as a non-profit, private site. The data is manually maintained, using information gathered from various sources. It is not run as an official record or cave registry. You can help keep the data on this site up to date by sending in any corrections or additions.

Now, on to the technical details of how the site actually works:

Languages

This site is written using a combination of several Web based languages. The data is stored in tables in an SQL database, and queried using SQL. To produce the tables of caves from the data in the database, I have used PHP 4+, a runtime processing language, specifically designed to output HTML. The PHP creates the page out of semantically useful HTML. The colours, fonts and layout are set up using CSS style sheets. JavaScript is used only in a few places to produce notifications or helpful features. The original site also used JavaScript to open description windows, but these are no longer used.

Because of how the site is written, JavaScript is not required for any functionality (this was also the case for the old design that used JavaScript), frames are not needed (this was also the case for the old design that used framesets), images are not required, CSS support is not needed and no plugins are used. This site will work in even the lowest spec browsers, and with screen reading tools.

The site style (which should cause a lot less pain to your eyes than the original one) converts the semantic HTML into the current layout. Because this is done completely with CSS, it is possible to change the look of the entire site without needing to alter the markup, using a single stylesheet. In fact, if you decide you don't like the style, you can use a browser that allows you to disable the CSS or provide your own, and you can redesign the site for yourself.

The database server I have used is MySQL, although the PHP is written using a separate file that contains all of the SQL functions. As each PHP file includes this file, it is only needed once, and if the SQL server type is changed, for example to PostgreSQL or any compatible SQL databases, only one file will need changing to make all PHP scripts run properly. The user name, password, database name and hostname are all set up in a separate file, included by the file containing the SQL functions. This also will only need to be changed once. The SQL functions code looks like this:

$handler = mysql_connect( $host, $username, $password );

$database_active_handler = mysql_select_db( $database, $handler );

function sqlfunct_query( $query_string ) {
	global $handler;
	$result = mysql_query( $query_string, $handler );
	return( $result );
}

...

So in every PHP page, to make an SQL query I would type:
$my_answer = sqlfunct_query( " - Query String - " );

Language links

LanguageReference manualGoverning body
MySQLFullMySQL
PHPFullPHP.net
HTMLFullW3C
CSSFullW3C
JavaScriptFullECMA, W3C, WhatWG

The database

The entries in the database are in four tables.

The first table (caves) is in the format:

The second table (regions) is in the format:

The third table (parent regions or countries) is in the format:

The fourth table (news) is in the format:

Using this format, I can change a parent region name once in the parent region table and every cave in that parent region will now have the new parent region name. I can also do likewise with region names.

URL rewriting

The URLs used on the site are prepared in human readable format. To convert these into something more useful for PHP, Apache's mod_rewrite is used to rewrite the URLs internally into something much more messy, passing the parts of the human readable variables as GET parameters.

Mod_rewrite also handles some of the more simple redirects from the older URL formats into the new ones, to prevent search engines, bookmarks, and links from breaking when they try to use the new formats. For the more awkward older URLs, such as those used by the cave tables and description windows, the PHP files check the database for the old codes, and redirect to the new URL formats.

Page generation

In all cases, the descriptions and names are modified to suit their placement. For instance, if they are inside a text box or a relevant part of the page, then all standard HTML entities are used in code form, i.e. " becomes ". If they are to be used inside a link, then they are URL encoded.

I had been having problems with a server caching pages (storing old versions and not updating them, then sending out the old versions instead of the new ones) even though the the browsers were asking to be given a fresh copy of the page so they could see updates. For this reason, every PHP page on the site includes a script that sends headers telling browsers they can cache the pages, but servers may not.

Includes

Most of the pages on the site want the same basic HTML structure; the doctype, stylesheet link, navigation, footer, etc. Most of them need the same PHP functionality; SQL functions, database checks, variable initialisations, sanitisation routines, authentication, key checking, etc. To make this easy, parts of the structure and PHP that are reused, are contained within separate files, which are included by the pages that need them. Only the unique parts of a page are contained within the page itself. Parts of this could be done with SSI and templating systems, but templating systems are invariably bloated, clunky, and too limited for my needs. Since the rest of the site is written in PHP, it is done with regular PHP includes.

The index

The tree diagram of regions and parent regions down the left side of the page is generated using PHP. This is done by selecting every parent region from the table of parent regions and with each in turn then selecting every region whose parent region code matches the parent region code of the parent region. Got that?

The stree structure is created as a nested HTML list. CSS is then used to create a tree structure diagram. If an item is the last item in a list, then it receives the 'L' shaped background image. Otherwise it receives a left border, and a '-' shaped background image. The pictures and border are designed to work with each other to create a seamless effect that works even if the list item content wraps to multiple lines.

The news list on the index page is created by reading all non-hidden news items from the news table in descending order of date. The PHP prepares a friendly date format, and displays the news box only if it actually finds some non-hidden items.

The cave tables

The all regions table is produced by reading every value from the database in descending order of length or depth. The PHP adds a number to each cave according to its length/depth, e.g. 1, 2, 3, 4, etc. If this number is even, then the class of the row is changed to one with a slightly darker background. If the 'mine' column is blank, then the cell is given a value of 'All cave'. If there is a description, then a link is created to description, with the parameter of the cave short name.

If a region is selected then it requests a PHP file telling it the short name (which the PHP ensures is unique, even though the primary key is the reference number, allowing me to change the short name if needed without breaking any references). This can be taken directly from the database as a normal query using 'where regionshortname = SHORT_NAME'.

If a parent region is selected then it requests a PHP file telling it the short name, which receives the same treatment as above. This can be taken from a select query that combines the tables to attach the the region table onto the caves table, where the region has the parent region as its parent. This means that the region table line with a matching region code will be added on to each cave line. This is also done on both of the above queries in order to display the region name in the table.

On a search, there are three options; any words, all words, exact match. For the first two the input is split into component words. For the third, it is treated as one word. The caves table is then searched for cave names containing the words. If the first search scope (all words) is used, then the word 'AND' is used when querying the database. If the second is used, then the word 'OR' is used. If the third is used, it doesn't matter because there is only one "word".

For SQL database search, the syntax used is:

SELECT * FROM caves WHERE ( $string ) ORDER BY length DESC

$string is made up as follows (please note this is not correct syntax, it is for demonstration only):

for the first word
$string = name LIKE '%FIRST_WORD%'
for subsequent words
$string = $string AND/OR name LIKE '%NEXT_WORD%'

The table is then prepared as above.

The cave description

This is done with a single SQL query where the reference code is the short name passed to it.

It produces a title from the name field, statistics using the depth and length fields, then finally a description using the description field.

Redirected caves

Caves grow and are connected to each other to create larger cave systems. If two caves that are listed separately in the database are connected, then one of the cave entries is no longer needed. If the combined system has a new name, then one entry could be renamed. In both cases, simply removing the old name information from the database will mean that the old URLs referencing that cave's information no longer work, and anyone following a link to those caves would end up with a 404 error.

To avoid this problem, caves are able to redirect to each other. The cave description redirects to the new description if there is one, or the summary if not. The summary redirects to the new summary. The cave disappears from all listings and counts. To maintain the old name for searching, it must be included in the the name of the new system. Only one level of redirects is allowed at any time, so a cave may not redirect to another cave if that cave also redirects to another cave. This is enforced to avoid accidental infinite redirect loops.

The admin section

This bit has been designed with HTTP password authentication. For this reason, each page in the admin section includes a HTTP password authentication PHP script. Because this script is included by all the files, the user name or password can be changed once and it would work with all the admin pages.

The design of the administration section has had a major overhaul. In the original site, it was quite groundbreaking for its time (or so I like to think anyway). Originally created back in 2001, and based on the old frameset, it used to load the list of caves/regions/parent regions each on a single page, with a form at the top to create, edit, or delete. As I was the only user of that admin system, accessibility was not an issue, and the site made heavy use of JavaScript. Selecting any cave/region/parent region would trigger loading of a PHP page into a hidden frame in the frameset, passing it the code for the item. It would retrieve the details, bundle it into a JavaScript (prepared so that it would not break when it contained special characters, of course), and then the script would pass it back to the main frame when it loaded, filling out the form with the details. These days, the buzzword lovers like to name this sort of approach after that AJAX bathroom cleaner, and they progressed through iframes and DOMDocument to XMLHttpRequest to achieve the same result. Here's the thing though; given when it was written, mine was so reliable, it even worked in Netscape 2. Try saying that about GMail.

The current version of the site uses a more accessible approach, not because it is an issue for me, but just because I prefer regular HTTP to XMLHttpRequest. Each section has a page listing the contents of the database table. There is also a form for adding new entries, which uses Web Forms 2.0 to validate on the client (no JavaScript required), and PHP to back those checks up on the server. Since some of the form fields allow HTML content (the cave and region descriptions, and the news items), there is also a routine for automatically performing markup validation using the W3C service (only if the description has actually been changed). There is also a slightly more picky function that insists on optional closing tags, case sensitivity, and quoted attributes, simply to follow my style guide, not because it is needed. The validation step ensures that all descriptions validate, without needing to individually validate each page.

Each item in the list can be deleted (via an extra "are you really really really sure" page), or edited. Editing uses the same form as adding (though it appears on a separate page), but also has a field to allow the unique id to be changed to allow for any old redirects or table linking that may need to be changed. Changing this code also triggers a change in any related tables to ensure that the linking does not break. Similarly, the cave editing form has a field to select a cave to redirect to (if the cave is eligible for redirection).

For hardcore manipulation, there is also a direct SQL interaction facility, that takes SQL queries and returns the results, like an SQL client, without needing an SQL client.

The admin section includes a backup facility, allowing the database to be backed up, emptied and restored. This is done with a custom PHP script, and not the database engine's own dump routines, because it means that shell access is not required, it is not tied to any specific database, and it can be made more intelligent. It will not, for example, back up the database if it is empty. It will not attempt to restore into a database that is already full. And for extra geek points, the way it works is to have the PHP script read the database contents, then create a new PHP script that contains the database contents and can be run in order to restore it. PHP that writes PHP.

Since the site is designed to reference external Web sites when appropriate information about caves or regions is published, it is important to ensure that links to those sites continue to work. Unfortunately, caving sites are particularly bad at survival. Often they are hosted by the author's ISP, or under some unrelated domain name. When the author changes ISP or obtains a new domain name, the sites tend to be abandoned and are soon removed by the host. Pages are also deleted from sites without offering replacements. In a few cases, redirects are set up to point to newer versions of the page.

To combat this problem, the admin section of this site has a tool that checks for broken links. It loads all descriptions of caves and regions, uses PHP's DOM support to create virtual documents for them in memory, searches those documents for links, and then attempts to make HEAD requests for the linked pages. It reports all failures to connect, misbehaving servers, links to pages that no longer exist, and redirects. Typically, old copies of lost pages can be found through the excellent archive.org archiving service. So the tool also says if archive.org already appears to be referenced in the description.

All actions that modify the database are protected from XSRF by embedding a unique key in each form, and checking for it before making modifications. At no point is data passed in a URL written on the page without being safely sanitised first to ensure it cannot contain harmful code (converting to HTML entities). This protects against XSS.