When creating and deploying a web-site for a client, there are a number of sanity checks you should always run through before delivering the site. Spell checking the site is a good start (SprySpell) as is testing it in a number of web browsers. Checking for 404 errors and ensuring the structure of your site is what you expected is also essential, and for that I introduce 'Integrity' in this article.

Introduction

When you've spent days and weeks creating the perfect site for your client, and you've been desperately trying to impress, getting an e-mail from the client, after you have proudly delivering it, saying that they have discovered a broken link is totally demoralising. This looks terribly unprofessional, and can only get worse if the client missed it as well, and the site goes live with this error. A broken link is a very simple mistake to make, and can easily get lost in-between all the complex front- and back-end code that has to be created for a project.

Rather than manually clicking every single link on the site, this is a process which can readily be automated. Using this method images, style-sheets and javascript pages can also be checked to ensure they are correctly loaded. This builds up a useful overview of how your web-site pages link together and where problems lie. Creating a visual representation of this information not only allows quick resolutions of the issues found, but also gives a method to check to overall information architecture of a site. This is vitality important as the structure of the site's content must make sense to your end users. To perform these tasks I have created 'Integrity'.

Usage

Actually using Integrity is as trivial as giving the program a URL which will be be used as the base location to start the integrity check. If you wish to jump straight in and give it a whirl, please feel free to do so. Integrity is available from this site:

Integrity starts by creating an internal map of the target web-site, by pulling the page content when looking for new links on a page. As each page is scanned, this is reported on screen. Following this scan of the site, a map of this site is generated using SVG. To get information about a specific page simply click on either the page title label or on the dot in the map which represents the page. The map will update to highlight the inter-page links which effect this pages, and detailed information while be shown below the map. This detailed information is broken down into incoming links (pages which link to this page) and outgoing links (links from this page). Outgoing links also have a server response code as Integrity checks all outbound links. Any errors are indicated in the map and in the detailed information by a red highlight.

There are a few things to note about this online version of Integrity. Firstly, because SVG is used to draw the map, only Opera and Firefox (or other Gecko based browsers) will work as expected. For further details about browser issues, please refer to the Methodology section of this article.

In addition to requiring an SVG browser, the online version of Integrity is limited to checking only up to 25 pages per site. This is due to the fact that Integrity can really eat bandwidth when used a lot, as each page is pulled to the SpryMedia server for parsing. Furthermore, external web-pages (pages not in your web-site's domain) are not checked by Integrity to ensure the links are valid. This is due to the fact that these pages can considerably increase the latency of a check.

If you wish to use a version of integrity which does not have these limitations, please get in touch with me.

Methodology

The idea behind Integrity is very simple, and indeed it has been done several times before, although never quite like this. Integrity utilises two basic classes, 'webLink' and 'webPage'. webLink stores information about individual links in an organised manner (an array of webLink objects is used), for example the HTTP response code from a GET request is stored here (found by using cURL). The webPage class stores information about a specific web page, making use of the webLink class by storing the index of each link that is found on the page. Integrity itself is simply a spider program which fills out this information in the classes. As each page is scanned for links a snippet of JavaScript is written to the browser which tells it to update the display, showing the user what page is now being scanned.

When a site has been fully spidered, the PHP script creates a site hierarchy using the information that has been found, by classifying pages into levels. If a page is links from the first page entered then it is first level, if a page is links from a first level page then it is second level, etc. The information contained by the site hierarchy array and the webPage and webLink arrays in PHP is then sent to a JavaScript front end which creates the site map using DOM SVG techniques. This is done using PHP 5.2's json_encode function, which simply dumps the entire object array into a JavaScript object. This is phenomenally useful for exchanging information between PHP and JavaScript. The JavaScript then parses through this information to calculate the co-ordinates where each page should be painted. With this information the paths between pages can also be drawn to the display.

Once the map co-ordinates have all be calculated, createElement and appendChild are used to add SVG paths and circles to the SVG container in the document. I have used HTML DIV's to display the title of each page rather than SVG text, as creating a coloured background with a border in SVG is a bit of a nightmare. Using simple 'onclick' event handlers the display functions can be used to show further information about each page.

So why PHP, JavaScript and SVG? Well I used PHP simply because I can rapidly prototype in it. Originally I didn't actually intend to have any JavaScript interaction, it was just going to be a page with a huge amount of information rendered by PHP. However, as I was developing this I realised that the information found by Integrity could be used to map a site. So I used PHP to generate some basic JavaScript (all the heavy lifting was done in PHP) which would create a site map using Canvas. This works quite nicely, but I wanted to have the map click-able so information about each page would only be shown when required. Using Canvas this would mean redrawing the entire map for every click. So I plugged for SVG instead, and using simple setAttribute properties got the desired effect. This lead to a distinct PHP layer and a JavaScript layer. Integrity as a core package doesn't need the mapper, and likewise the mapper can map information from any source, which outputs the correct format. Possibly useful in future...

One thing I have done in Integrity while is for my own personal satisfaction rather than a technical one, is rather than showing the title tag from a web page, I show the URL in the page label. I think this is important from an IA point of view, as the URL should (must?) mean something to your user. It is on the page the whole time, so they see it and register it (since it is in the most prominent position in the browser) but so many site's misuse it. For example, have a look at this URL from Amazon:

http://www.amazon.com/Weaving-Web-Tim-Berners-Lee/
dp/0752820907/ref=pd_bbs_2/104-5803230-2997551?
ie=UTF8&s=books&qid=1182689656&sr=8-2

What the hell is that? How about using:

http://www.amazon.com/Books/Tim-Berners-Lee/Weaving-The-Web

This would actually mean something to the user. All the other information in the URL is irrelevant (it should be in cookie - tough if you have cookies off, you get a basic version with out all the wizzy features). Lets take this a step further. Let's say we have the URL http://www.sprymedia.co.uk/index.php. Why would the end user care if it is php, asp, jsp or anything else. They don't, it means nothing to them - so get rid of it (as I have done on this site). /index or /contact etc actually means something. Having Integrity show the URL rather than the title lets you see if the URL mapping on your site makes any sense.

A few notes of browser compatibility: Since Integrity uses SVG, an SVG capable browser is required to view the generated maps. This means hard luck for those using IE. Indeed the original Canvas version wouldn't have helped IE any either since it can't see that either. However, as I have said, Integrity is not a commercial product, it is a technical demo and used for internal testing, so this was not a consideration. Furthermore, I found that Webkit unfortunately does not have the level of SVG DOM scripting support required for Integrity, although it does appear to be getting there. Firefox and Opera however do work well with Integrity. The Gecko (Firefox) rendering engine is unbelievably slow, both for SVG and HTML in general. This is really disappointing, but hopefully it will improve with future revisions. Opera on the other hand is blazingly fast, showing exactly what I would expect from Integrity, and drawing the map (and modifying it) very quickly. Therefore I strongly suggest that, for the moment, if you fancy using Integrity, use Opera.

Examples

Finally, in this section I present a few examples of Integrity in action. The first of these is this site, which is relatively complex as there are a lot of pages (most of them dynamically generated from SpryPanel content...). Shown below is the generated site map.

As can be seen the interaction between the pages is fairly logical, however there is an error on all of my portfolio pages. It turned out that this was a call to a JavaScript library to allow IE6 to display alpha channel PNGs which I deleted a while ago. One other thing which comes out of this map is that I should look at my URL scheme for the portfolio items and come up with a better way of indicating what page the user is looking at.

The next example is of SpryPanel.com which is structurally a very simple site, with no sub-pages and a single top level navigation on all pages. The map for SpryPanel.com is shown below:

This map shows that there is an error on the purchase page. Below is the detailed information Integrity gives for this page:

As we can see there is an error in the link to the contact page. The false contact page actually shows up in Integrity because it is a soft 404 page. This means that the server generates a valid page, rather than simply returning a 404 error.

So that is how Integrity works. please give it a go and let me know what you think

Article image

Elsewhere on the web