The html_analyzer-1.00 README file
OVERVIEW:
This file contains information outlining the types of processing performed by the
html_analyzer software as well as copyright, disclaimer, and funding information.
Please read the file Installation in this directory for information on installing the
software. To walk through and example run of the analyzer, see Example.
MOTIVATION:
The intent of the html_analyzer is to assist the maintenance of HyperText MarkUp
Language (HTML) databases. As the number of HTML databases increases, the
potential for hyperlinks that point to files or servers that no longer exist also
increases. This results in the need for an automated hyperlink validation program.
This is exactly what the html_analyzer does. The program also explores the
relationship between hyperlinks and the contents of the hyperlink.
PROCESSING:
This directory contains the software to perform analysis of HTML databases.
Specifically, the following tasks are performed:
Extract all hyperlinks (a.k.a. anchors) from all *html files within a given
directory hierarchy. The HREF values are allowed to be either quoted or not.
The following types of hyperlinks are not processed:
HREF=""
HREF=" "
HREF="#foo"
HREF=#foo
HREF="telnet"
HREF=telnet
HREF="rlogin" and
HREF=rlogin
Note:Within document hyperlinks are pointless to verify, either the the
hyperlinks goes to the intended section or is does not. The telnet and rlogin
hyperlinks only require that the intended machine is alive. If the machine is
alive, the user must proceed to enter information. Since this user interaction
defeats the automated goal of the software, these access methods are not
processed.
Create non-html versions of the files. These files are by default placed in
/var/tmp/html_analyzer. These file are used to examine the relationship
between hyperlinks and the contents of the hyperlinks. To change the location
of this repository, place the desired directory as the last command line
argument, e.g.
shell_prompt> html_analyzer . /users/pitkow/swap
Creates the directory /html_analyzer in /users/pitkow/swap and places the
non-html files there.
Note:The path must already exist in order for successful execution. The
html_analyzer creates a directory within this directory; it does not create the
directory itself.
Validates the availability of the documents pointed to by the hyperlinks. this
test is called validate. This is accomplished via routines from Mosaic's
modified WWWLibrary_2.09a.
Looks for hyperlink contents that occur in the database but are not themselves
hyperlinks (See Example). This test is termed completeness.
Look for a one-to-one relation between hyperlinks and the contents of the
hyperlink (See Example). This test is called consistency.
RATIONAL:
We believe that there ought to exist a one-to-one correspondence between
hyperlinks and the hyperlink's contents, such that every occurrence of the hyperlink
points to only one document ( or section of document). This means every time a user
sees a hyperlink, it will always point to the same section of a document. It also means
that each section of document will only have one hyperlink pointing to it. We
hypothesize that such a correspondence is necessary to create a clear internal
representation in the user of the connections in the HTML database.
RUNNING:
To run the html_analyzer after it has been installed (Please read the file Installation
in this directory for information on installing the software), type:
html_analyzer [-val] [-com] [-con] directory [path of repository]
The -val, -com, and -con turn off the validation, completeness, and consistency
tests. Only the name of a directory can be specified to check. If a directory is
specified, all *.html files within the directory hierarchy will be processed. The path
of the temporary repository (default is /var/tmp) can be used if /var/tmp is full or not
desirable. A directory (/html_analyzer) is created in this directory to store the
temporary files generated by execution. The program does not create the
temporary repository.
COPYRIGHT:
The libwww-2.09a directory is the modified WWW library that accompanies
xmosaic-pre4. The libhtmlw directory is also from the prerelease.i Mosaic was
developed by Marc Anderson at the National Center for Super- Computing
Applications. This code is available from ftp.ncsa.uiuc.edu in the /Web directory.
The original WWWLibrary_2.09a library was developed by Tim Berners-Lee at the
European Laboratory for Particle Physics (CERN). This code is available from
ftp.info.ch in the /pub/www/src directory Please see the file Copyrights in this
directory for more information on the copyrights that exist to these portions of code.
The Regents of the University of Colorado claim copyright on the other portions of
the distribution.
This distribution of the software may be freely distributed, used, and modified but
may not be sold as a whole nor in parts without permission of the copyright owners
of the parts.
DISCLAIMER:
This software is provided as is. The Laboratory for Atmospheric and Space Physics
(LASP) and the author are not responsible for support of this distribution.
FUNDING:
Development of this software was funded by the NASA Earth Observing System
Project under NASA contract NAS5-32392.
CHANGES:
Version 0.10 from 0.02
0) made Mosaic libbwww-2.09a and libhtmlw dependent; this means that all
valid Mosaic files are now valid html_analyzer files.
1) removed unnecessary temporary files created by extract_links();
extract_links() now loads the skiplists directly.
2) enabled validation of other access methods. e.g gopher, wais, etc.
version 0.02 from 0.01:
0) converted CHECK_HTML_DB and GET_ANCHORS to c code.
1) added verification of relative addressed hyperlinks.
2) added one-to-many check of the hyperlink's contents to documents pointed
to (previously: many-to-one check of hyperlinks to the hyperlink's contents)
3) cleaned up
ENHANCEMENTS:
Here's a list of things that could be done to improve the html_analyzer:
0) create a program to automatically prune hyperlinks that no longer point to
valid files. This entails some tricky questions as to how automated this
process needs to be. In other words, it might be nice for the user to have the
option of specifying the correct location of the file and have the software
make the changes to the HREFs as needed AS WELL as provide the user with
the option of having the software remove all anchors pointing to the
no-longer existent file. Let me know if your interested in this option, this
seems like the next logical addition to the software.
1) add a linked list to the data struct of the skiplist that points to a list of other
files that have the same hyperlink and hyperlink content. This will enable
more sophisticated analysis, e.g. enable option 0) above by producing a list of
files that point to a document for pruning purposes, etc.
2) add statistical analysis of the HTML db i.e. number of hyperlinks per
document, number of links to a document, list of files that point to a
document, etc.
3) perform empirical study to confirm the hypothesis of the importance on a
one-to-one correspondence between hyperlinks and their content. [I might do
this this fall if time allows].
COMMENTS:
The purpose of this distribution is to further the development of HTML database
creation and maintenance utilities. Comments, questions, and REVISIONS are indeed
welcome.
To be added to the html_analyzer mailing list, mail
pitkow@cc.gatech.edu with the subject: html_analyzer add
James E. Pitkow
Graphics, Visualization and Usability Laboratory
Georgia Institute of Technology
pitkow@cc.gatech.edu