HP UX Archive Centre

The html_analyzer-1.00 README file



OVERVIEW:

This file contains information outlining the types of processing performed by the
html_analyzer software as well as copyright, disclaimer, and funding information.
Please read the file Installation in this directory for information on installing the
software. To walk through and example run of the analyzer, see Example.



MOTIVATION:

The intent of the html_analyzer is to assist the maintenance of HyperText MarkUp
Language (HTML) databases. As the number of HTML databases increases, the
potential for hyperlinks that point to files or servers that no longer exist also
increases. This results in the need for an automated hyperlink validation program.
This is exactly what the html_analyzer does. The program also explores the
relationship between hyperlinks and the contents of the hyperlink. 



PROCESSING:

This directory contains the software to perform analysis of HTML databases.
Specifically, the following tasks are performed: 

   Extract all hyperlinks (a.k.a. anchors) from all *html files within a given
   directory hierarchy. The HREF values are allowed to be either quoted or not.
   The following types of hyperlinks are not processed: 
      HREF="" 
      HREF=" " 
      HREF="#foo" 
      HREF=#foo 
      HREF="telnet" 
      HREF=telnet 
      HREF="rlogin" and 
      HREF=rlogin 
   Note:Within document hyperlinks are pointless to verify, either the the
   hyperlinks goes to the intended section or is does not. The telnet and rlogin
   hyperlinks only require that the intended machine is alive. If the machine is
   alive, the user must proceed to enter information. Since this user interaction
   defeats the automated goal of the software, these access methods are not
   processed. 

   Create non-html versions of the files. These files are by default placed in
   /var/tmp/html_analyzer. These file are used to examine the relationship
   between hyperlinks and the contents of the hyperlinks. To change the location
   of this repository, place the desired directory as the last command line
   argument, e.g. 

   shell_prompt> html_analyzer . /users/pitkow/swap 

   Creates the directory /html_analyzer in /users/pitkow/swap and places the
   non-html files there. 

   Note:The path must already exist in order for successful execution. The
   html_analyzer creates a directory within this directory; it does not create the
   directory itself. 

   Validates the availability of the documents pointed to by the hyperlinks. this
   test is called validate. This is accomplished via routines from Mosaic's
   modified WWWLibrary_2.09a. 

   Looks for hyperlink contents that occur in the database but are not themselves
   hyperlinks (See Example). This test is termed completeness.

   Look for a one-to-one relation between hyperlinks and the contents of the
   hyperlink (See Example). This test is called consistency.

RATIONAL:

We believe that there ought to exist a one-to-one correspondence between
hyperlinks and the hyperlink's contents, such that every occurrence of the hyperlink
points to only one document ( or section of document). This means every time a user
sees a hyperlink, it will always point to the same section of a document. It also means
that each section of document will only have one hyperlink pointing to it. We
hypothesize that such a correspondence is necessary to create a clear internal
representation in the user of the connections in the HTML database. 



RUNNING:

To run the html_analyzer after it has been installed (Please read the file Installation
in this directory for information on installing the software), type: 

html_analyzer [-val] [-com] [-con] directory [path of repository]

The -val, -com, and -con turn off the validation, completeness, and consistency
tests. Only the name of a directory can be specified to check. If a directory is
specified, all *.html files within the directory hierarchy will be processed. The path
of the temporary repository (default is /var/tmp) can be used if /var/tmp is full or not
desirable. A directory (/html_analyzer) is created in this directory to store the
temporary files generated by execution. The program does not create the
temporary repository. 


COPYRIGHT:

The libwww-2.09a directory is the modified WWW library that accompanies
xmosaic-pre4. The libhtmlw directory is also from the prerelease.i Mosaic was
developed by Marc Anderson at the National Center for Super- Computing
Applications. This code is available from ftp.ncsa.uiuc.edu in the /Web directory.
The original WWWLibrary_2.09a library was developed by Tim Berners-Lee at the
European Laboratory for Particle Physics (CERN). This code is available from
ftp.info.ch in the /pub/www/src directory Please see the file Copyrights in this
directory for more information on the copyrights that exist to these portions of code. 

The Regents of the University of Colorado claim copyright on the other portions of
the distribution. 

This distribution of the software may be freely distributed, used, and modified but
may not be sold as a whole nor in parts without permission of the copyright owners
of the parts. 



DISCLAIMER:

This software is provided as is. The Laboratory for Atmospheric and Space Physics
(LASP) and the author are not responsible for support of this distribution. 



FUNDING:

Development of this software was funded by the NASA Earth Observing System
Project under NASA contract NAS5-32392. 



CHANGES:

Version 0.10 from 0.02

   0) made Mosaic libbwww-2.09a and libhtmlw dependent; this means that all
   valid Mosaic files are now valid html_analyzer files. 
   1) removed unnecessary temporary files created by extract_links();
   extract_links() now loads the skiplists directly. 
   2) enabled validation of other access methods. e.g gopher, wais, etc. 

version 0.02 from 0.01:

   0) converted CHECK_HTML_DB and GET_ANCHORS to c code. 
   1) added verification of relative addressed hyperlinks. 
   2) added one-to-many check of the hyperlink's contents to documents pointed
   to (previously: many-to-one check of hyperlinks to the hyperlink's contents) 
   3) cleaned up 



ENHANCEMENTS:

Here's a list of things that could be done to improve the html_analyzer: 

   0) create a program to automatically prune hyperlinks that no longer point to
   valid files. This entails some tricky questions as to how automated this
   process needs to be. In other words, it might be nice for the user to have the
   option of specifying the correct location of the file and have the software
   make the changes to the HREFs as needed AS WELL as provide the user with
   the option of having the software remove all anchors pointing to the
   no-longer existent file. Let me know if your interested in this option, this
   seems like the next logical addition to the software. 
   1) add a linked list to the data struct of the skiplist that points to a list of other
   files that have the same hyperlink and hyperlink content. This will enable
   more sophisticated analysis, e.g. enable option 0) above by producing a list of
   files that point to a document for pruning purposes, etc. 
   2) add statistical analysis of the HTML db i.e. number of hyperlinks per
   document, number of links to a document, list of files that point to a
   document, etc. 
   3) perform empirical study to confirm the hypothesis of the importance on a
   one-to-one correspondence between hyperlinks and their content. [I might do
   this this fall if time allows]. 



COMMENTS:

The purpose of this distribution is to further the development of HTML database
creation and maintenance utilities. Comments, questions, and REVISIONS are indeed
welcome. 

To be added to the html_analyzer mailing list, mail 

pitkow@cc.gatech.edu with the subject: html_analyzer add 



James E. Pitkow
Graphics, Visualization and Usability Laboratory
Georgia Institute of Technology

pitkow@cc.gatech.edu