packages icon
Note: This package is still useful and will be maintained.
However new functions will go into the Perl Module HTML::TagReader
which is available from http://cpan.org/authors/id/G/GU/GUS/
------------------------------------------------------------------

The webgrep tool box consists of 7 utilities for the web-master


wchck
---------
This is cgi-bin to check html pages. It is written in perl.
Take a look at the top of wchck file to get detailed information
about this program and how to install it.

lshtmlref
---------
This is a nice utility to build tar archives from webpages and
include all the necessary GIFs, textfiles etc..
This can ofcourse only work for relative links. .... a good
web editor uses anyway only relative links as this makes the pages
re-locatable :-).

blnkcheck
--------
blnkcheck checks web-pages for broken links. It searches only the
relative links and is therefore not dependent on a web-server and very
fast. On a pentium 75 Mhz with local disk it can e.g check 5000 links
in 3 seconds. It is ideal to verify that your complete web-server is 
consistent.

The idea is that after editing your page in a html or plain text
editor you just type "blnkcheck the_page_you_changed.html". This
tells you if all links are correct. 

blnkcheck checks relative links (things like href="../info.html" or
src="point.jpg" etc..). You can however list the absolute links whith
-a and run httpcheck as a post processor on them. 

NOTE:
 blnkcheck is designed as a fast checker for web masters that have shell
 and file system level access to their web-pages. It can also be used
 if you are able to keep a mirror of the web site on your local disk.
 Other programs like e.g curl (http://curl.haxx.nu/)
 can be used if you want to check your web-pages only remotely via a web
 server. curl comes with a dead link checker called checklinks.pl.


httpcheck
---------
Is a post processor for "blnkcheck -a" and can be used to check
absolute links of protocol type http. It does the checks by sending 
HEAD requests to the webservers for the page in question. This
is a lot faster then fetching the whole web-page but still quite slow.
httpcheck is written in perl. It requires perl 5 in /usr/bin/perl
As of version 1.6 httpcheck can also handle proxies.

taggrep
-------
taggrep is a program to grep for html tags. E.g search for meta tags or
list the title of a number of web pages.
To quickly see which file is which web-page you may type
taggrep -c title title `find . -name '*.htm*' -print`

The command
> taggrep -c li,ol li doc 
applied to a web page that looks like this:

<ol>
<li>item 1
<li>item 2
<li type=disc> three
<li type=square>four
</ol>

produces:
doc:12: <li>item 1 <li>
doc:13: <li>item 2 <li type=disc>
doc:14: <li type=disc> three <li type=square>
doc:15: <li type=square>four </ol>

webfgrep
--------
Is an web search engine that works well for up to websites with
up to 1Mb of html pages. It uses memory maped file access and is therefore
quite fast. It is best used from a perl cgi-bin wrapper that produces and
evaluates the form. Make sure you write a secure cgi-bin! One solution
is is to escape all meta characters with quotemeta() before passing it
to the shell and the webfgrep but the simpler and more secure one may
be to use the -s option of webfgrep.

There is are 2 sample cgi-bin program called websearch and websearch-s
available in this distribution. They are made for english web-pages
and do not support any special characters that you might have
in other languages.

webfgrep as such is basically a very fast fgrep program that excludes
tags from the seach text.

srcgrep
-------
srcgrep searches web-pages for <img ... src=...> or <body ... background=...>
and displays the data contained in the tag in a nice readable
format. 
This is useful if you need to re-work web-pages (e.g check
what images are included).

hrefgrep 
--------
hrefgrep is like srcgrep except that is searches for <a href=...>...</a>
or an area tag of the from <area ... href=...>
It takes otherwise exactly the same options.

htmlpp 
------
htmlpp removes line breakes in html tags that contain one of
href=,name=,background=,src= and compensate the removed newlines
later on by adding them after the next newline outside a tag.
This way all tags start at the same line number as in the original file.
This makes it possible to post-process tags with programs that
work best in a line oriented mode (sed, awk, perl....).
This program does not edit the file. All output goes to stdout.

At the moment htmlpp is not used for anything. Scripts will follow. 

------------------------
See the INSTALL file for a description on how to install
this software.
------------------------
History:

1.0: first usable c-version.

1.1: 1999-02-27
     hrefgrep added, documentaion improved.

1.2: 1999-03-01
     hrefgrep, srcgrep: Now you can list each file name only once 
     with the option -t

1.3: 1999-03-19
     area tag added to hrefgrep

1.4: 1999-04-08
     -added webfgrep, a poor man's web search engine
     -webfgrep -i option.

1.5: 1999-04-30
     handle now comments. blnkcheck, httpcheck and lshtmlref added

1.6: 1999-05-05
     timeout added to httpcheck. Some servers connect but do not respond.
     blnkcheck: it is now ok to have a space or \n in the path of a link
     httpcheck: can now handle proxies
     httpcheck: incompatible change with 1.5, option -b removed
     lshtmlref: incompatible change with 1.5, option -w removed

1.7: 1999-05-10
     Some corrections in documentaion and help functions. taggrep added.
     blnkcheck,lshtmlref,srcgrep:check also the "background="

1.8: 1999-05-17
     Documentaion updates. 
     httpcheck: keep a cache of already requested pages in ram to speed
                up repeated checks for the same URL.
1.9: statistic format of blnkcheck changed

2.0: 1999-11-22 
     A complete re-write: - hrefgrep and srcgrep do no longer have the 
                            options -t and -a as they were anyhow obsolate.
                          - blnkcheck has been enhanced significantly and
                            checks now for references to named anchors
                            ("page.html#anchor") also that the anchor really
                            exists in that file.
                          - The hash tables have been improved and allow
                            for fast random access to orderd data tables.
2.1: 1999-11-24 
    -cgi-bin's are now correctly checked if they are included
     in the same directory tree as html pages and referenced
     by a relative link (e.g href=../qq.pl?xx=1).
    -code clean up to use more regexp matching. httpcheck prints now
     always ERROR if it could not verify a page. 

2.2: 1999-12-19
    - option -a for hrefgrep
    - print error for anchors that are terminated
      with a new anchor: <a href=....>...<a href=...
    - print error for numterminated anchor tags
2.3: 2000-01-12
    - delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV','PATH'}; added to
      cgi-bin
    - lshtmlref -Wa file.html did append "index.html" if the links 
      did not exist.
    - lshtmlref: new options -A and -i, bug fix for option -L
    - blnkcheck: option -n and -w are now case insensitive
    - blnkcheck: option -O added. Bug fix for image tags nested inside anchor
2.4: 2000-01-29
    -bug in httpcheck: Url in all upper case could not be checked, HTTP://WWW...
    -bug in httpcheck: The cache did not work correctly for urls that were
     not broken
2.5: 2000-04-09
    -httpcheck: some web-server want to have a Host: ...\n\r in the 
     HEAD request.
    -htmlpp added
2.6: -make it compile on HPUX
2.7: 2000-10-18
     -some comments added in httpcheck
     -bug reported by Matthijs Hollemans : broken ref to named
      anchors in other files are only reported once.
2.8: 2000-11-05
     -it is possible to have a tag like <a href=... name=...> ... </a>
2.9: 2000-12-20
     -updates to lshtmlref httpcheck blnkcheck
     -new cgi-bin wchck
     -blnkcheck should also check for file://
2.9b: 2002-01-28
     - changed httpcheck to work with more recent perl versions
2.9d: 2002-10-07 
     - editorial updates for upload to 
       www.ibiblio.org/pub/Linux/apps/www/misc/
2.10: 2002-10-08 
     - updates to httpcheck
2.11: 2003-01-06
     - better rpm spec file
2.12: 2004-04-15
     - added ebuild file for gentoo
------------------------
Author: Guido Socher, guido at linuxfocus.org
Copyright: GPL (see http://www.gnu.org/copyleft/gpl.html)
------------------------
This program is available from:
http://linuxfocus.org/~guido
------------------------