HP UX Archive Centre

This directory contains awkpretty, a prettyprinter for the awk
programming language.

It is based on minor modifications of Brian Kernighan's awk code in
../19990305 (== /u/sy/beebe/src/bwkawk/19990305), following the
observation made during a study of the code that awk operates by first
calling yyparse() to build a parse tree in memory, then calls run() to
execute that parse tree.  yyparse() in turn calls yylex() to get
lexical tokens.

The idea is that by wrapping yylex() inside another function, wwlex(),
we can arrange to output the lexical token stream while grammar
conformance is being checked by yyparse(), and then we can just skip
the normal run() step by providing a dummy version of that function.

Here are the changes made:

	* completely new Makefile

	* two new token types (COMMENT and WHITESPACE) added to
	  awkgram.y

	* modification of the two lex rules for COMMENT and WHITESPACE
	  in lex.c to actually return those tokens, instead of
	  discarding them

	* new function, wwlex() in wwlex.c, wrapping yylex(), allowing
	  the output of a lexical token before returning to yyparse()

	* dummy function, wwrun(), in wwlex.c, so that main.c needs no
	  modifications

	* compilation of main.c with -Drun=wwrun, so that the run()
	  step is dummied out

	* compilation of ytab.c (the yacc output from awkgram.y) with
	  -Dyylex=wwlex, so that yyparse() calls the wwlex() wrapper,
	  instead of yylex().  It sees exactly the same token stream
	  in either case.

	* compilation with -Dtrue=ctrue -Dfalse=cfalse so that C++
	  compilers can be used

	* (char*) typecasts added calls to malloc() and realloc()
	  in run.c to allow C++ compilation

	* three () argument lists changed to (void) in lex.c to allow
	  C++ compilation

The lexical token stream, similar to that produced by bibclean and
biblex, is piped into a completely separate, and relatively simple,
prettyprinter, written in awk itself for compactness and ease of
modification.

This approach guarantees that the prettyprinter sees exactly the
tokens that awk sees and that the token sequence will have been
verified to conform to the awk grammar.  Best of all, it requires only
the addition of two lines to awkgram.y and about 50 lines to lex.c,
modification of about 10 lines in awklex.l, and a 140-line file,
wwlex.c: under 200 new lines of C, lex, and yacc code.  This is a huge
bonus compared to writing almost 8,500 lines from scratch:

	% cat ../19990305/*.[chly] | wc -l
	  8463

Here is a sample of the lexer output:

	./awklex 'BEGIN {print "hello, world"}' /dev/null
	# line 1 "/dev/stdin"
	261     XBEGIN  BEGIN
	337     WHITESPACE
	123     token 123       {
	319     PRINT   print
	337     WHITESPACE
	334     STRING  "hello, world"
	 59     token 59        }
	125     token 125       }
	  0     token 0 }

Fields are tab-separated, but only the first two tabs on a line are
significant; all others are just data.

Since awkpretty has relatively complex logic, and the awk language has
some `dark corners', it is possible that a bug in awkpretty could
result in a change in the meaning of its input program.  One simple
such case would be the incorrect introduction of a non-backslashed
line break at the space in the program "pattern {action}".

To increase confidence in awkpretty, extensive regression testing has
been carried out, using a test body of about 500,000 lines of real awk
programs in the large file systems from several major UNIX vendors at
the author's site.  The regression tests compare the token stream
produced by awklex on the original programs with that from the
prettyprinted programs: there should be no differences, except in line
number directives and horizontal and vertical spacing.  The only such
tests found to fail were those where there were syntax errors in the
original programs.  

The validation suite included with the awkpretty distribution, and run
by "make check" tests the formatting of more than 3300 lines in sample
files that have been devised to attempt to expose problems for
prettyprinting, and to exhibit uses of all possible language
constructs.  The results are compared against results obtained at the
author's site, and believed to be correct.  There is another make
target intended primarily for the developer: "make maintainer-check"
runs a regression test of the type described above, using all of the
check*in files used by "make check", plus all of the awk programs
included in the distribution (another 1200+ lines of code).