HP UX Archive Centre

This is the first beta release of an AWK-to-C Translator I'm currently
in the process of developing.  It is based on GAWK 2.15.6 and supports
various Unix systems, and Windows NT/95 and DOS16/DOS32 on PC systems.  The
development is being done on a Linux system. 


Features of AWK/GAWK currently not supported yet (I'll probably forget some
feature(s)... do let me know if you discover a "missing feature"):

1) All the builtin functions (i.e: printf, gsub, split, etc).  Note however
   that the 'print' statement is supported.
2) String concatenations 
3) The 'for XX in YY' statement.  Associative arrays are otherwise supported
   fully.
4) I/O redirection within AWK program
   (i.e:  print "fjklfj" > out.dat)
5) Variable initializations via command line.   
   (i.e:  awk -v foo=2.3 -f test.awk test.in) 
6) User functions 

(they're listed in the order of priority for support in the next release)


1. Instructions for building the AWK-to-C Translator
====================================================

1) Change to the src/ subdirectory

2) Unix systems:  Run 'configure' , specifying option for your system
                  (run it by itself to get a list)
   PC systems:    Change to the src/pc/ subdirectory and run config.bat
                  with the option for your OS/Compiler combination and change
                  to the main src/ subdir again.  
                  OS choices: Windows NT, Win95, DOS, 32-bit DOS
                  Compiler choices: MSC or Visual C++, Watcom C/C++, GCC
                  (run it by itself to get a list)

   The configuration process will produce a Makefile and config.h tailored
   to your system and copy any special source files to the src/ directory.

3) Compiling the translator and static library:
     Run 'make all' (PC compilers like MSC use 'nmake', Watcom C/C++ uses
     'wmake, etc).  
     For unix systems, the make will produce the 'awk2c' translator
     executable and 'libawk2c.a' static library.  For PC systems, the
     files will be 'awk2c.exe' and 'awk2c.lib'.  You can ignore the
     warnings given by the PC compilers during the build.

    NOTE:  You might want to add any compiler flags you like in 
           the CFLAGS variable in the Makefile.  

4) Copy 'awk2c', 'a2c', and 'a2cb' to some directory in your path
   (i.e. /usr/local/bin).  (awk2c.exe, a2c.bat, and a2cb.bat for PC)

5) Copy 'libawk2c.a' or 'awk2c.lib' to your favourite library directory 
   where the linker can find it (i.e. /usr/local/lib, etc)

6) Copy 'awk.h' and 'config.h' to your favourite include directory where
   the compiler can find it (i.e. /usr/local/include, etc)

NOTE:  If you want, you can keep everything in the src directory and can
       compile the converted C program there too (see section 3)  
       

2. Using the 'awk2c' translator
===============================      

The translator behaves just like the standard gawk program except that
it doesn't expect any input files or stdin.  So you can translate an
AWK program in 2 ways:   

  1) awk2c '<YOUR AWK PROGRAM GOES HERE>'  (awk2c "..." for PC)
  2) awk2c -f <AWK PROGRAM FILE>
  (run 'awk2c' by itself or 'awk2c --help' to get a synopsis)

The converted C program is dumped to standard out (The raw output from
awk2c is not indented or pretty formatted).  The translator also
accepts and understands any of gawk options which aren't specific to
the runtime and execution of the AWK program. (Note that most of the
options are severely under-tested)

You can also use the 'a2c' unix script (or 'a2c.bat' for PC) which takes
two arguments for the awk program filename and destination C filename.
The a2c script passes awk2c's output to 'indent' to pretty format the
code.

For example, to translate ~/awk/foo.awk to test.c do:
'a2c ~/awk/foo.awk test.c'

Run 'a2c' with no arguments to get a synopsis. 


3. Compiling the converted C program
====================================

The scheme that is currently used (at a very high level) is:  The translator
produces a C file with three functions (each to run the BEGIN block,
pattern/actions blocks, and the END block) and AWK user variable declarations.
This file is compiled and linked with other objects to produce the final
executable.  One of these "other" objects is 'driver.o' which contains the
main() function for driving the compiled AWK program. 

There are two methods for compiling the converted C program:

1) Name your converted C program to 'awk2c_cprogram.c' ('a2c_cprg.c'
   on PC systems) and compile it in the main src/ subdirectory by
   running 'make driver'.  This uses the Makefile and the make will
   produce the compiled program in a file called 'driver'.
   (Obviously this method requires you to keep the various object files, 
   header files, and Makefile lying around)

2) Use the 'a2cb' script (or a2cb.bat on PC systems).  This script takes
   two arguments for the source C filename and destination executable name.
   It also expects to find 'libawk2c.a' during link time, and 'awk.h' &
   'config.h' during compile-time.   
   (run a2cb by itself to get a synopsis)

   This method is cleaner than method 1) in that it only requires you
   to keep 'awk2c'/'a2c'/'a2cb' (obvious), libawk2c.a (or awk2c.lib) in your
   favourite library directory, and awk.h & config.h in an include directory
   your choice.  Everything else can be zapped after the translator is built.
  
   You might want to modify the 'CC' and 'CFLAGS' environment variables set
   in the 'a2cb' script to your compiler and options (default is CC=gcc,
   CFLAGS=-O2).
 
   A typical unix setup:
   'awk2c', 'a2c' and 'a2cb' go in /usr/local/bin
   'libawk2c' goes in /usr/local/lib
   'awk.h' and 'config.h' go in /usr/local/include

   *NOTE:  For some weird reason, Method 1) seems to produce slightly 
           faster executables on my Linux 1.2.13 system running gcc 2.6.3
           (the order that the objects are linked seems to effect the runtime
            performance)


4. Using the compiled AWK program
=================================

You use the compiled AWK program just as you would use gawk but now the
AWK program related options don't come into play.  You can use the
'--help' option to get a synopsis.  The program will run on data input
from standard input or pipes.  To run it on input files just pass
the filename(s) as arguments as you would with gawk.

i.e:  'program < data.in',  'program data1.in data2.in', etc (where 'program'
      is filename of compiled executable)


5. Runtime Performance
======================

My tests show that performance improvements for gawk versus translator-produced
executables range from minimal to more than 2000% (20+ times as fast).  To be
fair, I compiled both GAWK and the converted C programs with the same optimization
options.  Generally, you can win big if your AWK program uses loops, moderate
to heavy expression processing or calculations, etc.  As the complexity of
the AWK program increases, the speed improvement factor increases.  

One other observation is that as the number of input records increase, the
speed improvement for the same AWK program also increases.  

Generally, if your AWK program uses variables which are singular in type
throughout the program or undergo only a few number<->string conversions,
then the optimizer can detect this and emit more efficient C code.

You can look at src/test/report/perf.rep to look at the performance numbers.
(Some testcases are shown in "a" and "b" series.  The former is the testcase
being run on a small input set, while the latter is the same testcase being
run on a larger input set).
  

6. Testcases
============

I've developed a suite of verificational and performance testcases in the
src/test/fvt/ and src/test/performance/ subdirectories.  You can look over these
AWK programs (*test.*.awk) and their C counterparts (*test.*.c) to get an
idea as to how the translator works.

To make my life easier, I developed a testing tool (the 'tst' script in 
the main src/ subdirectory) which can run the verificational and
performance tests on a given set of AWK programs and inputs.
The performance tests are only meaningful if you're on a Unix system where
activity is low and you have most/all of the processor cycles).  The
verification tests are simply a check that both gawk and the compiled
AWK program produce the same output (=> meaningless if testcase has no
output)

To run the entire FVT suite, run 'tst v'; for performance, run 'tst p'.
(Run 'tst' w/o arguments to get a synopsis).  You can also run a single
or a select group of testcases as long as each of the testcase filenames
have a corresponding input file.

i.e: If you're in the src/ directory, you can do
     'tst v test/fvt/test.[1-4].awk'
     'tst p test/performance/test.1*.awk'
     
     If you have a testcase in ~/test/foo.awk and an input file ~/test/foo.in,
     you can do 'tst v ~/test/foo.awk'.

  
Misc
====

I hope you find use for this translator and I'll be interested in any comments
or problem reports you have.  If you have an idea for making something better,
, or want to help in porting it to an unsupported system,
also let me know or post to the comp.lang.awk usenet newsgroup.
 

Leonard Theivendra
IBM Toronto Software lab
------------------------------------------------------------------------------
E-Mail:  theiven@skule.ecf.toronto.edu (<= 08/31/96)
         theivend@torolab6.vnet.ibm.com (>= 07/01/96)

Standard Disclaimer:  Any opinions expressed are solely my own and not
                      that of IBM Corp.