packages icon









                        TS  --   A   Simple   Token   Scanning   Library
 Release 1.06


                                     Paul DuBois
                               dubois@primate.wisc.edu

                      Wisconsin Regional Primate Research Center
                           Revision date:  18 October 1993





             Applications often wish to pull strings apart  into  indivi-
             dual tokens.  This document describes TS, a library consist-
             ing of an unsophisticated set of routines  providing  simple
             token scanning operations.              String tokenizing  can
 often  be  done  satisfactorily  using
             strtok()  or  equivalent  function from the C library.  When
             such routines are insufficient, the routines described  here
             may  be  useful.   They offer, for example, quote and escape
             character parsing, and configurability of  underlying  scan-
             ning  properties,  within the confines of a fixed interface.
             TS provides a simple built-in scanner, which may be replaced
             by  alternate  routines as desired.  Applications may switch
             back and forth between scanners on the fly.

             1.  Installation
                                          This release of TS is  configured
 using imake and the  WRPRC2
             configuration  files,  so you also need to obtain the WRPRC2
             configuration distribution if you want to build it the usual
             way.   (If  you  want to avoid imake, the Makefile is simple
             enough that you  should  be  able  to  tweak  it  by  hand.)
 There is one library  to  be  built,  libtokenscan.a.   That
             library  should  be installed in a system library directory.
             The header file tokenscan.h should be installed in a  system
             header file directory.

             2.  Example
                                     The canonical method of  tokenizing  a
 string with  TS  is  as
             follows:









             Revision date:  18 October 1993 Printed:  19 February 192002





             Token Scanning Library     - 2 -



                     char    buf[size], *p;

                     /* ...initialize contents of buf here... */
                     TSScanInit (buf);       /* initialize scanner */
                     while ((p = TSScan ()) != (char *) NULL)
                     {
                             /* ...do something here with token  pointed  to
 by p here... */
                     }

             The scanner is initialized  by  passing  the  string  to  be
             scanned  to  TSScanInit()  and  TSScan()  is  called  to get
             pointers to successive tokens.  TSScan() returns  NULL  when
             there are no more.

             3.  Behavior of the Default Scanner
                                                               The  default
 scanner is destructive in that it  modifies  the
             string  scanned  (it  writes  nulls at the end of each token
             found), so make a copy of the scanned string if you need  to
             maintain  an  intact  version.                The  scanner  is
 controlled by delimiter, quote,  escape,  and
             end-of-string  (EOS)  characters.   The defaults for each of
             these are given below.

             center tab(:); l l .  delimiter:space  tab  quote:"  '  escape:
 EOS:null linefeed carriage-return

             In the simplest case, tokens  are  sequences  of  characters
             between  delimiters.   Since  the default delimiters are the
             whitespace characters space and tab, tokens are sequences of
             non-whitespace characters.

                     This is a line ->   <This> <is> <a> <line>

             Quotes may be used to include  whitespace  within  a  token.
             Quotes  must match; hence one quote character may be used to
             quote another kind of quote character, if there is more than
             one.

                     "This is" a line    -><This is> <a> <line>
                     This" "is a line    -><This is> <a> <line>
                     "'" '"'        ->   <'> <">
                     "'"'"'         ->   <'">

             The escape character turns off any special  meaning  of  the
             next character, including another escape character.

                     What's up     ->   <What's> <up>
                     \ is the escape    -><> <is> <the> <escape>



             Revision date:  18 October 1993 Printed:  19 February 192002





                                        - 3 -      Token Scanning Library


             The EOS characters tell the scanner when to  quit  scanning.
             A null character always terminates the scan.  In the default
             case,  linefeed  and   carriage   return   do   as   well.
 You can replace the delimiter, quote, escape, or EOS charac-
             ter  sets.   This  changes  the  particular  characters that
             trigger the above behaviors, without changing  the  way  the
             default  scan  algorithm works.  Or you can replace the scan
             routine to make the scanner  behave  in  entirely  different
             ways.
                                By default, multiple consecutive  delimiter
 characters  are
             treated  as  a  single  delimiter.  A flag may be set in the
             scanner structure to suppress  delimiter  concatenation,  so
             that every delimiter character is significant.  This is use-
             ful  for  tokenizing  strings  in  which  empty  fields  are
             allowed:  two  consecutive delimiters are considered to have
             an empty token between them, and delimiters appearing at the
             beginning  or  end of a string signify an empty token at the
             beginning end of the string.               The  difference  in
 treatment of strings when  delimiters  are
             concatenated  versus when they are not is illustrated below.
             Suppose the delimiter is colon (:) and the string to be tok-
             enized is:

                     :a:b::c:

             When delimiters are concatenated, the string contains  three
             tokens:

                     :a:b::c:       -> <a> <b> <c>

             When all delimiters are significant, string  contains  three
             empty tokens in addition:

                     :a:b::c:       -> <> <a> <b> <> <c> <>


             4.  Programming Interface             Source  files  using  TS
 routines  should  include  tokenscan.h
             and  executables  should  be  linked   with   -ltokenscan.

 A scanner is described by a data structure:












             Revision date:  18 October 1993 Printed:  19 February 192002





             Token Scanning Library     - 4 -



                     typedef struct TSScanner TSScanner;
                     struct TSScanner
                     {
                             void    (*scanInit) ();
                             char    *(*scanScan) ();
                             char    *scanDelim;
                             char    *scanQuote;
                             char    *scanEscape;
                             char    *scanEos;
                             int     scanFlags;
                     }

             Scanner  structures  may  be  obtained  or  installed   with
             TSGetScanner()  and  TSSetScanner().
                                                                 For   each
 string to be  scanned,  the  application  passes  a
             pointer to it to TSScanInit(), which takes care of scan ini-
             tialization.  If the application requires initialization  to
             be  performed  in  addition to that done internally by TS, a
             pointer to a routine that does so should be installed in the
             scanInit  field of the scanner data structure.  It takes one
             argument, a pointer  to  the  string  to  be  scanned.   The
             default   scanInit   is   NULL,   which   does   nothing.
 scanDelim, scanQuote, scanEscape, and scanEos  are  pointers
             to  null-terminated strings consisting of the set of charac-
             ters to be considered  delimiter,  quote,  escape,  and  EOS
             characters, respectively.  The default values were described
             previously.
                                      scanScan points to the  routine  that
 does  the  actual  scan-
             ning.   It  is  called by TSScan() and should be declared to
             take no arguments and return a character pointer to the next
             token  in  the  current scan buffer.  Normally, this routine
             does the following: call TSGetScanPos() to get  the  current
             scan position, scan the token, call TSSetScanPos() to update
             the scan position, then return a pointer to the beginning of
             the  token.  If there are no more tokens in the scan buffer,
             the routine should return NULL, and should continue to do so
             until TSScanInit() is  called  again.                scanFlags
 contains flags that modify the scanner's behavior.
             For  the default scanner, the default is zero.  If the tsNo-
             ConcatDelims flag is set, the scanner stops on every  delim-
             iter rather than treating sequences of contiguous delimiters
             as a single delimiter.
                                                 The public routines in the
 TS library are described below.

             void TSScanInit (p)
             char    *p;

             Initializes the scanning  routines  to  make  the  character
             string pointed to by p the current scan buffer.




             Revision date:  18 October 1993 Printed:  19 February 192002





                                        - 5 -      Token Scanning Library



             char *TSScan ()

             Returns a pointer to the next  token  in  the  current  scan
             buffer,  NULL if there are no more.  The token is terminated
             by a null byte.  Scan behavior may be modified by substitut-
             ing  alternate  scan  routines.                Once   TSScan()
 returns NULL, it continues to do so until  the
             scanner is reinitialized with another call to TSScanInit().

             void TSGetScanner (p)
             TSScanner*p;

             Gets the current  scanner  information  (initialization  and
             scan procedures; delimiter, quote, escape, and EOS character
             sets; and scanner flags) into the structure pointed to by p.

             void TSSetScanner (p)
             TSScanner*p;

             Installs a scanner.   If  p  itself  if  NULL,  all  default
             scanner  values  are  reinstalled.   Otherwise,  any pointer
             field in p with a NULL value causes the corresponding  value
             from   the   default  scanner  to  be  reinstalled,  and  if
             p->scanFlags is zero, the  scanner  flags  are  set  to  the
             default (also zero).

             void TSGetScanPos (p)
             char    **p;

             Puts the current position within  the  current  scan  buffer
             into  the argument, which should be passed as the address of
             a character pointer.  This is useful when you want  to  scan
             only enough of the buffer to partially classify it, then use
             the rest in some other way.

             void TSSetScanPos (p)
             char    *p;

             Set the current scan position to p.

             int TSIsScanDelim (c)
             char    c;

             Returns non-zero if c is a member of the  current  delimiter
             character set, zero otherwise.

             int TSIsScanQuote (c)
             char    c;

             Returns non-zero if c is a member of the current quote char-
             acter set, zero otherwise.





             Revision date:  18 October 1993 Printed:  19 February 192002





             Token Scanning Library     - 6 -



             int TSIsScanEscape (c)
             char    c;

             Returns non-zero if c is a  member  of  the  current  escape
             character set, zero otherwise.

             int TSIsScanEos (c)
             char    c;

             Returns non-zero if c is an  end-of-string  character,  zero
             otherwise.

             int TSTestScanFlags (flags)
             int     flags

             Returns non-zero if all  bits  in  flags  are  set  for  the
             current scanner, zero otherwise.

             4.1.  Overriding Scanning Routines             It is  possible
 to switch back and forth  between  scan  pro-
             cedures on the fly, even in the middle of scanning a string.
             The general procedure is to use TSGetScanner()  to  get  the
             current scanner information, and TSSetScanner() to install a
             new one and reinstall the old one when done  with  the  new.
             If you switch between more than two scanners, another method
             may be necessary.
                                            It is possible  to  modify  the
 default scanner without replac-
             ing  it.   For instance, you could change the default delim-
             iter set but leave everything else the same, as follows:
                     TSScanner scanStruct;

                     TSGetScanner (&scanStruct);
                     scanStruct.scanDelim = " :;?,!";
                     TSSetScanner (&scanStruct);


             5.  Miscellaneous
                                           A scanner can be  nondestructive
 with  respect  to  the  line
             being scanned by using a scan routine that copies characters
             out of the scanned line into a second buffer and returning a
             pointer  to  the  second  buffer.  The second buffer must be
             large enough to hold the largest possible token, of  course.
             If  the  second buffer is a fixed area, the host application
             must be careful not to call TSScan() again until it is  done
             with the current token, or else make a copy of it first.  If
             the second buffer is dynamically allocated, the  application
             must  be  ready  to  do  storage  management  of  the tokens
             returned.              Some scanners might not need delimiter,
 quote,  escape,  or
             EOS  characters at all, particularly if token boundaries are



             Revision date:  18 October 1993 Printed:  19 February 192002





                                        - 7 -      Token Scanning Library


             context sensitive.

             6.  Distribution and Update Availability               The  TS
 distribution may be freely circulated and  is  avail-
             able  for  anonymous  FTP access in the /pub/TS directory on
             host ftp.primate.wisc.edu.  Updates  appear  there  as  they
             become available.
                                            The WRPRC2 imake  configuration
 file distribution  is  avail-
             able on ftp.primate.wisc.edu as well, in  /pub/imake-stuff.
 If  you  do  not  have  FTP   access,   send   requests   to
             software@primate.wisc.edu.   Bug reports, questions, sugges-
             tions and comments may be sent to this address as well.












































             Revision date:  18 October 1993 Printed:  19 February 192002