TS -- A Simple Token Scanning Library Release 1.06 Paul DuBois dubois@primate.wisc.edu Wisconsin Regional Primate Research Center Revision date: 18 October 1993 Applications often wish to pull strings apart into indivi- dual tokens. This document describes TS, a library consist- ing of an unsophisticated set of routines providing simple token scanning operations. String tokenizing can often be done satisfactorily using strtok() or equivalent function from the C library. When such routines are insufficient, the routines described here may be useful. They offer, for example, quote and escape character parsing, and configurability of underlying scan- ning properties, within the confines of a fixed interface. TS provides a simple built-in scanner, which may be replaced by alternate routines as desired. Applications may switch back and forth between scanners on the fly. 1. Installation This release of TS is configured using imake and the WRPRC2 configuration files, so you also need to obtain the WRPRC2 configuration distribution if you want to build it the usual way. (If you want to avoid imake, the Makefile is simple enough that you should be able to tweak it by hand.) There is one library to be built, libtokenscan.a. That library should be installed in a system library directory. The header file tokenscan.h should be installed in a system header file directory. 2. Example The canonical method of tokenizing a string with TS is as follows: Revision date: 18 October 1993 Printed: 19 February 192002 Token Scanning Library - 2 - char buf[size], *p; /* ...initialize contents of buf here... */ TSScanInit (buf); /* initialize scanner */ while ((p = TSScan ()) != (char *) NULL) { /* ...do something here with token pointed to by p here... */ } The scanner is initialized by passing the string to be scanned to TSScanInit() and TSScan() is called to get pointers to successive tokens. TSScan() returns NULL when there are no more. 3. Behavior of the Default Scanner The default scanner is destructive in that it modifies the string scanned (it writes nulls at the end of each token found), so make a copy of the scanned string if you need to maintain an intact version. The scanner is controlled by delimiter, quote, escape, and end-of-string (EOS) characters. The defaults for each of these are given below. center tab(:); l l . delimiter:space tab quote:" ' escape: EOS:null linefeed carriage-return In the simplest case, tokens are sequences of characters between delimiters. Since the default delimiters are the whitespace characters space and tab, tokens are sequences of non-whitespace characters. This is a line -> <This> <is> <a> <line> Quotes may be used to include whitespace within a token. Quotes must match; hence one quote character may be used to quote another kind of quote character, if there is more than one. "This is" a line -><This is> <a> <line> This" "is a line -><This is> <a> <line> "'" '"' -> <'> <"> "'"'"' -> <'"> The escape character turns off any special meaning of the next character, including another escape character. What's up -> <What's> <up> \ is the escape -><> <is> <the> <escape> Revision date: 18 October 1993 Printed: 19 February 192002 - 3 - Token Scanning Library The EOS characters tell the scanner when to quit scanning. A null character always terminates the scan. In the default case, linefeed and carriage return do as well. You can replace the delimiter, quote, escape, or EOS charac- ter sets. This changes the particular characters that trigger the above behaviors, without changing the way the default scan algorithm works. Or you can replace the scan routine to make the scanner behave in entirely different ways. By default, multiple consecutive delimiter characters are treated as a single delimiter. A flag may be set in the scanner structure to suppress delimiter concatenation, so that every delimiter character is significant. This is use- ful for tokenizing strings in which empty fields are allowed: two consecutive delimiters are considered to have an empty token between them, and delimiters appearing at the beginning or end of a string signify an empty token at the beginning end of the string. The difference in treatment of strings when delimiters are concatenated versus when they are not is illustrated below. Suppose the delimiter is colon (:) and the string to be tok- enized is: :a:b::c: When delimiters are concatenated, the string contains three tokens: :a:b::c: -> <a> <b> <c> When all delimiters are significant, string contains three empty tokens in addition: :a:b::c: -> <> <a> <b> <> <c> <> 4. Programming Interface Source files using TS routines should include tokenscan.h and executables should be linked with -ltokenscan. A scanner is described by a data structure: Revision date: 18 October 1993 Printed: 19 February 192002 Token Scanning Library - 4 - typedef struct TSScanner TSScanner; struct TSScanner { void (*scanInit) (); char *(*scanScan) (); char *scanDelim; char *scanQuote; char *scanEscape; char *scanEos; int scanFlags; } Scanner structures may be obtained or installed with TSGetScanner() and TSSetScanner(). For each string to be scanned, the application passes a pointer to it to TSScanInit(), which takes care of scan ini- tialization. If the application requires initialization to be performed in addition to that done internally by TS, a pointer to a routine that does so should be installed in the scanInit field of the scanner data structure. It takes one argument, a pointer to the string to be scanned. The default scanInit is NULL, which does nothing. scanDelim, scanQuote, scanEscape, and scanEos are pointers to null-terminated strings consisting of the set of charac- ters to be considered delimiter, quote, escape, and EOS characters, respectively. The default values were described previously. scanScan points to the routine that does the actual scan- ning. It is called by TSScan() and should be declared to take no arguments and return a character pointer to the next token in the current scan buffer. Normally, this routine does the following: call TSGetScanPos() to get the current scan position, scan the token, call TSSetScanPos() to update the scan position, then return a pointer to the beginning of the token. If there are no more tokens in the scan buffer, the routine should return NULL, and should continue to do so until TSScanInit() is called again. scanFlags contains flags that modify the scanner's behavior. For the default scanner, the default is zero. If the tsNo- ConcatDelims flag is set, the scanner stops on every delim- iter rather than treating sequences of contiguous delimiters as a single delimiter. The public routines in the TS library are described below. void TSScanInit (p) char *p; Initializes the scanning routines to make the character string pointed to by p the current scan buffer. Revision date: 18 October 1993 Printed: 19 February 192002 - 5 - Token Scanning Library char *TSScan () Returns a pointer to the next token in the current scan buffer, NULL if there are no more. The token is terminated by a null byte. Scan behavior may be modified by substitut- ing alternate scan routines. Once TSScan() returns NULL, it continues to do so until the scanner is reinitialized with another call to TSScanInit(). void TSGetScanner (p) TSScanner*p; Gets the current scanner information (initialization and scan procedures; delimiter, quote, escape, and EOS character sets; and scanner flags) into the structure pointed to by p. void TSSetScanner (p) TSScanner*p; Installs a scanner. If p itself if NULL, all default scanner values are reinstalled. Otherwise, any pointer field in p with a NULL value causes the corresponding value from the default scanner to be reinstalled, and if p->scanFlags is zero, the scanner flags are set to the default (also zero). void TSGetScanPos (p) char **p; Puts the current position within the current scan buffer into the argument, which should be passed as the address of a character pointer. This is useful when you want to scan only enough of the buffer to partially classify it, then use the rest in some other way. void TSSetScanPos (p) char *p; Set the current scan position to p. int TSIsScanDelim (c) char c; Returns non-zero if c is a member of the current delimiter character set, zero otherwise. int TSIsScanQuote (c) char c; Returns non-zero if c is a member of the current quote char- acter set, zero otherwise. Revision date: 18 October 1993 Printed: 19 February 192002 Token Scanning Library - 6 - int TSIsScanEscape (c) char c; Returns non-zero if c is a member of the current escape character set, zero otherwise. int TSIsScanEos (c) char c; Returns non-zero if c is an end-of-string character, zero otherwise. int TSTestScanFlags (flags) int flags Returns non-zero if all bits in flags are set for the current scanner, zero otherwise. 4.1. Overriding Scanning Routines It is possible to switch back and forth between scan pro- cedures on the fly, even in the middle of scanning a string. The general procedure is to use TSGetScanner() to get the current scanner information, and TSSetScanner() to install a new one and reinstall the old one when done with the new. If you switch between more than two scanners, another method may be necessary. It is possible to modify the default scanner without replac- ing it. For instance, you could change the default delim- iter set but leave everything else the same, as follows: TSScanner scanStruct; TSGetScanner (&scanStruct); scanStruct.scanDelim = " :;?,!"; TSSetScanner (&scanStruct); 5. Miscellaneous A scanner can be nondestructive with respect to the line being scanned by using a scan routine that copies characters out of the scanned line into a second buffer and returning a pointer to the second buffer. The second buffer must be large enough to hold the largest possible token, of course. If the second buffer is a fixed area, the host application must be careful not to call TSScan() again until it is done with the current token, or else make a copy of it first. If the second buffer is dynamically allocated, the application must be ready to do storage management of the tokens returned. Some scanners might not need delimiter, quote, escape, or EOS characters at all, particularly if token boundaries are Revision date: 18 October 1993 Printed: 19 February 192002 - 7 - Token Scanning Library context sensitive. 6. Distribution and Update Availability The TS distribution may be freely circulated and is avail- able for anonymous FTP access in the /pub/TS directory on host ftp.primate.wisc.edu. Updates appear there as they become available. The WRPRC2 imake configuration file distribution is avail- able on ftp.primate.wisc.edu as well, in /pub/imake-stuff. If you do not have FTP access, send requests to software@primate.wisc.edu. Bug reports, questions, sugges- tions and comments may be sent to this address as well. Revision date: 18 October 1993 Printed: 19 February 192002