TS -- A Simple Token Scanning Library
Release 1.06
Paul DuBois
dubois@primate.wisc.edu
Wisconsin Regional Primate Research Center
Revision date: 18 October 1993
Applications often wish to pull strings apart into indivi-
dual tokens. This document describes TS, a library consist-
ing of an unsophisticated set of routines providing simple
token scanning operations. String tokenizing can
often be done satisfactorily using
strtok() or equivalent function from the C library. When
such routines are insufficient, the routines described here
may be useful. They offer, for example, quote and escape
character parsing, and configurability of underlying scan-
ning properties, within the confines of a fixed interface.
TS provides a simple built-in scanner, which may be replaced
by alternate routines as desired. Applications may switch
back and forth between scanners on the fly.
1. Installation
This release of TS is configured
using imake and the WRPRC2
configuration files, so you also need to obtain the WRPRC2
configuration distribution if you want to build it the usual
way. (If you want to avoid imake, the Makefile is simple
enough that you should be able to tweak it by hand.)
There is one library to be built, libtokenscan.a. That
library should be installed in a system library directory.
The header file tokenscan.h should be installed in a system
header file directory.
2. Example
The canonical method of tokenizing a
string with TS is as
follows:
Revision date: 18 October 1993 Printed: 19 February 192002
Token Scanning Library - 2 -
char buf[size], *p;
/* ...initialize contents of buf here... */
TSScanInit (buf); /* initialize scanner */
while ((p = TSScan ()) != (char *) NULL)
{
/* ...do something here with token pointed to
by p here... */
}
The scanner is initialized by passing the string to be
scanned to TSScanInit() and TSScan() is called to get
pointers to successive tokens. TSScan() returns NULL when
there are no more.
3. Behavior of the Default Scanner
The default
scanner is destructive in that it modifies the
string scanned (it writes nulls at the end of each token
found), so make a copy of the scanned string if you need to
maintain an intact version. The scanner is
controlled by delimiter, quote, escape, and
end-of-string (EOS) characters. The defaults for each of
these are given below.
center tab(:); l l . delimiter:space tab quote:" ' escape:
EOS:null linefeed carriage-return
In the simplest case, tokens are sequences of characters
between delimiters. Since the default delimiters are the
whitespace characters space and tab, tokens are sequences of
non-whitespace characters.
This is a line -> <This> <is> <a> <line>
Quotes may be used to include whitespace within a token.
Quotes must match; hence one quote character may be used to
quote another kind of quote character, if there is more than
one.
"This is" a line -><This is> <a> <line>
This" "is a line -><This is> <a> <line>
"'" '"' -> <'> <">
"'"'"' -> <'">
The escape character turns off any special meaning of the
next character, including another escape character.
What's up -> <What's> <up>
\ is the escape -><> <is> <the> <escape>
Revision date: 18 October 1993 Printed: 19 February 192002
- 3 - Token Scanning Library
The EOS characters tell the scanner when to quit scanning.
A null character always terminates the scan. In the default
case, linefeed and carriage return do as well.
You can replace the delimiter, quote, escape, or EOS charac-
ter sets. This changes the particular characters that
trigger the above behaviors, without changing the way the
default scan algorithm works. Or you can replace the scan
routine to make the scanner behave in entirely different
ways.
By default, multiple consecutive delimiter
characters are
treated as a single delimiter. A flag may be set in the
scanner structure to suppress delimiter concatenation, so
that every delimiter character is significant. This is use-
ful for tokenizing strings in which empty fields are
allowed: two consecutive delimiters are considered to have
an empty token between them, and delimiters appearing at the
beginning or end of a string signify an empty token at the
beginning end of the string. The difference in
treatment of strings when delimiters are
concatenated versus when they are not is illustrated below.
Suppose the delimiter is colon (:) and the string to be tok-
enized is:
:a:b::c:
When delimiters are concatenated, the string contains three
tokens:
:a:b::c: -> <a> <b> <c>
When all delimiters are significant, string contains three
empty tokens in addition:
:a:b::c: -> <> <a> <b> <> <c> <>
4. Programming Interface Source files using TS
routines should include tokenscan.h
and executables should be linked with -ltokenscan.
A scanner is described by a data structure:
Revision date: 18 October 1993 Printed: 19 February 192002
Token Scanning Library - 4 -
typedef struct TSScanner TSScanner;
struct TSScanner
{
void (*scanInit) ();
char *(*scanScan) ();
char *scanDelim;
char *scanQuote;
char *scanEscape;
char *scanEos;
int scanFlags;
}
Scanner structures may be obtained or installed with
TSGetScanner() and TSSetScanner().
For each
string to be scanned, the application passes a
pointer to it to TSScanInit(), which takes care of scan ini-
tialization. If the application requires initialization to
be performed in addition to that done internally by TS, a
pointer to a routine that does so should be installed in the
scanInit field of the scanner data structure. It takes one
argument, a pointer to the string to be scanned. The
default scanInit is NULL, which does nothing.
scanDelim, scanQuote, scanEscape, and scanEos are pointers
to null-terminated strings consisting of the set of charac-
ters to be considered delimiter, quote, escape, and EOS
characters, respectively. The default values were described
previously.
scanScan points to the routine that
does the actual scan-
ning. It is called by TSScan() and should be declared to
take no arguments and return a character pointer to the next
token in the current scan buffer. Normally, this routine
does the following: call TSGetScanPos() to get the current
scan position, scan the token, call TSSetScanPos() to update
the scan position, then return a pointer to the beginning of
the token. If there are no more tokens in the scan buffer,
the routine should return NULL, and should continue to do so
until TSScanInit() is called again. scanFlags
contains flags that modify the scanner's behavior.
For the default scanner, the default is zero. If the tsNo-
ConcatDelims flag is set, the scanner stops on every delim-
iter rather than treating sequences of contiguous delimiters
as a single delimiter.
The public routines in the
TS library are described below.
void TSScanInit (p)
char *p;
Initializes the scanning routines to make the character
string pointed to by p the current scan buffer.
Revision date: 18 October 1993 Printed: 19 February 192002
- 5 - Token Scanning Library
char *TSScan ()
Returns a pointer to the next token in the current scan
buffer, NULL if there are no more. The token is terminated
by a null byte. Scan behavior may be modified by substitut-
ing alternate scan routines. Once TSScan()
returns NULL, it continues to do so until the
scanner is reinitialized with another call to TSScanInit().
void TSGetScanner (p)
TSScanner*p;
Gets the current scanner information (initialization and
scan procedures; delimiter, quote, escape, and EOS character
sets; and scanner flags) into the structure pointed to by p.
void TSSetScanner (p)
TSScanner*p;
Installs a scanner. If p itself if NULL, all default
scanner values are reinstalled. Otherwise, any pointer
field in p with a NULL value causes the corresponding value
from the default scanner to be reinstalled, and if
p->scanFlags is zero, the scanner flags are set to the
default (also zero).
void TSGetScanPos (p)
char **p;
Puts the current position within the current scan buffer
into the argument, which should be passed as the address of
a character pointer. This is useful when you want to scan
only enough of the buffer to partially classify it, then use
the rest in some other way.
void TSSetScanPos (p)
char *p;
Set the current scan position to p.
int TSIsScanDelim (c)
char c;
Returns non-zero if c is a member of the current delimiter
character set, zero otherwise.
int TSIsScanQuote (c)
char c;
Returns non-zero if c is a member of the current quote char-
acter set, zero otherwise.
Revision date: 18 October 1993 Printed: 19 February 192002
Token Scanning Library - 6 -
int TSIsScanEscape (c)
char c;
Returns non-zero if c is a member of the current escape
character set, zero otherwise.
int TSIsScanEos (c)
char c;
Returns non-zero if c is an end-of-string character, zero
otherwise.
int TSTestScanFlags (flags)
int flags
Returns non-zero if all bits in flags are set for the
current scanner, zero otherwise.
4.1. Overriding Scanning Routines It is possible
to switch back and forth between scan pro-
cedures on the fly, even in the middle of scanning a string.
The general procedure is to use TSGetScanner() to get the
current scanner information, and TSSetScanner() to install a
new one and reinstall the old one when done with the new.
If you switch between more than two scanners, another method
may be necessary.
It is possible to modify the
default scanner without replac-
ing it. For instance, you could change the default delim-
iter set but leave everything else the same, as follows:
TSScanner scanStruct;
TSGetScanner (&scanStruct);
scanStruct.scanDelim = " :;?,!";
TSSetScanner (&scanStruct);
5. Miscellaneous
A scanner can be nondestructive
with respect to the line
being scanned by using a scan routine that copies characters
out of the scanned line into a second buffer and returning a
pointer to the second buffer. The second buffer must be
large enough to hold the largest possible token, of course.
If the second buffer is a fixed area, the host application
must be careful not to call TSScan() again until it is done
with the current token, or else make a copy of it first. If
the second buffer is dynamically allocated, the application
must be ready to do storage management of the tokens
returned. Some scanners might not need delimiter,
quote, escape, or
EOS characters at all, particularly if token boundaries are
Revision date: 18 October 1993 Printed: 19 February 192002
- 7 - Token Scanning Library
context sensitive.
6. Distribution and Update Availability The TS
distribution may be freely circulated and is avail-
able for anonymous FTP access in the /pub/TS directory on
host ftp.primate.wisc.edu. Updates appear there as they
become available.
The WRPRC2 imake configuration
file distribution is avail-
able on ftp.primate.wisc.edu as well, in /pub/imake-stuff.
If you do not have FTP access, send requests to
software@primate.wisc.edu. Bug reports, questions, sugges-
tions and comments may be sent to this address as well.
Revision date: 18 October 1993 Printed: 19 February 192002