packages icon
Dict Library README
~~~~~~~~~~~~~~~~~~~

$Id$

Introduction
------------

Libdict is a compact, ANSI C library which provides access to a set of generic
and flexible ``dictionary'' data structures, including a hash table, a
red-black tree, a splay tree, a weight-balanced tree, a path-reduction tree,
and a height-balanced (or AVL) tree. These structures can be used to
efficiently store and retrieve key-data pairs. Each of these structures can be
accessed using its direct API, or it can be accessed using a dictionary
abstraction. All data structures have the following operations defined:

- Insert: insert a key-data pair into the dictionary.
- Probe: search the data structure for the given key. If found, return it's
  associated data, otherwise, insert the given key-data pair.
- Search: search the data structure for the given key, and return the
  associated data, if found.
- Const Search: same as search but accepts a const pointer to the data
  structure instead and returns a const pointer to the associated data for the
  key.
- Remove: If found, remove the key-data pair from the data structure.
- Count: return the number of key-data pairs stored in the data structure.
- Empty: remove all key-data pairs from the data structure.
- Walk: pass each key-data pair to a callback function, stopping when the
  callback functions returns 0.
- Destroy: destroy the data structure and release its storaage.
- Make Iterator: allocate and return an iterator, which allows one to iterate
  over each key-data pair in the structure efficiently. Iterators are both
  random and sequential (bi-directional).

In addition, various structures can provide statistical information. For
example, each tree structure can provide the minimum and maximum path lengths
from the root to a leaf, and the hash table can provide the number of buckets
used, etc.  These statistical operations are not provided in the generic
dictionary interface because they are not applicable to each data structure.

The great advantage to using the generic interface is, by changing only the
line of code which creates the dictionary, the underlying data structure and
therefore the performance characteristics of the tree can be changed.

Using the Library
-----------------
The library allows the client to store keys and ``satellite'' data associated
with each key in the tree. The client must provide the library with a
comparison function which will then be used to determine the relationship
between two keys. This is passed to the library as a ``dict_cmp_func'', which is
defined as:

typedef int		(*dict_cmp_func)(const void *key1, const void *key2);

This function should return 0 if the two keys are determined to be equal, less
then zero if key1 < key2, and greater than zero if key1 > key2. This function
must be deterministic and must always return the same result for the same keys.

In addition, the client may optionally provide functions which can be used for
automatically freeing the key and / or the data which is stored in the tree.
These are passed to the library as ``dict_del_func'', which is defined as:

typedef void	(*dict_del_func)(void *);

For example, if a tree was being used to store blocks of memory, indexed by
allocated strings, dict_cmp_func could be the ``strcmp'', and the routines
passed in for freeing the key and data could both be ``free''. The deallocation
function is not required to do anything, and it should not be used as a
``notification'' in the client application that the data item is no longer in
the tree. This is because these routines are invoked when ``overwriting'' the
data of an item already in the tree to free the old key and data. If the
deallocation mechanism is abused and is used as a notification that the item is
no longer in the key, the application will think the item is no longer in the
tree when in fact it is still present but only had its data changed.

Red-Black Trees
---------------
A red-black tree is a balanced tree structure. This structure guarantees that
insertion, deletion, and search for an item has a time complexity O(lg N). A
red-black tree with N data items has a height of at most 2lg(N + 1). More
specifically, the insertion and deletion algorithms guarantee that no leaf has
a distance from the root that is more than twice the distance of another leaf.
Thus the tree remains roughly balanced and manipulation of the tree has
logarithmic time complexity.  This is unlike ``plain'' unbalanced search trees,
where performance may, and often does, degenerate into linear time.

Although the maximum number of comparisons for an insert, search, or delete for
a given item in a red-black tree is bounded by 2lg N + 1, empirical studies on
red-black trees which are filled with random data show that the average number
of comparisons for these operations is about 1.002lg N.

All data structures provide identical interfaces (apart from function name
prefixes) and identical semantics for all operations defined. Thus, by
presenting the interface for a particular structure, the reader will also learn
how to use the rest. In this section, I will explain the interface to the
red-black tree. Every data structure is opaque to the user; only a pointer to
it may be stored by the client program. The function which creates a new
dictionary structure is the abbreviated name of the structure, followed by a
``_new''. For example, to allocate and return a new red-black tree, the
function to use would be:

rb_tree *rb_tree_new(dict_cmp_func key_cmp, dict_del_func key_del,
					 dict_del_func dat_del);

The ``key_cmp'' function is REQUIRED and may not be NULL. Either ``key_del'' or
``dat_del'' may be NULL, in which case they will not be called :-).

Once the tree is created, it may be used until it is no longer needed, in which
case it must be destroyed with

void rb_tree_destroy(rb_tree *tree, int del);

This function will release all storage used by the tree and, if del is set, it
will invoke the user-defined deallocation functions to free the key and / or
the data, if applicable.

To insert key and data pairs into the tree, use:

int rb_tree_insert(rb_tree *tree, void *key, void *dat, int overwrite);

This will attempt to insert the given key and data pair into ``tree''. If
the item is already in the tree and ``overwrite'' is non-zero, the new data
value will overwrite the old one, and the function will return a positive
value. If the node is not present in the tree, it is inserted and the function
will return 0. If, however, the function cannot allocate memory to perform the
insertion, it will return -1.

This is a more general function which also doubles as a tree lookup function:

int rb_tree_probe(rb_tree *tree, void *key, void **dat);

This function will search the given tree for a match for ``key''; if it is
found, the data associated with the given key is stored at the location
``dat'', and the function will return 0. Otherwise, the given key and data pair
is inserted into the tree and the function will return 1.  ``dat''. However, if
it the function was unable to allocate memory it will return -1. Probe
operations provides an interface similar to, for example, the programming
language awk's associative arrays, in which the first reference to a given key
will bring that key and data pair into the array, and subsequent accesses will
return the item originally placed into the array.

Searching the tree is done using:

void *rb_tree_search(rb_tree *tree, const void *key);
const void *rb_tree_csearch(const rb_tree *tree, const void *key);

Both will return the data associated with the given key if it present in the
tree, or NULL. ``rb_tree_csearch'' will return a const pointer to the data. It
is meant for sections of code which are not allowed to modify the tree, and
therefore are not allowed to modify any data held in the tree.

Removing an item from the tree is done using:

int rb_tree_remove(rb_tree *tree, const void *key, int del);

This function will return 0 if the item was found and successfully removed. If
the item was not found it will return -1. Any user-defined deallocators will
only be invoked if ``del'' is non-zero.

To remove all key-data pairs from a tree, use:

void rb_tree_empty(rb_tree *tree, int del);

The tree will be completely emptied. Any user-defined deallocators will only be
invoked if ``del'' is non-zero.

It is possible to have the tree call a given function for each item in the
tree. The callback is defined as:

int	(*dict_vis_func)(const void *key, void *dat);

This function will be called for each key-data pair in the tree. The tree will
call this function for each item in order, until the function returns zero.
This can be done with:

void rb_tree_walk(rb_tree *tree, dict_vis_func visit);

Finally, to find out how many key-data pairs are held in the tree, use:

unsigned rb_tree_count(rb_tree *tree);

This sums up the all the operations which can be done on the dict data
structures, using the red-black tree as an example. There are, however, more
operations which can be performed on trees.

unsigned rb_tree_height(const rb_tree *tree);
unsigned rb_tree_mheight(const rb_tree *tree);

``rb_tree_height'' will return the length of the longest path from the root to
a leaf. ``rb_tree_mheight'' will return the length of the shortest path from
the root to a leaf.

const void *rb_tree_min(const rb_tree *tree);
const void *rb_tree_max(const rb_tree *tree);

These functions will return the minimum and maximum keys in the tree,
respectively. Be careful, as these may will NULL if the tree is empty.

Suppose you decide you would like to use the red-black tree structure provided
by this library in your application. It would be perfectly OK to use the
``rb_tree_*'' functions described here. However, suppose at a later time you
decide you'd rather use the splay tree. This would require you to go through
your source code and change every instance of ``rb_tree'' to ``sp_tree.'' There
is a Better Way (tm) to access this library which relieves the programmer from
having to know about the underlying data in use.

The generic interface is defined in the dict.h file. It includes all operations
such as dict_insert(), dict_probe(), dict_remove(), dict_empty(), etc. Each of
these routines requires a pointer to a ``dict'' structure. However,
there is no function defined which creates a new dict structure. The function
which will create the generic dict structure is defined in the header file for
that particular structure. For example, earlier we used the rb_tree_new()
function to create a new red-black tree. Instead, we now use the rb_dict_new()
function, which will create a red-black tree, but instead of returning the tree
itself, it will return a dict structure which encapsulates all of the
operations on the structure. We can now use the dict structure returned by
rb_dict_new() with all of the functions such as dict_insert(), dict_remove(),
etc, from dict.h. Thus, the programmer may change the underlying data structure
used, and hence the performance characteristics of the application, by simply
changing a single line of code - the line which creates the dict structure.
Throughout the rest of the application, the programmer no longer has to worry
about which actual structure is being used. All these advantages are available
for only the most infinitesmal performance hit (the overhead of one extra
function call).

As mentioned earlier, iterators are also available which allow both random and
sequential, bi-directional access. A routine which creates a new iterator would
be something like:

rb_itor *rb_itor_new(rb_tree *tree);

All iterators have the following operations defined:

- Valid: is the iterator positioned at a meaningful location?
- Invalidate: invalidate the iterator, or position it at a sentinel location
  which signifies that it cannot be ``dereferenced.''
- Next: move the iterator to the next location. If it was invalid, move it to
  the first location in its associated dictionary.
- Next N: move the iterator forward N times. If it was invalid, first move it
  the first location in its associated dictionary, then move it forward from
  there N - 1 times.
- Prev: move the iterator to the previous location. If it was invalid, move it
  to the last location in its associated dictionary.
- Prev N: move the iterator backwards N times. If it was invalid, first move it
  the last location in its associated dictionary, then move it backwards from
  there N - 1 times.
- First: move the iterator to the first location in its associated dictionary.
- Last: move the iterator to the last location in its associated dictionary.
- Key: return the key stored at the current location of the iterator.
- Data: return the data stored at the current location of the iterator.
- Set Data: change the data stored the current location of the iterator.

The following code fragment illustrates the use of the iterator:

	rb_tree *tree;
	rb_itor *itor;

	tree = rb_tree_new((dict_cmp_func), free, free);

	/* Insert some items, keyed by character string, into the tree here... */

	/* Print every character string, in order, that is in the tree. */
	itor = rb_itor_new(tree);
	for (rb_itor_first(itor); rb_itor_valid(itor); rb_itor_next(itor))
		printf("%s\n", rb_itor_key(itor));
	rb_itor_destroy(itor);

Farooq Mela <fmela@sm.socccd.cc.ca.us>
Mar 27, 2001