packages icon
README - grappe 3.0:
********************

Introduction
************
grappe is a pattern matching program. 
It looks, in a text, for a set of patterns containing 'don't care
symbols' (wildcards) of unbounded or bounded length. The size of
patterns, as well as their length, is unlimited. grappe is expected to 
be particularly efficient for searching for a big number of patterns,
each containing don't cares. However, grappe is competitive with
respect to other programs (egrep, agrep) on small pattern sets too.
Patterns may occur with substitution errors. The number of errors and
their occurrences in a pattern have to be specified by the user (this
feature is restricted in grappe 3.0 by one error per keyword, see
below). 
A number of other useful options are present. As an example, the user
can specify letters occurring at a given position in a pattern by
using multiple choice, or negation.
A special option allows to compile a grappe version for DNA
treatment. 

How to use grappe
*****************
The simplest usage of grappe is to call 
  grappe <patterns> <textfile>

<patterns> is a set of patterns specified according to the following
syntax:
  patterns = pattern  OR pattern|patterns
  pattern  = fragment OR fragment#pattern
  fragment = keyword  OR keyword(p,q)fragment
  keyword  = word     OR keyword[set-of-letters] OR keyword[^set-of-letters] OR
             keyword word OR ?keyword
  set-of-letters = letter   OR letter set-of-letters
  word     = letter   OR letter word

Given a set of patterns, separated by '|' sign, grappe is looking if
one of them (and not each of them) occurs in the text.

Don't care symbol '#', also called unlimited (or arbitrary) wildcard,
matches arbitrary number of caracters. Notation '(p,q)'
(bounded-length wildcard) is used to bound the number of matched
caracters between p and q. 

Letters in square brackets '[]' match anyone of them. If they are
preceeded by '^' sign, this matches any other letter, i.e. anyone
which is not listed.

Sign '?' before a keyword means that the keyword may occur with one
substitution error, i.e. that one letter may get replaced with respect 
to an "exact" match of the keyword.

Note that if you wish to specify a pattern containing 
#,(,),[,],|,\,^,? then you have to protect these symbols with '\' (
'\#', '\(', '\)', '\[', '\]', '\|', '\\', '\^', '\?' ).

Examples
********
'hello' matches any text that containing an occurence of 'hello'.
'hello#world' matches any text containing an occurence of 'hello'
   followed, within any distance, by an occurence of 'world'. 
'hello(1,2)world' matches any text containing an occurence of 'hello'
   followed by an occurence of 'world' such that the two occurrences are
   separated by one or two letters.
'h[ea]llo' matches any text containing an occurence of 'hello' or an
   occurence of 'hallo'.
'h[^a]llo' matches any text containing an occurence of 'h' directly
   followed by any letter except 'a' and directly followed by an
   occurence of 'llo'. For example, 'hello' and 'hjllo' are matched. 
'?hello' matches any text containing an occurence of
   'hello' or an occurence of any subwords at a distance 1 from
   'hello'. For example, 'hello', 'hallo', 'yellow' are matched, but
   'gallo' is not matched. 

Warnings
********
grappe 3.0 can handle at most one error per keyword. Besides, handling 
errors has an overhead in memory use. For some big examples, this
memory blow-up may be prohibitive. In general, we recommend to use
this option with the DNA version of grappe and with small keywords (of
order of 5 letters) unless you have a lot of memory on your computer ...

Installation
************
Installing grappe is easy.
If you got the grappe archive file which ends with gz (like
grappe-3.0.tar.gz) you have first to decompress it with the command 
   gunzip grappe-3.0.tar.gz
Then you have to extract the archive with the command 
   tar xvf grappe-3.0.tar
Now you go to the grappe directory by typing 
   cd grappe-3.0
and then run two commands:
   ./configure
   make
You are done.

If you want to have a version of grappe tuned to working with DNA
files, you type 
   ./configure --enable-DNA
instead of just ./configure. Actually, "configure" can be invoked with 
other options, to know them type 
   ./configure --help

Moreover, using "configure" allows to simply create different versions 
of grappe (for different architectures, for handling DNA files, ...)
without duplicating source files. For exemple, you might want to
create two versions of grappe -- generic and DNA. You may then type
the following commands:
   mkdir generic; cd generic
   ../configure
   make
   cd ..; mkdir dna; cd dna;
   ../configure --enable-DNA
   make

Options
*******
The usage of grappe is the following:
grappe [-r] [-l] [-c] [-d <eor_chars>] ([-e] <patterns> | -f <pattern-file>) [<textfiles>]

   -r   : considers "end of record" characters as usual characters
   -l   : matches each line of the text independently
   -c   : counts the number successfully matched records (lines) of the text
   -d <separators> : redefines the "end of record" characters
                     (end-of-line by default)
   -i   : forces grappe to be case insensitive
   -e <patterns> : introduces the patterns (this option is useful when
                   the patterns begin with '-', othewise it is unnecessary). 
   -f <pattern-file> : specifies the file with patterns
You should specify exactly one set of patterns. 
If no text file is specified then grappe tries to match the
standard input. You can specify a set of text files by listing them or 
using shell
regular expressions. For example, 
   grappe 'void' *.c
looks for pattern 'void' in all c-files in the directory.

DNA version
***********
The DNA version of grappe understands the IUPAC nucleid acid codes in
the patterns:
	A = adenine
	C = cytosine
	G = guanine
	T = thymine
	U = uracil
	R = purine	(same as [GA])
	Y = pyrimidine	(same as [TC]) 
	K = keto	(same as [GT])
	M = amino	(same as [AC])
	S =         	(same as [GC])
	W =        	(same as [AT])
	B =        	(same as [GTC])
	D =        	(same as [GAT])
	H =         	(same as [ACT])
	V =         	(same as [GCA])
	N = any 	(same as [ACGT])
Note that the basic alphabet is {A,C,G,T,U}. 'T' and 'U' are
synonyms. The other letters {R,Y,K,M,S,W,B,D,H,V,N} are only aliases
to the corresponding multiple choice letter. 
Option '-i' does not work with the DNA version of grappe, which is
**always** case insensitive and skips spaces. 
Option '-r' is not supposed to be used with the DNA version.

Examples
********
	grappe 'hello' /usr/dict/*
	grappe -c -l 'hello' /usr/dict/words
	grappe 'CAATCT#TATA' yeastI.pure
	grappe -f patterns yeastI.pure

Bug reports
***********
Although we have made lots of tests in extremal conditions, bugs might 
still occur. If you happen to find one, please make a bug report to
kucherov@loria.fr and sbriais@ens-lyon.fr, with the subject "grappe
[Bug report]". Don't forget to indicate in your mail the command line
that provoked the bug (in particular, send us the patterns and if
possible the text), the type of bug (assertion failed [see note
below], segmentation fault, ...), the computer and the operating
system you use. 
Note that the 'assertion failed' messages of the form '.... != NULL'
(where .... is a name of a variable) are probably out of memory
problems and can't be fixed. Others are more interesting ... 

References
**********
grappe is based on the algorithm presented in the following paper:

G. Kucherov, M. Rusinowitch
Matching a Set of Strings with Variable Length Don't Cares
Theoretical Computer Science, 178 (1997), pp 129-154.

An extended abstract of this paper appeared in
6th Symposium on Combinatorial Pattern Matching
Helsinki, July 1995
Lecture Notes in Computer Science, vol. 937, (1995), pp 230-247.

You can find more informations at the following url:
	http://www.loria.fr/~kucherov/
or follow links from this one:
	http://www.ens-lyon.fr/~sbriais/

Contributors
************
grappe-1  G.Kucherov (with help of P.Zimmermann and M.Rusinowitch)
grappe-2  E.Queffelec
grappe-3  S.Briais