README - grappe 3.0:
********************
Introduction
************
grappe is a pattern matching program.
It looks, in a text, for a set of patterns containing 'don't care
symbols' (wildcards) of unbounded or bounded length. The size of
patterns, as well as their length, is unlimited. grappe is expected to
be particularly efficient for searching for a big number of patterns,
each containing don't cares. However, grappe is competitive with
respect to other programs (egrep, agrep) on small pattern sets too.
Patterns may occur with substitution errors. The number of errors and
their occurrences in a pattern have to be specified by the user (this
feature is restricted in grappe 3.0 by one error per keyword, see
below).
A number of other useful options are present. As an example, the user
can specify letters occurring at a given position in a pattern by
using multiple choice, or negation.
A special option allows to compile a grappe version for DNA
treatment.
How to use grappe
*****************
The simplest usage of grappe is to call
grappe <patterns> <textfile>
<patterns> is a set of patterns specified according to the following
syntax:
patterns = pattern OR pattern|patterns
pattern = fragment OR fragment#pattern
fragment = keyword OR keyword(p,q)fragment
keyword = word OR keyword[set-of-letters] OR keyword[^set-of-letters] OR
keyword word OR ?keyword
set-of-letters = letter OR letter set-of-letters
word = letter OR letter word
Given a set of patterns, separated by '|' sign, grappe is looking if
one of them (and not each of them) occurs in the text.
Don't care symbol '#', also called unlimited (or arbitrary) wildcard,
matches arbitrary number of caracters. Notation '(p,q)'
(bounded-length wildcard) is used to bound the number of matched
caracters between p and q.
Letters in square brackets '[]' match anyone of them. If they are
preceeded by '^' sign, this matches any other letter, i.e. anyone
which is not listed.
Sign '?' before a keyword means that the keyword may occur with one
substitution error, i.e. that one letter may get replaced with respect
to an "exact" match of the keyword.
Note that if you wish to specify a pattern containing
#,(,),[,],|,\,^,? then you have to protect these symbols with '\' (
'\#', '\(', '\)', '\[', '\]', '\|', '\\', '\^', '\?' ).
Examples
********
'hello' matches any text that containing an occurence of 'hello'.
'hello#world' matches any text containing an occurence of 'hello'
followed, within any distance, by an occurence of 'world'.
'hello(1,2)world' matches any text containing an occurence of 'hello'
followed by an occurence of 'world' such that the two occurrences are
separated by one or two letters.
'h[ea]llo' matches any text containing an occurence of 'hello' or an
occurence of 'hallo'.
'h[^a]llo' matches any text containing an occurence of 'h' directly
followed by any letter except 'a' and directly followed by an
occurence of 'llo'. For example, 'hello' and 'hjllo' are matched.
'?hello' matches any text containing an occurence of
'hello' or an occurence of any subwords at a distance 1 from
'hello'. For example, 'hello', 'hallo', 'yellow' are matched, but
'gallo' is not matched.
Warnings
********
grappe 3.0 can handle at most one error per keyword. Besides, handling
errors has an overhead in memory use. For some big examples, this
memory blow-up may be prohibitive. In general, we recommend to use
this option with the DNA version of grappe and with small keywords (of
order of 5 letters) unless you have a lot of memory on your computer ...
Installation
************
Installing grappe is easy.
If you got the grappe archive file which ends with gz (like
grappe-3.0.tar.gz) you have first to decompress it with the command
gunzip grappe-3.0.tar.gz
Then you have to extract the archive with the command
tar xvf grappe-3.0.tar
Now you go to the grappe directory by typing
cd grappe-3.0
and then run two commands:
./configure
make
You are done.
If you want to have a version of grappe tuned to working with DNA
files, you type
./configure --enable-DNA
instead of just ./configure. Actually, "configure" can be invoked with
other options, to know them type
./configure --help
Moreover, using "configure" allows to simply create different versions
of grappe (for different architectures, for handling DNA files, ...)
without duplicating source files. For exemple, you might want to
create two versions of grappe -- generic and DNA. You may then type
the following commands:
mkdir generic; cd generic
../configure
make
cd ..; mkdir dna; cd dna;
../configure --enable-DNA
make
Options
*******
The usage of grappe is the following:
grappe [-r] [-l] [-c] [-d <eor_chars>] ([-e] <patterns> | -f <pattern-file>) [<textfiles>]
-r : considers "end of record" characters as usual characters
-l : matches each line of the text independently
-c : counts the number successfully matched records (lines) of the text
-d <separators> : redefines the "end of record" characters
(end-of-line by default)
-i : forces grappe to be case insensitive
-e <patterns> : introduces the patterns (this option is useful when
the patterns begin with '-', othewise it is unnecessary).
-f <pattern-file> : specifies the file with patterns
You should specify exactly one set of patterns.
If no text file is specified then grappe tries to match the
standard input. You can specify a set of text files by listing them or
using shell
regular expressions. For example,
grappe 'void' *.c
looks for pattern 'void' in all c-files in the directory.
DNA version
***********
The DNA version of grappe understands the IUPAC nucleid acid codes in
the patterns:
A = adenine
C = cytosine
G = guanine
T = thymine
U = uracil
R = purine (same as [GA])
Y = pyrimidine (same as [TC])
K = keto (same as [GT])
M = amino (same as [AC])
S = (same as [GC])
W = (same as [AT])
B = (same as [GTC])
D = (same as [GAT])
H = (same as [ACT])
V = (same as [GCA])
N = any (same as [ACGT])
Note that the basic alphabet is {A,C,G,T,U}. 'T' and 'U' are
synonyms. The other letters {R,Y,K,M,S,W,B,D,H,V,N} are only aliases
to the corresponding multiple choice letter.
Option '-i' does not work with the DNA version of grappe, which is
**always** case insensitive and skips spaces.
Option '-r' is not supposed to be used with the DNA version.
Examples
********
grappe 'hello' /usr/dict/*
grappe -c -l 'hello' /usr/dict/words
grappe 'CAATCT#TATA' yeastI.pure
grappe -f patterns yeastI.pure
Bug reports
***********
Although we have made lots of tests in extremal conditions, bugs might
still occur. If you happen to find one, please make a bug report to
kucherov@loria.fr and sbriais@ens-lyon.fr, with the subject "grappe
[Bug report]". Don't forget to indicate in your mail the command line
that provoked the bug (in particular, send us the patterns and if
possible the text), the type of bug (assertion failed [see note
below], segmentation fault, ...), the computer and the operating
system you use.
Note that the 'assertion failed' messages of the form '.... != NULL'
(where .... is a name of a variable) are probably out of memory
problems and can't be fixed. Others are more interesting ...
References
**********
grappe is based on the algorithm presented in the following paper:
G. Kucherov, M. Rusinowitch
Matching a Set of Strings with Variable Length Don't Cares
Theoretical Computer Science, 178 (1997), pp 129-154.
An extended abstract of this paper appeared in
6th Symposium on Combinatorial Pattern Matching
Helsinki, July 1995
Lecture Notes in Computer Science, vol. 937, (1995), pp 230-247.
You can find more informations at the following url:
http://www.loria.fr/~kucherov/
or follow links from this one:
http://www.ens-lyon.fr/~sbriais/
Contributors
************
grappe-1 G.Kucherov (with help of P.Zimmermann and M.Rusinowitch)
grappe-2 E.Queffelec
grappe-3 S.Briais