README - grappe 3.0: ******************** Introduction ************ grappe is a pattern matching program. It looks, in a text, for a set of patterns containing 'don't care symbols' (wildcards) of unbounded or bounded length. The size of patterns, as well as their length, is unlimited. grappe is expected to be particularly efficient for searching for a big number of patterns, each containing don't cares. However, grappe is competitive with respect to other programs (egrep, agrep) on small pattern sets too. Patterns may occur with substitution errors. The number of errors and their occurrences in a pattern have to be specified by the user (this feature is restricted in grappe 3.0 by one error per keyword, see below). A number of other useful options are present. As an example, the user can specify letters occurring at a given position in a pattern by using multiple choice, or negation. A special option allows to compile a grappe version for DNA treatment. How to use grappe ***************** The simplest usage of grappe is to call grappe <patterns> <textfile> <patterns> is a set of patterns specified according to the following syntax: patterns = pattern OR pattern|patterns pattern = fragment OR fragment#pattern fragment = keyword OR keyword(p,q)fragment keyword = word OR keyword[set-of-letters] OR keyword[^set-of-letters] OR keyword word OR ?keyword set-of-letters = letter OR letter set-of-letters word = letter OR letter word Given a set of patterns, separated by '|' sign, grappe is looking if one of them (and not each of them) occurs in the text. Don't care symbol '#', also called unlimited (or arbitrary) wildcard, matches arbitrary number of caracters. Notation '(p,q)' (bounded-length wildcard) is used to bound the number of matched caracters between p and q. Letters in square brackets '[]' match anyone of them. If they are preceeded by '^' sign, this matches any other letter, i.e. anyone which is not listed. Sign '?' before a keyword means that the keyword may occur with one substitution error, i.e. that one letter may get replaced with respect to an "exact" match of the keyword. Note that if you wish to specify a pattern containing #,(,),[,],|,\,^,? then you have to protect these symbols with '\' ( '\#', '\(', '\)', '\[', '\]', '\|', '\\', '\^', '\?' ). Examples ******** 'hello' matches any text that containing an occurence of 'hello'. 'hello#world' matches any text containing an occurence of 'hello' followed, within any distance, by an occurence of 'world'. 'hello(1,2)world' matches any text containing an occurence of 'hello' followed by an occurence of 'world' such that the two occurrences are separated by one or two letters. 'h[ea]llo' matches any text containing an occurence of 'hello' or an occurence of 'hallo'. 'h[^a]llo' matches any text containing an occurence of 'h' directly followed by any letter except 'a' and directly followed by an occurence of 'llo'. For example, 'hello' and 'hjllo' are matched. '?hello' matches any text containing an occurence of 'hello' or an occurence of any subwords at a distance 1 from 'hello'. For example, 'hello', 'hallo', 'yellow' are matched, but 'gallo' is not matched. Warnings ******** grappe 3.0 can handle at most one error per keyword. Besides, handling errors has an overhead in memory use. For some big examples, this memory blow-up may be prohibitive. In general, we recommend to use this option with the DNA version of grappe and with small keywords (of order of 5 letters) unless you have a lot of memory on your computer ... Installation ************ Installing grappe is easy. If you got the grappe archive file which ends with gz (like grappe-3.0.tar.gz) you have first to decompress it with the command gunzip grappe-3.0.tar.gz Then you have to extract the archive with the command tar xvf grappe-3.0.tar Now you go to the grappe directory by typing cd grappe-3.0 and then run two commands: ./configure make You are done. If you want to have a version of grappe tuned to working with DNA files, you type ./configure --enable-DNA instead of just ./configure. Actually, "configure" can be invoked with other options, to know them type ./configure --help Moreover, using "configure" allows to simply create different versions of grappe (for different architectures, for handling DNA files, ...) without duplicating source files. For exemple, you might want to create two versions of grappe -- generic and DNA. You may then type the following commands: mkdir generic; cd generic ../configure make cd ..; mkdir dna; cd dna; ../configure --enable-DNA make Options ******* The usage of grappe is the following: grappe [-r] [-l] [-c] [-d <eor_chars>] ([-e] <patterns> | -f <pattern-file>) [<textfiles>] -r : considers "end of record" characters as usual characters -l : matches each line of the text independently -c : counts the number successfully matched records (lines) of the text -d <separators> : redefines the "end of record" characters (end-of-line by default) -i : forces grappe to be case insensitive -e <patterns> : introduces the patterns (this option is useful when the patterns begin with '-', othewise it is unnecessary). -f <pattern-file> : specifies the file with patterns You should specify exactly one set of patterns. If no text file is specified then grappe tries to match the standard input. You can specify a set of text files by listing them or using shell regular expressions. For example, grappe 'void' *.c looks for pattern 'void' in all c-files in the directory. DNA version *********** The DNA version of grappe understands the IUPAC nucleid acid codes in the patterns: A = adenine C = cytosine G = guanine T = thymine U = uracil R = purine (same as [GA]) Y = pyrimidine (same as [TC]) K = keto (same as [GT]) M = amino (same as [AC]) S = (same as [GC]) W = (same as [AT]) B = (same as [GTC]) D = (same as [GAT]) H = (same as [ACT]) V = (same as [GCA]) N = any (same as [ACGT]) Note that the basic alphabet is {A,C,G,T,U}. 'T' and 'U' are synonyms. The other letters {R,Y,K,M,S,W,B,D,H,V,N} are only aliases to the corresponding multiple choice letter. Option '-i' does not work with the DNA version of grappe, which is **always** case insensitive and skips spaces. Option '-r' is not supposed to be used with the DNA version. Examples ******** grappe 'hello' /usr/dict/* grappe -c -l 'hello' /usr/dict/words grappe 'CAATCT#TATA' yeastI.pure grappe -f patterns yeastI.pure Bug reports *********** Although we have made lots of tests in extremal conditions, bugs might still occur. If you happen to find one, please make a bug report to kucherov@loria.fr and sbriais@ens-lyon.fr, with the subject "grappe [Bug report]". Don't forget to indicate in your mail the command line that provoked the bug (in particular, send us the patterns and if possible the text), the type of bug (assertion failed [see note below], segmentation fault, ...), the computer and the operating system you use. Note that the 'assertion failed' messages of the form '.... != NULL' (where .... is a name of a variable) are probably out of memory problems and can't be fixed. Others are more interesting ... References ********** grappe is based on the algorithm presented in the following paper: G. Kucherov, M. Rusinowitch Matching a Set of Strings with Variable Length Don't Cares Theoretical Computer Science, 178 (1997), pp 129-154. An extended abstract of this paper appeared in 6th Symposium on Combinatorial Pattern Matching Helsinki, July 1995 Lecture Notes in Computer Science, vol. 937, (1995), pp 230-247. You can find more informations at the following url: http://www.loria.fr/~kucherov/ or follow links from this one: http://www.ens-lyon.fr/~sbriais/ Contributors ************ grappe-1 G.Kucherov (with help of P.Zimmermann and M.Rusinowitch) grappe-2 E.Queffelec grappe-3 S.Briais