mg(1) CITRI mg(1)
1994/09/19
NAME
mg - full-text inverted index support
DESCRIPTION
mg is a suite of programs that can be used to create and query full-
text inverted indexes for a collection of documents.
An inverted index is essentially a list of pointers to occurrences of
every word in a collection of documents, so that, for example, in a
collection of Shakespeare's works, one can pose questions like How
often is a fool mentioned in the plays? and Are Romeo and Juliet ever
mentioned in the same sentence? While search utilities like the UNIX
grep(1) family can be used for simple queries, they suffer from
several limitations:
+ Each invocation requires a separate pass over the documents, which
becomes impossibly slow if the document collection is very large.
+ It is difficult to compose Boolean queries like Caesar AND Brutus
AND NOT Antony.
+ They provide no mechanism for ranking matches according to
importance, which is extremely important when queries are complex
and numerous matches are found.
+ They provide no mechanism for displaying surrounding context
(although the Free Software Foundation implementations for the GNU
Project remedy this with their -A num (after) and -B num (before)
switches).
+ They provide no mechanism for searching for occurrences in the same
paragraph, unless preprocessing is done on the files to wrap
paragraphs into long lines. Even that will often fail, because
input buffer sizes may be limited to a few hundred characters.
+ They provide no easy way to deal with grammatical word-ending
variants, except by explicit enumeration, such as searching for
compress, compressed, compresses, compressing, and compression.
Most computer users have been faced with the problem of finding
information from a large collection of files, such as electronic mail,
on-line documentation, or source code. Inverted indices provide a
highly-effective solution to this problem.
The mg software is described in Appendix A of the book
Ian H. Witten, Alistair Moffat, and Timothy C. Bell
Managing Gigabytes: Compressing and Indexing Documents and Images
Van Nostrand Reinhold
1994
xiv + 429 pages
US$54.95
- 1 - Formatted: December 14, 2025
mg(1) CITRI mg(1)
1994/09/19
ISBN 0-442-01863-0
Library of Congress catalog number TA1637 .W58 1994
Many of the algorithms implemented in mg represent significant
advances over previous work, both in speed, and in storage
requirements. On a fast workstation, in tens of minutes, or a few
hours, mgbuild(1) can create an index to all of the words in hundreds
of thousands of documents occupying hundreds of megabytes, or even
more than a gigabyte, of disk space. mgquery(1) can then be used to
answer complex queries, with responses often returned in a second or
less. mg also contains algorithms to deal with images, so that with a
small amount of descriptive text for each image, it is possible to do
searches in collections of images, and to have retrievals display the
images using a viewer like xv(1).
mg can deal with compressed text and image files and surprisingly, it
usually runs faster than it would if the files were not compressed!
Thus, the considerable disk space savings possible from file
compression are not lost because of the need for fast document search
and retrieval.
The Free Software Foundation GNU Project compression utilities gzip(1)
and gunzip(1) are recommended for general use over older alternatives,
like compress(1), because of their speed and high compression ratios.
AVAILABILITY
The mg software can be obtained via anonymous ftp to the Australian
archive host munnari.oz.au [128.250.1.21] from the directory /pub/mg.
TYPICAL USE OF mg
Although mg consists of more than 20 separate programs, many of which
have complicated command-line options, take heart: most users require
only two or three of these programs, and nothing more than a document
name on the command line.
A document for mg is a fragment of text suitable for retrieval as a
unit when it is found to contain a requested word, or words. In a
collection of poetry, a document might be a stanza, while in a novel,
it could be a paragraph. In an index of first lines of poems, a
document would likely be just a single line.
Just what constitutes a document is decided by a user-modifiable UNIX
shell script, mg_get(1). The default script provided with the mg
source distribution knows about these named document collections:
alice Lewis Carroll's Alice in Wonderland book.
allfiles all mail files in the directory tree $HOME/Mail, including
all of its nested subdirectories.
- 2 - Formatted: December 14, 2025
mg(1) CITRI mg(1)
1994/09/19
mailfiles individual mail messages in $HOME/mbox and $HOME/.sentmail.
davinci A small collection of text and images from the work of
Leonardo da Vinci.
A document collection name is used by mg as a csh(1) case statement
selector, and as a subdirectory name in the $MGDATA directory, or the
current directory, if MGDATA is not defined.
EXTENDING mg_get
This section describes how to extend mg_get(1) to handle a new
document collection.
Let us take two examples: all BibTeX .bib files, and all TeX files,
contained in subdirectories under the login directory.
For BibTeX, each bibliographic entry will be considered a separate
document. In order to facilitate easy identification of entries, we
shall require them to begin at the start of a line; the bibclean(1)
utility can be used to standardize the format of .bib files, and to
validate their string values, so that this requirement is met.
For TeX, each paragraph will be a separate document, and we assume
that paragraphs are separated by blank lines. We assume that files
with extensions .atx, .ltx, .stx, and .tex contain input to common TeX
macro package variants.
Make a personal copy of the mg_get script, using the one in the mg
source distribution (mg-1.0/mg/mg_get.sh), or the one in the local
binary program directory, at many sites called /usr/local/bin/mg_get.
Examination of the mg_get script shows that each document collection
name is used in a csh(1) case statement selector, and that most of
work is done by very simple awk(1) programs that extract documents
from files. In your private copy of the mg_get file, after the line
breaksw #davinci
and before the line
default:
insert this new code:
case bibfiles:
# Takes a list of files that contain BibTeX entries, and splits them up
# by putting ^B after each entry. Assumes that each entry
# begins with a line '^@'.
switch ($flag)
case '-init':
breaksw
- 3 - Formatted: December 14, 2025
mg(1) CITRI mg(1)
1994/09/19
case '-text':
find $HOME -name '*.bib' -print | \
sort | \
xargs -l100 awk \
'/^@/&&NR!=1{print "^B"} {print $0} END{print "^B"}'
breaksw #-text
case '-cleanup':
breaksw #-cleanup
endsw #flag
breaksw #bibfiles
case texfiles:
# Takes a list of TeX files and split them up
# by putting ^B after each paragraph. Assumes that each entry
# begins with a line '^@'.
switch ($flag)
case '-init':
breaksw
case '-text':
find $HOME -name '*x' -print | \
egrep '[.]tex$|[.]ltx$|[.]atx$|[.]stx$' | \
sort | \
xargs -l100 nawk ' /^ *$/ {if (b!=1) printf "^B";b=1} \
\!/^ *$/ {print;b=0} \
END {printf "^B"}'
breaksw #-text
case '-cleanup':
breaksw #-cleanup
endsw #flag
breaksw #texfiles
The ^B characters here are Control-B characters, not caret-B pairs.
If you have a large number of BibTeX or TeX files, it is likely that a
list of them would be too long for the UNIX shell to hold in a single
variable, or on a single command line. Thus, instead of storing the
output of find(1) in a variable, we proceed more cautiously, and
employ it to produce a list of the required files, then pipe them to
xargs(1), which in turn passes up to 100 filenames at a time to
nawk(1) for document selection.
Install this modified mg_get script in your private directory for
executable programs (e.g. $HOME/bin), create a directory $HOME/mgdata
to hold the index, issue a rehash command if you are using csh(1) or
tcsh(1), ensure that mg_get occurs in your search path before any
system-wide one (the command which mg_get will tell you which version
- 4 - Formatted: December 14, 2025
mg(1) CITRI mg(1)
1994/09/19
will be selected), then create the inverted indexes by mgbuild
bibfiles and mgbuild texfiles. These commands may take several
minutes to run if you have a lot of BibTeX or TeX files, or a large
home directory tree. Once they are complete, you can then query the
index with the commands mgquery bibfiles and mgquery texfiles. These
should respond very rapidly.
In order to keep your index up-to-date, you should arrange for it to
be recreated automatically and regularly, probably every night. You
can do this with cron(1). Use the command crontab -e to edit your
crontab file and add two lines like this:
00 04 * * * mgbuild bibfiles >$HOME/mgdata/bibfiles.log 2>&1
15 04 * * * mgbuild texfiles >$HOME/mgdata/texfiles.log 2>&1
Save the file and exit the editor. Now, every night at 4am and
4:15am, mgbuild(1) will reconstruct your inverted indexes, and the
results of the builds will be saved in log files in your $HOME/mgdata
directory.
SEE ALSO
awk(1), bibclean(1), bibtex(1), compress(1), csh(1), grep(1),
gunzip(1), gzip(1), mg_compression_dict(1), mg_fast_comp_dict(1),
mg_get(1), mg_invf_dict(1), mg_invf_dump(1), mg_invf_rebuild(1),
mg_passes(1), mg_perf_hash_build(1), mg_text_estimate(1),
mg_weights_build(1), mgbilevel(1), mgbuild(1), mgdictlist(1),
mgfelics(1), mgquery(1), mgstat(1), mgtic(1), mgticbuild(1),
mgticdump(1), mgticprune(1), mgticstat(1), nawk(1), tcsh(1), tex(1),
xargs(1), xmg(1), xv(1).
- 5 - Formatted: December 14, 2025