packages icon



 mg(1)                              CITRI                              mg(1)
                                 1994/09/19



 NAME
      mg - full-text inverted index support

 DESCRIPTION
      mg is a suite of programs that can be used to create and query full-
      text inverted indexes for a collection of documents.

      An inverted index is essentially a list of pointers to occurrences of
      every word in a collection of documents, so that, for example, in a
      collection of Shakespeare's works, one can pose questions like How
      often is a fool mentioned in the plays? and Are Romeo and Juliet ever
      mentioned in the same sentence? While search utilities like the UNIX
      grep(1) family can be used for simple queries, they suffer from
      several limitations:

      +  Each invocation requires a separate pass over the documents, which
         becomes impossibly slow if the document collection is very large.

      +  It is difficult to compose Boolean queries like Caesar AND Brutus
         AND NOT Antony.

      +  They provide no mechanism for ranking matches according to
         importance, which is extremely important when queries are complex
         and numerous matches are found.

      +  They provide no mechanism for displaying surrounding context
         (although the Free Software Foundation implementations for the GNU
         Project remedy this with their -A num (after) and -B num (before)
         switches).

      +  They provide no mechanism for searching for occurrences in the same
         paragraph, unless preprocessing is done on the files to wrap
         paragraphs into long lines.  Even that will often fail, because
         input buffer sizes may be limited to a few hundred characters.

      +  They provide no easy way to deal with grammatical word-ending
         variants, except by explicit enumeration, such as searching for
         compress, compressed, compresses, compressing, and compression.

      Most computer users have been faced with the problem of finding
      information from a large collection of files, such as electronic mail,
      on-line documentation, or source code.  Inverted indices provide a
      highly-effective solution to this problem.

      The mg software is described in Appendix A of the book
           Ian H. Witten, Alistair Moffat, and Timothy C. Bell
           Managing Gigabytes: Compressing and Indexing Documents and Images
           Van Nostrand Reinhold
           1994
           xiv + 429 pages
           US$54.95



                                    - 1 -         Formatted:  April 20, 2024






 mg(1)                              CITRI                              mg(1)
                                 1994/09/19



           ISBN 0-442-01863-0
           Library of Congress catalog number TA1637 .W58 1994

      Many of the algorithms implemented in mg represent significant
      advances over previous work, both in speed, and in storage
      requirements.  On a fast workstation, in tens of minutes, or a few
      hours, mgbuild(1) can create an index to all of the words in hundreds
      of thousands of documents occupying hundreds of megabytes, or even
      more than a gigabyte, of disk space.  mgquery(1) can then be used to
      answer complex queries, with responses often returned in a second or
      less.  mg also contains algorithms to deal with images, so that with a
      small amount of descriptive text for each image, it is possible to do
      searches in collections of images, and to have retrievals display the
      images using a viewer like xv(1).

      mg can deal with compressed text and image files and surprisingly, it
      usually runs faster than it would if the files were not compressed!
      Thus, the considerable disk space savings possible from file
      compression are not lost because of the need for fast document search
      and retrieval.

      The Free Software Foundation GNU Project compression utilities gzip(1)
      and gunzip(1) are recommended for general use over older alternatives,
      like compress(1), because of their speed and high compression ratios.

 AVAILABILITY
      The mg software can be obtained via anonymous ftp to the Australian
      archive host munnari.oz.au [128.250.1.21] from the directory /pub/mg.

 TYPICAL USE OF mg
      Although mg consists of more than 20 separate programs, many of which
      have complicated command-line options, take heart: most users require
      only two or three of these programs, and nothing more than a document
      name on the command line.

      A document for mg is a fragment of text suitable for retrieval as a
      unit when it is found to contain a requested word, or words.  In a
      collection of poetry, a document might be a stanza, while in a novel,
      it could be a paragraph.  In an index of first lines of poems, a
      document would likely be just a single line.

      Just what constitutes a document is decided by a user-modifiable UNIX
      shell script, mg_get(1).  The default script provided with the mg
      source distribution knows about these named document collections:

      alice      Lewis Carroll's Alice in Wonderland book.

      allfiles   all mail files in the directory tree $HOME/Mail, including
                 all of its nested subdirectories.





                                    - 2 -         Formatted:  April 20, 2024






 mg(1)                              CITRI                              mg(1)
                                 1994/09/19



      mailfiles  individual mail messages in $HOME/mbox and $HOME/.sentmail.

      davinci    A small collection of text and images from the work of
                 Leonardo da Vinci.

      A document collection name is used by mg as a csh(1) case statement
      selector, and as a subdirectory name in the $MGDATA directory, or the
      current directory, if MGDATA is not defined.

 EXTENDING mg_get
      This section describes how to extend mg_get(1) to handle a new
      document collection.

      Let us take two examples: all BibTeX .bib files, and all TeX files,
      contained in subdirectories under the login directory.

      For BibTeX, each bibliographic entry will be considered a separate
      document.  In order to facilitate easy identification of entries, we
      shall require them to begin at the start of a line; the bibclean(1)
      utility can be used to standardize the format of .bib files, and to
      validate their string values, so that this requirement is met.

      For TeX, each paragraph will be a separate document, and we assume
      that paragraphs are separated by blank lines.  We assume that files
      with extensions .atx, .ltx, .stx, and .tex contain input to common TeX
      macro package variants.

      Make a personal copy of the mg_get script, using the one in the mg
      source distribution (mg-1.0/mg/mg_get.sh), or the one in the local
      binary program directory, at many sites called /usr/local/bin/mg_get.

      Examination of the mg_get script shows that each document collection
      name is used in a csh(1) case statement selector, and that most of
      work is done by very simple awk(1) programs that extract documents
      from files.  In your private copy of the mg_get file, after the line

        breaksw #davinci

      and before the line

        default:

      insert this new code:

        case bibfiles:
      # Takes a list of files that contain BibTeX entries, and splits them up
      #   by putting ^B after each entry. Assumes that each entry
      #   begins with a line '^@'.
        switch ($flag)
          case '-init':
          breaksw



                                    - 3 -         Formatted:  April 20, 2024






 mg(1)                              CITRI                              mg(1)
                                 1994/09/19



          case '-text':
            find $HOME -name '*.bib' -print | \
              sort | \
              xargs -l100 awk \
                '/^@/&&NR!=1{print "^B"} {print $0} END{print "^B"}'
          breaksw #-text

          case '-cleanup':
          breaksw #-cleanup

        endsw #flag
        breaksw #bibfiles

        case texfiles:
      # Takes a list of TeX files and split them up
      #   by putting ^B after each paragraph. Assumes that each entry
      #   begins with a line '^@'.
        switch ($flag)
          case '-init':
          breaksw

          case '-text':
            find $HOME -name '*x' -print | \
              egrep '[.]tex$|[.]ltx$|[.]atx$|[.]stx$' | \
              sort | \
              xargs -l100 nawk '   /^ *$/ {if (b!=1) printf "^B";b=1} \
                  \!/^ *$/ {print;b=0} \
                  END {printf "^B"}'
          breaksw #-text

          case '-cleanup':
          breaksw #-cleanup

        endsw #flag
        breaksw #texfiles

      The ^B characters here are Control-B characters, not caret-B pairs.

      If you have a large number of BibTeX or TeX files, it is likely that a
      list of them would be too long for the UNIX shell to hold in a single
      variable, or on a single command line.  Thus, instead of storing the
      output of find(1) in a variable, we proceed more cautiously, and
      employ it to produce a list of the required files, then pipe them to
      xargs(1), which in turn passes up to 100 filenames at a time to
      nawk(1) for document selection.

      Install this modified mg_get script in your private directory for
      executable programs (e.g.  $HOME/bin), create a directory $HOME/mgdata
      to hold the index, issue a rehash command if you are using csh(1) or
      tcsh(1), ensure that mg_get occurs in your search path before any
      system-wide one (the command which mg_get will tell you which version



                                    - 4 -         Formatted:  April 20, 2024






 mg(1)                              CITRI                              mg(1)
                                 1994/09/19



      will be selected), then create the inverted indexes by mgbuild
      bibfiles and mgbuild texfiles.  These commands may take several
      minutes to run if you have a lot of BibTeX or TeX files, or a large
      home directory tree.  Once they are complete, you can then query the
      index with the commands mgquery bibfiles and mgquery texfiles.  These
      should respond very rapidly.

      In order to keep your index up-to-date, you should arrange for it to
      be recreated automatically and regularly, probably every night.  You
      can do this with cron(1).  Use the command crontab -e to edit your
      crontab file and add two lines like this:

      00 04 * * * mgbuild bibfiles >$HOME/mgdata/bibfiles.log 2>&1
      15 04 * * * mgbuild texfiles >$HOME/mgdata/texfiles.log 2>&1

      Save the file and exit the editor.  Now, every night at 4am and
      4:15am, mgbuild(1) will reconstruct your inverted indexes, and the
      results of the builds will be saved in log files in your $HOME/mgdata
      directory.

 SEE ALSO
      awk(1), bibclean(1), bibtex(1), compress(1), csh(1), grep(1),
      gunzip(1), gzip(1), mg_compression_dict(1), mg_fast_comp_dict(1),
      mg_get(1), mg_invf_dict(1), mg_invf_dump(1), mg_invf_rebuild(1),
      mg_passes(1), mg_perf_hash_build(1), mg_text_estimate(1),
      mg_weights_build(1), mgbilevel(1), mgbuild(1), mgdictlist(1),
      mgfelics(1), mgquery(1), mgstat(1), mgtic(1), mgticbuild(1),
      mgticdump(1), mgticprune(1), mgticstat(1), nawk(1), tcsh(1), tex(1),
      xargs(1), xmg(1), xv(1).

























                                    - 5 -         Formatted:  April 20, 2024