packages icon






























































                                    - 1 -      Formatted:  November 15, 2024






 mifluz(3)                                                         mifluz(3)
                                    local



 NAME
      mifluz - C++ library to use and manage inverted indexes



 SYNOPSIS
      #include <mifluz.h>

      main()
      {
         Configuration* config = WordContext::Initialize();

         WordList* words = new WordList(*config);

         ...

         delete words;

         WordContext::Finish();
      }


 DESCRIPTION
      The purpose of mifluz is to provide a C++ library to build and query a
      full text inverted index. It is dynamically updatable, scalable (up to
      1Tb indexes), uses a controlled amount of memory, shares index files
      and memory cache among processes or threads and compresses index files
      to 50% of the raw data. The structure of the index is configurable at
      runtime and allows inclusion of relevance ranking information. The
      query functions do not require loading all the occurrences of a
      searched term.  They consume very few resources and many searches can
      be run in parallel.

      The file management library used in mifluz is a modified Berkeley DB
      (www.sleepycat.com) version 3.1.14.



 CLASSES AND COMMANDS
      Configuration

           reads the configuration file and manages it in memory.

      WordContext

           read configuration and setup mifluz context.

      WordCursor

           abstract class to search and retrieve entries in a WordList
           object.



                                    - 1 -      Formatted:  November 15, 2024






 mifluz(3)                                                         mifluz(3)
                                    local



      WordCursorOne

           search and retrieve entries in a WordListOne object.

      WordDBInfo
           inverted index usage environment.

      WordDict

           manage and use an inverted index dictionary.

      WordKey
           inverted index key.

      WordKeyInfo
           information on the key structure of the inverted index.

      WordList

           abstract class to manage and use an inverted index file.

      WordListOne

           manage and use an inverted index file.

      WordMonitor
           monitoring classes activity.

      WordRecord
           inverted index record.

      WordRecordInfo
           information on the record structure of the inverted index.

      WordReference
           inverted index occurrence.

      WordType
           defines a word in term of allowed characters, length etc.

      htdb_dump

           dump the content of an inverted index in Berkeley DB fashion

      htdb_load

           displays statistics for Berkeley DB environments.

      htdb_stat

           displays statistics for Berkeley DB environments.



                                    - 2 -      Formatted:  November 15, 2024






 mifluz(3)                                                         mifluz(3)
                                    local



      mifluzdict

           dump the dictionnary of an inverted index.

      mifluzdump

           dump the content of an inverted index.

      mifluzload

           load the content of an inverted index.

      mifluzsearch
           search the content of an inverted index.

 CONFIGURATION
      The format of the configuration file read by WordContext::Initialize
      is:
      keyword: value
      Comments may be added on lines starting with a #. The default
      configuration file is read from from the file pointed by the
      MIFLUZ_CONFIG environment variable or ~/.mifluz or /etc/mifluz.conf in
      this order. If no configuration file is available, builtin defaults
      are used.  Here is an example configuration file:
      wordlist_extend: true
      wordlist_cache_size: 10485760
      wordlist_page_size: 32768
      wordlist_compress: 1
      wordlist_wordrecord_description: NONE
      wordlist_wordkey_description: Word/DocID 32/Flags 8/Location 16
      wordlist_monitor: true
      wordlist_monitor_period: 30
      wordlist_monitor_output: monitor.out,rrd

      wordlist_allow_numbers {true|false} <number> (default false)
           A digit is considered a valid character within a word if this
           configuration parameter is set to true otherwise it is an error
           to insert a word containing digits.  See the Normalize method for
           more information.

      wordlist_cache_inserts {true|false} (default false)
           If true all Insert calls are cached in memory. When the WordList
           object is closed or a different access method is called the
           cached entries are flushed in the inverted index.

      wordlist_cache_max <bytes> (default 0)
           Maximum size of the cumulated cache files generated when doing
           bulk insertion with the BatchStart() function. When this limit is
           reached, the cache files are all merged into the inverted index.
           The value 0 means infinite size allowed.  See WordList(3) for the
           rationale behind cache file handling.



                                    - 3 -      Formatted:  November 15, 2024






 mifluz(3)                                                         mifluz(3)
                                    local



      wordlist_cache_size <bytes> (default 500K)
           Berkeley DB cache size (see Berkeley DB documentation) Cache
           makes a huge difference in performance. It must be at least 2% of
           the expected total data size. Note that if compression is
           activated the data size is eight times larger than the actual
           file size. In this case the cache must be scaled to 2% of the
           data size, not 2% of the file size. See Cache tuning in the
           mifluz guide for more hints.  See WordList(3) for the rationale
           behind cache file handling.

      wordlist_compress {true|false} (default false)
           Activate compression of the index. The resulting index is eight
           times smaller than the uncompressed index.

      wordlist_env_dir <directory> (default .)
           Only valid if wordlist_env_share set to true. Specify the
           directory in which the sharable environment will be created. All
           inverted indexes specified with a non-absolute pathname will be
           created relative to this directory.

      wordlist_env_share {true,false} (default false)
           If true a sharable environment is open or created if none exist.

      wordlist_env_skip {true,false} (default false)
           If true no environment is created at all. This must never be used
           if a WordList object is created. It may be useful if only WordKey
           objects are used, for instance.

      wordlist_extend {true|false} (default false)
           If true maintain reference count of unique words. The Noccurrence
           method gives access to this count.

      wordlist_locale <locale> (default C)
           Set the locale of the program to locale for more information.

      wordlist_lowercase {true|false} <number> (default true)
           If a word contains upper case letters it is converted to
           lowercase if this configuration parameter is true, otherwise it
           is left untouched.

      wordlist_maximum_word_length <number> (default 25)
           The maximum length of a word.  See the Normalize method for more
           information.

      wordlist_mimimun_word_length <number> (default 3)
           The minimum length of a word.  See the Normalize method for more
           information.

      wordlist_monitor {true|false} (default false)
           If true create a WordMonitor instance to gather statistics and
           build reports.



                                    - 4 -      Formatted:  November 15, 2024






 mifluz(3)                                                         mifluz(3)
                                    local



      wordlist_monitor_output <file>[,{rrd,readable] (default stderr)
           Print reports on file instead of the default stderr If type is
           set to rrd the output is fit for the benchmark-report script.
           Otherwise it a (hardly :-) readable string.

      wordlist_monitor_period <sec> (default 0)
           If the value sec is a positive integer, set a timer to print
           reports every sec seconds. The timer is set using the ALRM signal
           and will fail if the calling application already has a handler on
           that signal.

      wordlist_page_size <bytes> (default 8192)
           Berkeley DB page size (see Berkeley DB documentation)

      wordlist_truncate {true|false} <number> (default true)
           If a word is too long according to the
           wordlist_maximum_word_length it is truncated if this
           configuration parameter is true otherwise it is considered an
           invalid word.

      wordlist_valid_punctuation [characters] (default none)
           A list of punctuation characters that may appear in a word. These
           characters will be removed from the word before insertion in the
           index.

      wordlist_verbose <number> (default 0)
           Set the verbosity level of the WordList class.


           1 walk logic


           2 walk logic details


           3 walk logic lots of details

      wordlist_wordkey_description <desc> (no default)
           Describe the structure of the inverted index key.  In the
           following explanation of the <desc> format, mandatory words are
           in bold and values that must be replaced in italic.


           Word bits/name bits [/...]


           The name is an alphanumerical symbolic name for the key field.
           The bits is the number of bits required to store this field.
           Note that all values are stored in unsigned integers (unsigned
           int).  Example:
           Word 8/Document 16/Location 8



                                    - 5 -      Formatted:  November 15, 2024






 mifluz(3)                                                         mifluz(3)
                                    local



      wordlist_wordkey_document [field ...] (default none)
           A white space separated list of field numbers that define a
           document.  The field number list must not contain gaps. For
           instance 1 2 3 is valid but 1 3 4 is not valid.  This
           configuration parameter is not used by the mifluz library but may
           be used by a query application to define the semantic of a
           document. In response to a query, the application will return a
           list of results in which only distinct documents will be shown.

      wordlist_wordkey_location field (default none)
           A single field number that contains the position of a word in a
           given document.  This configuration parameter is not used by the
           mifluz library but may be used by a query application.

      wordlist_wordrecord_description {NONE|DATA|STR} (no default)
           NONE: the record is empty


           DATA: the record contains an integer (unsigned int)


           STR: the record contains a string (String)

 ENVIRONMENT
      MIFLUZ_CONFIG file name of configuration file read by WordContext(3).
      Defaults to ~/.mifluz. or /usr/etc/mifluz.conf



 AUTHORS
      Loic Dachary loic@gnu.org

      The Ht://Dig group http://dev.htdig.org/



 SEE ALSO
      htdb_dump(1), htdb_stat(1), htdb_load(1), mifluzdump(1),
      mifluzload(1), mifluzsearch(1), mifluzdict(1), WordContext(3),
      WordList(3), WordDict(3), WordListOne(3), WordKey(3), WordKeyInfo(3),
      WordType(3), WordDBInfo(3), WordRecordInfo(3), WordRecord(3),
      WordReference(3), WordCursor(3), WordCursorOne(3), WordMonitor(3),
      Configuration(3)











                                    - 6 -      Formatted:  November 15, 2024