HP UX Archive Centre




 RELS(1)                                                             RELS(1)
                              February 22, 1998



 NAME
      rels - order the relevance of text documents to a phonetic search
      criteria

 SYNOPSIS
      rels [options] patterns paths ...

 DESCRIPTION
      Rels is a program that determines the relevance of text documents to a
      set of keywords expressed in boolean infix notation.  The relevance is
      determined by comparing the phonetic representation of the keywords
      with the phonetic representation of every word in a document.
      (Phonetic searching has some degree of tolerance to misspelled words.)
      The list of file names that are relevant are printed to the standard
      output, in order of relevance. The boolean operators supported are
      logical or, logical and, and logical not. These operators are
      represented by the symbols, "|", "&", and, "!", respectively, and left
      and right parenthesis, "(" and ")", are used as the grouping
      operators. The paths can be files and/or directories-if it is a
      directory, the program will recursively descend into the directory,
      searching all files and directories contained in the directory.

      For example, the command:

          rels "(directory & listing)" /usr/share/man/cat1

      (ie., find the order of relevance of all files that contain both of
      the words "directory" and "listing" in the catman directory) will list
      a few tens of files, out of the hundreds of catman files, of which
      "ls.1" is the among the most relevant-meaning that to find the command
      that lists directories in a Unix system, the "literature search" was
      reduced, on average, by about 98%, which is a considerable expediency
      in relation to browsing through the files in the directory.  Although
      this example is remedial, a similar expediency can be demonstrated in
      searching for documents in email repositories and text archives.

      Additional applications include information robots, (ie., "mailbots,"
      or "infobots,") where the disposition (ie., delivery, filing, or
      viewing,) of text documents can be determined dynamically, based on
      the relevance of the document to a set of criteria, framed in boolean
      infix notation. Or, in other words, the program can be used to order,
      or rank, text documents based on a "context," specified in a general
      mathematical language, similar to that used in calculators.

      The words in the query are case insensitive, and either upper or lower
      case can be used.

      Associativity of operators is left to right, and the precedence of
      operators is identical to 'C':

          precedence      operator



                                    - 1 -       Formatted:  January 15, 2025






 RELS(1)                                                             RELS(1)
                              February 22, 1998



          high            ! = not
          middle          & = and
          lowest          | = or

      The operator symbols can be escaped with the "\" character to include
      the symbol in a search pattern. The "escape space" character sequence
      represents one or more instances of space character(s) in search
      patterns, and each instance will match one or more consecutive
      whitespace characters, (as defined by isspace(3) in ctype.h and/or
      locale.h,) and allows phrases to be searched for. The "many to one"
      whitespace character translation occurs in both the keyword arguments
      and the text document(s). Multiple consecutive instances of the
      "escape space" character sequence in keyword search phrases should not
      be used, and single instances are appropriate only when necessary to
      specify a consecutive sequence of keywords-the logical and operator is
      the preferred searching construct when searching documents that
      contain set(s) of keywords.

      Note that the logical or operator, (|) is useful in conjunction with a
      thesaurus. For example, the thesaurus entry for the word "complexity"
      is:

          Complexity. -- N. complexity; complexness &c. adj.;
          complexus; complication, implication; intricacy,
          intrication; perplexity; network, labyrinth;
          wilderness, jungle; involution, raveling,
          entanglement; coil &c.  (convolution) 248; sleave,
          tangled skein, knot, Gordian knot, wheels within
          wheels; kink, knarl; webwork.

               Adj.  knarled. complex, complexed; intricate,
               complicated, perplexed, involved, raveled,
               entangled, knotted, tangled, inextricable;
               irreducible.

      implying that a reasonable context for a search for things that are
      complex would be:

          rels '(complex | complic | implicat | intric |
               perplex | labyrinth | involut | convolut |
               involv | tangl | inextric | irreduc)' ...

      which would probably return too many document names. The number of
      documents can be reduced with the logical and (&) and not (!) operator
      in an iterative fashion to reject documents of little interest.

 DOCUMENT FORMAT ISSUES
      Hyphenation issues are addressed by deleting hyphens and any following
      sequence of instances of whitespace characters, (as defined by
      isspace(3),) in both the keyword arguments and the text document(s).




                                    - 2 -       Formatted:  January 15, 2025






 RELS(1)                                                             RELS(1)
                              February 22, 1998



      Backspace character issues are addressed by overwriting the character
      before the backspace with the character after the backspace, which
      will instantiate the character of the last instance of of consecutive
      backspace/character combinations. This is specifically for catman
      pages which utilize underscore/backspace/character combinations for
      underlining, in addition to backspace/character combinations for bold
      (overstrike,) representation-note that for this process to be
      successful, a single underscore (used for underlining,) must preceed a
      single character in the sequence.

 PHONETIC TRANSLATION
      This program is a derivative works based on the rel(1) program,
      available from sunsite.unc.edu in /pub/Linux/utils/text/rel-
      1.3.tar.gz. The sources were modified to include a soundex search
      algorithm.

      The soundex algorithm is a mechanical phonetic translation system for
      the English language, and converts English words into a corresponding
      phonetic code for the word. The algorithm is as follows:

          for each character in a word:

              if the character is the first character of a word

                  1) do nothing

              else

                  2) replace consecutive sequences of the
                  labials, (ie., the characters, B, F, P, V,)
                  with the character '1'

                  3) replace consecutive sequences of the
                  gutterals and sibilants, (ie., the characters,
                  C, G, J, K, Q, S, X, Z,) with the character
                  '2'

                  4) replace consecutive sequences of the
                  dentals, (ie., the characters, D, T,) with the
                  character '3'

                  5) replace consecutive sequences of the
                  longliquids, (ie., the character, L,) with the
                  character '4'

                  6) replace consecutive sequences of the
                  nasals, (ie., the characters, M, N,) with the
                  character '5'

                  7) replace consecutive sequences of the
                  shortliquids, (ie., the character, R,) with



                                    - 3 -       Formatted:  January 15, 2025






 RELS(1)                                                             RELS(1)
                              February 22, 1998



                  the character '6'

                  8) and, omit all other characters, (ie., the
                  characters, A, E, H, I, O, U, W, Y,)

                  9) if the soundex translation of the word is
                  larger than 4 characters, truncate to 4
                  characters.

      For example, the soundex translation of the word "conover" is C516.
      Unfortunately, there are two related issues in using the soundex
      algorithm as a search mechanism; interior keyword search is
      impossible, and, there is no practical strategy to handle hyphenation.

      As a heuristic, simply eliminating 1), above, would permit interior
      keyword searches and hyphenation through concatenation of characters
      on each side of a '-' character, at the expense of erroneous matches.
      In practice, the expense is small-depending on the point of view-
      particularly if the requirement in 9), above, is removed, permitting
      soundex keyword translations of more syllables.

      Note that this heuristic returns soundex translated words that consist
      only of numbers. Since numerical data can be a valid search criteria,
      the ambiguity can be avoided by using letters from the alphabet for
      the numbers, making the algorithm as follows:

          1) replace consecutive sequences of the labials, (ie.,
          the characters, B, F, P, V,) with the character 'B'

          2) replace consecutive sequences of the gutterals and
          sibilants, (ie., the characters, C, G, J, K, Q, S, X,
          Z,) with the character
          'G'

          3) replace consecutive sequences of the dentals, (ie.,
          the characters, D, T,) with the character 'D'

          4) replace consecutive sequences of the longliquids,
          (ie., the character, L,) with the character 'L'

          5) replace consecutive sequences of the nasals, (ie.,
          the characters, M, N,) with the character 'N'

          6) replace consecutive sequences of the shortliquids,
          (ie., the character, R,) with the character 'S'

          7) and, omit all other characters, (ie., the
          characters, A, E, H, I, O, U, W, Y,)

      which turns out to be implementable as a direct, many-to-one, and on-
      to simple character mapping. It is, also, a very fast phonetic search



                                    - 4 -       Formatted:  January 15, 2025






 RELS(1)                                                             RELS(1)
                              February 22, 1998



      methodology-there is no speed penalty.

      Comparing the two methodologies, (standard soundex vs. modified
      soundex,) on a text version of the Webster's dictionary, (mine has
      234,932 words,) as to the number of different words recognized, with
      both unlimited soundex word length, and a word length of 4:

             standard soundex               modified soundex

          length = 4    unlimited        length = 4    unlimited

            4,335        61,408             932         31,983

      Although the modified soundex with unlimited length is inferior to the
      standard soundex with unlimited word length in capability of
      recognizing differences in words, it is superior to the standard
      soundex with a word length of 4, which is the way the algorithm is
      usually used. It would seem that the modified soundex algorithm is a
      reasonable, (depending on the point of view,) compromise for
      implementing a phonetic search algorithm.

      There are additional issues with the soundex algorithm for phonetic
      keyword searches:

          1) it only works for the English language

          2) a syntax error will be returned for keywords made
          up of ONLY the characters A, E, I, H, O, U, W, and Y,
          (there is nothing to search for-these characters are
          ignored by the soundex algorithm)

          3) Extreme care must be exercised when using the
          algorithm to reject documents with the logical not
          operator (!) since it will reject more documents than
          probably expected.

      meaning that the algorithm should be considered as an adjunct to,
      instead of a replacement for, a strict keyword search.

      Tests on large email archives, and the HTML pages from WWW servers
      (each about 15 Mbytes,) tend to indicate that, in practice, the
      algorithm returns not quite twice as many keyword matches as a strict
      keyword search. (The output of this program was compared to the output
      of the rel(1) program.)

 OPTIONS
      -v   Print the version and copyright banner of the program.

 WARNINGS
      In the interest of performance, Memory is allocated to hold the entire
      file to be searched.  Large files may create resource issues.



                                    - 5 -       Formatted:  January 15, 2025






 RELS(1)                                                             RELS(1)
                              February 22, 1998



      The "not" boolean operator, '!', can NOT be used to find the list of
      documents that do NOT contain a keyword or phrase, (unless used in
      conjunction with a preceeding boolean construct that will
      syntactically define an intermediate accept criteria for the
      documents.) The rationale is that the relevance of a set of documents
      that do NOT contain a phrase or keyword is ambiguous, and has no
      meaning-ie., how can documents be ordered that do not contain
      something?  Whether this is a bug, or not, depends on one's point of
      view.

      The phonetic translation only works for the English language.

      A syntax error will be returned for keywords made up of ONLY the
      characters A, E, I, H, O, U, W, and Y, (there is nothing to search
      for-these characters are ignored by the soundex algorithm).

      Extreme care must be exercised when using the algorithm to reject
      documents with the logical not operator (!) since it will reject more
      documents than probably expected.

 SEE ALSO
      egrep(1), agrep(1), rel(1)

 DIAGNOSTICS
      Error messages for illegal or incompatible search patterns, for non-
      regular, missing or inaccessible files and directories, or for
      (unlikely) memory allocation failure, and signal errors.

 AUTHORS
      ----------------------------------------------------------------------

      A license is hereby granted to reproduce this software source code and
      to create executable versions from this source code for personal,
      non-commercial use.  The copyright notice included with the software
      must be maintained in all copies produced.

      THIS PROGRAM IS PROVIDED "AS IS". THE AUTHOR PROVIDES NO WARRANTIES
      WHATSOEVER, EXPRESSED OR IMPLIED, INCLUDING WARRANTIES OF
      MERCHANTABILITY, TITLE, OR FITNESS FOR ANY PARTICULAR PURPOSE.  THE
      AUTHOR DOES NOT WARRANT THAT USE OF THIS PROGRAM DOES NOT INFRINGE THE
      INTELLECTUAL PROPERTY RIGHTS OF ANY THIRD PARTY IN ANY COUNTRY.

      Copyright (c) 1995, 1996, 1997, 1998 John Conover, All Rights Reserved.

      Comments and/or bug reports should be addressed to:

          john@johncon.com (John Conover)

      ----------------------------------------------------------------------





                                    - 6 -       Formatted:  January 15, 2025