RELS(1) RELS(1)
February 22, 1998
NAME
rels - order the relevance of text documents to a phonetic search
criteria
SYNOPSIS
rels [options] patterns paths ...
DESCRIPTION
Rels is a program that determines the relevance of text documents to a
set of keywords expressed in boolean infix notation. The relevance is
determined by comparing the phonetic representation of the keywords
with the phonetic representation of every word in a document.
(Phonetic searching has some degree of tolerance to misspelled words.)
The list of file names that are relevant are printed to the standard
output, in order of relevance. The boolean operators supported are
logical or, logical and, and logical not. These operators are
represented by the symbols, "|", "&", and, "!", respectively, and left
and right parenthesis, "(" and ")", are used as the grouping
operators. The paths can be files and/or directories-if it is a
directory, the program will recursively descend into the directory,
searching all files and directories contained in the directory.
For example, the command:
rels "(directory & listing)" /usr/share/man/cat1
(ie., find the order of relevance of all files that contain both of
the words "directory" and "listing" in the catman directory) will list
a few tens of files, out of the hundreds of catman files, of which
"ls.1" is the among the most relevant-meaning that to find the command
that lists directories in a Unix system, the "literature search" was
reduced, on average, by about 98%, which is a considerable expediency
in relation to browsing through the files in the directory. Although
this example is remedial, a similar expediency can be demonstrated in
searching for documents in email repositories and text archives.
Additional applications include information robots, (ie., "mailbots,"
or "infobots,") where the disposition (ie., delivery, filing, or
viewing,) of text documents can be determined dynamically, based on
the relevance of the document to a set of criteria, framed in boolean
infix notation. Or, in other words, the program can be used to order,
or rank, text documents based on a "context," specified in a general
mathematical language, similar to that used in calculators.
The words in the query are case insensitive, and either upper or lower
case can be used.
Associativity of operators is left to right, and the precedence of
operators is identical to 'C':
precedence operator
- 1 - Formatted: October 29, 2025
RELS(1) RELS(1)
February 22, 1998
high ! = not
middle & = and
lowest | = or
The operator symbols can be escaped with the "\" character to include
the symbol in a search pattern. The "escape space" character sequence
represents one or more instances of space character(s) in search
patterns, and each instance will match one or more consecutive
whitespace characters, (as defined by isspace(3) in ctype.h and/or
locale.h,) and allows phrases to be searched for. The "many to one"
whitespace character translation occurs in both the keyword arguments
and the text document(s). Multiple consecutive instances of the
"escape space" character sequence in keyword search phrases should not
be used, and single instances are appropriate only when necessary to
specify a consecutive sequence of keywords-the logical and operator is
the preferred searching construct when searching documents that
contain set(s) of keywords.
Note that the logical or operator, (|) is useful in conjunction with a
thesaurus. For example, the thesaurus entry for the word "complexity"
is:
Complexity. -- N. complexity; complexness &c. adj.;
complexus; complication, implication; intricacy,
intrication; perplexity; network, labyrinth;
wilderness, jungle; involution, raveling,
entanglement; coil &c. (convolution) 248; sleave,
tangled skein, knot, Gordian knot, wheels within
wheels; kink, knarl; webwork.
Adj. knarled. complex, complexed; intricate,
complicated, perplexed, involved, raveled,
entangled, knotted, tangled, inextricable;
irreducible.
implying that a reasonable context for a search for things that are
complex would be:
rels '(complex | complic | implicat | intric |
perplex | labyrinth | involut | convolut |
involv | tangl | inextric | irreduc)' ...
which would probably return too many document names. The number of
documents can be reduced with the logical and (&) and not (!) operator
in an iterative fashion to reject documents of little interest.
DOCUMENT FORMAT ISSUES
Hyphenation issues are addressed by deleting hyphens and any following
sequence of instances of whitespace characters, (as defined by
isspace(3),) in both the keyword arguments and the text document(s).
- 2 - Formatted: October 29, 2025
RELS(1) RELS(1)
February 22, 1998
Backspace character issues are addressed by overwriting the character
before the backspace with the character after the backspace, which
will instantiate the character of the last instance of of consecutive
backspace/character combinations. This is specifically for catman
pages which utilize underscore/backspace/character combinations for
underlining, in addition to backspace/character combinations for bold
(overstrike,) representation-note that for this process to be
successful, a single underscore (used for underlining,) must preceed a
single character in the sequence.
PHONETIC TRANSLATION
This program is a derivative works based on the rel(1) program,
available from sunsite.unc.edu in /pub/Linux/utils/text/rel-
1.3.tar.gz. The sources were modified to include a soundex search
algorithm.
The soundex algorithm is a mechanical phonetic translation system for
the English language, and converts English words into a corresponding
phonetic code for the word. The algorithm is as follows:
for each character in a word:
if the character is the first character of a word
1) do nothing
else
2) replace consecutive sequences of the
labials, (ie., the characters, B, F, P, V,)
with the character '1'
3) replace consecutive sequences of the
gutterals and sibilants, (ie., the characters,
C, G, J, K, Q, S, X, Z,) with the character
'2'
4) replace consecutive sequences of the
dentals, (ie., the characters, D, T,) with the
character '3'
5) replace consecutive sequences of the
longliquids, (ie., the character, L,) with the
character '4'
6) replace consecutive sequences of the
nasals, (ie., the characters, M, N,) with the
character '5'
7) replace consecutive sequences of the
shortliquids, (ie., the character, R,) with
- 3 - Formatted: October 29, 2025
RELS(1) RELS(1)
February 22, 1998
the character '6'
8) and, omit all other characters, (ie., the
characters, A, E, H, I, O, U, W, Y,)
9) if the soundex translation of the word is
larger than 4 characters, truncate to 4
characters.
For example, the soundex translation of the word "conover" is C516.
Unfortunately, there are two related issues in using the soundex
algorithm as a search mechanism; interior keyword search is
impossible, and, there is no practical strategy to handle hyphenation.
As a heuristic, simply eliminating 1), above, would permit interior
keyword searches and hyphenation through concatenation of characters
on each side of a '-' character, at the expense of erroneous matches.
In practice, the expense is small-depending on the point of view-
particularly if the requirement in 9), above, is removed, permitting
soundex keyword translations of more syllables.
Note that this heuristic returns soundex translated words that consist
only of numbers. Since numerical data can be a valid search criteria,
the ambiguity can be avoided by using letters from the alphabet for
the numbers, making the algorithm as follows:
1) replace consecutive sequences of the labials, (ie.,
the characters, B, F, P, V,) with the character 'B'
2) replace consecutive sequences of the gutterals and
sibilants, (ie., the characters, C, G, J, K, Q, S, X,
Z,) with the character
'G'
3) replace consecutive sequences of the dentals, (ie.,
the characters, D, T,) with the character 'D'
4) replace consecutive sequences of the longliquids,
(ie., the character, L,) with the character 'L'
5) replace consecutive sequences of the nasals, (ie.,
the characters, M, N,) with the character 'N'
6) replace consecutive sequences of the shortliquids,
(ie., the character, R,) with the character 'S'
7) and, omit all other characters, (ie., the
characters, A, E, H, I, O, U, W, Y,)
which turns out to be implementable as a direct, many-to-one, and on-
to simple character mapping. It is, also, a very fast phonetic search
- 4 - Formatted: October 29, 2025
RELS(1) RELS(1)
February 22, 1998
methodology-there is no speed penalty.
Comparing the two methodologies, (standard soundex vs. modified
soundex,) on a text version of the Webster's dictionary, (mine has
234,932 words,) as to the number of different words recognized, with
both unlimited soundex word length, and a word length of 4:
standard soundex modified soundex
length = 4 unlimited length = 4 unlimited
4,335 61,408 932 31,983
Although the modified soundex with unlimited length is inferior to the
standard soundex with unlimited word length in capability of
recognizing differences in words, it is superior to the standard
soundex with a word length of 4, which is the way the algorithm is
usually used. It would seem that the modified soundex algorithm is a
reasonable, (depending on the point of view,) compromise for
implementing a phonetic search algorithm.
There are additional issues with the soundex algorithm for phonetic
keyword searches:
1) it only works for the English language
2) a syntax error will be returned for keywords made
up of ONLY the characters A, E, I, H, O, U, W, and Y,
(there is nothing to search for-these characters are
ignored by the soundex algorithm)
3) Extreme care must be exercised when using the
algorithm to reject documents with the logical not
operator (!) since it will reject more documents than
probably expected.
meaning that the algorithm should be considered as an adjunct to,
instead of a replacement for, a strict keyword search.
Tests on large email archives, and the HTML pages from WWW servers
(each about 15 Mbytes,) tend to indicate that, in practice, the
algorithm returns not quite twice as many keyword matches as a strict
keyword search. (The output of this program was compared to the output
of the rel(1) program.)
OPTIONS
-v Print the version and copyright banner of the program.
WARNINGS
In the interest of performance, Memory is allocated to hold the entire
file to be searched. Large files may create resource issues.
- 5 - Formatted: October 29, 2025
RELS(1) RELS(1)
February 22, 1998
The "not" boolean operator, '!', can NOT be used to find the list of
documents that do NOT contain a keyword or phrase, (unless used in
conjunction with a preceeding boolean construct that will
syntactically define an intermediate accept criteria for the
documents.) The rationale is that the relevance of a set of documents
that do NOT contain a phrase or keyword is ambiguous, and has no
meaning-ie., how can documents be ordered that do not contain
something? Whether this is a bug, or not, depends on one's point of
view.
The phonetic translation only works for the English language.
A syntax error will be returned for keywords made up of ONLY the
characters A, E, I, H, O, U, W, and Y, (there is nothing to search
for-these characters are ignored by the soundex algorithm).
Extreme care must be exercised when using the algorithm to reject
documents with the logical not operator (!) since it will reject more
documents than probably expected.
SEE ALSO
egrep(1), agrep(1), rel(1)
DIAGNOSTICS
Error messages for illegal or incompatible search patterns, for non-
regular, missing or inaccessible files and directories, or for
(unlikely) memory allocation failure, and signal errors.
AUTHORS
----------------------------------------------------------------------
A license is hereby granted to reproduce this software source code and
to create executable versions from this source code for personal,
non-commercial use. The copyright notice included with the software
must be maintained in all copies produced.
THIS PROGRAM IS PROVIDED "AS IS". THE AUTHOR PROVIDES NO WARRANTIES
WHATSOEVER, EXPRESSED OR IMPLIED, INCLUDING WARRANTIES OF
MERCHANTABILITY, TITLE, OR FITNESS FOR ANY PARTICULAR PURPOSE. THE
AUTHOR DOES NOT WARRANT THAT USE OF THIS PROGRAM DOES NOT INFRINGE THE
INTELLECTUAL PROPERTY RIGHTS OF ANY THIRD PARTY IN ANY COUNTRY.
Copyright (c) 1995, 1996, 1997, 1998 John Conover, All Rights Reserved.
Comments and/or bug reports should be addressed to:
john@johncon.com (John Conover)
----------------------------------------------------------------------
- 6 - Formatted: October 29, 2025