WordCluster -- A Text-retrieval Program
Help Document (version 1.0)

Author:
   Dr. Thomas B. Horton
   Department of Computer Science and Engineering
   Florida Atlantic University
   Boca Raton, FL 33431  USA        Phone:  407/367-2674   FAX: 407/367-2800
   Internet:  tom@cse.fau.edu       Bitnet: HortonT@fauvax
Date:  June 18, 1992


Contents of this Document:
   General Description
   General Concepts
      Mark-Up
      Passage
      Cluster Description
      Index
      How Passages Are Found and 'Scored'
   Description of Output
   How to use Wordcluster (User Interface)
   File Formats
   Limitations of the Current Release


GENERAL DESCRIPTION:
====================

The approach to text retrieval used in the WordCluster program was conceived as
an effort to find image clusters in Shakespeare.  Scholars found that words
from certain narrow or broad categories (for example: death, the bird the kite,
sleep, food) occur in relatively close proximity in a number of different
texts.  For some clusters, certain categories or words always occurred, while
other categories usually occurred.  The clusters sometimes span hundreds of
words, and are spread over a number of sentences, verses, speeches, etc.

To find these, a program must be able to look at a large number
arbitrarily-defined 'chunks' of a text, count word occurrences in these
sections, and somehow select the 'best' set of chunks that match the cluster
description.  WordCluster attempts to do this in a highly flexible manner,
which is achieved by having a simple but powerful definition of a cluster and
by having several parameters to control the search algorithm.   This
flexibility allows the program to really find image clusters in Shakespeare,
and it appears that it can also be used to find smaller passages or allusions
(e.g. sentences, verses, phrases).


GENERAL CONCEPTS:
=================

Because of its unusual approach, to use WordCluster one must understand the
following concepts.

Mark-Up:
========
Scholars often add certain symbols in a text that do not occur in the printed
version; these symbols and the system that defines them serve as an encoding
system to define things like titles, speech headings, act/scene headings,
chapter headings or boundaries, etc.  The symbols and the system that governs
their use is often referred to as a 'mark-up system'.  WordCluster understands
the same kinds of mark-up as the program TACT.  [NOTE: this is not quite true
for the first version; it understands most of the COCOA-reference system.]

Passage:
========
The word 'passage' is the term used for a chunk or section of text.  Passages
are defined by a starting and ending point in the text.  The starting and
ending points may be given in terms of mark-up in the text, line number, word
number in the text, etc.  The concept of a passage is very general; sentences,
speeches, paragraphs, acts and scenes are all passages.  WordCluster finds and
optionally displays passages.

Cluster Description:
====================
To describe a cluster, one provides a list of categories.  Users define any
categories they need, along with the words that make up the categories.  (Note:
a given word can only belong to one category.)  Users can also just supply a
simple list of words, in which case each is treated as its own category.
Normally, for a given text and a given cluster description, there are a huge
number of passages that match to some extent.  WordCluster provides two ways of
eliminating clusters that we don't want to consider at all, and then it has a
method (described later) of choosing the best passages from those that were not
eliminated.

The first method of eliminating possible matches is to specify which categories
are 'required' and which are 'optional'.  If a passage does not have at least
one word from all the required categories, it is rejected no matter how many
other matches it contains.  The second method is to specify that any displayed
passage must contain matches from at least a certain number of categories; in
other words, at least "N" distinct categories must occur.  These two methods
can be combined.

Users describe a cluster for the program by creating a 'cluster description
file'.  This file contains lists of the categories (and the words that belong
to them), plus optionally an indication of whether some passages will be
eliminated as described above.  Samples of cluster description files are
provided with the program.


Index:
======
There are really three separate activities that may take place during a
WordCluster run: building an index; searching that index for the best passages;
and then finally displaying the set of passages that were found.

An WordCluster index is simply a file containing information on words and where
they occur in the text file.  Given a cluster description file and a text file
to search, WordCluster creates the index file, giving it a name it can
recognize later.  Subsequent searches can make use of the same index file,
unless you have modified the text file or cluster file, in which case
WordCluster recognizes this and rebuilds the index.  Index files are named with
extension ".cid" ("cluster index"), and the first part of the name is a
combination of the first letters of the cluster description file and the first
few letters of the text file.

WordCluster index files are plain ASCII files, which can be viewed or edited in
the normal manner.  In fact, they're just a list of the words in the cluster
that are found in the text, with some extra information.  This means that if
you add words to the cluster file, you have to rebuild the index.  Other text
retrieval programs, like TACT or Digital Librarian on the NeXT, have more
sophisticated and efficient indexes that contain all or most of the words in
the text.  An index like these would have the advantage of not having to be
rebuilt to match each different cluster file, but they have the disadvantage of
being larger.  Maybe someday WordCluster will be able to use one of these
program's index files, or have the ability to create better indexes.

For each text file, there will be a separate index file for each cluster
description file used with that text.  It's possible to search many texts in
one run; the program will automatically use the right index file for each.  To
create an index file, the user must specify which mark-up information should
be included in the index (no, an index doesn't include all the markup used in
the text).

How Passages Are Found and 'Scored':
====================================
First, let's mention that there are three parameters to the program that
control the search for passages:  the number of passages to report; the 'window
width'; and the type of function used to weight matches.   The first of these
is easy to explain.  The program evaluates each passage and assigns it a
numerical score according to how well it matches the cluster description;
since a very large number of passage may match to some extent, the program only
keeps track of the best matches.  The user specifies how many to store and
display.  This value can be made larger or smaller and the search run again
fairly efficiently (assuming the index file does not need rebuilding).

As for the other two parameters, WordCluster uses an algorithm that effectively
slides a fixed-size 'window' through the text, counting the number of word
matches within that section of text.  At any point in the process the window is
treated as being centered at a given word and having a radius of specified size
(in number of words).  The window width parameter is what determines the size
of this window.  If the width was 101 words, then the program would look at a
given word, plus the 50 words before it, plus the 50 words after it.  (Note
that since all windows are centered on a word, the window width is always an
odd number; if you specify an even number, it is increased by one
automatically.)

WordCluster moves such a window one word at a time through the text.  At each
point, it determines how many of the words in the window are in the cluster
description.  Which of the large number of such windows could be considered the
the best passage?  One simple technique would be to just use the total number
of words in the window that are in the cluster categories as a window's score.
(WordCluster can do this; just use the 'rectangular' function.)  But recall
that one of the goals was to find concentrated clusters of words.  If most of
the matching words in a window are close to the central word, then we might
wish to give that a better score than if the words are evenly spread out across
the window.  So WordCluster uses a function to weight matches within the window
based on their distance from the center point.  The weighted values of all
matches within the window are summed to give a total score for that position of
the window.

The shape of the weighting function may vary.  For example, if a function like
the bell-shaped curve associated with the normal distribution is used, then
word matches occurring near the center of the window will contribute more than
those farther away.  Thus a much larger radius for the window could be used to
take some account of matches that may occur many lines or sentences away from
the window's center without presenting the user with a large number of
undesired matches.  This approach might be appropriate for searching for image
clusters.  On the other hand, a uniform or 'rectangular' function (all matches
within the window contribute the same weight) could be used with a smaller
radius to find tight clusters involving a few words.  

So windows are scored using the weighting function, and the best ones are
retained.  The two words in the window that are farthest from the center define
the passage that is displayed.  (Note that this means you might have a window
width of 201 words but end up with the best passage made up of 50 words.)  Also
the program must be able to handle passages that overlap or are identical.  As
the window moves one word at a time, it may be that the exact same set of
matching words are found in the window but their positions relative to the
center have just changed a little.  WordCluster tries to remove as much
redundancy as possible in its results.  For passages with the same set of
words, it only remembers and reports the best score.  Also, only the best
passage is reported when one of two almost identical passages is wholly
contained in another (because the window slides past the earliest matching word
without adding a new matching word).  However, when two good passages differ on
both ends (one has an early word the other doesn't have, and the other has a
later word the first doesn't have), both are reported.  This feature is very
evident when a cluster is much larger than the selected window width.  When
users encounter this situation, they may need to increase the number of
passages to be found and displayed in order to avoid just seeing several
variants of the same large cluster.


To sum up, there are three parameters to the program that control the search
for passages: the number of passages to report; the 'window width'; and the
type of function used to weight matches.  If you're looking for a phrase, a
sentence, or a verse, use a small window width, perhaps with the rectangular
function.  In this case, the width should be the size of the largest passage
you think you might want to find.  If you're hoping to identify larger sections
of text, use a large window width and one of the bell-shaped functions like the
normal curve or the Epanechnikov function.  (I like the latter, perhaps just
because of the name.  It's also bell-shaped, but doesn't tail off quite so
quickly as the normal curve.)  I also have the impression, gained from my
experiences, that using a bell-shaped function and a larger window size (say,
twice what you were using with rectangular function) is better for finding
small passages.  Note that because the bell-shaped functions have small values
far away from the center, having a seemingly ridiculously large window size
will not really result in unwanted matches; it just helps avoid the 'horizon
effect', where there is some extra information just beyond the edge of your
window.  You'll just have to experiment to find which approach works best for
what you're looking for.  It's efficient to re-run a search on the same index
with slightly different parameters; rebuilding the index is much more
expensive.


DESCRIPTION OF OUTPUT:
======================

WordCluster always prints a short summary 'listing' that summarizes the
passages it finds.  There are two lines for each passage, and passages are
presented in order, best passage first.  Here is an example listing entry for
one passage:

 3.42 r3.t1 R3/1/1/133   R3/1/1/143B   5
      eagles(1049) kites(1054) buzzards(1056) diet(1100) bed(1120)

The first line contains the passage's score (3.42 in this example), the name of
the text file (r3.t1), and the COCOA-references for the start and end of the
passage.  The last number on the first line tells how many matching words were
found in this passage.  The second line lists the matching words, along with
their number from the start of the text.  (For example, 'eagles' above is the
1049th word in the text.)

Unless you specify that you only want the listing by using the -l option, you
will also see the passage presented in a more readable format together with the
text of the passage.

Also, you can process multiple texts (and indexes) using the same cluster
file with just one run of the program.  The best passages in each text can
be output separately (use the -s option) or combined to find the best
matches in all of the texts (the default).  (The texts can have different
mark-up definitions, too.)


HOW TO USE WORDCLUSTER (USER INTERFACE):
========================================

WordCluster has a command-line interface that is similar to other commands on
the UNIX operating system.  Options are specified using a 'dash' combined with
an identifying letter;  some options are then followed by an argument.
Defaults are provided in most cases.

Since there are many parameters that must be specified to run the program, and
since many of these don't vary from run to run, WordCluster is equipped with a
'memory' ability that remembers what you did during the last execution.  When
called again, it will do the same thing again unless you change one or more of
the options on the command line.  In effect all the parameters you used the
last time become the new defaults.   These values are stored in a file
(probably "wdclust.mem") that can be viewed and even edited; the options are
specified exactly as on the command line.  (In addition, comments after the
character # are allowed.)  To do nothing but see what options were used last
time and are now in effect, use the display option;  type: "wdcluster -d"

(Note:  the -l option, the -o option and the -s option are not remembered.)


The following list shows the possible options and provides a description for
each one.  Most of this information is printed as a "usage" message when the
user makes a mistake.  (The number of options may confuse you at first, so look
at some of the examples give below this section.)

Usage:
   (on UNIX) wdcluster [options] <txt-file> [ [-m <file>] txt-file]...
   (on DOS)  wdclust   [options] <txt-file> [ [-m <file>] txt-file]...

Where options are as follows:

-c  <file>  	Name of cluster description file.
-r  <string>	List of reference ids and widths; only this information will be
    	    	stored in the index and output when passages are displayed.

-w <num>    	An integer that specifies the window width to examine.  Because
    	    	of the way the algorithm works, this value must be an odd
    	    	number, so even numbers will be increased by one.
-f [val]    	Specifies which weighting function to use.
-n <num>    	How many passages to find and display.  If 0, then the index
    	    	will be created (if it doesn't exist) but not searched.

-m <file>   	Name of markup description file to be used with subsequent
    	    	files given as arguments; these include cluster description
    	    	files and text files.  Can be specified more than once in order
    	    	to alter markup for different text files.  The first file given
    	    	will be applied to the cluster description file.
<txt-file>  	Text file to be indexed or whose index should be searched.


-l  	    	Only display a short "listing" that summarizes the passages
    	    	found and not the actual text of the passages.
-s  	    	If multiple text files are searched, this flag indicates that
    	    	output should be generated for each one individually.
    	    	Otherwise the overall best passages in all texts are displayed.
-d  	    	Display all the options to be used from defaults, command line
    	    	options and memory file; do not update memory file.
-o <file>    	Send all output (except errors) to this file instead of
    	    	terminal.

Example uses:

wdclust -c kite.clu -r "t:3,line:5" -n 10 -w 101 -m shak.mkp r3.t ham.t
wdclust -c kite.clu -r "t:3,line:5" -m shak.mkp r3.t ham.t -m jonson.mkp vol.t

where "kite.clu" is a cluster description file, the ".mkp" files are mark-up
description files, and the ".t" files are text files.


FILE FORMATS:
=============

Cluster description files contain categories and the words that are members of
a category.  (A category can be a single word on its own, with no members.)
Here's a typical cluster description file (between the dashed lines):

----------------------------------------------------------------------------   
   <numdistinct 2>
   <optional>
   
   spirit <members> devils devil spirit spirits soul souls ghost
   ghosts <members>
   
   death <members> die dies death deaths dying deadly kill kills murder
   murders murdered monuments monument <members>
   
   <required>
   kite <members> kites kite <members>

   <optional>
   bed <members> bolster bolsters coverlet coverlets canopy canopies bed beds
   sheet sheets linen pillow pillows sleep <members>
----------------------------------------------------------------------------   

COCOA-style references are used in several ways:
  A)	The "numdistinct" reference takes a value that specifies now many
    	distinct categories must occur for a passage to be retained;
  B)	The "optional" and "required" references are used to specify
    	whether a category must occur for a passage to be retained.
    	"Optional" is the default, and these references can be placed as many
    	times in the file as you please;
  C)	If a category is made up of several words, then use the "members"
    	reference to enclose the list of member words.  Note that in this
    	case the word at the beginning becomes a category title and must
    	also be explicity placed in the list if you want it included.
The mark-up used to read the category description file is the first mark-up
file specified anywhere on the command-line (or in the memory file).
Also, I use the ".clu" extension in naming my cluster files, but this is
not required.


Mark-up description files attempt to define symbols etc. in the same way
that the program TACT does.  [NOTE: this is currently incomplete.  See
"Limitations" section below.]  If you are familiar with how TACT does this,
the following example file will make sense (I hope).  For now, please see
the TACT manual for more information.   Here's an example mark-up file used
for the Oxford Shakespeare texts:

----------------------------------------------------------------------------   
alphabet: a b c d e f g h i j k l m n o p q r s t u v w x y z
    0 1 2 3 4 5 6 7 8 9 
diacritics-retained: - '
diacritics-nonretained: \
continuation: +
ignore-pair: 
reference-pair: < >
label-pair: 
word-seperator: _
----------------------------------------------------------------------------   

There are keywords that end in colons that are followed by symbols (single
characters for now).  Upper- and lower-case letters are both defined even
though you only enter one of the cases.  Any characters not defined in the
mark-up file are treated as word-seperators.  I use the ".mkp" in naming my
mark-up files, but this is not required.

LIMITATIONS OF THE CURRENT RELEASE:
===================================

There are many limitations of this, the first release of WordCluster.

(A) First, since I am rushing to release this before going on holiday for a
couple of weeks, I have not tested as thoroughly as I wished.  The
algorithm for finding matchings existed in a previous program; it was
tested thoroughly at that time.  The code for handling mark-up, reading
words based on the mark-up, etc. has been testing a moderate amount.  The
code for handling the options and the memory file has been tested
rigourously.  Problems may exist from integrating these various components.
Let me know!

(B) The passage display functions are primitive.  They do not understand
mark-up, and they should provide more context.  (Text is displayed from the
first matching word to the last matching word in the passage.)  I suspect
they may blow up if a passage runs to the end of a text.

(C) While the mark-up attempts to emulate TACT capabilities, the code does
not handle multi-character symbols yet.  Nor does it allow several symbols
to have the same internal value for sorting etc.  It only handles
COCOA-style references; it recognized "labels" but just ignores them for
now.  Labels, ignored text and references always are treated as word
breaks.  There is no way to specify that the occurrence of a specific
reference or symbol resets a counter associated with another reference in
the way that TACT and OCP do.  Many of these features were planned for in
the design of WordCluster, and will not be hard to implement in the near
future.

(D) Even though different texts can have different markup descriptions, each
must have the same set of reference ids for output.  In other words, there is
no way to have different "-r" settings for different texts searched by a single
run of WordCluster.  Users must use seperate runs and then compare/merge the
output.  For example, consider the following (and assume that arguments not
present come from the memory file):
   wdclust -c kite.clu -r "text:3,line:5" -m shak.mkp r3.t ham.t
   wdclust -c kite.clu -r "txt:3,l:5" -m jonson.mkp vol.t

Note in this example that the file kite.clu had better be consistent with
both of the .mkp files.  If not -- say it was defined using shak.mkp --
then one should insert "-m shak.mkp" before the "-c" argument in the run of
WordCluster.

(E) The -s option could optionally take a reference id to cause the best
passages to be output when that reference changes value.  (In effect a "split"
within a file in addition to splits between files.)  Anyone with a need to find
the best passages "by section" within a single file (marked off by a change in
a reference value) should contact the author.  (It's not a major change.)


(F) There should be an option to specify a specific filename for the index file
rather than using the name automatically generated by WordCluster.  This would
allow users to rename, save and reuse index files under a name that is more
meaningful to them.  This would involve associating an index with each text
file, the same way markup files are handled.

(G) The first markup file specified must also describe the markup used in the
cluster description file (-c option) and the output reference string (-r
option).  Usually this is the markup associated with the first text to be
indexed or searched.  But if this markup file does not also describe these
other two items, then use the -m option before the -c and -r options to
associate another markup file with these.  Consider the following examples:
   wdclust -c kite.clu -r "txt:3,l:5" -m jonson.mkp vol.t
   wdclust -m general.mkp -c kite.clu -r "txt:3,l:5" -m jonson.mkp vol.t
   wdclust -c kite.clu -r "txt:3,l:5" -m general.mkp -m jonson.mkp vol.t

  In the first example, the file "kite.clu" and the string "txt:3,l:5" will be
interpreted using the markup description in "jonson.mkp".
  In the second example, the markup description in "general.mkp" will be used
for the file "kite.clu" and the string "txt:3,l:5".
  In the third example, the markup description in "general.mkp" will be NOT
used for the file "kite.clu" and the string "txt:3,l:5" as you might expect,
since the markup file associated with the first file is "jonson.mkp".


-30-