WordCluster -- A Text-retrieval Program Help Document (version 1.0) Author: Dr. Thomas B. Horton Department of Computer Science and Engineering Florida Atlantic University Boca Raton, FL 33431 USA Phone: 407/367-2674 FAX: 407/367-2800 Internet: tom@cse.fau.edu Bitnet: HortonT@fauvax Date: June 18, 1992 Contents of this Document: General Description General Concepts Mark-Up Passage Cluster Description Index How Passages Are Found and 'Scored' Description of Output How to use Wordcluster (User Interface) File Formats Limitations of the Current Release GENERAL DESCRIPTION: ==================== The approach to text retrieval used in the WordCluster program was conceived as an effort to find image clusters in Shakespeare. Scholars found that words from certain narrow or broad categories (for example: death, the bird the kite, sleep, food) occur in relatively close proximity in a number of different texts. For some clusters, certain categories or words always occurred, while other categories usually occurred. The clusters sometimes span hundreds of words, and are spread over a number of sentences, verses, speeches, etc. To find these, a program must be able to look at a large number arbitrarily-defined 'chunks' of a text, count word occurrences in these sections, and somehow select the 'best' set of chunks that match the cluster description. WordCluster attempts to do this in a highly flexible manner, which is achieved by having a simple but powerful definition of a cluster and by having several parameters to control the search algorithm. This flexibility allows the program to really find image clusters in Shakespeare, and it appears that it can also be used to find smaller passages or allusions (e.g. sentences, verses, phrases). GENERAL CONCEPTS: ================= Because of its unusual approach, to use WordCluster one must understand the following concepts. Mark-Up: ======== Scholars often add certain symbols in a text that do not occur in the printed version; these symbols and the system that defines them serve as an encoding system to define things like titles, speech headings, act/scene headings, chapter headings or boundaries, etc. The symbols and the system that governs their use is often referred to as a 'mark-up system'. WordCluster understands the same kinds of mark-up as the program TACT. [NOTE: this is not quite true for the first version; it understands most of the COCOA-reference system.] Passage: ======== The word 'passage' is the term used for a chunk or section of text. Passages are defined by a starting and ending point in the text. The starting and ending points may be given in terms of mark-up in the text, line number, word number in the text, etc. The concept of a passage is very general; sentences, speeches, paragraphs, acts and scenes are all passages. WordCluster finds and optionally displays passages. Cluster Description: ==================== To describe a cluster, one provides a list of categories. Users define any categories they need, along with the words that make up the categories. (Note: a given word can only belong to one category.) Users can also just supply a simple list of words, in which case each is treated as its own category. Normally, for a given text and a given cluster description, there are a huge number of passages that match to some extent. WordCluster provides two ways of eliminating clusters that we don't want to consider at all, and then it has a method (described later) of choosing the best passages from those that were not eliminated. The first method of eliminating possible matches is to specify which categories are 'required' and which are 'optional'. If a passage does not have at least one word from all the required categories, it is rejected no matter how many other matches it contains. The second method is to specify that any displayed passage must contain matches from at least a certain number of categories; in other words, at least "N" distinct categories must occur. These two methods can be combined. Users describe a cluster for the program by creating a 'cluster description file'. This file contains lists of the categories (and the words that belong to them), plus optionally an indication of whether some passages will be eliminated as described above. Samples of cluster description files are provided with the program. Index: ====== There are really three separate activities that may take place during a WordCluster run: building an index; searching that index for the best passages; and then finally displaying the set of passages that were found. An WordCluster index is simply a file containing information on words and where they occur in the text file. Given a cluster description file and a text file to search, WordCluster creates the index file, giving it a name it can recognize later. Subsequent searches can make use of the same index file, unless you have modified the text file or cluster file, in which case WordCluster recognizes this and rebuilds the index. Index files are named with extension ".cid" ("cluster index"), and the first part of the name is a combination of the first letters of the cluster description file and the first few letters of the text file. WordCluster index files are plain ASCII files, which can be viewed or edited in the normal manner. In fact, they're just a list of the words in the cluster that are found in the text, with some extra information. This means that if you add words to the cluster file, you have to rebuild the index. Other text retrieval programs, like TACT or Digital Librarian on the NeXT, have more sophisticated and efficient indexes that contain all or most of the words in the text. An index like these would have the advantage of not having to be rebuilt to match each different cluster file, but they have the disadvantage of being larger. Maybe someday WordCluster will be able to use one of these program's index files, or have the ability to create better indexes. For each text file, there will be a separate index file for each cluster description file used with that text. It's possible to search many texts in one run; the program will automatically use the right index file for each. To create an index file, the user must specify which mark-up information should be included in the index (no, an index doesn't include all the markup used in the text). How Passages Are Found and 'Scored': ==================================== First, let's mention that there are three parameters to the program that control the search for passages: the number of passages to report; the 'window width'; and the type of function used to weight matches. The first of these is easy to explain. The program evaluates each passage and assigns it a numerical score according to how well it matches the cluster description; since a very large number of passage may match to some extent, the program only keeps track of the best matches. The user specifies how many to store and display. This value can be made larger or smaller and the search run again fairly efficiently (assuming the index file does not need rebuilding). As for the other two parameters, WordCluster uses an algorithm that effectively slides a fixed-size 'window' through the text, counting the number of word matches within that section of text. At any point in the process the window is treated as being centered at a given word and having a radius of specified size (in number of words). The window width parameter is what determines the size of this window. If the width was 101 words, then the program would look at a given word, plus the 50 words before it, plus the 50 words after it. (Note that since all windows are centered on a word, the window width is always an odd number; if you specify an even number, it is increased by one automatically.) WordCluster moves such a window one word at a time through the text. At each point, it determines how many of the words in the window are in the cluster description. Which of the large number of such windows could be considered the the best passage? One simple technique would be to just use the total number of words in the window that are in the cluster categories as a window's score. (WordCluster can do this; just use the 'rectangular' function.) But recall that one of the goals was to find concentrated clusters of words. If most of the matching words in a window are close to the central word, then we might wish to give that a better score than if the words are evenly spread out across the window. So WordCluster uses a function to weight matches within the window based on their distance from the center point. The weighted values of all matches within the window are summed to give a total score for that position of the window. The shape of the weighting function may vary. For example, if a function like the bell-shaped curve associated with the normal distribution is used, then word matches occurring near the center of the window will contribute more than those farther away. Thus a much larger radius for the window could be used to take some account of matches that may occur many lines or sentences away from the window's center without presenting the user with a large number of undesired matches. This approach might be appropriate for searching for image clusters. On the other hand, a uniform or 'rectangular' function (all matches within the window contribute the same weight) could be used with a smaller radius to find tight clusters involving a few words. So windows are scored using the weighting function, and the best ones are retained. The two words in the window that are farthest from the center define the passage that is displayed. (Note that this means you might have a window width of 201 words but end up with the best passage made up of 50 words.) Also the program must be able to handle passages that overlap or are identical. As the window moves one word at a time, it may be that the exact same set of matching words are found in the window but their positions relative to the center have just changed a little. WordCluster tries to remove as much redundancy as possible in its results. For passages with the same set of words, it only remembers and reports the best score. Also, only the best passage is reported when one of two almost identical passages is wholly contained in another (because the window slides past the earliest matching word without adding a new matching word). However, when two good passages differ on both ends (one has an early word the other doesn't have, and the other has a later word the first doesn't have), both are reported. This feature is very evident when a cluster is much larger than the selected window width. When users encounter this situation, they may need to increase the number of passages to be found and displayed in order to avoid just seeing several variants of the same large cluster. To sum up, there are three parameters to the program that control the search for passages: the number of passages to report; the 'window width'; and the type of function used to weight matches. If you're looking for a phrase, a sentence, or a verse, use a small window width, perhaps with the rectangular function. In this case, the width should be the size of the largest passage you think you might want to find. If you're hoping to identify larger sections of text, use a large window width and one of the bell-shaped functions like the normal curve or the Epanechnikov function. (I like the latter, perhaps just because of the name. It's also bell-shaped, but doesn't tail off quite so quickly as the normal curve.) I also have the impression, gained from my experiences, that using a bell-shaped function and a larger window size (say, twice what you were using with rectangular function) is better for finding small passages. Note that because the bell-shaped functions have small values far away from the center, having a seemingly ridiculously large window size will not really result in unwanted matches; it just helps avoid the 'horizon effect', where there is some extra information just beyond the edge of your window. You'll just have to experiment to find which approach works best for what you're looking for. It's efficient to re-run a search on the same index with slightly different parameters; rebuilding the index is much more expensive. DESCRIPTION OF OUTPUT: ====================== WordCluster always prints a short summary 'listing' that summarizes the passages it finds. There are two lines for each passage, and passages are presented in order, best passage first. Here is an example listing entry for one passage: 3.42 r3.t1 R3/1/1/133 R3/1/1/143B 5 eagles(1049) kites(1054) buzzards(1056) diet(1100) bed(1120) The first line contains the passage's score (3.42 in this example), the name of the text file (r3.t1), and the COCOA-references for the start and end of the passage. The last number on the first line tells how many matching words were found in this passage. The second line lists the matching words, along with their number from the start of the text. (For example, 'eagles' above is the 1049th word in the text.) Unless you specify that you only want the listing by using the -l option, you will also see the passage presented in a more readable format together with the text of the passage. Also, you can process multiple texts (and indexes) using the same cluster file with just one run of the program. The best passages in each text can be output separately (use the -s option) or combined to find the best matches in all of the texts (the default). (The texts can have different mark-up definitions, too.) HOW TO USE WORDCLUSTER (USER INTERFACE): ======================================== WordCluster has a command-line interface that is similar to other commands on the UNIX operating system. Options are specified using a 'dash' combined with an identifying letter; some options are then followed by an argument. Defaults are provided in most cases. Since there are many parameters that must be specified to run the program, and since many of these don't vary from run to run, WordCluster is equipped with a 'memory' ability that remembers what you did during the last execution. When called again, it will do the same thing again unless you change one or more of the options on the command line. In effect all the parameters you used the last time become the new defaults. These values are stored in a file (probably "wdclust.mem") that can be viewed and even edited; the options are specified exactly as on the command line. (In addition, comments after the character # are allowed.) To do nothing but see what options were used last time and are now in effect, use the display option; type: "wdcluster -d" (Note: the -l option, the -o option and the -s option are not remembered.) The following list shows the possible options and provides a description for each one. Most of this information is printed as a "usage" message when the user makes a mistake. (The number of options may confuse you at first, so look at some of the examples give below this section.) Usage: (on UNIX) wdcluster [options] [ [-m ] txt-file]... (on DOS) wdclust [options] [ [-m ] txt-file]... Where options are as follows: -c Name of cluster description file. -r List of reference ids and widths; only this information will be stored in the index and output when passages are displayed. -w An integer that specifies the window width to examine. Because of the way the algorithm works, this value must be an odd number, so even numbers will be increased by one. -f [val] Specifies which weighting function to use. -n How many passages to find and display. If 0, then the index will be created (if it doesn't exist) but not searched. -m Name of markup description file to be used with subsequent files given as arguments; these include cluster description files and text files. Can be specified more than once in order to alter markup for different text files. The first file given will be applied to the cluster description file. Text file to be indexed or whose index should be searched. -l Only display a short "listing" that summarizes the passages found and not the actual text of the passages. -s If multiple text files are searched, this flag indicates that output should be generated for each one individually. Otherwise the overall best passages in all texts are displayed. -d Display all the options to be used from defaults, command line options and memory file; do not update memory file. -o Send all output (except errors) to this file instead of terminal. Example uses: wdclust -c kite.clu -r "t:3,line:5" -n 10 -w 101 -m shak.mkp r3.t ham.t wdclust -c kite.clu -r "t:3,line:5" -m shak.mkp r3.t ham.t -m jonson.mkp vol.t where "kite.clu" is a cluster description file, the ".mkp" files are mark-up description files, and the ".t" files are text files. FILE FORMATS: ============= Cluster description files contain categories and the words that are members of a category. (A category can be a single word on its own, with no members.) Here's a typical cluster description file (between the dashed lines): ---------------------------------------------------------------------------- spirit devils devil spirit spirits soul souls ghost ghosts death die dies death deaths dying deadly kill kills murder murders murdered monuments monument kite kites kite bed bolster bolsters coverlet coverlets canopy canopies bed beds sheet sheets linen pillow pillows sleep ---------------------------------------------------------------------------- COCOA-style references are used in several ways: A) The "numdistinct" reference takes a value that specifies now many distinct categories must occur for a passage to be retained; B) The "optional" and "required" references are used to specify whether a category must occur for a passage to be retained. "Optional" is the default, and these references can be placed as many times in the file as you please; C) If a category is made up of several words, then use the "members" reference to enclose the list of member words. Note that in this case the word at the beginning becomes a category title and must also be explicity placed in the list if you want it included. The mark-up used to read the category description file is the first mark-up file specified anywhere on the command-line (or in the memory file). Also, I use the ".clu" extension in naming my cluster files, but this is not required. Mark-up description files attempt to define symbols etc. in the same way that the program TACT does. [NOTE: this is currently incomplete. See "Limitations" section below.] If you are familiar with how TACT does this, the following example file will make sense (I hope). For now, please see the TACT manual for more information. Here's an example mark-up file used for the Oxford Shakespeare texts: ---------------------------------------------------------------------------- alphabet: a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 diacritics-retained: - ' diacritics-nonretained: \ continuation: + ignore-pair: reference-pair: < > label-pair: word-seperator: _ ---------------------------------------------------------------------------- There are keywords that end in colons that are followed by symbols (single characters for now). Upper- and lower-case letters are both defined even though you only enter one of the cases. Any characters not defined in the mark-up file are treated as word-seperators. I use the ".mkp" in naming my mark-up files, but this is not required. LIMITATIONS OF THE CURRENT RELEASE: =================================== There are many limitations of this, the first release of WordCluster. (A) First, since I am rushing to release this before going on holiday for a couple of weeks, I have not tested as thoroughly as I wished. The algorithm for finding matchings existed in a previous program; it was tested thoroughly at that time. The code for handling mark-up, reading words based on the mark-up, etc. has been testing a moderate amount. The code for handling the options and the memory file has been tested rigourously. Problems may exist from integrating these various components. Let me know! (B) The passage display functions are primitive. They do not understand mark-up, and they should provide more context. (Text is displayed from the first matching word to the last matching word in the passage.) I suspect they may blow up if a passage runs to the end of a text. (C) While the mark-up attempts to emulate TACT capabilities, the code does not handle multi-character symbols yet. Nor does it allow several symbols to have the same internal value for sorting etc. It only handles COCOA-style references; it recognized "labels" but just ignores them for now. Labels, ignored text and references always are treated as word breaks. There is no way to specify that the occurrence of a specific reference or symbol resets a counter associated with another reference in the way that TACT and OCP do. Many of these features were planned for in the design of WordCluster, and will not be hard to implement in the near future. (D) Even though different texts can have different markup descriptions, each must have the same set of reference ids for output. In other words, there is no way to have different "-r" settings for different texts searched by a single run of WordCluster. Users must use seperate runs and then compare/merge the output. For example, consider the following (and assume that arguments not present come from the memory file): wdclust -c kite.clu -r "text:3,line:5" -m shak.mkp r3.t ham.t wdclust -c kite.clu -r "txt:3,l:5" -m jonson.mkp vol.t Note in this example that the file kite.clu had better be consistent with both of the .mkp files. If not -- say it was defined using shak.mkp -- then one should insert "-m shak.mkp" before the "-c" argument in the run of WordCluster. (E) The -s option could optionally take a reference id to cause the best passages to be output when that reference changes value. (In effect a "split" within a file in addition to splits between files.) Anyone with a need to find the best passages "by section" within a single file (marked off by a change in a reference value) should contact the author. (It's not a major change.) (F) There should be an option to specify a specific filename for the index file rather than using the name automatically generated by WordCluster. This would allow users to rename, save and reuse index files under a name that is more meaningful to them. This would involve associating an index with each text file, the same way markup files are handled. (G) The first markup file specified must also describe the markup used in the cluster description file (-c option) and the output reference string (-r option). Usually this is the markup associated with the first text to be indexed or searched. But if this markup file does not also describe these other two items, then use the -m option before the -c and -r options to associate another markup file with these. Consider the following examples: wdclust -c kite.clu -r "txt:3,l:5" -m jonson.mkp vol.t wdclust -m general.mkp -c kite.clu -r "txt:3,l:5" -m jonson.mkp vol.t wdclust -c kite.clu -r "txt:3,l:5" -m general.mkp -m jonson.mkp vol.t In the first example, the file "kite.clu" and the string "txt:3,l:5" will be interpreted using the markup description in "jonson.mkp". In the second example, the markup description in "general.mkp" will be used for the file "kite.clu" and the string "txt:3,l:5". In the third example, the markup description in "general.mkp" will be NOT used for the file "kite.clu" and the string "txt:3,l:5" as you might expect, since the markup file associated with the first file is "jonson.mkp". -30-