Addition to SLEUTH
A New Look to SLEUTH's Search Engine
by Colleen DeJong

1 Introduction

Documentation is a very important part of a software system, however it is not easy to create, maintain, or access. Software engineers dislike writing or maintaining documentation. The documentation for a software system embodies a large number of documents of different types which are difficult to link in a consistent manner. These documents are traditionally statically structured so they do not respond well to the many different ways they will be used. Although documentation is a valuable resource in the development and maintenance of software, its full potential has not been realized because of the lack of a system to support its creation, maintenance, and access. A prototype system to support software documentation, SLEUTH, was developed at the University of Virginia. The usefulness of information retrieval techniques in the domain of software documentation were investigated. It employs hypertext links for both navigating and searching the documentation collection. One feature of this system is that it incorporates documents of many types. Another is that it provides two types of links, static and dynamic. The static links are defined by the author(s), maintained by the system, and useful in navigation. The dynamic links are created in response to a user query.

While the static links are created when the documents are written, the dynamic links are created when the user, viewing the documentation, poses a question to the system in the form of a query composed of keywords. In the last version of the SLEUTH system, the response to a user's query was a ranked list of links to paragraphs in the documents containing the query keywords. In order to find the information relevant to the question, the user must follow each link to the original document. This is a tedious process involving numerous links and many documents. A more useful response from SLEUTH would be to return the text of the paragraphs containing the keywords. These paragraphs would then compose a fact-sheet for the query. Hopefully the answer to the user's question can be found on that response page. If not, the links to the original documents would still allow the user to pursue the question further.

As a user of the SLEUTH system, it was beneficial for me to implement the fact-sheet response to queries. However, since I was not involved in any way with the development of the system, I first had to study SLEUTH's implementation before I could make changes to the system. Thus, the goals of this project were to:
· Gain an understanding of the SLEUTH system
· Modify the response to a query to return a fact-sheet
The methods used to gain an understanding of SLEUTH were to read the documentation available, study the source code, experiment with the system, and question the developers. At least half of the time spent on this project was devoted to this objective. The approach taken to modifying the query response was stepwise. The first step in the process was to write a string on the response page. The next was to create the text database automatically from the collection. Only after the database was created correctly, was the problem of inserting the paragraphs into the response page addressed. Of course as difficulties arose, it was necessary to modify the database generation code.

The SLEUTH system was developed to explore the effectiveness of information retrieval techniques in the domain of software documentation. It is only a prototype, but it has facilitated experimentation with hypertext links. The project described here has provided an alternative to the previous form of the query response page. This type of experimentation will help us achieve more effective software documentation systems.

2 Understanding SLEUTH

SLEUTH (Software Literacy Enhancing Usefulness to Humans) is a prototype system developed at the University of Virginia for software documentation management. It was developed in order to explore the applicability of information retrieval principles in the domain of software documentation. Specifically, the effectiveness of the use hypertext techniques was examined. A brief description of the problem domain, characteristics of the SLEUTH system, and its implementation are given here. For more information, please refer to the references at the end of this paper.

2.1 Problem Domain: Software Documentation

Documentation is an immensely important part of a software system. The code alone can not express every aspect of the system. During the lifespan of a software system many people will want answers to a wide variety of questions about the system. Examples of such people are a manager, an implementor, a maintainer, or a user. In order to be effective in answering this wide array of questions, the documentation must be easy to create, easy to use, complete, accurate, and easy to maintain.

While there are many editing tools available for the composition of documents that include text, figures, and code, software documentation usually requires a large collection of documents and the editing tools do not aid in structuring such a collection of documents or searching the entire collection. So, while editing tools make it easy to create documents, they do not make it easy to reference them.

In order quickly find information in a set of documents about a software system, the documents should be structured in a logical manner. There should be links between related topics that are easy to trace. A static representation of the documentation is not flexible enough to service every demand, which could involve abstract principles of the entire system or detailed design of a small piece. In addition to obtaining answers to specific requests, the structure of the documentation should accommodate the desire to read the full documentation in a linear form.

In order to be useful in either answering specific questions or in getting a general overview of the system, the documentation has to be complete and accurate. "Completeness ensures that if the author determines that a link should be present to alert the user to other information in the document, then it will be present. Accuracy ensures that all links throughout the document that are supposed to be to a particular point will indeed all be to that point."[1] This requires that it be carefully written when the system is being built and modified when changes are made to the system. However, engineers strongly object to creating and maintaining documentation. If there is no commitment to the documentation process, it will often be haphazard or ignored altogether. Incompleteness and inaccuracies decrease the utility of the documentation and, the less useful it is, the more objection there will be to its creation and maintenance. Since documentation is so vital, the documentation process should be as automated as possible, so that it is not an enormous burden on the engineer.

Software documentation can be an invaluable tool to the many people involved in the development of a software system. An editing tool facilitates document composition, but this is not enough. The documents need to be structured and linked together. If this is all done manually, then it is demanding for the software engineer to create and maintain the documentation. To ease this process, it should be automated as much as possible. Automation will also alleviate human omissions that lead to inaccuracies and incompleteness.

2.2 Characteristics of SLEUTH System

The SLEUTH system was constructed to explore the effectiveness information retrieval techniques in the domain of software documentation. This domain places many demands on the documentation system. Some of the characteristics of the SLEUTH system that were developed to meet those demands are:
· SLEUTH accommodates any type of text, from requirements to source code.
All of the documents are accessible from the same interface and are used in the static linking system. However, the source code is not searchable at this time.
· In order to aid navigation, SLEUTH has a set of static hypertext links.
Keywords are specified by the author(s) and links involving those keywords are maintained automatically by SLEUTH throughout further additions or modifications to the documentation. This is to insure completeness and accuracy.
· A second set of hypertext links is created dynamically in response to a user query.
These links access the files in their current state; they do not depend upon the storage of a separate database.
· The collection of documents is structured so that they can be flattened into a hard copy.
The static links are then indicated by color-coded underlining and page references.

2.3 System Overview

2.3.1 Framemaker

The front end of SLEUTH is Framemaker. Framemaker provides a powerful WYSIWYG editor for the creation and viewing of the documentation, including navigational features, a customization toolkit, and the ability to produce a hardcopy of the documentation. It also allows for the creation of hypertext links to the top of documents or to markers within them. Framemaker can be used to manipulate many types of documents, including specification and code. Using an editor like Framemaker as the front end eases the composition of documents in SLEUTH.

2.3.2 Static Links

In order to create the static links that aid navigation throughout the documentation, during composition the author(s) are required to keep a list of keywords. While this is some extra work for the author(s), SLEUTH does not require the author(s) to maintain these or other links during further modification of the documentation. This design upholds the rule of letting the human do what he does best (identification of the keywords) and the computer do what it does best (mechanical installation and maintenance of the links).

Using the keywords, a filter modifies the MIF (Maker Interchange Format) representation of the documents in the collection to create the specified links. It parses the MIF and inserts the link information at the appropriate location within the documents. The links are color-coded to indicate links to glossary definitions, figures, other parts of the documentation, appendices, or source code. In order to avoid inconsistencies due to multiple copies of the documents, the filter creating the links uses the original documents, and does all conversions necessary. Therefore, changes to the documentation require the linking filter to be re-executed.

2.3.3 The Search Engine

The static keywords that the author(s) designate are not sufficient for all questions about the software system. SLEUTH also provides a keyword searching capability that is integrated into the Framemaker environment as a menu option. The search engine used to provide this capability is a WAIS (Wide Area Information Server). During the initialization of the collection for viewing, a filter parses each of the documents, formatting them for the WAIS indexer, which creates its database for the collection. Then, the submission of a query in Framemaker by a user of the documentation causes another utility in the SLEUTH system to call the WAIS search engine with the query, parse the results, and create a response Framemaker document. In the last version, this response document contained a ranked list of links to paragraphs in the documentation that contain the keywords of the query.

3 Modification of the Query Response

3.1 Specification

Rather than returning a ranked list of links to paragraphs, as was the response to a query in the last version of the SLEUTH system, the purpose here is to modify SLEUTH's response so that it provides a fact-sheet for the query. The hope is that the answer to the user's question can be found on that response page. This will avoid the need to open numerous documents Additionally, the user can quickly become aware of all references to the topic by skimming the fact-sheet; this would not have happened in the previous version if only half the links were followed.

In light of the goals and implementation of the SLEUTH system, the following decisions were made about the implementation of the fact-sheet response:
· A database of text from the collection will be stored in such a way that can easily be imported into the Framemaker response document.
· This database must be derived from the original Framemaker documents as was the WAIS database, not from a separately stored collection.
· The database will be constructed during initialization of the collection for viewing--the same time that the documents are sent to the WAIS indexer to create its database.
· A separate database must be constructed for each collection using the same programs, so the database will be stored in a directory with the collection name.
· In order to allow quick and easy access to the paragraphs, each paragraph of a document will be contained in a separate file and these files will be grouped in directories by document name. The document directories will be in the collection directory.

3.2 Implementation

3.2.1 Creating the text database

The script that initializes the collection for viewing, is doIndex. DoIndex calls another script, index, on every document in the collection. Index does the following: save the document as text, parse the text, save it in paragraph format, and call the WAIS indexer. One of the inputs to index is the name of the collection. This is the name given to the database created by the WAIS indexer. To create the collection directory for the text database, this name was also used. In order to parse the text and convert it to paragraph format, index calls doc_parse, which is written in C. It is this file that had to be modified to fill the collection directory with the text database. Inputs to doc_parse are the textfile, the pathname of the Framemaker file, and now the pathname for the text database files. This pathname was constructed by index from the collection directory name and the document name. In doc_parse, upon detection of the beginning of each paragraph in the input textfile, a new filename had to be constructed from the output pathname and the paragraph number. This file was then opened for writing and the paragraph text copied into it.

3.2.2 Constructing the fact-sheet

When a document in a collection is opened by the user for viewing, the query script specific to that collection is copied into the generic query file. This query script knows the name of the collection, so that the appropriate database is used by the WAIS searcher and now the appropriate text database is used to retrieve the text paragraphs. The query script takes the keywords that were entered by the user in the Framemaker interface and calls the WAIS searcher. The WAIS searcher puts a ranked list of the paragraphs containing the keywords in a file, query.out. This file is parsed by the C file result_parse. Previously result_parse simply cut the appropriate string from query.out to put in the response page as the link to the paragraph. In order to retrieve the correct text file, the string had to be further broken down and reconstructed into the path of the file containing the text. Finally each link and its paragraph text are written to the response page in the tagged language of Framemaker and the new document containing the fact-sheet is opened for the user.

4 Difficulties Faced

While the concept of this modification is fairly straightforward, the implementation involved a few challenges. The greatest challenge was to understand the SLEUTH system. While the system was designed to support software documentation, the system itself was not well documented. There exists a System User's Guide [2]. It is intended to help a new user of the system install SLEUTH on their machine and describes the use of the system for software documentation. However there is no documentation aimed at maintainers of SLEUTH, describing the system itself.

Documentation was painfully missed because of the variety of notations, formats, and tools used in the system. The scripts were written in korn shell, which is not my scripting language of choice. The parsing programs were written in C and their executables called by the scripts. The documents were saved as text, then converted into paragraph format. The WAIS searcher also returns the query in a file with a certain format. Each of these formats has certain characteristics which were not documented and the temporary files were constantly overwritten and then deleted at the end of their usage. While this is a good practice for SLEUTH, it made it difficult to ascertain their formatting conventions. Besides file formats dictated by the WAIS indexer and searcher, Framemaker also has file formats. For example, the query response page had to be written in MML. Additionally, both of these tools have commands that must be invoked with meaningful parameters. To overcome the lack of documentation, the scripts had to be modified to save the temporary files and a significant amount of experimentation was required to determine the conventions of the formats and which temporary files contained which formats.

Another challenge arose from a quirk of the system that had not been corrected because it previously had no effect. This quirk involved the labelling of the paragraphs in the documents. The word "paragraph" was used in the string that labels the paragraphs as they go to the WAIS indexer. This string is what is returned by the searcher to indicate which paragraphs contained the keywords of the query. It is also now used to reference the files in the text database. In assigning these labels, usually "paragraph" was capitalized, but sometimes it was not. The files in the text database were labelled with a capital "p", thus filenames derived from the strings containing the lowercase "p" were unable to be found.

Finally, the text version of the documents saved by Framemaker contained a some unexpected characters. In the original documents, the labels of links often contained several words that were separated by spaces. In saving the file as text, these spaces were written as "\x11", rather than as a space. When given to Framemaker as part of the body of text in the fact-sheet, this notation was read as an illegal hexadecimal number and insertion of the paragraph aborted. Similarly, bullets in the original text were saved as an anomalous character value that appeared as a bullet in the textfiles, but when inserted in the fact-sheet, cause Framemaker to print the Greek letter sigma. The obvious solution was to run the textfile through an extra filter after it was saved as text and before either of the databases were created. However, for an undetermined reason, the removal of these characters resulted in WAIS being unable to create its database. The solution implemented, therefore, was to allow the two databases to be created with the anomalous characters intact, since they would not effect the keyword searching, and to clean up the text database after it was created. This cleanup was additionally difficult because the bullet character is not on the keyboard and its representation was elusive.

5 Evaluation

The goal of understanding the SLEUTH system was achieved despite the lack of documentation. Although extensive documentation was not added, comments were inserted to explain especially convoluted section of the code. Furthermore, this paper more carefully describes the functions of the utilities that make up SLEUTH than have previous documents. The properties of the SLEUTH design were also maintained throughout this modification.

The goal of modifying the system was also achieved. The fact-sheet response provides information about the query as well as assisting in the identification of useful documents to pursue further. The links to these documents are also included. This new implementation also revealed idiosyncrasies of the WAIS search engine. Because shorter paragraphs were favored, headings were often the top responses. Also, the headline that was inserted to identify the paragraphs was also used by WAIS in the searching. This resulted in lines containing no text, only a newline, being returned because their headline contained a query keyword.

A future modification to the query response page might be to maintain the links to other documents within the paragraphs. This would further the sense that this page, while generated dynamically, is an integrated part of the documentation. Another useful feature would be to concatenate adjacent paragraphs when they appear consecutively in the ranked list. A study should first be conducted to measure the frequency of this occurrence because it may be quite low. Such a modification would require further interpretation of the response from WAIS, but would enhance the usefulness of the query response as a fact-sheet. One further possibility would be to convert the interface from Framemaker to a WWW interface. This option was explored early in the development of SLEUTH, but discarded because the WWW was not well developed at that time. Now that it is more mature, the possibility should be revisited.