Chair: Worthy Martin, Chair; James French
Advisor: Alfred Weaver
OLSSON 236D, 09:00:00
A Master's Thesis Presentation
ABSTRACT
Keyword search is the mechanism of choice for information discovery and retrieval due to the enormous success of Internet search engines. In fact, nearly half of Internet users perform at least one search daily. The keyword search paradigm regrettably does not extend to similar forms of content, particularly semi-structured and relational data. Searching structured content is difficult because structured data challenges important assumptions of scoring models developed for unstructured text. First, a complete document may not provide the correct granularity for search resultsâ~@~Tit may be too coarse (that is, containing irrelevant information) or too fine (that is, lacking context). Second, the relationships among data must be considered, for the correct interpretation of structured data depends on understanding its relationships. Searching relational data is particularly difficult due to the data normalization process that guides the design of relational databases. Although a number of systems have been proposed to remedy this situation, none have met with considerable success. The search techniques described in this thesis apply to both semi-structured and relational data. The key contributions include an intuitive scoring model for scoring related information and a novel hybrid indexing scheme. The intuitive scoring model allows reuse of existing weighting models developed for unstructured text collections; reusing these models is critical due to the extensive testing they have already seen. Also included in the scoring model is 1) an innovative score normalization that blurs the distinction between unstructured text and information connected by relationships and 2) a factor that recalibrates the raw score produced by the weighting model so that the final score adheres to usersâ~@~Y expectations. The hybrid indexing scheme practically eliminates data redundancy by exploiting the full-text indexing capabilities of the underlying relational database. A small data graph stored in main memory allows efficient graph traversals to identify related information. Evaluation on two widely different datasets demonstrates the effectiveness of the techniques.