AFTER: Adaptive Feedback Testing & Evaluation enviRonment

Version: 0.9b (November 08, 2005)

Relevance Feedback (RF)

For any IR task, it is usually difficult to get satisfying result via the initial search. Feedback process is employed to refine the search result. During such process, some questionnaire is prepared according to current search result for the user to answer. Then the user's answer is delivered to the system to refine the search result. This process can be repetitive until the user is satisfied or some pre-set criteria met.

Relevance feedback is a special form of such question-answer process, where relevance questions are asked to catch the user's information need. Usually this kind of question is in the form of letting the user judge whether the provided material is relevant to his/her information need. For example, questions such as "Are you interested in documents like this one?" are examples of relevance questions. Early work for relevance feedback of IR can be traced back to the 1970s, such as the classic Rocchio method [Rocchio 71]. Many new relevance feedback approaches have been proposed during the last decade with the booming needs of multimedia retrieval applications.

We briefly summarize relevance feedback algorithms according to how they select the samples for user to judge (sampling) and how these judged samples are used to refine the search result (refinement). These approaches are put under the same framework so that each approach can be treated as a special case and their intrinsic relations can be studied.

Click here to read this short summary document.

Problems of Evaluation of RF in Multimedia Retrieval

Many different communities have conducted research on the efficacy of relevance feedback in information retrieval systems. Unlike text IR, performance evaluation of multimedia IR systems tends to conform to the accepted standards of the community within which the work is conducted. This leads to idiosyncratic performance evaluations and hampers the ability to compare different techniques fairly. In general, relevance feedback research could be taken by:

Computer Vision groups [Rui 1998][Porkaew 1999],
Text-IR groups [Rocchio 1971][Williamson 1978],
Data Mining groups [Ishikawa 1998][Wu 2000][Kim 2003],
HCI & Psychology groups and etc. al.

These different background groups follow different traditions and have different standards toward evaluation, which makes it hard to study relations among them and compare their performance fairly. We seldom see continuous evaluation work to test ideas under a practical environment either by the same group or
others. This makes it extremely difficult for us to assess their usability for real applications. We now list several common evaluation problems that often appear in the published literature for relevance feedback evaluation in multimedia retrieval.

  • Lacking Common Testbed
  • For example, COREL image collection is usually employed in the evaluation of performance of CBIR (Content-based Image Retrieval) system. However, "Every evaluation is done on a different image subset thus making comparison impossible." [Muller 2003]. On the other hand, even the same document collection is used, different groundtruth can be employed. The groundtruth can be acquired by:

    (1) Pure manually label approach, such as the COREL groundtruth used in [Kim 2003]
    (2) Pooling and then manually labeled, such as TRECVid
    (3) Auto judged by a reference retrieval system, such as the MARS evaluation work in [Rui 1998] [Porkaew 1999]
    (4) Semi-auto judged by system-assisted human labeling, such as MiAlbum evaluation work in [Liu 2001]

  • Unverified Assumption
  • The algorithm is proposed based on some assumption. E.g., re-weighting [Rui 1998] requires ellipse query exist, MindReader [Ishikawa 1998] requires diagonal query exist. In MARS works [Rui 98] [Porkaew 99], groundtruth is generated by arbitrary distance function which simulate an“ellipse” query. It would be not astonishing that re-weighting would over perform Rocchio method in this environment. But we still do not know how useful the proposed technique works in a real multimedia retrieval application, since we do not know how often such “ellipse” query exist. The performance evaluation tells us very little information (since it is a high probability event).

  • Without Rank-normalization
  • The retrieval result is usually presented in the format of a ranked list. Normalization issues arise when we compare the retrieval results directly. On one hand, when we compare different feedback approaches, some feedback approaches (such as fuzzy-OR merge [Wu 2000]) will automatically shift those feedback examples to the head of the result list, while others (such as Rocchio method [Rocchio 1971]) will not. Comparing them directly is unfair because any approach can shift the feedback examples in post-processing. On the other hand, when we compare different feedback iterations, it is unclear how much of the performance improvement is contributed from the user and how much is from the retrieval system since those training samples also join the evaluation. Rank normalizations, such as "fluid" and "frozen" [Williamson 1978], are the techniques dealing with these problems. Without a rank normalization process, we would exaggerate the improvement of some feedback approaches. Although rank normalization is generally accepted in text IR community, it is often neglected by multimedia retrieval.

    Click here to see our implementation of rank normalizations.

  • Improper Baseline
  • This is similar to the case where a modification of the QuickSort algorithm is proposed. Instead of comparing it to the performance QuickSort, it is compared to BubbleSort. [Kim 2003] is such an example which misuses the baseline for comparison.

  • System-oriented vs. user-oriented
  • Some evaluation employs real human user to judge relevance, such as TREC's HARD track. Some evaluation uses machine to judge relevance by pre-acquired groundtruth automatically. Usually, machine based evaluation tends to assume a perfect user (never make any judge mistake) and the performance gain is usually exaggerated. According to our experience, the document-level relevance judgment can only achieve 80% accuracy [Jin 2005] for a human user. This could lead to unfair comparison where claimed advanced technology actually is more difficult for the user make correct judgment [Jin 2005].

    The AFTER Project

    The motivation of AFTER project is to build a general evaluation environment for relevance feedback study, so that approaches can be compared fairly under different testbeds. The AFTER environment will integrate several prevailing sampling algorithms and refinement algorithms. Several evaluation metrics and rank normalization will be implemented. A content-based image retrieval (CBIR) testbed and a text-IR testbed will be used and compared as test collections. New testbeds can be easily added to the environment. Instead of providing a compiled binary, we provide a set of java APIs. A main class is also provided to run the demo evaluation.

    Assumptions and Restrictions

  • Document-level Feedback
  • Generally, the relevance question can be asked at different granuity. It can be a term, a passage, an example document, or even a group of documents. In the following, we focus on document-level relevance feedback. This is due to in for multimedia retrieval, it is hard to have general interpretation of
    what a "term" or a "passage" means. The diversity of media format will make techniques not general and incomparable. Moreover, most testbeds today only has document-level groundtruth, this makes it is hard to apply automatic evaluation for non-document-level relevance feedback.

  • Distance-based IR
  • Most current IR systems are distance-based (or dis-similarity based). In a distance-based IR system, both the documents and queries are represented as points in some space and each pair of points' distance is defined. The search process can be interpreted as finding nearest neighbors for the query point in this space and the search result is usually outputted in the format of a rank list. In case of the retrieval system is treated as a blackbox and its internal implementation of the retrieval space is not accessible, we can still extract the pair-wise distances in the following way. First, we issue a document as a query (suppose the retrieval system support query-by-example interface) and get the search results from the retrieval system. Then, each document's distance to the query document can be estimated from the corresponding similarity (if it is outputted as part of the search result) or directly from the rank. For documents which do not appear in the retrieved set, we can just assign them an arbitrarily large distance score.

    The AFTER environment assumes the base IR system, which relevance feedback algorithms are implemented over, is distance-based. There is nothing can be extracted from the base IR system other than the pair-wise distances. For example, we cannot extract feature vectors from the base IR system for later feedback purpose. This assumption is extremely useful to implement relevance feedback algorithms externally rather than internally, when the base IR system is sealed like blackbox and whose internal representation cannot be accessed from outside.

  • Binary-judgment
  • Although the user can express the extent of how relevant a document is, we simplify the relevance judgment as a binary process. A judged document is either relevant or irrelevant. We do this simplification for the reason that multi-level judgment requires much more user effort which is difficult to be implemented. Moreover, currently most evaluation testbeds are based on binary groundtruth.

    What data is needed?

    In order to avoid complexity of different interfaces of retrieval systems. Relevance feedback is implemented in an external manner. Intermediate results are needed to simulate relevance feedback process. The following data should be provided to the AFTER environment.

  • The Initial Retrieval Result
  • For each query, an initial search result is needed. The result should be provided in a ranked list in the following format. The rank list does not need to be a full list, i.e., a truncated list is also acceptable. The list is sorted in descendant order by document similarities to the query.

    #rank (start from 0) doc-ID similarity

    Each query has its own initial search result file, named by its unique query-ID. A list of query-IDs should also be provided in a multi-line plain text file.

  • The Channels
  • The channels provide pair-wise distances information among document. They can be interpretted as seach result for example document. First, the top K documents for each initial search result are combined to form a pool. Each document inside the pool will be issued as a query to search its neighbors. The search result is in the same format as the initial retrieval result file. But this time it is named after the retrieval example's unique doc-ID. This part would be tricky in the computational cost. For our CBIR testbed, we make K the maximum to 3400. All image documents have been issued as query example independently. On the other hand, for the TREC testbed, the whole database has 1M document in total so it is impossible to run it in the full scale. We take K to be 300 to generate the channels. On the other hand, the computation for channels generation is fully independent so they can be run in parallel. A computational cluster will no doubt linearly boost the computational speed.

  • Groundtruth
  • The groundtruth should be provided in the classic 4 column TRECEval format.

    #Query-ID Q0 #doc-ID 0/1

    What AFTER will generate?

    After setup an .ini file of specifying system parameters, a report will be automatically generated. In order to generate the graph and report automatically, gnuplot and latex should be installed. The final evaluation results could be shown in .tex file and need Latex to compile it. We plan to include the following issues in the report:

  • Baseline performance
  • The initial retrieval performance. Different evaluation metrics can be chosen.

  • System performance after relevance feedback
  • The refined retrieval performance. Variation includes sampling strategies, refinement algorithms, user models (how do user choose relevant documents), and rank normalization methods.

  • Dataset analysis
  • Query difficulty analysis. Relevant document distribution analysis.

  • Conclusions
  • Concludes what feedback algorithms and parameter settings are suitable for this application (dataset and retrieval system).

    Future Works

    Currently, it is not feasible for us to generate all possible channels for large testbed. For example, in TREC collection, 1M documents will issue 1M channels which is intractable. At current stage, AFTER only allows one iteration feedback. In the future, we will extend the framework to let it handle multiple iterations. One possible solution is to use larger K to form the pool. And later when we select candidate feedback examples, we just excludes documents outside the pool. Another possible solution is to allow complex inter-system interations. The retrieval system is running as a service and the AFTER environment could call the retrieval system to ask for service. This framework will be much more complex and need much larger computational cost, too.