Change history:

l         V0.9, Aug 03, 3: 03 PM, 2005



Since it is very costly to perform user-in-the-loop evaluation, prevailing evaluation methodology for relevance feedback usually employs machine simulated user instead of real user. By this method, the user is often assumed to be perfect, i.e., always consistent with the assessor who defined the groundtruth. However, this is hard to achieve in practice due to user interface constrains or information need refinement during the feedback loop. It is unknown how often such judging inconsistency would happen and how it will affect the refined retrieval performance. Neglecting this problem may result in exaggerated performance gain and unfair comparisons among relevance feedback algorithms. The HARD (High Accuracy Retrieval from Documents) track of TREC (Text Retrieval Conference) provides us an opportunity to quantatively analyze judging inconsistency and their impact on relevance feedback. We study several cases and find that practical effectiveness of relevance feedback is comparable to pseudo-relevance feedback, although in theoretical way it should perform much better. The work is reported in [TREC05] and [SIGIR06].




The Text REtrieval Conference (TREC), was started in 1992 as part of the TIPSTER Text program. Its purpose was to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies.


The goal of HARD is to achieve High Accuracy Retrieval from Documents by leveraging additional information about the searcher and/or the search context, through techniques such as passage retrieval and using targeted interaction with the searcher. The hard track first ran in TREC 2003. For this year HARD, please refer to the official HARD05 guidelines.



Corpus & Queries & Search Engine


The test collection for HARD05 is the AQUAINT collection. The whole collection contains 1033461 documents and average document length is 425.4 terms. All the documents are indexed with Lucene’s Snowball analyzer (with porter stemming).


50 old/new topics are selected or constructed. They are listed here. Legal topic numbers are:

303, 307, 310, 314, 322, 325, 330, 336, 341,
344, 345, 347, 353, 354, 362, 363, 367, 372,
374, 375, 378, 383, 389, 393, 394, 397, 399,
401, 404, 408, 409, 416, 419, 426, 427, 433,
435, 436, 439, 443, 448, 622, 625, 638, 639,
648, 650, 651, 658, 689

These numbers can be inputted in the following textbox to view results of CFs and runs.


Lucene is employed as our base search engine. A BM25 ranking is implemented. In order to make the re-ranking more efficient, we only re-rank the top 2000 result from Lucene’s retrieval result. We believe that is sufficient for most relevant documents (if find) will appear in this subset.


Each TREC topic (include title+desc+narr) is indexed with the terms in the title counted as three times appearances. A standard TF*IDF (here TF is interpreted as the term’s raw frequency, IDF is the log version of inverse document frequency) is used to rank the terms and top 20 is extracted to form a vector query. In order not to utilize the other topic's information, we use TREC3's topic IDF instead, such that we can still get very low importance for term such as "document", "relevant", "irrelevant" ...


We implement the vector querys search in Lucene as followers:

For example, for Topic 307 "ELECTRIC PROJECTS”," the query is actually is

project^1.0 hydroelectr^0.5200539566526735 propos^0.4160431653221388 locat^0.31203237399160405 exist^0.223208701414889 plan^0.223208701414889 against^0.2080215826610694 facil^0.2080215826610694 reservoir^0.2080215826610694 under^0.2080215826610694 construct^0.1818181818181818 minimum^0.1818181818181818 statement^0.1818181818181818 new^0.15601618699580203 acr^0.1040107913305347 call^0.1040107913305347 clear^0.1040107913305347 consequ^0.1040107913305347 decis^0.1040107913305347 dismantl^0.1040107913305347



Baseline Runs


We submit two baseline runs. Run-tag is SAICBASE1 and SAICBASE2.


SAICBASE1 use BM25 re-ranked Lucene search result directly.


SAICBASE2 get the top 30 documents from SAICBASE1, each document will select its top 40 terms and then automatic query expansion is performed (blind feedback the top 30 documents). Rocchio method is used and the initial query is omitted when forming the refined query (alpha equals to 0). The purpose is to make the difference between the initial and refined search more obvious.


The results are listed as follows, with top 25 retrieved documents are shown



Clarification Forms with Assessor Judgment


There are two clarification forms. SAIC1 and SAIC2.


SAIC1 is the traditional relevance feedback user interface. First, we select top 16 documents without duplication.Here we set a threshold for pair-wise doucment similarity. If the document is very similar to a exisiting one, we will skip this document and continue to select followers. After we get 16 documents, we only use odd ranked documents to increase its diversity. Each document's title, source, created time, and its abstraction (less than 70 terms and within 3 sentenses) are shown for the user to make judgment. Possible judgment are "relevant", "non-relevant", and "perhaps". Finally, we offer a choice of user's feeling about the initial retrieval's effect. Unfortuately, this choice seems not quite useful since most topics are selected as "some are relevant".


SAIC2 is what we called "feedback by future". This is motivated by the fact that sometime an irrelevant sample can be quite good query for the information need. On the other hand, a relevant document may retrieve many irrelevant ones also. In this CF, we still offer the same set of documents in the same order for user to judge. But we provide different information to the user, that is the document's neighborhood's information (i.e., if I feedback this document, what might happen). Each document will extract its top 20 terms and issue a search, with top 8 documents are combined to extract abstraction. The abstraction is offerred to the user for relevance judgment.


Input the topic number to view CFs


Input topic number here

CF1 (with document's information)

CF2 (with document's neighbor's information)



As we expected, the assessor actually gives quite different judgment for the same document. Here are the comparisions and judgement for the listed candidate document's ID (not trec ID but its index ID by Lucene). 1.0 means user judge this document as relevant, -1.0 as irrelevant, and 0 as no opinion. We can discover there are many documents actually receive contraversary judgement by the same assessor.




Final Runs


There are 6 final runs. Named as SAICFINAL1 to SAICFINAL6. We have the following settings variations:


Following are our settings for our TREC final runs:


Generate from
Standard query refinement
Conservative query refinement



Evaluation & Comparison


The result for each run is listed as follow, with top 25 retrieved documents are shown



Input topic number here


SAICBASE1 (without blind feedback)

SAICBASE2 (with blind feedback)


SAICFINAL1 (CF1, Standard)

SAICFINAL2 (CF2, Standard)

SAICFINAL3 (CF1, Conservative)

SAICFINAL4 (CF2, Conservative)

SAICFINAL5 (CF1&CF2, Standard)

SAICFINAL6 (CF1&CF2, Conservative)