MP2—Retrieval Evaluations

In this assignment, based on the retrieval pipeline that we have built in MP1 with the Lucene toolkit, we will study retrieval evaluations. This assignment is required to be finished by each student individually.

The assignment is composed of the following parts consisting of 100 total points:

Evaluation functions (40 points)
- Precision at k
- Mean Average Precision
- Mean Reciprocal Rank
- Normalized Discounted Cumulative Gain
Questions (60 points)

The 40 points in evaluation functions are for coding completion -- it has to be done in the correct way, and you need to explain your implementation in your report (paste the modified code section).

The 60 points are for correct, justified answers to the questions at the end.

Please download the provided sample project here, inside which you will find all the necessary files. You have to use the provided Lucene library, because the latest version of Lucene might not be compatible with our provided sample implementation.

Running Lucene

Creating an index

There is a small data set distributed with this assignment. It is the NPL dataset of physics paper titles located in the data/ folder in the root of the project.

Two different main functions are provided in the runner.Main.java file. One is for you to interact with the retrieval pipeline to search in the built index; and another is to help you evaluate the ranking functions.

Searching in the index

Keep in mind the documents you're searching are physics paper titles. We provide you two standard retrieval functions to compare with in this assignment,

--ok     Okapi BM25
--tfidf  TFIDF Dot Product

In order to control the total number of returned documents, you can change defaultNumResults in edu.virginia.cs.index.SearchQuery.java. By default, the searcher would return you maximum 50 ranked documents.

Search Evaluation Functions

You need to implement the required search evaluation functions in edu.virginia.cs.evaluator.Evaluate.java. A sample implementation of Mean Average Precision has been given in this file (not fully implemented yet), but you need to complete it according to the comments.

Base on this sample implementation, you need to implement Precision at k (P@K), Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).

In this assignment, we will only evaluate Precision at 10, i.e., P@10, and NDCG@10. And we will implement DCG according to the following formula, $$DCG@k = \sum^k_{i=1} \frac{2^{rel_i}-1}{\log_2 (1+i)}$$

Although we only provide binary evaluation in this assignment, NDCG can still be applied.

To test your retrieval functions, you should use the main functions provided in runner.Main.java. Notice the option to use a specific retrieval function. That is the same option as in the interactive search engine.

For each query, the corresponding retrieval result and performance are printed. Once you implement the required evaluation functions, you should get at least a MAP of 0.10 for each of the provided standard ranking algorithms.

Questions

Copy and paste your implementation of each evaluation function into your report, together with the corresponding final MAP/P@10/MRR/NDCG@10 performance you get from each ranking function. (40pts + 20pts)
In edu.virginia.cs.index.SpecialAnalyzer.java, we defined a special document analyzer to process the document/query for retrieval purpose. Basically, we built up a pipeline with filters of LowerCaseFilter, LengthFilter, StopFilter, and PorterStemFilter. Please disable some of the filters, e.g., without stopword removal or stemming, and test the new analyzer with the BM25 model. What is your conclusion about the effect of document analyzer on retrieval effectiveness? (20pts) Note: this analyzer has to be used in both indexing time and query time!
In question 1, we compared the ranking model BM25 with TFIDF only by the mean value of the four evaluation metrics. As we already know that statistical test is necessary when we only have a small evaluation data set (93 queries in our case). Let's compute and report the p-value from paired t-test and Wilcoxon signed-rank test for the comparison over all four metrics. Based on your statistical test results, which is a better ranking algorithm? (20pts) Note: you do not need to implement the calculation of those tests. You can find any Java/Python/Matlab implementation for this purpose, and just prepare the required input for it.

Deadlines & How to submit

This assignment is due on April 14th 11:59pm. Therefore, you have in total 14 days to finish this MP. The late policy for all our homework has been carefully discussed in the course syllabus.

The collab assignment page has been created for this MP. Please submit your written report strictly following the requirement specified above. The report must be in PDF format.