Resources

Textbooks

The following books are for your reference. The first book is our required text book.

Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze, Cambridge University Press, 2008.
Modern Information Retrieval (2nd Edition). Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison-Wesley, 2011.
Search Engines: Information Retrieval in Practice. Bruce Croft, Donald Metzler, and Trevor Strohman, Pearson Education, 2009.
Statistical Language Models for Information Retrieval. ChengXiang Zhai, Morgan & Claypool Publishers, 2008.
Information Retrieval: Implementing and Evaluating Search Engines. Stefan Buttcher, Charlie Clarke, Gordon Cormack, MIT Press, 2010.
Information Retrieval: Algorithms And Heuristics. David A. Grossman, Ophir Frieder), 2nd edition, 2004, Springer.
Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Richard K. Belew, Cambridge University Press, 2001.
Managing Gigabytes: Compressing and Indexing Documents and Images. I Witten, A. Moffat, and T. Bell, Morgan Kaufmann, 1999.
Foundations of Statistical Natural Language Processing. C. Manning and H. Schutze, MIT Press, 1999.
Mining the Web: Analysis of Hypertext and Semi Structured Data (The Morgan Kaufmann Series in Data Management Systems). Soumen Chakrabarti, Morgan Kaufmann, 2002.
Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Bing Liu, Springer, 2006.

IR Course in Other Universities

It is benefical to be aware of how IR is taught in other top universities, especially by those top researchers in the field. Here is a list of wonderful IR courses selected by the instructor.

UIUC: CS 410: Introduction to Text Information Systems, by Dr. ChengXiang Zhai. Disclaimer: the material for the basic IR concepts are borrowed from this course, but there are differences in the organization and content for the modern concepts. The instructor has prepared a list of example course projects from this UIUC course for your reference. IMPORTANT: all copyrights belong to their original authors.
CMU 11-741: by Dr. Jamie Callan and Dr. Yiming Yang.
Stanford CS276: Information Retrieval and Web Search, by Dr. Christopher Manning and Dr. Pandu Nayak.
UMass CS646: Information Retrieval, by Dr. James Allan.
Purdue CS-54701: Information Retrieval, by Dr. Luo Si.
UT Austin CS 371R: Information Retrieval and Web Search, by Dr. Raymond J. Mooney.

Top Conferences and Journals in IR Field

The following list and comments only represent the instructor's personal opinion.

SIGIR: One of the most important and influential conference in IR field (attract more attention from academia), proceedings of publications can be found here.
WWW: Another most important and influential conference in IR field (attract more attention from industry), proceedings of publications can be found here.
WSDM: A new but quickly raising conference in the field, attracking attentions from both industry and academia. Proceedings of publications can be found here.
CIKM: A major conference in IR field. Proceedings of publications can be found here.
TOIS: One of major journals for IR field.
If you are interested in rankings or indices of those conferences and journals, you may take a look at Google Scholar's Metrics. IR is under this category.

IR Toolkits

Lucene (Apache)
Lemur & Indri (CMU/Univ. of Massachusetts)
Terrier (Glasgow)
MeTA (University of Illinois)
RankLib (A collection of learning-to-rank algorithms University of Massachusetts Amherst)
General Information Retrieval Systems

NLP-related Resources

Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources
Stanford NLP parser (Stanford University NLP group)
OpenNLP (Apache)
LingPipe (Jave-based)
NLTK (Python-based)

Machine Learning Toolkits

Weka (A rich collection of machine learning algorithms, Machine Learning Group at the University of Waikato)
Mallet (An alternative package for Weka, developed by Andrew McCallum at University of Massachusetts Amherst)
LibSVM (A collection of SVMs, developed by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University)
SVM-light (Another collection of SVMs, developed by Thorsten Joachims at Cornell University)
GraphLab (Large-scale machine learning package)
mahout (Apache large-scale machine learning package)
Topic Models (David Blei's collection of various topic models)

Data Repository

TREC: a long-history IR conference for different task evaluations. Various IR tasks have been proposed and corresponding data sets are available.
- TREC Web,Terabyte & Blog Tracks: Queries and relevance assessments are available for these collections.
- Blog Track: Evaluation of retrival methods for blog search.
- Enterprise Track: Evaluation of enterprise track.
- Million Query Track: Evaluation of a large varity of incompletely judged topics.
- Session Track: Evaluation of session search.
Twitter: Twitter is currently open to public, twitter streams can be accessed via their APIs, and also there are some crawled twitter available: e.g., Stanford SNAP twitter data set, and TREC microblog collection.
Microsoft Learning to Rank Datasets: a collection of annotated query and URLs for learning-to-rank study, with a handful of practical IR ranking features.
LETOR: another collection of annotated data set for learning-to-rank studies.
ClueWeb09: 1 billion web pages in ten languages that were collected in January and February 2009.
UCI Machine Learning Repository: a standard machine learning benchmark repository (a bit small and old).
AOL search log: a collection of around 20M web queries collected from about 650k users over three months in AOL web search engine. There is a famous privacy leak scandal related to this search log data wiki, which is one of the major reasons preventing any search engine to share their search log data. And there is also a search engine built to inspect users' privacy in this data set: Search-ID.
Yelp Dataset Challenge: A large set of Yelp reviews and entities provided by Yelp. Also, "If you are a student and come up with an appealing project, you’ll have the opportunity to win one of ten Yelp Dataset Challenge awards for $5,000."

Related Courses at UVA

CS 4720: Web and Mobile Systems
CS 4501: Introduction to Machine Learning and Data Mining

LaTeX

Here are the LaTeX files necessary to write the project report.
We want everyone to use the same format so we can grade each paper fairly.
Additionally, LaTeX is a skill we feel you should learn if you haven't already!
- Official website of latex: http://www.latex-project.org/
- TEX editor for windows: WinEdt, LEd
- TEX editor for MacOS: TeXPad, Latexian
- Please share the best TEX editor or integrated solutions in your mind to the class via Pizza.