Resources

Textbooks

The following books are for your reference. The first book is our required text book.

IR Course in Other Universities

It is benefical to be aware of how IR is taught in other top universities, especially by those top researchers in the field. Here is a list of wonderful IR courses selected by the instructor.

Top Conferences and Journals in IR Field

The following list and comments only represent the instructor's personal opinion.

  • SIGIR: One of the most important and influential conference in IR field (attract more attention from academia), proceedings of publications can be found here.
  • WWW: Another most important and influential conference in IR field (attract more attention from industry), proceedings of publications can be found here.
  • WSDM: A new but quickly raising conference in the field, attracking attentions from both industry and academia. Proceedings of publications can be found here.
  • CIKM: A major conference in IR field. Proceedings of publications can be found here.
  • TOIS: One of major journals for IR field.
  • If you are interested in rankings or indices of those conferences and journals, you may take a look at Google Scholar's Metrics. IR is under this category.

IR Toolkits

NLP-related Resources

Machine Learning Toolkits

  • Weka (A rich collection of machine learning algorithms, Machine Learning Group at the University of Waikato)
  • Mallet (An alternative package for Weka, developed by Andrew McCallum at University of Massachusetts Amherst)
  • LibSVM (A collection of SVMs, developed by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University)
  • SVM-light (Another collection of SVMs, developed by Thorsten Joachims at Cornell University)
  • GraphLab (Large-scale machine learning package)
  • mahout (Apache large-scale machine learning package)
  • Topic Models (David Blei's collection of various topic models)

Data Repository

  • TREC: a long-history IR conference for different task evaluations. Various IR tasks have been proposed and corresponding data sets are available.
  • Twitter: Twitter is currently open to public, twitter streams can be accessed via their APIs, and also there are some crawled twitter available: e.g., Stanford SNAP twitter data set, and TREC microblog collection.
  • Microsoft Learning to Rank Datasets: a collection of annotated query and URLs for learning-to-rank study, with a handful of practical IR ranking features.
  • LETOR: another collection of annotated data set for learning-to-rank studies.
  • ClueWeb09: 1 billion web pages in ten languages that were collected in January and February 2009.
  • UCI Machine Learning Repository: a standard machine learning benchmark repository (a bit small and old).
  • AOL search log: a collection of around 20M web queries collected from about 650k users over three months in AOL web search engine. There is a famous privacy leak scandal related to this search log data wiki, which is one of the major reasons preventing any search engine to share their search log data. And there is also a search engine built to inspect users' privacy in this data set: Search-ID.
  • Yelp Dataset Challenge: A large set of Yelp reviews and entities provided by Yelp. Also, "If you are a student and come up with an appealing project, you’ll have the opportunity to win one of ten Yelp Dataset Challenge awards for $5,000."

Related Courses at UVA

  • CS 4720: Web and Mobile Systems
  • CS 4501: Introduction to Machine Learning and Data Mining

LaTeX

  • Here are the LaTeX files necessary to write the project report.
  • We want everyone to use the same format so we can grade each paper fairly.
  • Additionally, LaTeX is a skill we feel you should learn if you haven't already!

General Advice on Computer Science Research