CS 6501: Text Mining Spring 2019 · CS@UVa

Resources

Textbooks

There is no official textbook for this course. The following books are for your reference.

Text Mining Course in Other Universities and MOOC

It is beneficial to be aware of how text mining is taught in other top universities, especially by those top researchers in the field. Here is a list of wonderful text mining courses selected by the instructor.

Top Conferences and Journals Related to Text Mining Research

The following list and comments only represent the instructor's personal opinion.

  • KDD: One of the most important and influential conference in the field of data mining, proceedings of publications can be found here.
  • SIGIR: One of the most important and influential conference in the field of information retrieval (attract more attention from academia), proceedings of publications can be found here.
  • WWW: Another most important and influential conference in IR field (attract more attention from industry), proceedings of publications can be found here.
  • WSDM: A new but quickly raising conference in the field, attracking attentions from both industry and academia. Proceedings of publications can be found here.
  • CIKM: A major conference in the field of data mining and information retrieval. Proceedings of publications can be found here.
  • ACL: A major conference for computational linguistics research. A Digital archive of research papers in computational linguistics at ACL Anthology.
  • TOIS: One of major journals for information retrieval and data mining field.
  • If you are interested in rankings or indices of those conferences and journals, you may take a look at Google Scholar's Metrics.

Text Mining Toolkits

  • Lucene Apache Lucene is a free open source information retrieval software library. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized[4][5] for its utility in the implementation of Internet search engines and local, single-site searching.
  • MeTA MeTA is a modern C++ data sciences toolkit developed by Timan group in University of Illinois. Various text mining and machine learning algorithms are implemented.
  • RankLib (A collection of learning-to-rank algorithms University of Massachusetts Amherst)
  • Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources
  • Stanford NLP parser (Stanford University NLP group)
  • OpenNLP (Apache)
  • LingPipe (Jave-based)
  • NLTK(Python-based)
  • Weka: A rich collection of machine learning algorithms, Machine Learning Group at the University of Waikato.
  • Mallet: An alternative package for Weka, developed by Andrew McCallum at University of Massachusetts Amherst
  • LibSVM: A collection of SVMs, developed by Chih-Chung Chang and Chih-Jen Lin at National Taiwan University
  • SVM-light: Another collection of SVMs, developed by Thorsten Joachims at Cornell University
  • GraphLab: Large-scale machine learning package
  • mahout: Apache large-scale machine learning package
  • Spark: A fast and general engine for large-scale data processing.
  • Topic Models (David Blei's collection of various topic models)

Data Repository

  • Twitter: Twitter is currently open to public, twitter streams can be accessed via their APIs, and also there are some crawled twitter available: e.g., Stanford SNAP twitter data set, and TREC microblog collection.
  • UCI Machine Learning Repository: a standard machine learning benchmark repository (a bit small and old).
  • Yelp Dataset Challenge: A large set of Yelp reviews and entities provided by Yelp. Also, "If you are a student and come up with an appealing project, you’ll have the opportunity to win one of ten Yelp Dataset Challenge awards for $5,000."

Related Courses at UVa

LaTeX

  • Here are the LaTeX files necessary to write the project report. And you are required to use "ACM Standard" or "ACM Large" for your report.
  • We want everyone to use the same format so we can grade each paper fairly.
  • Additionally, LaTeX is a skill we feel you should learn if you haven't already!

Tips on Presentation

General Advice on Computer Science Research