CS 6501: Text Mining Spring 2019 · CS@UVa

Course Project

The course project is to give the students hands-on experience on solving some novel text mining problems. The project thus emphasizes either research-oriented problems or "deliverables." It is preferred that the outcome of your project could be publishable, e.g., your (unique) solution to some (interesting/important/new) problems, or tangible, e.g., some kind of prototype system that can be demonstrated. Team work is required.

General steps

  • Pick a topic
  • Form a team
  • Survey related work
  • Write a project proposal
  • Work on the project
  • Write a report
  • Present the project

Your project will be graded based on the following required components:

  • Project proposal (20%)
    • State your motivation, research problem, and expected outcome of your course project.
    • Discussion with instructor prior to deadline is encouraged.
    • Submit a two page maximum proposal in the required latex template.
    • Due by the end of 4th week.
  • Project presentation (40%)
    • 10 minutes presentation about what you have done for this course project, plus at most 2 minutes for question answering. Format could be tailored according to the nature of the project, e.g., slides presentation and/or system demo.
    • Performance will be graded by both instructor and peer students (no self-grading).
  • Project report (40%)
    • A detail written report of your project.
    • Quality requirement is the same as research papers, i.e., in formal written English and rigorous paper format.
    • Six page maximum (with unlimited references) in the required latex template.
    • Due on the last week of course, before project presentation.

An official rubric for the final report and rubric for the project presentation are provided for your reference.

Note that you are required to use the provided templates for your project proposal and final report. See the Resources page for the template and example file. Please name your submitted document as "CompID[-CompID]+-Proposal.PDF" or "CompID[-CompID]+-Report.PDF" accordingly, where "CompID[-CompID]+" refers to the list of your group members' computing IDs. One team only needs to provide one submission on collab; and unless specifically required, the same grade will be applied to all team members.

Pick a topic

You can either pick from a list of sample topics provided by the instructor or choose your own topic. You are suggested to starting thinking about the topic for your course project from the first day of the class, and discuss it with your fellow students. This is a good way to identify opportunities for collaborations.

Leveraging existing resources is especially encouraged as it allows you to minimize the amount of work that you have to do and focus on developing truly your ideas.

When picking a topic, try to ask yourself the following questions:

  • What is exactly the (research) problem that you want to solve? Will it matter if nobody realizes this problem?
  • What kind of changes could your project make to the others?
  • Is there any existing alternative? If so, why do you still want to do it? How is your idea different from theirs? Would people appreciate about the difference?
  • What would be the major challenge(s) in this problem? Any specific background or resource you have to solve the identified problem?
  • What is the minimum goal to be achieved during this semester? (Try to drop everything non-essential and only keep the part that is truly novel.)
  • How do you plan to demonstrate that method to be developed is indeed solving the pain? Empirical experimentation and/or demo are required, unless you are doing a purely theoretic work.

Keep in mind, you are required to address the above questions in your project proposal and final report.

Form a team

You are required to work with other students as a team. Teams may consist of up to four total students, and three students a team is recommended. Teamwork not only gives your some experience on working with others, but also allows you to work on a larger (presumably more important) topic.

Note that it is your responsibility to figure out how to contribute to your group project, so you will need to act proactively and in a timely manner if your group leader has not assigned a task to you. The instructor will believe all team members actively contribute to the project and the same grade will be applied to the group member (unless special treatment is required by the group members).

Survey related works

While choosing a topic, it is very important to be aware of whether the problem you would like to tackle has already been solved. If so, you may want to figure out where exactly your novelty is and whether novelty leads to any benefit to others. Your goal is to go beyond, rather than simply duplicate, the existing work. To minimize your effort, you are encouraged to leverage existing algorithms, toolkits, and other useful resources as much as possible. The instructor can also help you check related work. Please feel free to discuss your plan with the instructor before finalizing your proposal.

Write a project proposal

You are required to write a two-page proposal before you actually go in depth on a topic. In the proposal, you should address the following questions and include the names of all the team members as authors. The order among authors' names do not matter.

  • What is the problem identified in the project?
  • Why is this problem important?
  • Is there any related work? How different is your idea from theirs?
  • What techniques/algorithms will you use/develop to solve the problem?
  • How will you evaluate your work?
  • List your potential contributions of this work.

Intuitively, the proposal should read like the introduction part of a regular research paper. Briefly state the background/motivation, what has been done, what is missing, how do you plan to solve it, how do you plan to prove the usefulness of your method, and summarize your contribution(s).

Work on the project

You should leverage any existing tools or methods as much as possible. For example, consider using the Lucene toolkit for indexing and searching in a large text corpus; using Stanford NLP parser or OpenNLP toolkit for text analysis; using MALLET or WEKA for classification or clustering. There are also many tools available on the Internet. See the resources page for some useful pointers. Discuss any problems or issues with your teammates or classmates. If you need special support, please let the instructor know.

Consider documenting your work regularly. This way, you will already have a lot of things written down by the end of the semester. In addition, we strongly suggest using version control for your project! Nothing is more frustrating than losing a lot of your hard work, especially if it's close to a deadline.

To help you better manage your time in course project, every team is required to send an email to the instructor every month to briefly report their progress. In the email, please briefly summarize your achievements in the past month, milestones you have reached, and plan for the next month. Please feel free to discuss with the instructor and TAs about the difficulties and challenges you have encountered during the project.

Present the course project

At the end of the semester, each project team is expected to present their project in class. The purpose of this presentation is

  • Let you know about others' projects.
  • Give you some opportunity to practice presentation skills, which are very important for a successful career in both academia and industry.
  • Obtain some feedback from others about your project.

In general, the structure of your presentation should be prepared like a conference presentation. So it should touch all the following aspects (text in parenthesis states the instructor's expectation):

  • What is the background/motivation of your work? What research question will you address? (Learn how to attract public attention.)
  • Why is this problem important? (Learn to how persuade others.)
  • Is there any existing work? How novel is yours? (Learn how to sell your ideas.)
  • How did you solve this problem? (Learn how to deliver your solution.)
  • How good was your method? (Learn how to quantitatively/qualitatively evaluate your work.)
  • Any ideas for further improvement? (Learn how to look ahead.)

Think about how you can best present your work so as to make it as easy as possible for your audience to understand your main messages. Try to be concise, to the point. Pictures, illustrations, and examples are generally more effective than text for explaining your project. Try to show screen shots and/or plots of your experimental results. Watching some top conference presentations (e.g., KDD, SIGIR, ICML) on VideoLectures will be beneficial.

In order to be fair to all members in the same group, the instructor will randomly pick team members for question answering during the presentation.

Write a project report

You should write your report as if you were writing a regular conference paper. You should address the same questions as those you have addressed in the proposal and presentation, only with more details. Pay special attention to the challenges that you have solved and your detailed solutions. Basic sections to be included in the report should be the same as those in a conference paper, e.g., abstract, introduction, related work, method, experiment and conclusion. If you are developing a demo system or toolkit, your report should follow the format of a demo paper.

You are required to use LaTeX for your project report. See the Resources page for the template and example file. The project report must be at most six pages with the required template (no minimal requirement, as long as you feel it is sufficient to prove the merit of your work, and no page limit on the references).

Topics proposed by the instructor

  • Automatic tutor for English writing: Improving English writing skill is always a significant challenge for a non-English speaker. And it is also even stressful for a native speaker in specific scenarios, e.g., formal scientific writing. This project aims to develop automatic tools based on language models to beautify an amateur's English writings. For example, language models trained on twitter data would make an ordinary user's tweet looks more like being written by a experienced twitter user; and language models trained on scientific publications would make an amateur's paper read like an expert's work. In Gmail, neural language models are used to help complete our messages, but it can only make suggestions to complete the sentence, not to revise the sentence completely based on the context. Can we do better there?

  • Spatial temporal analysis of opinions: Social opinions provide a gold mine for researchers to understand the explore public's opinion towards a specific entity, e.g., products and celebrities, or a service, e.g., hotels and restaurants. This project aims to extend an existing aspect-based opinion mining system, ReviewMiner, for supporting spatial-temporal analysis of opinions. Specifically, we want to visualize the opinions: display the temporal dynamics of opinion across different entities (e.g., from twitter stream or reviews), render the opinions on a map, and support user interaction with such spatial-temporal analysis of opinions.

  • Temporal topic analysis: Documents generated over time, although could be large in volume, are never independent from each other. Both temporal and textual information strongly manifest the underlying dependency structure of document streams. How to effectively model and analyze these unstructured document streams becomes increasingly important for service providers to improve users' experience and maximize their service utility. This project focuses on developing a systematic solution to perform temporal analysis of topics in document streams and capture the temporal and semantic dependency among the documents.

  • Social influence v.s. homophily: Users in Yelp write reviews about businesses and make friends who share similar tastes and preferences. However it is unknown whether users become friends because they visited the same restaurants before (i.e., homophily); or they visited the same restaurants because they were friend (i.e., )influence. Distinguishing these two factors are very important for social network based recommendations.

  • Query intent classification: Current product search system can only support simply keyword search, e.g., "canon 5d3". It is preferred if the system can support some simple semantic search, e.g., "cheap digit camera with high resolution." The system should be able to correctly map the specifiers of "cheap" and "high" to corresponding aspects of the product, e.g., price and image quality, and return all the results matching such criteria. One can imagine this as a translation process and opinionated review text documents provide nice resource to estimate such translation model.

  • Active learning for sequential text labeling: Manually annotating text documents for supervised machine learning is generally time consuming and expensive. The situation becomes even worse when it comes to the situation of sequential text labeling, e.g., part-of-speech labeling and named entity recognition. However, the availability and quality of manual labels directly limit the effectiveness of the learnt models. Active learning becomes a natural remedy of this challenge. Traditional works in active learning mostly focus on simple learning tasks, e.g., multi-class classification or regression, while little attention has been paid onto the problem of structured prediction problems, e.g., sequential text labeling. Instead of selecting a whole sequence for labeling, can we only actively label a subsequence of input to improve model training? How to update a structured prediction model when only partial labeling is available?

  • Learning a text classifier with unreliable annotations only: Oftentimes, complete and fully trustful manual annotations are hard to obtain, but partial and noisy annotations, e.g., referred as weak or remote supervision, can be easily acquired at scale. How to model and take advantage of such weak supervision becomes an important and emerging research topic. In text mining, especially when handling social media data, being able to handle weak supervision becomes extremely important. How can we identify the reliability of the weak annotations, and modeling the dependency between the weak and true labels? If we can acquire the true labels on the fly, how should we design the query strategy to best improve the classifier over time?

Peer-evaluation website

We will use the same evaluation system page for peer evaluation in our project presentation. Please note you will not evaluate your own presentation, and therefore do not be surprised that you cannot find your name in the evaluation system.

Presentation Schedule

We will follow the following schedule and presentation order to perform our project presentations on April 30th and May 2nd.

NameDateProject Title
Xinzuo Wang, Jiayang Liu and Hao GuApril 30Hotel Recommendation Based on Opinion Analysis
Wanyu Du and Xinyu YangApril 30Alter the style of the text: A neural style transfer model
Austin Chen, Quinn Dawkins, Danial Hussain and Jihyeong LeeMay 2Predicting Lines In Movie Scripts
Zheng Chen, Yumeng Jiang, Runze Yan and Yingying ChenMay 2Yelp Recommendation System Based On Sequence Tagging
Jinyu Chen, Runnan Yang, Xiaoxi Lin and Jie YangMay 2Personalized Recommendation System for Restaurant
Yu Du and Haochuan ZhangMay 2Restaurant Recommendation System Based on Yelp
Chuanhao Li, Ruizhong Miao, Rongrong Liu and Mengyu GongMay 2Recommendation System Using Knowledge Graph
Wen Ying and Teng LiMay 2Query Intent Classification In Online-shopping
Yinqiao Xiong, Aobo Yang, Zixi Qi and Kechen LiuMay 2Text Summarization System for Articles
Anna Baglione and Abraham Gebru TesfayMay 2Mining Text From Twitter Users To Identify Political Affiliation
Akanksha Nichrelay and Arjun MalhotraMay 2Detecting questions with same intent in Question-Answering Platforms
Guangxu Xun, Mengdi Huai, Jianhui Sun and Kishlay JhaMay 2Knowledge-Base EnrichedWord Embeddings for Biomedical Domain
Yichen Jiang, Wen Ding, Shenghao Ye and Xiang GuoMay 2Emoji Usage Prediction and Its Application in Sentimental Analysis
Andrea Zhang, Eamon Collins, Mike Song and Monique MezherMay 2Cross-lingual Sentiment Comparison in Wikipedia
Tyler Handley, Hunter Murphy, Sile Shu and Rachel WicksMay 2Word Sense Disambiguation for Double Meanings