This assignment is designed to help you get familiar with basic document representation and analysis techniques. It consists of two parts, totaling 100 points.

Part 1: Vector Space Model (50 points): get the basic idea of text processing, e.g., tokenization, stemming, and normalization, construct vector space representation for text documents, TF/IDF weighting, and compute similarity among different text documents;
Part 2: Statistical Language Models (50 points): get the basic idea of statistical language models, including maximum likelihood estimation, smoothing, generate text documents from language models, and language model evaluation.

The deadline and submission guidance are explicitly described here.

Data Set

The instructor has prepared a small collection of Yelp restaurant reviews (38,688 review documents for training and 19,803 for testing). The data set is downloaded here.

The files are named and organized in the following manner:

Each file contains all the review documents for a specific business on Yelp, and it is named by its unique ID on Yelp, e.g., FAhx3UZtXvqNiEAd-GNruQ.json;

All the files are in json format. Each json file contains a json array of reviews ('Reviews') and a json object about the information of the restaurant ('RestaurantInfo').

2.1 The json object for user review is defined as follows:

{          
    'Author':'author name (string)',
    'ReviewID':'unique review id (string)',  
    'Overall':'numerical rating of the review (float)',
    'Content':'review text content (string)',   
    'Date':'date when the review was published',   
    'Author_Location':'author's registered location'  
}

2.2 The json object for restaurant info is defined as follows:

{                
    'RestaurantID':'unique business id in Yelp (string)',    
    'Name':'name of the business (string)',      
    'Price':'Yelp defined price range (string)',    
    'RestaurantURL':'actual URL to the business page on Yelp (string)',   
    'Longitude':'longitude of the business (double)',              
    'Latitude':'latitude of the business (double)',              
    'Address':'address of the business (string)',       
    'ImgURL':'URL to the business's image on Yelp (string)'     
}

WARNING: some collected json files might not strictly follow the above json object definitions, e.g., some fields are missing or empty. As a result, properly handling the exceptions in json parsing is necessary.

Sample implementation

A sample Java project has been provided along with this assignment for demonstration purpose. It includes sample code for loading JSON files, tokenization with OpenNLP and stemming with Snowball Stemmer.

It is recommended for you to follow this sample implementation to finish this assignment. It is also allowed to use Python or any other programming language for this assignment, if you are more confident with them. However, no sample implementation will be provided (some Python package usage examples are provided below for your reference).

NOTE: please carefully read the comments in the sample project. All the sample implementation is for demonstration purpose; and please carefully revise it to maximize your own implementation's efficiency.

Preparation: Basic Text Processing Techniques

Terminologies

Before describing the specifications of this assignment, several terminologies that will be repeatedly used below are defined here:

Document: it refers to each individual Yelp review document contained in the json file. Therefore, one json file might contain multiple review documents.
Restaurant: it refers to a specific restaurant entity contained in the given review data set. It is uniquely identified by the corresponding Yelp ID; and one json for one restaurant.
DF/IDF/TF/TTF/Cosine similarity: the definitions for those metrics strictly follow those on our lecture slides, unless specifically defined below.

Next, let's go over several important concepts and techniques for basic text analysis.

Tokenization

Tokenization is the process that one breaks a stream of text into meaningful units. In this assignment, we will show you how to use the tokenizer in OpenNLP (in Java) and NLTK (in Python) to perform tokenization.

1. Tokenizer in OpenNLP

The detailed documentation for this tokenizer can be found here. You can download the library here and the trained model file here (please choose the English tokenizer).

Once you have properly load the model from file, tokenization can be simply performed by,

String tokens[] = tokenizer.tokenize("An input sample sentence.");

2. Tokenizer in NLTK

NLTK provides several implementations of tokenization modules, and many of them are actually regular expression based.

The usage of them is the same and very simple,

>>> import nltk
>>> tokenizer = nltk.tokenize.treebank.TreebankWordTokenizer()
>>> tokenizer.tokenize("I've practiced for 30 years in pediatrics, and I've never seen anything quite like this.")
['I', "'ve", 'practiced', 'for', '30', 'years', 'in', 'pediatrics', ',', 'and', 'I', "'ve", 'never', 'seen', 'anything', 'quite', 'like', 'this', '.']

Stemming

Stemming refers to the process of reducing inflected (or sometimes derived) words to their stem, base or root form. For example, "ladies" would be mapped to "ladi" as a result of stemming (although "lady" would be a more desirable result).

1. Stemmers in Java

Unfortunately, OpenNLP does not support stemming function currently. It does have a module called "Lemmatizer," which can be understood as a more advanced version of stemmer. There are several existing implementations of stemmer in Java available, including Snowball Stemmer and Porter Stemmer. The Snowball Stemmer package contains both of these two popularly used stemmers.

The usage of stemmers Snowball package is very simple,

SnowballStemmer stemmer = new englishStemmer(); // get an instance of SnowballStemmer for English
stemmer.setCurrent(token); // set the input
stemmer.stem(); //perform stemming
String stem = stemmer.getCurrent(); // get the output

2. Stemmers in NLTK

NLTK provides several implementations of stemming modules, which includes the Porter Stemmer and Snowball Stemmer.

The usage of either stemmer in NLTK is very simple. For example, to use Snowball Stemmer,

>>> from nltk.stem.snowball import EnglishStemmer # load the stemmer module from NLTK
>>> stemmer = EnglishStemmer() # get an instance of SnowballStemmer for English
>>> stemmer.stem('ladies') # call stemmer to stem the input
u'ladi'

Stopword removal

In basic text analysis, stopwords are the words that are filtered out before or after processing of natural language text data, based on the assumption that such words do not carry specific semantic meanings. However, there is not one definite list of stopwords that are applicable in all scenarios, and the definition of stopwords are always domain specific.

Here is a popularly used list of stopwords: Smart system's stopword list. And we will use it as our initial stopword list for this assignment.

N-gram

An N-gram is a contiguous sequence of n items from a given sequence of text or speech. For example the bigram (2-gram) representation of the sentence "Text mining is helpful for everyone." would be ["text-mining", "mining-is", "is-helpful", "helpful-for", "for-everyone"].

To generate the N-grams, you simply scan through the list of split tokens and concatenate the consecutive tokens into N-grams.

Pre-processing

There are several general steps of pre-processing you need to take to finish this assignment.

Tokenization: tokenize the review content of each document into tokens.
Normalization: normalize the tokens from step 1, by removing individual punctuation marks (here is a list of English punctuation marks), converting tokens into lower cases, and recognizing digit numbers, e.g., integers and floats, and map them to a special symbol "NUM".
Stemming: stem the tokens back to their root form.
Stopword removal: remove a standard list of stopwords. You can define your criterion for stopword removal in N-grams, and please give detailed descriptions in your report.

Part I: Vector Space Model (50pts)

Based on the above pre-processing steps, we are ready to get some statistics about this corpus and represent the documents in vector space.

1.1 Understand Zipf's Law (20pts)

First, let's validate the Zipf's law with the provided Yelp review data set. This can be achieved by the following steps:

Process the text document according to the discussed steps above.
For each token, go over all the review documents containing it (in both train and test folder), and accumulate its frequency in those documents, i.e., total term frequency (TTF).
Order the tokens by their TTF in a descending order.
Create a dot plot by treating each word's rank as x-axis and its TTF as y-axis. Please use log-log scale for this plot.

Hint: basically, you can maintain a look-up table in memory while you are scanning through the documents, so that you only need to go through the corpus once to get the counts for all the tokens.

From the resulting plot, can we find a strong linear relationship between the x and y axes in the log-log scale? If so, what is the slope of this linear relationship? To answer these questions, you can dump the data into excel and use the plot function there. (You can also use some scientific computing packages for curve fitting for this purpose.)

What to submit:

Copy and paste your implementation of text normalization module.
The generated curve for Zipf's law in log-log space, with the corresponding slope and intercept of the linear interpolation results.
Follow the same procedure as above but this time collect document frequency (DF) of each word to create the curve again. Which one do you think better fits Zipf's law? And any explanation?

1.2 Construct a Controlled Vocabulary (10pts)

According to the procedure illustrated in our lecture slide, generate all the bigrams based on the resulting tokens from the pre-processing step (only in the train folder), and mix those bigrams with the unigrams as our initial controlled vocabulary.

Collect the DF statistics for all the N-grams (i.e., unigram and bigram) in this initial controlled vocabulary (only in the train folder), and find out the top 100 N-grams by DF. Compare this list with our initial stopword list and can you find some restaurant review specific stopwords? Merge those top 100 N-grams into the initial stopword list as our final stopword list.

In the meanwhile, filter out the N-grams with DF smaller than 50. All the rest N-grams will be used as our controlled vocabulary.

What to submit:

List of new stopwords you found specific to restaurant reviews.
The size of the resulting controlled vocabulary (i.e., total N-grams in it).
Top 50 and bottom 50 N-grams according to DF in the resulting controlled vocabulary, and their corresponding IDFs (should be estimated only based on the train folder).

1.3 Compute similarity between documents (20pts)

With the above automatically constructed controlled vocabulary, each review document can be represented as a N-gram vector. Specifically, each dimension in this vector space is defined by the mix of unigrams and bigrams defined in the controlled vocabulary; while the weight for each unigram/bigram in a review document is defined by TF-IDF. Specifically, we will use "Sub-linear TF scaling" to compute the normalized TF of each unigram/bigram in a review document. Note the IDF statistics should be only computed based on the train folder.

Using this document representation to encode all the review documents in the test folder.

We have prepared a special json file named "query.json", which is manually crafted by the instructor. It contains five carefully selected restaurant reviews from and outside the provided corpus. Construct the vector space representations for these five reviews (as what you have done for those review documents in the test folder) and find out the most similar reviews to them from the test folder, where the similarity metric is defined as cosine similarity.

What to submit:

Paste your implementation of cosine similarity computation.
For each document in the "query.json" file, list the top 3 most similar review documents (including Author, Content and Date) from the test folder and the corresponding cosine similarity.
Without reading the actual content in the "query.json" file, by simply judging from the content of the retrieved most similar review documents, can you tell which type of restaurants are specified in the query.json file (e.g., Indian food or Korean food)?

Part II: Statistical Language Models (50pts)

2.1 Maximum likelihood estimation for statistical language models with proper smoothing (20pts)

Use all the review documents in the train folder to estimate a unigram language model $p_u(w)$ and two bigram language models (with different smoothing methods specified below). When estimating the bigram language models, using linear interpolation smoothing and absolute discount smoothing based on the unigram language model $p_u(w)$ to get two different bigram language models accordingly, i.e., $p^L_b(w_i|w_{i-1})$ and $p^A_b(w_i|w_{i-1})$. In linear interpolation smoothing, set the parameter $\lambda=0.9$; and in absolute discount smoothing, set the parameter $\delta=0.1$.

From the resulting two bigram language models, find the top 10 words that are most likely to follow the word "good", i.e., rank the words in a descending order by $p^L_b(w|\unicode{x201C}good")$ and $p^A_b(w|\unicode{x201C}good")$ and output the top 10 words. Are those top 10 words the same from these two bigram language models? Explain your observation.

HINT: to reduce space complexity, you do not need to actually maintain a $V\times V$ array to store the counts and probabilities for the bigram language models. You can use a sparse data structure, e.g., hash map, to store the seen words/bigrams, and perform the smoothing on the fly, i.e., evoke some function calls to return the value of $p^L_b(w|\unicode{x201C}good")$ and $p^A_b(w|\unicode{x201C}good")$.

What to submit:

Paste your implementation of the linear interpolation smoothing and absolute discount smoothing.
The top 10 words selected from the corresponding two bigram language models.
Your explanation of the observations about the top words under those two bigram language models.

2.2 Generate text documents from a language model (15pts)

Fixing the sentence length to 15, generate 10 sentences by sampling words from $p_u(w)$, $p^L_b(w_i|w_{i-1})$ and $p^A_b(w_i|w_{i-1})$ respectively.

HINT: you can use $p_u(w)$ to generate the first word of a sentence and then sampling from the corresponding bigram language model when generating from $p^L_b(w_i|w_{i-1})$ and $p^A_b(w_i|w_{i-1})$.

What to submit:

Paste your implementation of the sampling procedure from a language model.
The 10 sentences generated from $p_u(w)$, $p^L_b(w_i|w_{i-1})$ and $p^A_b(w_i|w_{i-1})$ accordingly, and the corresponding likelihood given by the used language model.

2.3 Language model evaluation (15pts)

Compute perplexity of the previously estimated language models, i.e., $p_u(w)$, $p^L_b(w_i|w_{i-1})$ and $p^A_b(w_i|w_{i-1})$, on all the review documents in the test folder.

The perplexity metric defined in our lecture slide is for one document (copied below). Follow that definition to compute perplexity for every review document in the test folder and compute the mean and standard deviation of the resulting perplexities.

$PP(w_1,w_2,\dots,w_n) = \sqrt[n]{\frac{1}{\prod_{i=1}^n p(w_i|w_{i-1},\dots,w_{i-N+1})}}$

where $d=w_1,w_2,\dots,w_n$, i.e., a text sequence from testing document $d$ of length $n$; and the likelihood is computed under an N-gram language model.

NOTE: to smooth the unseen words in the test folder, use additive smoothing to smooth the unigram language model $p_u(w)$ by setting the parameter $\delta=0.1$. Then use the smoothed $\hat p_u(w)$ to smooth $p^L_b(w_i|w_{i-1})$ and $p^A_b(w_i|w_{i-1})$.

What to submit:

Paste your implementation of the perplexity calculation of a language model.
Report the mean and standard deviation of the perplexities for $\hat p_u(w)$, $\hat p^L_b(w_i|w_{i-1})$ and $\hat p^A_b(w_i|w_{i-1})$ on the test folder.
Can you conclude which language model predicts the data in test folder better? And why?

Deadlines & How to submit

This homework is due on Feb. 15th 11:59pm, and therefore you have 14 days to finish it. The late policy for all our homework has been carefully discussed in the course syllabus.

A collab assignment page have been created for this MP. Please submit your written report strictly following the requirement specified above. The report must be in PDF format, and simply name your submission by your computing ID dash MP1, e.g., "hw5x-MP1.PDF".

Reference solution

We have prepared the following sample solution for your reference, and hopefully it helps answer some of your confusions or questions about MP1. Sample Solution

CS 6501: Text Mining Spring 2019 · CS@UVa

Data Set

Sample implementation

Preparation: Basic Text Processing Techniques

Terminologies

Tokenization

1. Tokenizer in OpenNLP

2. Tokenizer in NLTK

Stemming

1. Stemmers in Java

2. Stemmers in NLTK

Stopword removal

N-gram

Pre-processing

Part I: Vector Space Model (50pts)

1.1 Understand Zipf's Law (20pts)

1.2 Construct a Controlled Vocabulary (10pts)

1.3 Compute similarity between documents (20pts)

Part II: Statistical Language Models (50pts)

2.1 Maximum likelihood estimation for statistical language models with proper smoothing (20pts)

2.2 Generate text documents from a language model (15pts)

2.3 Language model evaluation (15pts)

Deadlines & How to submit

Reference solution