CS 6501: Text Mining Spring 2019 · CS@UVa

This assignment is designed to help you practice with general steps in building a text categorization system. It consists of five tasks, including feature selection, building Naive Bayes classifier, kNN classifier, and classification performance evaluation, totaling 100 points.

The deadline and submission guidance are explicitly described here.

Data Set

We will use the same small Yelp review data set as we used in MP1. More details of this review data set can be found in our previous MP1 description.

Please reuse your implementation for MP1 as much as possible. Based on that, you should be able to finish this assignment without too much extra effort.

Preparation: Creating a classification task

Terminologies

We will keep using the same terminologies as we used in MP1.

Basic text processing techniques

We will use the same text processing pipeline as what we have explored in MP1, including tokenization, stemming, normalization and stopword removal (please use the standard Smart system's stopword list. More action details can be found in our previous description of MP1. Note, in all the following tasks, we will only use unigram as our feature and json files in both the train and test folders.

Defining the classes

For each review document in this review data set, it is associated with a numeric overall rating in the range of 1 to 5. We will treat the review documents with overall rating below 4 as negative (labeled as 0) and the rest as positive (labeled as 1). As a result, we have created a binary text classification problem.

Task 1: Feature Selection (15pts)

Based on the basic text processing steps we took in MP1, we have created a vector representation for every review document in our collection. Next we will implement two most effective text feature selection methods, i.e., Information Gain and Chi Square, to further refine this representation.

The definitions of Information Gain feature selector is as follows (please also refer to our lecture slides),

$$ IG(t) = -\sum_{y={0,1}} p(y)\log p(y) + p(t)\sum_{y={0,1}} p(y|t)\log p(y|t) + p(\bar t)\sum_{y={0,1}} p(y|\bar t)\log p(y|\bar t) $$ where $p(y=1|t)$ is the probability that we get a positive document where we observe the term t, and $p(y=1|\bar t)$ is the probability that we get a positive document where we do not observe the term t.

Note: here we need to define $0\log 0=0$ to avoid zero probabilities in entropy computation.

Accordingly, the definition of Chi Square feature selector is as follows,

$$\chi^2(t) = \frac{(A+B+C+D)(AD-BC)^2}{(A+C)(B+D)(A+B)(C+D)} $$ where A is the number of positive documents containing term t, B is the number of positive documents not containing term t, C is the number of negative documents containing term t, and D is the number of negative documents not containing term t.

Before we apply these two feature selection methods, we need to filter the features by DF first. This time we will filter out the words with DF smaller than 10.

As required in Chi Square, we will use the threshold 3.841 to filter out insignificant terms, i.e., removing the terms whose $\chi^2(t)<3.841$.

After these filtering steps, rank the resulting features by their $IG(t)$ and $\chi^2(t)$ value in descending orders accordingly. We will choose the top 5000 words from each of these two ordered lists as our candidate features. If we don't have 5000 features left after feature selection, return whatever you got after filtering.

Before moving onto the next task, merge the feature lists you have got from Information Gain and Chi Square, i.e., take the union of these two sets of selected features, and use the resulting list as the finalized features (i.e., controlled vocabulary) for the following tasks. Then use this controlled vocabulary to construct vector representation for all the review documents in both the train and test folders, and ignore the documents with less or equal to five non-zero features (i.e., length of the resulting sparse vector is less or equal to 5). We will refer to the resulting document collection as corpus in our following descriptions.

What to submit:

  1. List the top 20 words selected by Information Gain and Chi Square accordingly. What is the size of your finalized controlled vocabulary after merging? How many review documents are there in the resulting corpus after feature selection? (5 pts).
  2. Paste your code for Information Gain and Chi Square computation (2$\times$5 pts).

Task 2: Naive Bayes (20pts)

2.1 Training Naive Bayes with Maximum a Posterior estimator (10pts)

As we discussed in the course lecture, a Naive Bayes classifier for text categorization problem can be considered as independent N-gram language models for each class. In this assignment, we will use unigram language models with additive smoothing (set $\delta=0.1$) to estimate parameters in a Naive Bayes text classifier.

You are suggested to reuse your implementation in MP1 for unigram language models in this task.

HINT: you only need to estimate two unigram language models for positive and negative classes accordingly, and $p(y)$ is simply the ratio of the two classes in the corpus.

What to submit:

  1. Use the whole corpus to estimate a Naive Bayes classifier. Compute the log ratio between $p(w|y=1)$ and $p(w|y=0)$, i.e., $\log\frac{p(w|y=1)}{p(w|y=0)}$ and $w$ is a particular unigram in your controlled vocabulary, and rank the features in a descending order with respect to this log ratio. List the top 20 and bottom 20 words from this ranked list. Do they make sense to you in terms of distinguishing the positive opinion and negative opinion? (10 pts).

2.2 Naive Bayes as a linear classifier (10pts)

As discussed in our lecture, a Naive Bayes classifier for binary classification boils turns out to be a linear classification model:

$$f(X)=\log\frac{p(y=1|X)}{p(y=0|X)}=\log\frac{p(y=1)}{p(y=0)} + \sum_x(\log p(x|y=1) - \log p(x|y=0))$$, and $\bar y=sgn(f(X))$ where $sgn(x)=1$ if $x\ge0$, otherwise 0.

Implement this prediction function with your Naive Bayes classifier. It should take a review document as input and output its predicted class label.

What to submit:
1. Compute $f(X)$ over all documents in the testing set and rank them in a descending order with respect to $f(X)$. Plot the Precision-Recall curve with respect to this ranked list. Please refer to the lecture slides for the instructions of plotting this curve.
2. Vary the smoothing parameter $\delta$ and redraw the precision-recall curve. Will we see different precision/recall tradeoffs? And try to explain why.

Task 3: k Nearest Neighbor (25pts)

In our lecture discussion, we have introduced a highly efficient approximation method for retrieving the nearest neighbors to the given input instance, i.e., random projection. We will implement this method to speed up our kNN computation.

In our kNN classifier, we will use TF-IDF as feature values in the vector representation and cosine similarity between vectors as distance metric. We will follow the same definition of TF-IDF as we used in MP1, i.e., sub-linear TF scaling.

Note: in Naive Bayes classifier, we only need raw TF as our feature value.

Denote the size of our final controlled vocabulary as $m$, we will create a $l$-bits hash code for every review document in the corpus via random projection. The details of how to perform random projection has been discussed in our lecture slides. Briefly, we will create $l$ different $m$-dimensional random vectors, and each dimension of these random vectors uniformly falls into the range of [-1,1]. For an input review document, we will compute the inner product of its vector representation with those $l$ random vectors accordingly; and use the sign function to encode each bit as 0 or 1, i.e., $H(x)=[sgn(r_1^Tx),sgn(r_2^Tx),\dots,sgn(r_l^Tx)]$.

Please keep in mind that those $l$ random vectors should be fixed and shared among all the review documents during training and testing. In other words, you should first create those $l$ random vectors, store them, and keep using them for all the incoming documents when computing their hashing values. In addition, random projection only helps you narrow down the search space. Unless there are less than k documents in the resulting hash bucket, you still need to compute cosine similarity between the input document and all the documents in the resulting hash bucket to retrieve the k nearest neighbors for the input document.

What to submit:
1. Copy and paste your implementation of random projection. (15pts)
2. Use your implemented random projection (set $l=5$) to retrieve the k nearest documents (set $k=5$) for those five sample review documents listed in the query json file in MP1. Report the running time for brute-force k nearest neighbor search and approximate k nearest neighbor search by random projection. Can we still find reasonable results with this approximation, i.e., the retrieved documents help us determine the cuisine mentioned in the query document? (10pts)

Task 4: Classification performance evaluation (25pts)

In all our previous task descriptions, we used the same corpus for training and testing purposes. Now, we will perform classification evaluation in a more rigorous way, i.e., using cross validation.

We will randomly split the corpus into 10 folds; in each round, we will reserve one fold for testing and the rest for training. Detailed procedures of cross-validation can be found in our lecture slides.

In each round, train and test Naive Bayes and kNN on the same train/test separation. In kNN, set $l=5$ and $k=5$.

What to submit:
1. Report the average precision, recall and F1 measure of Naive Bayes and kNN over these 10-fold cross validation. (15pts)
2. Perform paired two sample t-test to verify if one classifier is significantly better than another (set the confidence level to be 95%)? If so, which one is better? (10pts)

Task 5: Explore the best configuration of $l$ and $k$ (15pts)

In our random projection based kNN implementation, we have two hyper-parameters, i.e., length of projection hash code $l$ and number of nearest neighbors $k$. We set them arbitrarily. Please use cross-validation as a model selection method to choose the best configuration of $l$ and $k$.

HINT: this is an open question and you could explore solutions that might not even be mentioned in our class.

What to submit:
1. Your experiment design that helps you find the best configuration of $l$ and $k$. And the optimal $l$ and $k$ you have found. (15pts)

Deadlines & How to submit

MP3 is due on April 12th 11:59pm. You have in total 14 days to finish this MP. The late policy for all our homework assignments has been carefully discussed in the course syllabus.

Collab assignment page has been created for this MP. Please submit your written report strictly following the requirement specified above. The report must be in PDF format, and simply name your submission by your computing ID dash MP3, e.g., "hw5x-MP3.PDF".

Reference solution

We have prepared the following sample solution for your reference, and hopefully it helps answer some of your confusions or questions about MP3. Sample Solution