MP1— Getting Familiar with Text Processing & Inverted Index

This assignment is designed to help you get familiar with basic document representation and analysis techniques. You will get the basic ideas of tokenization, stemming, normalization, constructing bag of words representation for text documents, and building an inverted index structure for efficient access.

This assignment has in total 100 points. The deadline and submission guidance are explicitly described here.

Data Set

The instructor has prepared a small size collection of Yelp restaurant reviews (included in the provided sample Java project and separated into three folders) for this assignment. The review files are named and organized in the following manner:

  1. Each file contains all crawled review documents for a specific business on Yelp, and it is named by its unique ID on Yelp, e.g., FAhx3UZtXvqNiEAd-GNruQ.json;
  2. All the files are in json format. Each json file contains a json array of reviews ('Reviews') and a json object about the information of the restaurant ('RestaurantInfo').

    2.1 The json object for user review is defined as follows:
    {
    'Author':'author name (string)', 'ReviewID':'unique review id (string)',
    'Overall':'numerical rating of the review (float)', 'Content':'review text content (string)',
    'Date':'date when the review was published',
    'Author_Location':'author's registered location'
    }

    2.2 The json object for restaurant info is defined as follows:
    {
    'RestaurantID':'unique business id in Yelp (string)',
    'Name':'name of the business (string)',
    'Price':'Yelp defined price range (string)',
    'RestaurantURL':'actual URL to the business page on Yelp (string)',
    'Longitude':'longitude of the business (double)',
    'Latitude':'latitude of the business (double)',
    'Address':'address of the business (string)',
    'ImgURL':'URL to the business's image on Yelp (string)'
    }

In the following discussion, we will refer to each individual user review as a document.

WARNING: some collected json files might not strictly follow the above json object definitions, e.g., some fields are missing or empty. As a result, properly handling the exceptions in json parsing is necessary.

Sample implementation

A sample Java project has been provided along with this assignment for demonstration purpose. It includes sample code for loading JSON files, tokenization with OpenNLP, stemming with Snowball Stemmer, and building an inverted index with Lucene. Most of the implementations have been provided, and you need to make necessary modifications to complete the required tasks.

NOTE: please carefully read the comments in the sample project (especially those noted as "INSTRUCTOR'S NOTE"). The sample implementation is for demonstration purpose; and please carefully revise it to maximize your implementation's efficiency.

Preparation: Basic Text Processing Techniques

Let's first go over several important concepts and techniques for basic text analysis.

1. Tokenization

Tokenization is the process that one breaks a stream of text into meaningful units. Simple tokenizatoin can be achieved by regular expressions. For example, the following statement in Java split the input string into tokens:

"I've practiced for 30 years in pediatrics, and I've never seen anything quite like this.".split("[\\W]+")

In this statement, we define the boundary for a token to be any non-word sequence. And the corresponding output of this statement is,

*I*, *ve*, *practiced*, *for*, *30*, *years*, *in*, *pediatrics*, *and*, *I*, *ve*, *never*, *seen*, *anything*, *quite*, *like*, *this*

where ** indicate the boundary of a token.

A more advanced solution is the statistic machine learning based approaches. And in this assignment, we will show you how to use the tokenizer in OpenNLP (in Java) to perform tokenization. The detailed documentation for this tokenizer can be found here. You can download the library here and the trained model file here (please choose the English tokenizer).

Once you have properly load the model from file, tokenization can be simply performed by,

String tokens[] = tokenizer.tokenize("An input sample sentence.");

2. Stemming

Stemming refers to the process of reducing inflected (or sometimes derived) words to their stem, base or root form. For example, "ladies" would be mapped to "ladi" as a result of stemming (although "lady" would be a more desirable result).

Unfortunately, OpenNLP does not support stemming function currently. There are several existing implementations of stemmer in Java available, including Snowball Stemmer and Porter Stemmer. The Snowball Stemmer package contains both of these two popularly used stemmers.

The usage of stemmers Snowball package is very simple,

SnowballStemmer stemmer = new englishStemmer(); // get an instance of SnowballStemmer for English
stemmer.setCurrent(token); // set the input
stemmer.stem(); //perform stemming
String stem = stemmer.getCurrent(); // get the output

3. Stopword removal

In basic text analysis, stopwords are the words that are filtered out before or after processing of natural language text data, based on the assumption that such words do not carry specific semantic meanings. However, there is not one definite list of stopwords that are applicable in all scenarios, and the definition of stopwords are always domain specific.

In the provided implementation, a list of manually selected stopwords have been included.

4. Pre-processing

There are several general steps of pre-processing you need to take to finish this assignment.

  • Tokenization: tokenize the review content of each document into tokens.
  • Normalization: normalize the tokens from step 1, by removing individual punctuation marks (here is a list of English punctuation marks), converting tokens into lower cases, and recognizing digit numbers, e.g., integers and floats, and map them to a special symbol "NUM".
  • Stemming: stem the tokens back to their root form.

Homework Problems

1. Understand Zipf's Law (50pts)

First, let's validate the Zipf's law with the provided Yelp review data sets. This can be achieved by the following steps:

  1. Process the text document according to the discussed steps above.
  2. For each token, go over all the review documents containing it, and accumulate its frequency in those documents, i.e., total term frequency (TTF).
  3. Order the tokens by their TTF in a descending order.
  4. Create a dot plot by treating each word's rank as x-axis and its TTF as y-axis. Please use log-log scale for this plot.

Hint: basically, you can maintain a look-up table in memory while you are scanning through the documents, so that you only need to go through the corpus once to get the counts for all the tokens. In the provided implementation, this look-up table is maintained in the "Corpus" class, named as "m_dictionary".

From the resulting plot, can we find a strong linear relationship between the x and y axes in the log-log scale? If so, what is the slope of this linear relationship? To answer these questions, you can dump the data into excel and use the plot function there. (You can also use some scientific computing packages for curve fitting for this purpose.)

Then change the counting statistics in the previous experiment from total term frequency (TTF) to document frequency (DF) (i.e., how many documents contain this specific term), and perform the experiment again. According to new curve and corresponding slope and intercept of the linear interpretation, can you conclude which counting statistics, i.e., TTF v.s., DF, fits Zipf's law better on this data set? Can you give any explanation about why it fits the law better?

In the provided Yelp review data set, the instructor has separated the reviews into three folders, named as "20", "40" and "60", which include 20, 40 and 60 restaurants accordingly. You should collect statistics from these three folders separately and create the corresponding plots (3*2 curves in total). Will an increased number of documents better help us verify Zipf's law?

What to submit:

  1. Paste your implementation of text normalization module. (10pts)
  2. Six curves in log-log space generated above, with the corresponding slopes and intercepts of the linear interpretation results. (20pts)
  3. Your answers and thoughts to the above questions. (20pts)

2. Building An Inverted Index for Accelerating Text Access (50pts)

An inverted index accelerates text corpus access by building a term to document mapping. In this assignment, we will compare efficiency speed-up provided by such a specialized data structure.

First, let's redo the experiments in problem one using an inverted index. In the provided sample implementation, a standard Lucene index can be easily constructed by the "edu.virginia.cs.index.Indexer" class. The corresponding text processing pipeline has been implemented in "edu.virginia.cs.index.SpecialAnalyzer" class. And sample code to use the indexer can be found in "runners.Main" class. Please modify the implementation in "runners.Main" class to build index for documents in the "Yelp/60" folder.

To get a list of indexed terms in Lucene, you can use the following piece of code,

IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(indexPath)));//using your own index path
Terms terms = MultiFields.getTerms(reader, "content");//get reference to all the indexed terms in the content field
TermsEnum termsEnum = terms.iterator(null);
while (termsEnum.next()!=null) {//iterate through all terms
    Term t = new Term("content", termsEnum.term());//map it to the corresponding field
    System.out.format("%s\t%d\t%d\n", t, termsEnum.docFreq(), reader.totalTermFreq(t));//print term text, DF and TTF        
}

Now you are fully equipped to get all statistics you need to create curves for validating Zipf's law. Please record the total running time it takes to create the Zipf's law curves with Lucene index (including index building time). Compare this with the time it took in your implementation for problem one (on Yelp/60 data set); is it much faster?

Now we are ready to do some simple text retrieval with/without inverted index. In the provided implementation, how to search in an inverted index has been demonstrated in "edu.virginia.cs.index.Searcher" class, and sample code to run it is provided in "runners.Main" class. In the meanwhile, the instructor has provided some simple implementation of Bag-of-Word representation of documents, using a HashMap structure, and the corresponding brute-force search method in "edu.virginia.cs.searcher.DocSearcher" class. You are encouraged to design your own Bag-of-Word data structure to maximize the efficiency.

Use these two retrieval methods, i.e., inverted index and brute-force search over your BoW document representation, to count the total number of documents that match with the 10 queries listed below accordingly,

general chicken
fried chicken
BBQ sandwiches
mashed potatoes
Grilled Shrimp Salad
lamb Shank
Pepperoni pizza
brussel sprout salad
FRIENDLY STAFF
Grilled Cheese

Please record the total running time of each of retrieval methods, and the number of returned documents accordingly.

What to submit:

  1. Paste your implementation of Bag-of-Word document representation (e.g., how to construct it from raw document content, and how to search for a particular term in it). (15pts)
  2. Two curves in log-log space generated by using Lucene inverted index for collecting TTF and DF statistics, with the corresponding running time statistics in comparison with your implementation in problem one. (15pts)
  3. Running time and total number of returned documents by two retrieval models. If you found these two methods gave you different number of matched documents, do you have any explanation about it? (20pts)

Deadlines & How to submit

This assignment is due on March. 19th, Friday, 11:59pm. Therefore, you have in total 15 days to finish it. The late policy for all our homework has been carefully discussed in the course syllabus.

Collab assignment page have been created for this MP. Please submit your written report strictly following the requirement specified above. The report must be in PDF format, and you can use any template you like for this report.