This is the implementation for the paper titled “Clustered Model Adaptation for Personalized Sentiment Analysis”. We provide all the source codes for the algorithm and related baselines.

Quick Start (For Linux and Mac)

1. Download the whole compressed file named cLinAdapt.tar.gz to your local machine.


Use the command to extract the whole file.


Use the command to compile the whole project.


Use the command to run the cLinAdapt algorithm with default setting.

Questions regarding running cLinAdapt and Baselines

Q1: What’s inside the folder /cLinAdapt?

There are four folders and two files inside the folder:


Four folders:

/src folder provides all the source codes for our algorithms and baselines.

/libs folder has all the jar files needed for the project. If you want to import the project into IDE, you may need to import these lib files too.

/bin folder is usually where the compiled files are copied to. After you compile the whole project, the bin folder will be updated.

/data folder has all the data needed for the experiments reported in the paper, including both Amazon data(/data/CoLinAdapt/Amazon/) and Yelp data(/data/CoLinAdapt/YelpNew/).


/data/CoLinAdapt/Amazon contains all the files needed for running on Amazon dataset:

-SelectedVocab.csv is the 5000 features we selected for training for Amazon dataset.

-GlobalWeights.txt is the weights for features trained on a separate data.

-Users contains 9760 users.

-CrossGroups_400.txt is the 400 feature group indexes for all features. Similarly, CrossGroups_800.txt, CrossGroups_1600.txt are feature indexes for 800 feature groups and 1600 feature groups. At most, we can have 5000 feature groups, which means each feature will form a group.

Two files:

compile is the compiling file.

run is execution file.

Q2: How to run the algorithm cLinAdapt with different parameters?

We use ‘-model’ to select different algorithms and the default one is cLinAdapt in batch mode.

The following table lists all the parameters for cLinAdapt:


One sample command is as follows:


Q3: How to run baselines?

As reported in the paper, we have six baselines in batch mode. We can use “-model” to select baselines, the corresponding parameters are specified in the above table. For example, I would like to run GlobalSVM on Amazon dataset, then after the compiling, input the following command line:


Q4: What does the output mean?

The expected output consists of three parts. The first part is as follows:


The first seven lines of text are training information, 5000 features, 2-gram, 9760 users in training, 143161 reviews in total. Feature group size for users groups is 801, the additional dimension is for bias while the feature group size for global part is 5001. We have 69235 training reviews, which have ratio of positive reviews being 0.75, together we have 73911 testing reviews with positive ratio being 0.73. M is the size of the auxiliary variables introduced in the posterior inference of the group indicators. alpha is the concentration parameter used for Dirichlet priors. #Iter is the number of iterations for sampling. And N1 and N2 represent the normal distributions which the two sets of linear operations are draw from, i.e., shifting and scaling. The circle means the function value decrease during the optimization process while the cross means the function value increases. Since the optimization process involves the line search, the function value increases sometimes. The part indicates the algorithm is in burn-in period. logo

The second part shows the information for each iteration. The circle and cross also indicate the algorithm is executing the M-step, I.e., maximize the likelihood of model parameters. We also print out the cluster information, 840(0.75, 7.0), 840 is the number of reviews inside the cluster, 0.75 the positive ratio of the reviews and 7 is the average review length inside the cluster. Also, the log-likelihood is printed out, together with the delta log-likelihood compared with the previous step.


The third part indicates the sentiment classification performance of cLinAdapt. The first line is the information of the parameters used in the current run, which is the same as introduced in the first part. We also print out the confusion matrix, together with Micro F1 and Macro F1 for both classes. Micro F1 performance is calculated by aggregating all users’ reviews while Macro F1 is the average performance over all the users. Class 0 is negative class and Class 1 is positive class. Due to the randomness of initialization of model weights, the final sentiment performance may vary in different runs. More detail about the experimental setting can be found in our paper.