This assignment is designed for you to practice classical bandit algorithms with simulated environments.

Part 1: Multi-armed Bandit Problem (42+10 points): get the basic idea of multi-armed bandit problem, implement classical algorithms like Upper Confidence Bound (UCB), Thompsom Sampling (TS) and Perturbed-history Exploration (PHE), and compare their performance in the simulated environment;
Part 2: Contextual Linear Bandit Problem (58+10 points): get the basic idea of contextual linear bandit problem, implement the linear counterparts of the algorithms in Part 1, including LinUCB, LinTS and LinPHE, compare their performance in the simulated environments and summerize the influence that the shape of action set has on the performance.

A collab assignment page have been created for this assignment. Please submit your written report strictly following the requirement specified below. The report must be in PDF format, and simply name your submission by your computing ID dash assignment1, e.g., “cl5ev-assignment1.PDF”.

Simulation Environment

We have provided a starter code for this assignment in the attachment, which contains scripts for running simulation and example implementation of bandit algorithms. The script Simulation.py will generate:

action set “articles”, which is a set of Article objects (defined in Articles.py) storing id and feature vector of an article
user set “users”, which is a set of User objects (defined in Users.py) storing id and unobservable linear parameter (that parameterizes the reward function) of a user

Then in each time step, we will iterate over each user, make recommendation to it and receive an reward of the recommended article. After “testing_iterations” number of iterations, the accumulated regret and the parameter estimation error (if the bandit algorithm estimates the linear parameter) avergaed on all users will be plotted.

The important hyper-parameters that governs simulation:

testing_iterations: total number of time steps to run
NoiseScale: standard deviation of the Gaussian noise in the reward
n_articles: total number of articles in the articles set
n_users: total number of users
poolArticleSize: If it is set to None, in each time step, the action set contains all articles. Otherwise, randomly sample a size “poolArticleSize” subset of articles in each time step as the action set. This is the setting that many linear bandit adopts.
actionset: Setting to “basis_vector” or “random”. “basis_vector” constrains the articles feature vectors to be basis vector like e_0 = [1, 0, 0, …, 0], e_1 = [0, 1, 0, …, 0]. Therefore, feature vectors of the articles will be orthogonal, which means observation about reward of one article brings no information about reward of another article. “random” means using randomly sampled vectors from l2 unit ball.
context_dimension: dimension of article feature vector and user linear parameter.

Part 1: Multi-armed Bandit Problem (42 points + 10 bonus points)

In Multi-armed Bandit Problem, the reward of different arms are assumed to be independent. The learner can only observe index of the article, not feature vector. Therefore, the learner should maintain sufficient statistic and construct reward estimator / update posterior for each arm.

To simulate a K-armed bandit environment, set

actionset = “basis_vector”
n_articles = context_dimension = K

Then you can try different values of:

number of arms: K
standard deviation of Gaussian noise: NoiseScale

1.1 Implement Multi-armed Bandit Algorithms

We are goint to try out the algorithms following the three principles below for multi-armed bandit with Gaussian reward noise:

Upper Confidence Bound
Thompson Sampling
Perturbed-history Exploration (Bonus)

Note that these three principles are general and works with different reward distributions. Here our simulation experiment assumes the reward is Gaussian with unknown mean and known standard deviation, so you need to first figure out what the algorithms should be like for this particular reward assumption before you implement them. You will get partial points for using algorithms assuming different reward distributions.

Apart from the lecture slides, you may also find the following materials useful:

Chap. 7 and Chap. 36 of Bandit Algorithms
Perturbed-History Exploration in Stochastic Multi-Armed Bandits

See EpsilonGreedyMultiArmedBandit.py under lib directory for example of how to implement the multi-armed bandit algorithm. Remember that in this environment the algorithm should not use article.featrueVector, but only article.id.

1.2 Comparison in Different Environment Settings

Playing with different hyper-parameter settings of the simulation environment:

Experiment with different number of articles K and see how it influences the performance
Experiment with different amount of Guassian noise in the reward (standard deviation: NoiseScale) and see how it influences the performance

By instantiating your implemented learner, putting it in the dictionary “algorithms” and setting “plot=True”, you will be able to run these algorithms simutaneously and see the plot showing their accumulated regrets and parameter estimation errors.

Deliverables:

(10 points + 5 bonus points for PHE) In your report, write the two (or three including the bonus) algorithms you use (Clearly list the procedure: what are the input, and what are the steps). Note that you should write explicitly the equations you use, e.g. for computing UCB score.
(20 points + 5 bonus points for PHE) In your report, copy and paste important components of your implementation like arm selection, and parameter update step of the algorithms.
(12 points) In your report, include the plots and summerize your findings for the experiments you conduct for Section 1.2, e.g. under different K and noiseScale. You should also list the values of other environment parameters you set for the experiments, e.g. testing_iterations, n_users.

Part 2: Contextual Linear Bandit Problem (58 points + 10 bonus points)

In Contextual Linear Bandit Problem, each user has an unobservable linear parameter which parameterizes the linear reward function that takes the feature vector of the articles as input. In each time step, the learner observes the feature vectors of the action set, select an arm and observe the reward associated with the selected arm. Therefore, the learner maintain sufficient statistics and construct linear regression estimator / update posterior for the unknown linear parameter.

To simulate a linear bandit environment, set

actionset = “random”

Then you can try different values of:

dimension of feature vector and linear parameter: context_dimension
standard deviation of Gaussian noise: NoiseScale
total number of articles in the articles set: n_articles
poolArticleSize: the size of a random subset of articles the learner can observe in each time step

2.1 Implement Contextual Linear Bandit Algorithms

For this section, we are going to see how the three principles mentioned above work in the linear bandit problem.

LinUCB
LinTS
LinPHE (Bonus)

Apart from the lecture slides, you may also find the following materials useful:

See EpsilonGreedyLinearBandit.py under lib directory for example of how to implement the linear bandit algorithm. Remember that in this environment the algorithm can use article.featrueVector.

2.2 Comparison in Different Environment Settings

Playing with different hyper-parameter settings of the simulation environment:

Experiment with different dimensions of feature vectors and see how it influences the performance
Experiment with different amount of Guassian noise in the reward (standard deviation: NoiseScale) and see how it influences the performance
Experiment with different size of action set and see how it influences the performance, e.g. learner observes the whole set of articles as well as different sizes of the subsets of articles.

2.3 Influence of the Shape of Action Set

In this section, we investigate when / why linear bandit algorithms (algorithms that utilize feature vectors of articles) has advantage over multi-armed bandit (algorithms that only knows index of articles).

First, we need to set the hyper-parameters of the simulation environment to make fair comparison:

poolArticleSize = None
n_articles = context_dimension = K

Then run the algorithms under: 1) actionset = “basis_vector” 2) actionset = “random”

For linear bandit algorithms, the accumulated regret under which setting is smaller? And why? (Hint: what is the relationship between the linear bandit setting where feature vectors are all basis vectors and the multi-armed bandit?) Can you explain using the regret upper bound results of the bandit algorithm.

Try different K and see how the discrepancy of the performance under two settings changes.

Deliverables:

(10 points + 5 bonus points for LinPHE) In your report, write the two (or three including the bonus) algorithms you use (Clearly list the procedure: what are the input, and what are the steps). Note that you should write explicitly the equations you use, e.g. for computing UCB score.
(20 points + 5 bonus points for LinPHE) In your report, copy and paste important components of your implementation like arm selection, and parameter update step of the algorithms.
(18 points) In your report, include the plots and summerize your findings for the experiments you conduct for Section 2.2, e.g. under different context_dimension, NoiseScale and poolArticleSize. You should also list the values of other environment parameters you set for the experiments, e.g. testing_iterations, n_users, n_articles.
(10 points) In your report, include the plots and summerize your findings for the experiments you conduct for Section 2.3.

MP #1 - Bandit Algorithms

Late Policy