This assignment is designed for you to practice the policy gradient method for reinforcement learning, one of the most popularly employed RL algorithms.

  • Part 1: Implementating policy gradient algorithms (50 points): implement Reinforce and Actor-Critic algorithms and test your implementations using the provided grid world environment;

  • Part 2: Fine-tuning and comparisons (50 points + 10 bonus points): fine-tune the hyperparameters in both algorithms to obtain the best performance, and compare the results with value-based algorithms that you have implemented in MP2.

A collab assignment page has been created for this assignment. Please submit your written report strictly following the requirement specified below. The report must be in PDF format, and simply name your submission by your computing ID dash assignment3, e.g., “cl5ev-assignment3.PDF”.

Grid world environment

For this assignment, you will work with the same 4-by-4 grid world environment as provided in MP2.

Part 1: Implementing policy gradient algorithms (50 points)

Policy gradient (PG) algorithms directly target at improving the policy via gradient-based optimization. Your task is to implement two PG algorithms, namely Reinforce and Actor-Critic, to solve the grid world problem. The starter code is provided in PG.py, where you need to fill in the methods named reinforce and actorCritic. You can add new methods to the ReinforcementLearning class, when needed. The inputs of both algorithms should include theta (i.e., the initial policy) and other necessary hyperparameters.

Deliverables:

  • (30 points - Implementation) Report your algorithm implementations and explain key steps in both algorithms. Specifically, you need to write down the math equations for the parameterized policy, computed gradient, and the update rule.
  • (20 points - Evaluation) Evaluate the optimized policy by producing a figure where the x-axis indicates the # episodes (your choice of total number of episodes should be large enough so that the algorithm converges) and the y-axis indicates the cumulative rewards per episode starting from state-0 (averaged over at least 10 runs) for both algorithms. Copy and paste the main parts of your code into your report.

Part 2: Fine-tuning and Comparison (50 points + 10 points)

You probably have learned from MP2 that the exploration rate needs to be carefully tuned to secure acceptable performance. For policy-based algorithms, you will face a similar situation and are thus encouraged to fine-tune the critical parameters in search of a better result. In this part, you are asked to compare the performance of all RL algorithms implemented in MP2 and MP3, using the evaluation protocol defined in Part 1. You should fine-tune the hyperparameters of all RL algorithms you have implemented so far (MC, TD and PG) to report their best performance in this simple environment. And from the experiment results, please think about the advantages and disadvantages of these RL algorithms.

Deliverables:

  • (20 points - learning rate tuning) Fix the total number of episodes to 3000 and run REINFORCE and ActorCritics algorithms with different learning rate. Report the best learning rate that ensures the fastest convergence of each algorithm, and explain how the learning rate affects the performance.
  • (30 points - changing environment) In MDP.py, you can see that the transition function is parameterized by b, which controls the randomness of the environment. Try to run your code under different b and see how it affects the performance of your algorithms (still set the total number of episodes to 3000). Please create a figure where the x-axis indicates different values of b and the y-axis indicates the cumulative rewards starting from state-0 after a fixed number of episodes (averaged over at least 10 runs). The figure should contain 4 curves corresponding to b=0.0, 0.1, 0.2 and 0.3. Explain the impact of b on the cumulative rewards during training.
  • (10 bonus points - comparison between value-based and policy-based algorithms) Compare the two policy-based methods and those value-based methods you implemented in MP2 (i.e., off-policy MC and Q-learning) by producing a figure where the x-axis indicates the number of episodes (maximum 3000) and the y-axis indicates the cumulative rewards per episode starting from state-0 (averaged over at least 10 runs). Find the best-performed algorithm under different b and explain your findings.