This assignment is designed for you to practice classical solution methods to Markov Decision Processes (MDP).

  • Part 1: Dynamic Programming (50 points): implement value iteration and policy iteration for MDP, and test your code using the provided grid world environment;
  • Part 2: Model-free Control (50 points): implement off-policy Monte Carlo control and off-policy TD control (Q-learning) algorithms and test your code using the provided grid world environment.

A collab assignment page has been created for this assignment. Please submit your written report strictly following the requirement specified below. The report must be in PDF format, and simply name your submission by your computing ID dash assignment2, e.g., “cl5ev-assignment2.PDF”.

Grid world environment

For this assignment, you will work with this simple 4-by-4 grid world environment. The goal of the agent is to get to the goal (cell 15) as soon as possible, while avoid the pits (cell 5 and 9). This environment is defined in MDP.py.

---------------------
|  0 |  1 |  2 |  3 |
---------------------      -1 reward for each step
|  4 |  5 |  6 |  7 |      -70 reward for reaching cell 5 and 9
---------------------      +100 reward for reaching cell 15
|  8 |  9 | 10 | 11 |      
---------------------
| 12 | 13 | 14 | 15 |
---------------------

Components:

  • Action Space: 4 actions [0: up, 1: down, 2: left, 3: right]
  • State Space: 17 states (including an absorbing state that the agent will transit to after reaching cell 15)
  • Transition function T: A x S x S’ array
  • Reward function R: A x S array
  • discount factor: scalar in [0,1)

Part 1: Dynamic Programming (50 points)

Dynamic programming solutions for MDP assume a known environment in order to take expectation over all possible next states and rewards. Then they iteratively compute the optimal policy / value function of the given MDP.

Your task is to implement policy iteration and value iteration. Starter code is provided in DP.py, where you need to fill in the methods named policyIteration and valueIteration. You can add new methods to the class DynamicProgramming if you need to. For policyIteration, implement both ways to evaluate the policy: 1) solving a system of linear equations, 2) iteratively updates.

Apart from the lecture slides, you can also find detailed discussions about policy iteration and value iteration in Chapter 4.3 and Chapter 4.4 of our RL textbook.

Deliverables:

  • (15 points - value iteration) Report the computed policy, value function and number of iterations needed by value iteration when using a tolerance of 0.01 and starting from a value function set to 0 for all states. Copy and paste your value iteration code into your report.
  • (15 points - policy iteration v1) Report the computed policy, value function and number of iterations needed by policy iteration (policy evaluation performed by solving a system of linear equations), and starting from the policy that chooses action 0 in all states. Copy and paste your policy iteration code into your report.
  • (20 points - policy iteration v2) Report the number of iterations needed by policy iteration to converge (policy evaluation performed by iterative updates) when varying the number of iterations in policy evaluation from 1 to 10. Use a tolerance of 0.01, start with the policy that chooses action 0 in all states and start with the value function that assigns 0 to all states. Discuss the impact of the number of iterations in policy evaluation on the results and relate the results to previous two methods. Copy and paste your policy iteration code into your report (only code snippet for the policy evaluation part if the other parts are the same as previous method).

Part 2: Model-free Control (50 points)

Unlike dynamic programming methods that require full knowledge about the environment, model-free learning methods only require experience, i.e., sampled sequences of states, actions, and rewards from actual or simulated interaction with an environment, which is much more flexible and pratical.

Your task is to implement off-policy Monte Carlo control (with weighted importance sampling) and off-policy TD control (Q-learning). Starter code is provided in RL.py where you need to fill in the methods named OffPolicyMC and OffPolicyTD. For off-policy Monte Carlo control, choose the bahavior policy to be epsilon-soft.

Again, apart from the lecture slides, you can also find detailed discussions about these two methods in Chapter 5.7 and Chapter 6.5 of our RL textbook.

Deliverables:

  • (25 points - off-policy MC control) Produce a figure where the x-axis indicates the # episodes (the total number of episodes should be large enough so that algorithm converges) and the y-axis indicates the cumulative rewards per episode (averaged over at least 10 runs). Copy and paste your off-policy MC control code into your report.
  • (25 points - off-policy TD control) Produce a figure where the x-axis indicates the # episodes (the total number of episodes should be large enough so that algorithm converges) and the y-axis indicates the cumulative rewards per episode (averaged over at least 10 runs). The figure should contain 4 curves corresponding to the exploration probability epsilon=0.05, 0.1, 0.3 and 0.5. Explain the impact of the exploration probability epsilon on the cumulative rewards per episode earned during training as well as the resulting Q-values and policy. Copy and paste your off-policy TD control code into your report.