In the previous tutorials, we walked through how you can use existing planning and learning algorithms in BURLAP on domains in BURLAP and how to create your own domains on which you can use those algorithms. However, you may also want to extend or create your own planning or learning algorithm in BURLAP that can either be used on existing domains or novel BURLAP domains and be easily compared against existing algorithms. In this tutorial, we will show you how to create both a planning algorithm and a learning algorithm. In particular, we will be reimplementing versions of Value Iteration and Q-learning since they are conceptually simple algorithms to implement, but will expose the various properties of BURLAP you would want to use. In general, however, you should defer to using the existing implementations of these algorithms in BURLAP since they will support more features than we will cover here.

Value Iteration (VI) is an algorithm that finds the optimal value function (the expected discounted future reward of being in a state an behaving optimally from it), and consequentially the optimal policy, for an MDP's entire state space. Central to the idea of VI is the Bellman Equation, which states that the optimal value of a state is the value of the action with the maximum expected discounted future return (the action with the maximum Q-value) where the Q-value for a state-action pair is defined as the expected value over all possible state transitions of the immediate reward summed with the discounted value of the resulting state. In math: $$ \large V(s) = \max_a Q(s,a) $$ $$ \large Q(s,a) = \sum_{s'} T(s' | s,a) \left[ R(s, a, s') + \gamma V(s') \right], $$

where $T(s' | s, a)$ is the probability of transitioning to state $s'$ when taking action $a$ in state $s$, $R(s, a, s')$ is the reward received for transitioning to state $s'$ after taking action $a$ in state $s$, and $\gamma$ is a discount factor affecting how much immediate rewards are preferred to later ones.

This equation presents some issues in that if an MDP has cycles, it's unclear what the values should be. However, the Value Iteration algorithm will converge to the optimal Value function if you simply initialize the value for each state to some arbitrary value, and then iteratively use the Bellman equation to update the the value for each state.

Planning algorithms that make use of the Bellman equation to estimate the Value function are known as Dynamic Programming (DP) planning algorithms. Different DP algorithms specify different priorities for when the estimated value of each state is updated with the Bellman equation. In the case of VI, Bellman updates are performed in entire sweeps of the state space. That is, at the start, the value for all states is initialized to some arbitrary value. Then, for each state in the state space, the Bellman equation is used to update the value function estimate. Sweeps over the entire state space are repeated for some fixed number of iterations or until the maximum change in the value function is small. The algorithm is summarized in the below pseudocode.

- Initialize value function $V(s)$ arbitrarily for all states $s$.
- Repeat until convergence...
- For each state $s$
- $V(s) := \max_a \sum_{s'} T(s' | s, a) \left[R(s,a,s') + γ V(s')\right]$

Since there are a number of different DP algorithms that can be implemented, BURLAP includes a class called DynamicProgramming that includes a number of helpful methods for automatically performing Bellman Updates on states and which is extended by many of the DP algorithms included in BURLAP. However, to give a better sense of how to use the more fundamental parts of a planning algorithm in BURLAP, we will instead write our VI algorithm from scratch without using the DynamicProgramming class. However, we will extend the MDPSolver class since it provides a number of useful datamemb ers and setter and getting methods that you will commonly want to have for algorithms that solve MDPs.