This tutorial will cover three topics. First, we will review a little of the theory behind Markov Decision Processes (MDPs), which is the typical decision-making problem formulation that most planning and learning algorithms in BURLAP use. Next, we will discuss how you can implement an MDP definition in BURLAP. Finally, we will show you some basics with how to interact with your MDP and visualize it.

Beyond MDPs, BURLAP also supports stochastic games and partially observable MDPs. The BURLAP example code repository has some examples with these problems, but a core understanding of the MDP representation will cover a lot of the basics that are shared in those problem types. BURLAP also has first class support for the object-oriented MDP (OO-MDP) state representation. OO-MDPs have a lot of nice properties, but it's easier to first describe how to define general MDP states, and OO-MDPs are not necessary for every problem. Therefore, we will leave the discussion about OO-MDPs for a subsequent tutorial.

If you are already familiar with MDPs, or just want to get down to coding, feel free to skip the first section that discuss their mathematical description.

For more information on how to use planning and learning algorithms on the MDP domains that you create or that are already in BURLAP, see the Basic Planning and Learning tutorial.

Note that all of the code in this tutorial is listed at the end and is also available in the burlap_examples github repository.

To implement agents that learn how to behave or plan out behaviors for an environment, a formal description of the environment and the decision-making problem must first be defined. One of the most common formalisms used by learning and planning algorithms is the Markov Decisions Process (MDP), which considers the agent making decisions at discrete time steps to maximize a reward signal. In this tutorial, we will formalize a grid world as an MDP. A grid world is a 2D environment in which an agent can move north, south, east or west by one unit each time step, provided there are no walls in the way. The below image shows a simple grid world with the agent's position represented by a gray circle and walls of the environment painted black. Typically, the goal in a grid world is for the agent to navigate to some location, but there are a number of variants.

**Figure:** An example grid world.

All MDPs are defined by four components that we will need to specify for our grid world: a set of possible states of the environment ($S$); a set of actions that the agent can take ($A$); a definition of how actions change the state of the environment, known as the transition dynamics ($T$); and the rewards the agent receives for each of its actions ($R$), known as the reward function, which will determine what the best behavior is (that is, the agent will want to act in a way that maximizes the reward it receives).

The transition dynamics are formulated as a probabilistic function $T(s' | s, a)$, which defines the probability of the environment changing to state $s'$ in the next discrete time step when the agent takes action $a$ in the current state $s$. The fact that the environment can change stochastically is one of the unique properties of an MDP compared to more classic AI deterministic planning/decision making problems.

The reward function is a function of the last state, the action taken in that last state, and the next state to which the environment transitioned as a result of that action: $R(s, a, s')$.

Notice how both the transition dynamics and reward function are temporally independent from everything predating the most recent
state and action? Requiring this temporal independence from everything earlier than the last event makes this a *Markov* system and is why this formalism is called a *Markov* decision process.

In our grid world, the set of states are the possible locations of the agent. The actions are north, south, east, and west. We will define the transition dynamics to be stochastic so that with high probability (0.8) the agent will move in the intended direction, and with some low probability (0.2) move in a different direction. You can imagine this stochasticity being a model for a robot with slightly unreliable driving capabilities (which, as it turns out, is often the norm!). The transition dynamics will also encode the walls of our grid world by specifying that movement that would lead into a wall will result in returning to the same state. Finally, we will define a reward function that returns a high reward when the agent reaches the top right corner of the environment and zero everywhere else, to motivate movement to that corner.

For various kinds of problems, the concept of terminal states is often useful. A terminal state is a state that once reached causes all further action of the agent to cease. This is a useful concept for defining goal-directed tasks (i.e., action stops once the agent achieves a goal), failure conditions, or any number of other reasons. In our grid world, we'll want to specify the top right corner the agent is trying to get to as a terminal state. An MDP definition does not include a specific element for specifying terminal states because they can be implicitly defined in the transition dynamics and reward function. That is, a terminal state can be encoded in an MDP by being a state in which every action causes a deterministic transition back to itself with zero reward. However, for convenience and other practical reasons, terminal states in BURLAP are specified directly with a function (more on that later).

The goal of planning or learning in an MDP is to find the behavior that maximizes the reward the agent receives over time. More formally, we'd say that the goal is to find a *policy* ($\pi$), that is a mapping from states in the MDP to actions that the agent takes ($\pi : S \rightarrow A$). Sometimes, the policy can also be defined as a probability distribution over action selection in each state (and BURLAP supports this represenation), but for the moment we will consider the case when it is a direct mapping from states to actions. To find the policy that maximizes the reward over time, we must first define what it means to maximize reward *over time* and there are a number of different temporal objectives we
can imagine that change what the optimal policy is.

Perhaps the most intuitive way to define the temporal reward maximization is to maximize the the expected total future reward; that is, the best policy is the one that results in largest possible sum of all future rewards. Although this metric is sometimes appropriate, it's often problematic because it can result in policies that are evaluated to have infinite value or which do not discriminate between policies that receive the rewards more quickly. For example, in our grid world, any path the agent took to reach the top right would have a value 1 (because they all would eventually reach the goal); however, what we'd probably want instead is for the agent to prefer a policy that reaches the goal as fast as possible. Moreover, if we didn't set the top right corner to be a terminal state, the agent could simply move away from it and back into it, resulting in any policy that reaches the goal having an infinite value, which makes comparisons between different policies even more difficult!

A common
alternative that is more robust is to define the objective to be
to maximize the expected *discounted* future reward. That is, the
agent is trying to maximize expected sum
$$\large \sum_{t=0}^\infty \gamma^t r_t,$$
where $r_t$ is the reward received at time $t$ and $\gamma$ is the discount factor that dictates how much preference an agent has for more immediate rewards; in other words, the agent's patience. With $\gamma = 1$, the agent values distant
rewards that will happen much further in the future as much as the reward the agent will receive in the next time step; therefore, $\gamma = 1$ results in the same objective of summing all future rewards together and will have the same problems previously mentioned. With $\gamma = 0$ all the agent values only the next reward and does not care about anything that happens afterwards, thereby resulting in all policies having a finite value. This setting has the inverse effect of the agent never caring about our grid world goal location unless it's one step away. However, a $\gamma$ value somewhere in between 0 and 1 often results in what we want. That is, for all values of $0 \lt \gamma \lt 1$, the expected future discounted reward in an MDP when following any given policy is finite, while still considering events that happen in the distant future, and with a preference for more immediate satisfaction. If we set $\gamma$ to 0.99 in our grid world, the result is that the agent would want to get to the goal as fast as possible, because waiting longer would result in the eventual +1 goal reward being more heavily discounted. Although there are still other ways to define the temporal objective of an MDP, current learning and planning algorithms in BURLAP are based on using
the discounted reward, with the geometric discount factor $\gamma$ left as a parameter that the user can specify.