Tutorial: Basic Planning and Learning

Tutorials > Basic Planning and Learning > Part 1



You are viewing the tutorial for BURLAP 3; if you'd like the BURLAP 2 tutorial, go here.

Introduction

The purpose of this tutorial is to get you familiar with using some of the planning and learning algorithms in BURLAP. Specifically, this tutorial will cover instantiating a grid world domain bundled with BURLAP, and having the task solved with Q-learning, Sarsa learning, BFS, DFS, A*, and Value Iteration. This tutorial will also show you how to visualize these results in various ways using tools in BURLAP. The take home message you should get from this tutorial is that using different planning and learning algorithms largely amounts to changing the algorithm Java object you instantiate, with everything else being the same. You are encouraged to extend this tutorial on your own using some of the other planning and learning algorithms in BURLAP.

The complete set of code written in this tutorial is availabe at the end, and in the burlap_examples github repository.

Creating the class shell

For this tutorial, we will start by making a class that has data members for all the domain and task relevant properties. In the tutorial we will call this class "BasicBehavior" but feel free to name it to whatever you like. Since we will also be running the examples from this class, we'll include a main method. For convenience, we have also included at the start all of the class the imports that you will need for this tutorial. If you have a good IDE, like IntelliJ or Eclipse, those can auto import the classes as you go so that you never have to write an import line yourself.

	
import burlap.behavior.policy.GreedyQPolicy;
import burlap.behavior.policy.Policy;
import burlap.behavior.policy.PolicyUtils;
import burlap.behavior.singleagent.Episode;
import burlap.behavior.singleagent.auxiliary.EpisodeSequenceVisualizer;
import burlap.behavior.singleagent.auxiliary.StateReachability;
import burlap.behavior.singleagent.auxiliary.performance.LearningAlgorithmExperimenter;
import burlap.behavior.singleagent.auxiliary.performance.PerformanceMetric;
import burlap.behavior.singleagent.auxiliary.performance.TrialMode;
import burlap.behavior.singleagent.auxiliary.valuefunctionvis.ValueFunctionVisualizerGUI;
import burlap.behavior.singleagent.auxiliary.valuefunctionvis.common.ArrowActionGlyph;
import burlap.behavior.singleagent.auxiliary.valuefunctionvis.common.LandmarkColorBlendInterpolation;
import burlap.behavior.singleagent.auxiliary.valuefunctionvis.common.PolicyGlyphPainter2D;
import burlap.behavior.singleagent.auxiliary.valuefunctionvis.common.StateValuePainter2D;
import burlap.behavior.singleagent.learning.LearningAgent;
import burlap.behavior.singleagent.learning.LearningAgentFactory;
import burlap.behavior.singleagent.learning.tdmethods.QLearning;
import burlap.behavior.singleagent.learning.tdmethods.SarsaLam;
import burlap.behavior.singleagent.planning.Planner;
import burlap.behavior.singleagent.planning.deterministic.DeterministicPlanner;
import burlap.behavior.singleagent.planning.deterministic.informed.Heuristic;
import burlap.behavior.singleagent.planning.deterministic.informed.astar.AStar;
import burlap.behavior.singleagent.planning.deterministic.uninformed.bfs.BFS;
import burlap.behavior.singleagent.planning.deterministic.uninformed.dfs.DFS;
import burlap.behavior.singleagent.planning.stochastic.valueiteration.ValueIteration;
import burlap.behavior.valuefunction.QProvider;
import burlap.behavior.valuefunction.ValueFunction;
import burlap.domain.singleagent.gridworld.GridWorldDomain;
import burlap.domain.singleagent.gridworld.GridWorldTerminalFunction;
import burlap.domain.singleagent.gridworld.GridWorldVisualizer;
import burlap.domain.singleagent.gridworld.state.GridAgent;
import burlap.domain.singleagent.gridworld.state.GridLocation;
import burlap.domain.singleagent.gridworld.state.GridWorldState;
import burlap.mdp.auxiliary.stateconditiontest.StateConditionTest;
import burlap.mdp.auxiliary.stateconditiontest.TFGoalCondition;
import burlap.mdp.core.TerminalFunction;
import burlap.mdp.core.state.State;
import burlap.mdp.core.state.vardomain.VariableDomain;
import burlap.mdp.singleagent.common.GoalBasedRF;
import burlap.mdp.singleagent.common.VisualActionObserver;
import burlap.mdp.singleagent.environment.SimulatedEnvironment;
import burlap.mdp.singleagent.model.FactoredModel;
import burlap.mdp.singleagent.oo.OOSADomain;
import burlap.statehashing.HashableStateFactory;
import burlap.statehashing.simple.SimpleHashableStateFactory;
import burlap.visualizer.Visualizer;

import java.awt.*;
import java.util.List;


public class BasicBehavior {

	
	
	GridWorldDomain gwdg;
	OOSADomain domain;
	RewardFunction rf;
	TerminalFunction tf;
	StateConditionTest goalCondition;
	State initialState;
	HashableStateFactory hashingFactory;
	SimulatedEnvironment env;
	
	
	public static void main(String[] args) {
	
		//we'll fill this in later
	
	}
	
	
}
				

If you're already familiar with MDPs in general, the importance of some of these data members will be obvious. However, we will walk through in detail what each data member is and why we're going to need it.

GridWorldDomain gwdg
A GridWorldDomain is a DomainGenerator implementation for creating grid worlds.

Domain domain
An OOSADomain object is a fundamental class for defining problem domains that have OO-MDP state representations (although we will not focus on the OO-MDP aspects here). In short, any Domain object contain all of the elements of an MDP, except the entire state space, which is not included since for many domains it may be infinite.

TerminalFunction tf
By default, our grid world will use a UniformCostRF, a reward function that returns -1 everywhere. But if we want to specify a goal state, we need to tell our GridWorld generator which states are terminal states, which we do with a TerminalFunction. TerminalFunction is an interface with a boolean method that defines which states are terminal states.

StateConditionTest goalCondition
Not all planning algorithms are designed to maximize reward functions. Many are instead defined as search algorithms that seek action sequences that will cause the agent to reach specific goal states. A StateConditionTest is an interface with a boolean method that takes a state as an argument similar to a TerminalFunction, only we use it as a means to specify arbitrary state conditions, rather than just terminal states. We will use StateConditionTest to specify the goal state(s) of search-based planning algorithms.

State initialState
Since domains are not required to enumerate entire state spaces, we will need to define at least the initial state of our problem, which we hold in this data member.

HashableStateFactory hasingFactory
In this tutorial we will cover tabular algorithms; algorithms that learn or plan with tabular identifiers for states (in the later Solving Continuous Domains tutorial, we will cover how to use BURLAP to solve continuous domains). Typically, for fast access, tabular algorithms will associate values for states in a HashMap, which means tabular methods need some way to compute hash codes and test equality of states. The obvious solution is for State implementations to implement the Java equals and hashCode methods. However, it is not uncommon that differnet scenarios will require different ways of computing hash codes or state equality that the creator of the State did not anticipate, such as state abstraction or variable discretization. Therefore, BURLAP makes use of HashableStateFactory objects that allows a client to specify how to hash and check state equality for states. There are a number of default implementations also provided in BURLAP.

Environment env
Learning algorithms address a problem in which the agent observes its environment, makes a decision, and then observes how the environment changes. This is a challenging problem because initially, an agent will not know how the environment works or what a good decision is, but must live with the consequences of their decision. To facilitate the construction of learning problems, all single-agent learning algorithms in BURLAP (algorithms that implement the LearningAgent interface), interact with an implementation of the Environment interface. One of the included concrete implementations is SimulatedEnvironment, which you can use when you're constructing an Environment for a BURLAP domain that has an included model. Other concrete Environment implementations in BURLAP or library extensions to BURLAP include

Since Environment is an interface, you can also easily implement your own version if you need a BURLAP agent to interact with external code or systems that are not already provided in BURLAP.