Tutorial: Basic Planning and Learning

Tutorials > Basic Planning and Learning > Part 2

Tutorial Contents

Initializing the data members

Now that we have the structure of our class, we'll need to initialize our data members to instances that will create our domain and define our task. First, create a default constructor. The first thing we'll do in the constructor is create our domain.

public BasicBehavior(){
	gwdg = new GridWorldDomain(11, 11);
	gwdg.setMapToFourRooms(); 
	domain = gwdg.generateDomain();
	
	//more to come...
}

The first line will create an 11x11 deterministic GridWorld. The second line sets up the map to a pre-canned layout: the four rooms layout. This domain layout was used in Option learning work from Sutton, Precup, and Singh (1999) and it presents a simple environment for us to do some tests. Alternatively, you could also define your own map layout by either passing the constructor a 2D integer array (with 1s specifying the cells with walls and 0s specifying open cells), or you could simply specify the size of the domain like we did and then use the GridWorldDomain object's horiztonalWall and verticalWall methods to place walls on it. The GridWorldDomain also supports 1 dimensional walls between cells that you can set, if you'd prefer that kind of domain. For simplicity, we'll stick with the four rooms layout.

The third line will produce the Domain object for the grid world. Recall from the previous part of the tutorial that Domain objects hold references to all the attributes, object classes, propositional functions, and actions (along with the actions' transition dynamics).

How a GridWorldDomain Domain is defined

A GridWorldDomain has two primary attributes, an X attribute and a Y attribute. There are also two classes: an AGENT class and a LOCATION class, each of which is defined by the X and Y attributes. While there could potentially be any number of AGENT object instantiations in a state, in this domain we expect only one to ever be defined. The location objects will be used for points of interest. Specifically, we will use a single location object to represent a goal location.

The GridWorld domain also defines five propositional functions: atLocation(AGENT, LOCATION); wallToNorth(AGENT); wallToSouth(AGENT); wallToEast(AGENT); wallToWest(AGENT). The first of those returns true when the specified AGENT object is at the same location as the specified LOCATION object. The latter four propositional functions return true when there is a wall in the immediate cell of the defined direction of the specified AGENT object.

Finally, the GridWorld domain defines four actions to move north, south, east, or west of the agent's current position. Although we could have told the GridWorldDomain generator to make these movements stochastic (that is, specify a probability in which the agent moves in an unintended direction), in our specific example we have left them as the default deterministic actions. If an agent moves into a wall, then its position does not change.

Although this domain instantiation has specific settings for grid worlds, different domains in BURLAP follow similar conventions. That is, you create an instance of a domain generator; specify the parameters of the domain through mutators, and then finally extract the domain with a call to a generateDomain method. For example, the LunarLander domain generator lets you set properties like the maximum velocity and the force of gravity.

Next we will want to define the actual task to be solved for this domain. We will do this by specifying a reward function, a termination function, and a goal condition; the latter of which we will use exclusively for search-based deterministic planners that this tutorial will cover. In general, you can always define your own reward functions, terminal functions, and goal conditions by implementing the RewardFunction, TerminalFunction, and StateConditionTest interfaces, respectively, yourself. However, BURLAP also comes packaged with a bunch of standard instances (as well as various domain-specific implementations) that we will use here. If you want to know more about defining your own, consult the Building a Domain tutorial.

rf = new UniformCostRF(); 
tf = new SinglePFTF(domain.getPropFunction(GridWorldDomain.PFATLOCATION)); 
goalCondition = new TFGoalCondition(tf);

The first line will create a reward function that always returns -1 for every state-action-state transition. The next line defines a terminal function that identifies terminal states as states in which the agent is at the same position as locaion object. When the reward function (-1 everywhere) is taken together with this terminal function (everything ends when the agent reaches a location), it motivates the agent to reach a location object as soon as possible.

How SinglePFTF works

The TerminalFunction is created as an instance of SinglePFTF. SinglePFTF is a terminal function that takes as input a PropositionalFunction and sets any state in which the propositional function is true for any valid objects as terminal states. In this case, we provided it the GridWorldDomain's atLocation propositional function. This propositional function operates on two objects: the agent and a location object. For example, we might see the function evaluated on agent0 and location2: atLocation(agent0, location2). The propositional function returns true whenever the input agent is at the same position as the input location object. Therefore, for any input state, SinglePFTF finds all possible agent and location pairs and checks whether atLocation is true for any of them. If it is, then SinglePFTF returns evaluates that state as a terminal state.

Note that we are using UniformCostRF and SinglePFTF to illustrate some of the more general approaches in BURLAP to defining reward functions and terminal functions. However, GridWorldDomain also has specific reward function (GridWorldRewardFunction) and terminal function (GridWorldTerminalFunction) designed for grid world problems that in practice you may want to use instead for grid worlds.

The final line sets up the goal condition to be synonymous with the termination function. Note that goal states will not always be the same as terminal states (there may be terminal states that are not goal states, such as failure states), but in this example they are.

The next step will be to define the initial state of this task. We could either do this by creating an empty State object and then manually adding object instantiations for each object class, or we could use some methods of the GridWorldDomain class to facilitate the process. We will do the latter for brevity, but if you want a more complete description of creating a state object by hand, consider looking in the Building a Domain tutorial.

initialState = GridWorldDomain.getOneAgentOneLocationState(domain);
GridWorldDomain.setAgent(initialState, 0, 0);
GridWorldDomain.setLocation(initialState, 0, 10, 10);

The first line will return a state object with a single instance of the AGENT class and a single instance of the LOCATION class. The second line of code then sets the agent to be at position 0,0. The third line of code sets the location to be at position 10,10. The first zero you see in the parameters indicates which LOCATION object index position to set. Since there is only one LOCATION object, we are setting the position of the 0th indexed LOCATION object.

Next we will instantiate the HashableStateFactory that we wish to use. Since we are not doing anything fancy like state abstraction, we will use SimpleHashableStateFactory, which when we provide no constructor arguments will also use object identifier independence as discussed previously.

hashingFactory = new SimpleHashableStateFactory();

Finally, we will instantiate an Environment with which the agent will interact in our learning algorithm demonstrations. Since we will be using BURLAP's simulation of the environment, we will use a SimulatedEnvironment, which along with the domain, needs to be told about the reward function, terminal function and initial state for the environment.

env = new SimulatedEnvironment(domain, rf, tf, initialState);

At this point you should have initialized all of the data members for the class and the final constructor will look something like the below.

public BasicBehavior(){
	
	//create the domain
	gwdg = new GridWorldDomain(11, 11);
	gwdg.setMapToFourRooms();
	domain = gwdg.generateDomain();

	//define the task
	rf = new UniformCostRF();
	tf = new SinglePFTF(domain.getPropFunction(GridWorldDomain.PFATLOCATION));
	goalCondition = new TFGoalCondition(tf);

	//set up the initial state of the task
	initialState = GridWorldDomain.getOneAgentNLocationState(domain, 1);
	GridWorldDomain.setAgent(initialState, 0, 0);
	GridWorldDomain.setLocation(initialState, 0, 10, 10);

	//set up the state hashing system for tabular algorithms
	hashingFactory = new SimpleHashableStateFactory();

	//set up the environment for learning algorithms
	env = new SimulatedEnvironment(domain, rf, tf, initialState);

	
		
}

Setting up a result visualizer

Before we get to actually running planning and learning algorithms, we're going to want a way to visualize the results that they generate. While there are a few different ways to do this, for now we will define an offline visualizer that will allow us to run planning or learning completely, and then visualize the results after it's finished. Offline visualization has the advantage of not bogging down the runtime of planning/learning algorithms with time allocated for visualization. To create our offline visualizer, we will need to define a state Visualizer and pass it to an EpisodeSequenceVisualizer. A Visualizer is a Java canvas that can render State objects. An EpisodeSequenceVisualizer lets you view and explore episode (state-action-reward sequences) that an agent took and can either load the episodes from files or be provided them programmatically. In this example, we will save results to file and load them back up. To handle this kind of result visualization, create the below method.

public void visualize(String outputPath){
	Visualizer v = GridWorldVisualizer.getVisualizer(gwdg.getMap());
	new EpisodeSequenceVisualizer(v, domain, outputPath);
}

Note that the outputPath parameter specifies the directory where our planning/learning results were stored (well get to this when we actually apply a planning/learning algorithm).

The state Visualizer we will use is the one designed for rendering grid world states. It takes as input the map of the world (a 2D int array), which we retrieve from our GridWorldDomain instance. Note that other domains included in BURLAP have their own Visualizers that you can use for them.

Before moving on to the next part of the tutorial, lets also hook up our class constructor and visualizer method to the main method.

public static void main(String[] args) {


	BasicBehavior example = new BasicBehavior();
	String outputPath = "output/"; //directory to record results
	
	//we will call planning and learning algorithms here
	
	
	//run the visualizer
	example.visualize(outputPath);

}

Note that you can set the output path to whatever you want. If it doesn't already exist, the code that saves the results will automatically created it (more on that next).

Next Part