Now that we have the structure of our class, we'll need to initialize our data members to instances that will create our domain and define our task. First, create a default constructor. The first thing we'll do in the constructor is create our domain.
public BasicBehavior(){ gwdg = new GridWorldDomain(11, 11); gwdg.setMapToFourRooms(); domain = gwdg.generateDomain(); //more to come... }
The first line will create an 11x11 deterministic GridWorld. The second line sets up the map to a pre-canned layout: the four rooms layout. This domain layout was used in Option learning work from Sutton, Precup, and Singh (1999) and it presents a simple environment for us to do some tests. Alternatively, you could also define your own map layout by either passing the constructor a 2D integer array (with 1s specifying the cells with walls and 0s specifying open cells), or you could simply specify the size of the domain like we did and then use the GridWorld object's horiztonalWall and verticalWall methods to place walls on it. The GridWorld domain also supports 1 dimensional walls between cells that you can set, if you'd prefer that kind of domain. For simplicity, we'll stick with the four rooms layout.
The third line will produce the Domain object for the GridWorld. Recall from the previous part of the tutorial that Domain objects hold references to all the attributes, object classes, propositional functions, and actions (along with the actions' transition dynamics). In our GridWorld's case, there are two attributes, an X attribute and a Y attribute. There are also two classes: an AGENT class and a LOCATION class, each of which is defined by the X and Y attributes. While there could potentially be any number of AGENT object instantiations in a state, in this domain we expect only one to ever be defined. The location objects will be used for points of interest. Specifically, we will use a single location object to represent a goal location.
The GridWorld domain also defines five propositional functions: agentAtLocation(AGENT, LOCATION); wallToNorth(AGENT); wallToSouth(AGENT); wallToEast(AGENT); wallToWest(AGENT). The first of those returns true when the specified AGENT object is at the same location as the specified LOCATION object. The latter four propositional functions return true when there is a wall in the immediate cell of the defined direction of the specified AGENT object.
Finally, the GridWorld domain defines four actions to move north, south, east, or west of the agent's current position. Although we could have told the GridWorldDomain generator to make these movements stochastic (that is, specify a probability in which the agent moves in an unintended direction), in our specific example we have left them as the default deterministic actions. If an agent moves into a wall, then its position does not change.
Although this domain instantiation has specific settings for grid worlds, different domains in BURLAP follow similar conventions. That is, you create an instance of a domain generator; specify the parameters of the domain through mutators, and then finally extract the domain with a call to a generateDomain method. For example, the LunarLander domain generator lets you set properties like the maximum velocity and the force of gravity.
Next we will create a state parser:
sp = new GridWorldStateParser(domain);
This state parser is a custom state parser that is part of the GridWorld package. We could have also used the UniversalStateParser, StateYAMLParser, or StateJSONParser, all which can provide state parsing for any possible OO-MDP domain; however, the UniversalStateParser is verbose in its String output, which can be undesirable if you are recording thousands of learning results. In such cases, a custom parser, such as the one we used here, can be more compact.
The next thing we will want to do is define the actual task to be solved for this domain. We will do this by specifying a reward function, a termination function, and a goal condition, the latter of which we will use exclusively for search-based deterministic planners that this tutorial will cover. In general, you can always define your own reward functions, terminal functions, and goal conditions by implementing the RewardFunction, TerminalFunction, and StateConditionTest interfaces, respectively, yourself. However, BURLAP also comes packaged with a bunch of standard instances (as well as various domain-specific functions) that we will use here. If you want to know more about defining your own, consult the Building a Domain tutorial.
rf = new UniformCostRF(); tf = new SinglePFTF(domain.getPropFunction(GridWorldDomain.PFATLOCATION)); goalCondition = new TFGoalCondition(tf);
The first line will create a reward function that always returns -1 for every state-action-state transition. This might seem like a problem; it may seem like we would need to define a reward function that returns a greater reward when the agent reaches the goal (and there are existing reward function classes in the burlap.oomdp.singleagent.common package to do so). However, the next line which defines the termination function makes specifying a reward function that returns a greater reward at the goal unnecessary. Before explaining why, lets examine the creation of the TerminalFunction, which is specified as an instance of the SinglePFTF class (that stands for single propositional function terminal function). This is a class which is provided a single propositional function of the domain. Any state for which any object binding of that propositional function is true will be marked as a terminal state. In the constructor parameters, the propositional function is retrieved by querying the domain for the propositional function with the name GridWorldDomain.PFATLOCATION which is a constant field of the Gridworld domain referencing the name used for the atLocation propositional function. Note that if there were multiple LOCATION objects in the world, this TerminalFunction would mark any state in which the agent was at any of the locations as a terminal state.
We are using a terminal function and set of states that include location objects to highlight the OO-MDP way of working with objects and propositional functions in BURLAP. If you'd prefer a more classic method of defining grid world reward functions and terminal functions based on the location of the agent, you can do so using the GridWorldRewardFunction and GridWorldTerminalFunction classes that are part of the burlap.domain.singleagent.gridworld package.
The final line sets up the goal condition to be synonymous with the termination function. Note that goal states will not always be the same as terminal states (there may be terminal states that are not goal states), but in this example they are.
Let us now return to why our reward function does not need to return a greater reward when reaching the goal condition. Since we defined a terminal function that marks goal states as terminal states, it means that once an agent reaches this state, it can no longer receive any reward. Since all rewards are negative, the best way to maximize reward will be to run to the goal as fast as possible, after which no more negative reward will be received.
The next step will be to define the initial state of this task. We could either do this by creating an empty State object and then manually adding object instantiations for each object class, or we could use some methods of the GridWorldDomain class to facilitate the process. We will do the latter for brevity, but if you want a more complete description of creating a state object by hand, consider looking in the Building a Domain tutorial.
initialState = GridWorldDomain.getOneAgentOneLocationState(domain); GridWorldDomain.setAgent(initialState, 0, 0); GridWorldDomain.setLocation(initialState, 0, 10, 10);
The first line will return a state object with a single instance of the AGENT class and a single instance of the LOCATION class. The second line of code then sets the agent to be at position 0,0. The third line of code sets the location to be at position 10,10. The first zero you see in the parameters indicates which LOCATION object index position to set. Since there is only one LOCATION object, we are setting the position of the 0th indexed LOCATION object.
The last part of the constructor is to define a method for computing state hash codes that can be used to efficiently look up state objects in various planning and learning algorithm data structures. Since this domain is discrete, we will use the common DiscreteStateHashFactory class.
hashingFactory = new DiscreteStateHashFactory(); hashingFactory.setAttributesForClass(GridWorldDomain.CLASSAGENT, domain.getObjectClass(GridWorldDomain.CLASSAGENT).attributeList);
Note that the second line is optional. If we did not include that line, state hash codes would be computed with respect to all attributes of all objects. However, in this task, the location object position will be constant, therefore, there is no reason to use the location object attributes for computing hash codes. Instead the hashing only needs to be computed with respect to the X and Y attributes of the AGENT object, which will vary between states in the task. The second line of code tells the hashing factory that we are going to manually define the set of attributes to be used for computing hashing codes and it tells it to use all of the attributes used in the AGENT class for hashing. We could have also told it to use only the X attribute, or we could have specified attributes for other classes as well by adding additional calls to the setAttributesForClass method. Regardless of a choice of which attributes to use for computing hash codes, it's important to note that when states are compared for equality, all attributes will be checked for equality. For instance, if two states with different positions for the location object were compared with our above definition of the hashing factory, they would produce identical hash codes, but be evaluated as different states. If you wanted to not only limit which attributes were used for computing hash codes, but also which ones were used for checking state equality, then instead you should use the DiscreteMaskHashingFactory class.
Note that GridWorldDomain.CLASSAGENT is a constant field of the GridWorldDomain class that points to the name of AGENT class. The domain.getObjectClass method will return the object class with the specified name and the attributeList field of will return the list of attributes associated with that object class.
At this point you should have initialized all of the data members for the class and the final constructor will look something like the below.
public BasicBehavior(){ //create the domain gwdg = new GridWorldDomain(11, 11); gwdg.setMapToFourRooms(); domain = gwdg.generateDomain(); //create the state parser sp = new GridWorldStateParser(domain); //define the task rf = new UniformCostRF(); tf = new SinglePFTF(domain.getPropFunction(GridWorldDomain.PFATLOCATION)); goalCondition = new TFGoalCondition(tf); //set up the initial state of the task initialState = GridWorldDomain.getOneAgentOneLocationState(domain); GridWorldDomain.setAgent(initialState, 0, 0); GridWorldDomain.setLocation(initialState, 0, 10, 10); //set up the state hashing system hashingFactory = new DiscreteStateHashFactory(); hashingFactory.setAttributesForClass(GridWorldDomain.CLASSAGENT, domain.getObjectClass(GridWorldDomain.CLASSAGENT).attributeList); }
Before we get to actually running planning and learning algorithms, we're going to want a way to visualize the results that they generate. While there are a few different ways to do this, for now we will define an offline visualizer that will allow us to run planning or learning completely, and then visualize the results after it's finished. Offline visualization has the advantage of not bogging down the runtime of planning/learning algorithms with time allocated for visualization. To create our offline visualizer, we will need to define a state visualizer and pass it to an EpisodeSequenceVisualizer. To do so, first add the following imports.
import burlap.oomdp.visualizer.Visualizer; import burlap.behavior.singleagent.EpisodeSequenceVisualizer;Then create the below method.
public void visualize(String outputPath){ Visualizer v = GridWorldVisualizer.getVisualizer(gwdg.getMap()); EpisodeSequenceVisualizer evis = new EpisodeSequenceVisualizer(v, domain, sp, outputPath); }
Note that the outputPath parameter specifies the directory where our planning/learning results were stored (well get to this when we actually apply a planning/learning algorithm).
In order to visualize states, the domain will need to have a Visualizer defined for it, because it is impossible to tell how a domain should be visualized by attributes alone! In this case, however, a class to generate a GridWorldDomain Visualizer, called GridWorldVisualizer already exists as part of the domains package. This specific Visualizer class will require the 2D int array specifying the layout of the map (since different grid worlds can use different layouts). This map is retrieved from our GridWorldDomain generator object using the getMap() method. The final line will construct and load the EpisodeSequenceVisualizer. Note that in order to visualize episode results with this class, you will need to pass it a visualizer, the domain, the state parser (which will be used to extract the states out of stored files), and the path to which all the episodes you wish to visualize are located. Later in this tutorial, we will examine how to use the GUI of the EpisdoeSequenceVisualizer.
Before moving on to the next part of the tutorial, lets also hook up our class constructor and visualizer method to the main class.
public static void main(String[] args) { BasicBehavior example = new BasicBehavior(); String outputPath = "output/"; //directory to record results //we will call planning and learning algorithms here //run the visualizer example.visualize(outputPath); }
Note that you can set the output path to whatever you want. If it doesn't already exist, the code that will follow will automatically create it for you.