Next, we will define the model of our Grid World; how transitions and rewards are generated from actions. Lets start by defining a map of our grid world, that matches the Grid World image we used earlier in the tutorial (an 11x11 world split into 4 "rooms"). We will define this map using a 2D int array, with the first dimension representing x, and the second dimension representing y. Add the following to your ExampleGridWorld domain generator class:
//ordered so first dimension is x protected int [][] map = new int[][]{ {0,0,0,0,0,1,0,0,0,0,0}, {0,0,0,0,0,0,0,0,0,0,0}, {0,0,0,0,0,1,0,0,0,0,0}, {0,0,0,0,0,1,0,0,0,0,0}, {0,0,0,0,0,1,0,0,0,0,0}, {1,0,1,1,1,1,1,1,0,1,1}, {0,0,0,0,1,0,0,0,0,0,0}, {0,0,0,0,1,0,0,0,0,0,0}, {0,0,0,0,0,0,0,0,0,0,0}, {0,0,0,0,1,0,0,0,0,0,0}, {0,0,0,0,1,0,0,0,0,0,0}, };
To define our Grid World Model, we're going to use a FactoredModel implemention that is provided with BURLAP. A FactoredModel is a model that divides its duties into three components: a SampleStateModel, which defines the state transitions; a RewardFunction, which defines the rewards for given state transitions; and a TerminalFunction, which defines which states are terminal states of the MDP. Most domains in BURLAP use a FactoredModel, because it is often the case that clients will want to change the task of a domain (defined by the reward function and terminal state), but not the "physics" of the domain (how state transitions occur).
Lets start with the most complex part: the state model. Just as there is a SampleModel and a FullModel for the complete model, where the SampleModel merely samples transitions and a FullModel can also enumerate the probability distribution, a state model also has a SampleStateModel and a FullStateModel. We will implement the FullStateModel. Below, we've defined an inner class to ExampleGridWorld for our FullStateModel, with the required methods left as unimplemented, which we will walk through.
protected class GridWorldStateModel implements FullStateModel{ @Override public List<StateTransitionProb> stateTransitions(State s, Action a) { return null; } @Override public State sample(State s, Action a) { return null; } }
We're going to define our domain so that our four (north, south, east, west) actions are stochastic: with 0.8 probability they will go in the intended direction, and with 0.2 probability, it will randomly go in one of the other directions. To encode this stochasticity, lets define a matrix of direction transition probabilities for each action, so that the first dimension indexes by the selected action, and the next dimension indexes by the actual direction the agent will move, with the values specifying the movement in that direction, given the action selected. We will also implement a constructor that fills out this matrix, which will have 0.8 along the diagonal, and 0.8/3 on the off-diagonal elements. Note that each row will sum to 1, making it a proper probability distribution.
protected double [][] transitionProbs; public GridWorldStateModel() { this.transitionProbs = new double[4][4]; for(int i = 0; i < 4; i++){ for(int j = 0; j < 4; j++){ double p = i != j ? 0.2/3 : 0.8; transitionProbs[i][j] = p; } } }
Our actions in this domain will be represented with String names (we could alternatively make Actions that are defined by int values, but for simplicity and descriptive reasons, we will use String names). Therefore, we will first want to define a method that converts an action name into an int index for the direction transition matrix we defined:
protected int actionDir(Action a){ int adir = -1; if(a.actionName().equals(ACTION_NORTH)){ adir = 0; } else if(a.actionName().equals(ACTION_SOUTH)){ adir = 1; } else if(a.actionName().equals(ACTION_EAST)){ adir = 2; } else if(a.actionName().equals(ACTION_WEST)){ adir = 3; } return adir; }
When we either sample a state transition, or enumerate all possible outcomes, we will want to query the outcome of the agent moving in some direction. Lets add a method for doing that now.
protected int [] moveResult(int curX, int curY, int direction){ //first get change in x and y from direction using 0: north; 1: south; 2:east; 3: west int xdelta = 0; int ydelta = 0; if(direction == 0){ ydelta = 1; } else if(direction == 1){ ydelta = -1; } else if(direction == 2){ xdelta = 1; } else{ xdelta = -1; } int nx = curX + xdelta; int ny = curY + ydelta; int width = ExampleGridWorld.this.map.length; int height = ExampleGridWorld.this.map[0].length; //make sure new position is valid (not a wall or off bounds) if(nx < 0 || nx >= width || ny < 0 || ny >= height || ExampleGridWorld.this.map[nx][ny] == 1){ nx = curX; ny = curY; } return new int[]{nx,ny}; }
This method will return a 2 element int array, where the first component is the new x position of the agent and the second component is the new y position. Primarily, the method just increments/decrements the x and y values depending on what the action direction was. However, it also checks the map of our world, and if the agent would have moved into a wall, then the agent's position will not change.
Now lets implement the sample method.
@Override public State sample(State s, Action a) { s = s.copy(); EXGridState gs = (EXGridState)s; int curX = gs.x; int curY = gs.y; int adir = actionDir(a); //sample direction with random roll double r = Math.random(); double sumProb = 0.; int dir = 0; for(int i = 0; i < 4; i++){ sumProb += this.transitionProbs[adir][i]; if(r < sumProb){ dir = i; break; //found direction } } //get resulting position int [] newPos = this.moveResult(curX, curY, dir); //set the new position gs.x = newPos[0]; gs.y = newPos[1]; //return the state we just modified return gs; }
The first thing we do is make a copy of the input state, which will be modified and returned. Then we get the agent's x and y position, by type casting the State to our EXGridState class. We also get the index of our action, and then sample a resulting direction from the direction transition matrix. We then get the resulting new position from our moveResult method, and update the copied state to be at that new position.
Next we will implement the transitions method.
@Override public List<StateTransitionProb> stateTransitions(State s, Action a) { //get agent current position EXGridState gs = (EXGridState)s; int curX = gs.x; int curY = gs.y; int adir = actionDir(a); List<StateTransitionProb> tps = new ArrayList<StateTransitionProb>(4); StateTransitionProb noChange = null; for(int i = 0; i < 4; i++){ int [] newPos = this.moveResult(curX, curY, i); if(newPos[0] != curX || newPos[1] != curY){ //new possible outcome EXGridState ns = gs.copy(); ns.x = newPos[0]; ns.y = newPos[1]; //create transition probability object and add to our list of outcomes tps.add(new StateTransitionProb(ns, this.transitionProbs[adir][i])); } else{ //this direction didn't lead anywhere new //if there are existing possible directions //that wouldn't lead anywhere, aggregate with them if(noChange != null){ noChange.p += this.transitionProbs[adir][i]; } else{ //otherwise create this new state and transition noChange = new StateTransitionProb(s.copy(), this.transitionProbs[adir][i]); tps.add(noChange); } } } return tps; }
In many ways, this method is a lot like our sample method, except instead of randomly sampling a direction, we iterate over each possible outcome direction and consider movement in that direction. For each of those possible directions, we make a copy of the input state, change its position based on our moveResult method, and then put it in a StateTransitionProb tuple, which is a pair consisting of the probability of the outcome (determined from the entry in our transition matrix) and the outcome state we created, and we add each StateTransitionProb to a list to be returned by our method.
There is one extra bit of book keeping we perform in this method. If the agent tries to move into a wall, it's position does not change. And if a wall exists on multiple sides of the agent, then there are multiple possible directions that would result in the agent not moving. However, we don't want a separate StateTransitionProb element for multiple occurrences of the agent not changing position. So instead, if the agent's position doesn't change, we simply add the probability mass of the agent attempting to move in that direction to any existing StateTransitionProb element we have created that results in the agent not changing position.
We've now completed the state transition model. Next, lets implement a RewardFunction and TerminalFunction. We'll start with the TerminalFunction, which we'll let specify a single location in our grid world to be a terminal (goal) state and we'll let that location be a parameter.
public static class ExampleTF implements TerminalFunction { int goalX; int goalY; public ExampleTF(int goalX, int goalY){ this.goalX = goalX; this.goalY = goalY; } @Override public boolean isTerminal(State s) { //get location of agent in next state int ax = (Integer)s.get(VAR_X); int ay = (Integer)s.get(VAR_Y); //are they at goal location? if(ax == this.goalX && ay == this.goalY){ return true; } return false; } }
Our reward function will work similarly, but return a reward of -1 for all transitions, except the transition to a goal location, which will return +100.
public static class ExampleRF implements RewardFunction { int goalX; int goalY; public ExampleRF(int goalX, int goalY){ this.goalX = goalX; this.goalY = goalY; } @Override public double reward(State s, Action a, State sprime) { int ax = (Integer)s.get(VAR_X); int ay = (Integer)s.get(VAR_Y); //are they at goal location? if(ax == this.goalX && ay == this.goalY){ return 100.; } return -1; } }
Note that the reward function operates on the sprime method argument, not the s argument. The sprime argument specifies the state to which the agent transitioned, whereas the argument s specifies the state the agent left, and we want the goal reward to occur on transitions to the goal location, so we evaluate sprime.
We're just about ready to finish up our domain. Before we do, lets add two data members and ExampleGridWorld methods to allow a client to set the goal location.
Add the data members
protected int goalx = 10; protected int goaly = 10;
And add the method
public void setGoalLocation(int goalx, int goaly){ this.goalx = goalx; this.goaly = goaly; }
Now lets finish by implementing our generateDomain method!
@Override public SADomain generateDomain() { SADomain domain = new SADomain(); domain.addActionTypes( new UniversalActionType(ACTION_NORTH), new UniversalActionType(ACTION_SOUTH), new UniversalActionType(ACTION_EAST), new UniversalActionType(ACTION_WEST)); GridWorldStateModel smodel = new GridWorldStateModel(); RewardFunction rf = new ExampleRF(this.goalx, this.goaly); TerminalFunction tf = new ExampleTF(this.goalx, this.goaly); domain.setModel(new FactoredModel(smodel, rf, tf)); return domain; }
The generateDomain method starts by making a new SADomain instance. Then we add an ActionType for each of our actions. Our grid world north, south, east, west actions are unparameterized actions that can be applied anywhere in the world, so we can use BURLAP's provided UnviersalActionType implementation, which simply requires a name for each of the actions.
We then create an instance of our state model, and a reward function and terminal function using our implemented methods with a goal location set to the ExampleGridWorld instance's goalx and goaly values. The elements are used to define a FactoredModel, which is added to our domain. Then, we return the created domain!
We've now created all the elements we need for our grid world MDP and it's now ready to be used with BURLAP algorithms. However, it is often very useful to be able to visualize a domain. In the next section, we will show you have to create a visualizer for our grid world domain.