Tutorial: Basic Planning and Learning

Tutorials > Basic Planning and Learning > Part 4

Planning with Value Iteration

A common stochastic domain planner is Value Iteration (VI). An advantage of VI is that it will compute the policy for the entire state space that is reachable from the initial state that is passed to the planFromState method. To set up a VI planner, define the following method.

public void valueIterationExample(String outputPath){
		
	Planner planner = new ValueIteration(domain, 0.99, hashingFactory, 0.001, 100);
	Policy p = planner.planFromState(initialState);

	PolicyUtils.rollout(p, initialState, domain.getModel()).write(outputPath + "vi");
	
}
				

Instantiating value iteration works a lot like our deterministic search planning algorithms. However, because Value Iteration is based on solving MDPs, it requires us to specify a discount factor to use; we've chosen 0.99. It also needs stopping criteria specified, because Value Iteration, as the name implies, is an interative algorithm and we need to define when enough iterations for a good solutions have been performed. We've chossen for VI to terminate when either the changes in the value function are no longer than 0.001, or 100 iterations over the state space have been performed.

VI computes the optimal value function for the problem, which specifies the expected future discounted reward for taking each action in each state (the Q-function). Therefore, it returns a GreedyQPolicy. This policy looks at the Q-values the planner computes and returns the action with the maximium Q-value (and breaks ties randomly). This policy can be used with any planning or learning algorithm that returns Q-values by implementing the QProvider interface.

Try setting the main method to call our newly defined VI example method now.

Learning with Q-Learning

All of the previous examples were examples of using planning algorithms to solve our task. In this section, we will diverge from that and use a learning algorithm, Q-learning, to solve the task. Ultimately, learning algorithms are utilized in much the same way as planning algorithms, except you will run multiple episodes of learning in which the agent interacts with an Environment instance to solve it (or one very long episode if it is a continuing task rather than an episodic task). The method you should define to utilize Q-learning is shown below.

public void QLearningExample(String outputPath){
		
	LearningAgent agent = new QLearning(domain, 0.99, hashingFactory, 0., 1.);

	//run learning for 50 episodes
	for(int i = 0; i < 50; i++){
		Episode e = agent.runLearningEpisode(env);

		e.write(outputPath + "ql_" + i);
		System.out.println(i + ": " + e.maxTimeStep());

		//reset environment for next learning episode
		env.resetEnvironment();
	}
	
}				
				

Lets first look at the constructor. Rather than a planning instance, we're creating a LearningAgent instance which provides some methods for learning with an environment. QLearning is an instance of the LearningAgent interface and takes parameters for the domain, a discount factor, a HashableStateFactory, an initial value for the Q-values, and a learning rate (which for a deterministic domain, 1.0 is a good choice). Note that this constructor will by default set Q-learning to use a 0.1 epsilon greedy policy. There are other constructors that allow you to set which learning policy to use and there is also a setter that allows you to set it if you'd like to use a different policy. Other parameters for Q-learning could also be set, but we will not detail them here.

With the QLearning instance created, next we will run 50 learning episodes, so we set up a for loop. To run a learning episode, we call the method runLearningEpisode on the LearningAgent instance and pass it the Environment in which learning will be performed. The method returns an Episode object (similar to policies) so that a record of the interactions can be examined. As before, we can then write the returned episode to disk for viewing later.

Finally, at the end of the loop, we call the resetEnvironment method on the Environment. This method is the typical way to signal that an Environment needs to reset to an initial state from its current state, which may be a terminal state. When the method returns, it is expected that the environment in a non-terminal state from which an agent can act again.

After that, you can call this method from your main method and run the agents behavior for each of the 50 episodes of learning! You should find that as learning progessed, the agent got better. By the end, the agent's behavior will still be slightly random since it's following an epislon greedy policy that always takes some random actions. However, since QLearning implements the QFunction interface, you could always wrap a GreedyQPolicy around it, like with VI, to remove random action selection.

Learning with Sarsa(λ)

A similar learning algorithm to Q-learning is Sarsa(λ). The first difference between the two algorithms is that Sarsa(λ) updates Q-values with respect to the Q-value of the next action taken, rather than the maximum Q-value of the next state (see Wikipedia for more information). The second, and larger, difference is that at every time step, Sarsa(λ) will also update the Q-values for state-action pairs experienced previously in an episode with respect to the amount specified by λ and how long ago the experiences occurred. Define the below method to solve our task with Sarsa(λ).

public void SarsaLearningExample(String outputPath){
		
	LearningAgent agent = new SarsaLam(domain, 0.99, hashingFactory, 0., 0.5, 0.3);

	//run learning for 50 episodes
	for(int i = 0; i < 50; i++){
		Episode e = agent.runLearningEpisode(env);

		e.write(outputPath + "sarsa_" + i);
		System.out.println(i + ": " + e.maxTimeStep());

		//reset environment for next learning episode
		env.resetEnvironment();
	}
	
}				
				

You will notice that this method looks pretty identical to the Q-learning example, except this time a SarsaLam instance is constructed. Additionally, we lowered the learning rate to 0.5 (typically you should use lower learning rates when you have a higher value of λ). The last parameter of the constructor is the λ value which we set to 0.3. A value of λ=1.0 effectively makes algorithm run an online Monte Carlo in which the effects of all future interactions are fully considered in updating each Q-value of an episode. It is not always a good idea to use a large λ value.

Otherwise, the rest is the same; you can call this method from the main method and give it shot!