Tutorial: Basic Planning and Learning

Tutorials (v1) > Basic Planning and Learning > Part 4

Planning with Value Iteration

A common stochastic planner is Value Iteration (VI). An advantage of VI is that it will compute the policy for the entire state space that is reachable from the initial state that is passed to the planFromState method. To set up a VI planner, define the following method.

public void ValueIterationExample(String outputPath){
		
	if(!outputPath.endsWith("/")){
		outputPath = outputPath + "/";
	}
	
	
	OOMDPPlanner planner = new ValueIteration(domain, rf, tf, 0.99, hashingFactory, 0.001, 100);
	planner.planFromState(initialState);
	
	//create a Q-greedy policy from the planner
	Policy p = new GreedyQPolicy((QComputablePlanner)planner);
	
	//record the plan results to a file
	p.evaluateBehavior(initialState, rf, tf).writeToFile(outputPath + "planResult", sp);
	
}
				

VI is a planning method defined for the classic MDP formalism, so unlike the previous deterministic planners, its constructor takes as input the reward function and termination function, rather than a goal condition. VI also takes as a parameter a discount factor which specifies how much future rewards are favored over immediate rewards. In this case, a fairly large value of 0.99 is set (which means the agent will prefer later future rewards almost as much as immediate rewards).

The last two parameters to the constructor specify stopping conditions for the planning. The second to last parameter specifies that when the maximum change in the value function of any state is less than that specified threshold value (0.001 in this case), planning will stop. The last parameter specifies a maximum number of updates for each state that can happen before planning is stopped (100 in this case), regardless of whether the maximum value function change threshold was crossed.

Since VI is a stochastic planning algorithm, rather than a deterministic one like the previous algorithms we used, we cannot capture its planning results in a SDPlannerPolicy Policy class. Instead, a policy can be derived from the value function the planner estimates for each state using the GreedyQPolicy class that can be defined for any planner that adheres to the QComputablePlanner interface, which the VI algorithm does.

Once the solution is captured in a Policy class object, the results can be captured and visualized in the same way. Try setting the main method to call our newly defined VI example method now.

Learning with Q-Learning

All of the previous examples were examples of using planning algorithms to solve our task. In this section, we will diverge from that and use a learning algorithm, Q-learning, to solve the task. Ultimately, learning algorithms are utilized in much the same way as planning algorithms, except you run will run multiple episodes of learning to solve it (or one very long episode if it is a continuing task rather than an episodic task). The method you should define to utilize Q-learning is shown below.

public void QLearningExample(String outputPath){
		
	if(!outputPath.endsWith("/")){
		outputPath = outputPath + "/";
	}
	
	//creating the learning algorithm object; discount= 0.99; initialQ=0.0; learning rate=0.9
	LearningAgent agent = new QLearning(domain, rf, tf, 0.99, hashingFactory, 0., 0.9);
	
	//run learning for 100 episodes
	for(int i = 0; i < 100; i++){
		EpisodeAnalysis ea = agent.runLearningEpisodeFrom(initialState);
		ea.writeToFile(String.format("%se%03d", outputPath, i), sp); 
		System.out.println(i + ": " + ea.numTimeSteps());
	}
	
}				
				

Lets first look at the constructor. Rather than a planning instance, we're creating a LearningAgent instance which provides some methods for learning. QLearning is an instance of the LearningAgent class and like the Value Iteration planner, takes a parameters the reward function, termination function, and discount factor rather than a goal condition. The last two parameters of the constructor represent the initial Q-value to use for previously untaken state-action pairs (which we set to 0.0) and the last parameter is the learning rate, which we set fairly high since this is a simple deterministic task. Note that this constructor will by default set Q-learning to use a 0.1 epsilon greedy policy. There are other constructors that allow you to set which learning policy to use and there is also a mutator that allows you to set it if you'd like to use a different policy. Alternatively, you could also pass Q-learning an object that specifies how the Q-value for each state-action pair should be initialized, rather than initializing them all to the same value as we did here. For this tutorial, we will not bother with that, however.

With the QLearning instance created, next we will run learning episodes rather than tell it to plan a solution. To make sure a decent solution is learned, we will let learning run for 100 episodes, so we set up a loop to do so. To run a learning episode, we call the method runLearningEpisodeFrom on the LearningAgent instance and pass it the initial state from which it should begin the learning episode. Since this method will perform learning for an entire episode, the method will return an EpisodeAnalysis object that stores the results of the episode. Once we have this episode, we simple save it to a file like we did the planning results in previous examples (with care to give each episode a different name so that we can view them all). We also added a line to print the number of steps taken in each learning episode to the terminal so that we can follow its overall progress as it learns.

After that, you can call this method from your main method and run the results! This time, in the EpisodeSequenceVisualizer launcher you will find there are 100 episode files in the list, one for each episode of learning that occurred. You should find that in the earlier episodes behavior was quite bad, but as the agent experiences the world, its performance became better. You may find that even after a large amount of learning that the behavior is slightly random; this randomness is a result of learning using the epsilon greedy policy which always takes a random action with some probability. Alternatively, you could also capture the final learning results in a GreedyQPolicy like we did with VI and record those results, instead of the results with the learning policy.

Note

While QLearning adheres to the LearningAgent interface, it is also an instance of the OOMDPPlanner, which means we could have called the planFromState method on it. For the QLearning algorithm, the planFromState method will run some number of learning episodes automatically for you. By default it will only run one (which is unlikely to yield very good results!), but you can adjust the number of episodes it would run, similar to how the VI planner decides to stop planning. Specifically, you can set a maximum number of learning episodes to run with the setMaxEpisodesForPlanning method, or you can specify a Q-value change threshold with the setMaxQChangeForPlanningTerminaiton method that will terminate planning when the maximum Q-value change within an episode less than the threshold.

Learning with Sarsa(λ)

A similar learning algorithm to Q-learning is Sarsa(λ). The first difference between the two algorithms is that Sarsa(λ) updates Q-values with respect to the Q-value of the next action taken, rather than the maximum Q-value of the next state (see Wikipedia for more information). The second, and larger, difference is that at every time step, Sarsa(λ) will also update the Q-values for state-action pairs experienced previously in an episode with respect to the amount specified by λ and how long ago the experiences occurred. Define the below method to solve our task with Sarsa(λ).

public void SarsaLearningExample(String outputPath){
		
	if(!outputPath.endsWith("/")){
		outputPath = outputPath + "/";
	}
	
	//discount= 0.99; initialQ=0.0; learning rate=0.5; lambda=1.0
	LearningAgent agent = new SarsaLam(domain, rf, tf, 0.99, hashingFactory, 0., 0.5, 1.0);
	
	//run learning for 100 episodes
	for(int i = 0; i < 100; i++){
		EpisodeAnalysis ea = agent.runLearningEpisodeFrom(initialState);
		ea.writeToFile(String.format("%se%03d", outputPath, i), sp);
		System.out.println(i + ": " + ea.numTimeSteps());
	}
	
}				
				

You will notice that this method looks pretty identical to the Q-learning example, except this time a SarsaLam instance is constructed. Additionally, we lowered the learning rate to 0.5 (typically you should use lower learning rates when you have a higher value of λ). The last parameter of the constructor is the λ value which we set to 1.0. A value of λ=1.0 effectively makes algorithm run an online Monte Carlo in which the effects of all future interactions are fully considered in updating each Q-value of an episode.

Otherwise, the rest is the same; you can call this method from the main method and give it shot! You should find that learning is much faster than Q-learning when the higher value of λ is used. Like QLearning, the SarsaLam instance also supports the planning method and the conditions for planning termination can be set in the same way (SarsaLam is actually a subclass of QLearning).