BoundedRTDP

java.lang.Object
- burlap.behavior.singleagent.MDPSolver
- - burlap.behavior.singleagent.planning.stochastic.DynamicProgramming
  - - burlap.behavior.singleagent.planning.stochastic.rtdp.BoundedRTDP

All Implemented Interfaces:

MDPSolverInterface, Planner, QFunction, QProvider, ValueFunction
```
public class BoundedRTDP
extends DynamicProgramming
implements Planner
```
An implementation of Bounded RTDP [1]. Bounded RTPD is very similar to standard RTDP [2] with the main difference being that both an upper bound and lower bound value function is computed. Like RTDP, the upper bound is used for planning rollout exploration, but when planning rollouts are complete, the lower bound value function is used to direct behavior. Using the lower bound provides significantly better any-time planning performance that does not require convergence to get resaonable results.
Another differences in Bounded RTDP is updating the value of each state in a rollout in reverse after its completion, which has the effect of propagating back goal state values to the beginning (though this can be disbaled in this implementation using the setRunRolloutsInRevere(boolean) method).
Finally, after action selection, the next outcome state from which the rollout continues may also be selected in a number of ways. The way presnted in the original paper is to select the next state randomly according to the transition dynamics probability weighted by the margin between the upper bound and lower bound of the states, which promotes exploration toward states that are uncertain. However, in practice, we found that the standard unweighted sampling apporach of RTDP can work better and is more efficient; therefore, the default in this implementation is to use the unweighted transition dynamics, but it can be changed to way presented in the paper using the method setStateSelectionMode(StateSelectionMode). Another optional state selection mode is to always choose the next state with the highest uncertainty, but this tends to be even slower due to being overly conservative so it is not reccommended in genral. See the BoundedRTDP.StateSelectionMode documentation for more information.
1.McMahan, H. Brendan, Maxim Likhachev, and Geoffrey J. Gordon. "Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees." Proceedings of the 22nd international conference on Machine learning. ACM, 2005.
2. Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. "Learning to act using real-time dynamic programming." Artificial Intelligence 72.1 (1995): 81-138.

Author:

James MacGlashan

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`protected static class`	`BoundedRTDP.StateSelectionAndExpectedGap` A tuple class for a hashed state and the expected value function margin/gap of a the source transition.
`static class`	`BoundedRTDP.StateSelectionMode` The different ways that states can be selected for expansion.

Nested classes/interfaces inherited from interface burlap.behavior.valuefunction.QProvider
QProvider.Helper

Field Summary

Fields
Modifier and Type	Field and Description
`protected boolean`	`currentValueFunctionIsLower` Whether the current `DynamicProgramming` valueFunction reference points to the lower bound value function or the upper bound value function.
`protected boolean`	`defaultToLowerValueAfterPlanning` Sets what the `DynamicProgramming` valueFunction reference points to (the lower bound or upperbound) once a planning rollout is complete.
`protected java.util.Map<HashableState,java.lang.Double>`	`lowerBoundV` The lower bound value function
`protected ValueFunction`	`lowerVInit` The lowerbound value function initialization
`protected int`	`maxDepth` The maximum depth/length of a rollout before it is terminated and Bellman updates are performed.
`protected double`	`maxDiff` The max permitted difference between the lower bound and upperbound for planning termination.
`protected int`	`maxRollouts` the max number of rollouts to perform when planning is started unless the value function margin is small enough.
`protected int`	`numBellmanUpdates` Keeps track of the number of Bellman updates that have been performed across all planning.
`protected int`	`numSteps` Keeps track of the number of rollout steps that have been performed across all planning rollouts.
`protected boolean`	`runRolloutsInReverse` Whether each rollout should be run in reverse after completion.
`protected BoundedRTDP.StateSelectionMode`	`selectionMode` Which state selection mode is used.
`protected java.util.Map<HashableState,java.lang.Double>`	`upperBoundV` The upperbound value function
`protected ValueFunction`	`upperVInit` The upperbound value function initialization

Fields inherited from class burlap.behavior.singleagent.planning.stochastic.DynamicProgramming
operator, valueFunction, valueInitializer

Fields inherited from class burlap.behavior.singleagent.MDPSolver
actionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel

Constructor Summary

Constructors
Constructor and Description
`BoundedRTDP(SADomain domain, double gamma, HashableStateFactory hashingFactory, ValueFunction lowerVInit, ValueFunction upperVInit, double maxDiff, int maxRollouts)` Initializes.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected double`	`getGap(HashableState sh)` Returns the lower bound and upper bound value function margin/gap for the given state
`protected BoundedRTDP.StateSelectionAndExpectedGap`	`getNextState(State s, Action a)` Selects a next state for expansion when action a is applied in state s.
`protected BoundedRTDP.StateSelectionAndExpectedGap`	`getNextStateByMaxMargin(State s, Action a)` Selects a next state for expansion when action a is applied in state s according to the next possible state that has the largest lower and upper bound margin.
`protected BoundedRTDP.StateSelectionAndExpectedGap`	`getNextStateBySampling(State s, Action a)` Selects a next state for expansion when action a is applied in state s by randomly sampling from the transition dynamics weighted by the margin of the lower and upper bound value functions.
`int`	`getNumberOfBellmanUpdates()` Returns the total number of Bellman updates across all planning
`int`	`getNumberOfSteps()` Returns the total number of planning steps that have been performed.
`protected QValue`	`maxQ(State s)` Returns the maximum Q-value entry for the given state with ties broken randomly.
`GreedyQPolicy`	`planFromState(State initialState)` Plans from the input state and then returns a `GreedyQPolicy` that greedily selects the action with the highest Q-value and breaks ties uniformly randomly.
`double`	`runRollout(State s)` Runs a planning rollout from the provided state.
`void`	`setDefaultValueFunctionAfterARollout(boolean useLowerBound)` Use this method to set which value function--the lower bound or upper bound--to use after a planning rollout is complete.
`void`	`setMaxDifference(double maxDiff)` Sets the max permitted difference in value function margin to permit planning termination.
`void`	`setMaxNumberOfRollouts(int numRollouts)` Sets the maximum number of rollouts permitted before planning is forced to terminate.
`void`	`setMaxRolloutDepth(int maxDepth)` Sets the maximum rollout depth of any rollout.
`void`	`setOperator(DPOperator operator)` Sets the dynamic programming operator use.
`void`	`setRunRolloutsInRevere(boolean runRolloutsInRevers)` Sets whether each rollout should be run in reverse after completion.
`void`	`setStateSelectionMode(BoundedRTDP.StateSelectionMode selectionMode)` Sets the state selection mode used when choosing next states to expand.
`void`	`setValueFunctionToLowerBound()` Sets the value function to use to be the lower bound.
`void`	`setValueFunctionToUpperBound()` Sets the value function to use to be the upper bound.

Methods inherited from class burlap.behavior.singleagent.planning.stochastic.DynamicProgramming
computeQ, DPPInit, getAllStates, getCopyOfValueFunction, getDefaultValue, getModel, getOperator, getValueFunctionInitialization, hasComputedValueFor, loadValueTable, performBellmanUpdateOn, performBellmanUpdateOn, performFixedPolicyBellmanUpdateOn, performFixedPolicyBellmanUpdateOn, qValue, qValues, resetSolver, setValueFunctionInitialization, value, value, writeValueTable

Methods inherited from class burlap.behavior.singleagent.MDPSolver
addActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrinting

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface burlap.behavior.singleagent.MDPSolverInterface
addActionType, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, resetSolver, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, toggleDebugPrinting

- Field Detail
  - lowerBoundV
```
protected java.util.Map<HashableState,java.lang.Double> lowerBoundV
```
    The lower bound value function
  - upperBoundV
```
protected java.util.Map<HashableState,java.lang.Double> upperBoundV
```
    The upperbound value function
  - lowerVInit
```
protected ValueFunction lowerVInit
```
    The lowerbound value function initialization
  - upperVInit
```
protected ValueFunction upperVInit
```
    The upperbound value function initialization
  - maxRollouts
```
protected int maxRollouts
```
    the max number of rollouts to perform when planning is started unless the value function margin is small enough. If set to -1, then there is no limit.
  - maxDiff
```
protected double maxDiff
```
    The max permitted difference between the lower bound and upperbound for planning termination.
  - maxDepth
```
protected int maxDepth
```
    The maximum depth/length of a rollout before it is terminated and Bellman updates are performed. If set to -1 then there is no limit; the default is -1.
  - currentValueFunctionIsLower
```
protected boolean currentValueFunctionIsLower
```
    Whether the current DynamicProgramming valueFunction reference points to the lower bound value function or the upper bound value function. If true, then it points to the lower bound; if false then to the upper bound.
  - defaultToLowerValueAfterPlanning
```
protected boolean defaultToLowerValueAfterPlanning
```
    Sets what the DynamicProgramming valueFunction reference points to (the lower bound or upperbound) once a planning rollout is complete. If true, then it points to the lower bound; if false then the upper bound. Pointing to the lower bound is default and provides any-time planning performance.
  - selectionMode
```
protected BoundedRTDP.StateSelectionMode selectionMode
```
    Which state selection mode is used. Refer to the BoundedRTDP.StateSelectionMode documentation for more information on the modes. The default is MODELBASED.
  - numBellmanUpdates
```
protected int numBellmanUpdates
```
    Keeps track of the number of Bellman updates that have been performed across all planning.
  - numSteps
```
protected int numSteps
```
    Keeps track of the number of rollout steps that have been performed across all planning rollouts.
  - runRolloutsInReverse
```
protected boolean runRolloutsInReverse
```
    Whether each rollout should be run in reverse after completion. This is useful in goal-directed MDPs because it backups the goal reward to the initial state. The default is true.
- Constructor Detail
  - BoundedRTDP
```
public BoundedRTDP(SADomain domain,
                   double gamma,
                   HashableStateFactory hashingFactory,
                   ValueFunction lowerVInit,
                   ValueFunction upperVInit,
                   double maxDiff,
                   int maxRollouts)
```
    Initializes.
    
    Parameters:
    
    domain - the domain in which to plan
    
    gamma - the discount factor
    
    hashingFactory - the state hashing factor to use
    
    lowerVInit - the value function lower bound initialization
    
    upperVInit - the value function upper bound initialization
    
    maxDiff - the max permitted difference in value function margin to permit planning termination. This value is also used to prematurely stop a rollout if the next state's margin is under this value.
    
    maxRollouts - the maximum number of rollouts permitted before planning is forced to terminate. If set to -1 then there is no limit.
- Method Detail
  - setOperator
```
public void setOperator(DPOperator operator)
```
    Description copied from class: DynamicProgramming
    
    Sets the dynamic programming operator use. Note that default setting is BellmanOperator (max)
    
    Overrides:
    
    setOperator in class DynamicProgramming
    
    Parameters:
    
    operator - the dynamic programming operator to use.
  - setMaxNumberOfRollouts
```
public void setMaxNumberOfRollouts(int numRollouts)
```
    Sets the maximum number of rollouts permitted before planning is forced to terminate. If set to -1 then there is no limit.
    
    Parameters:
    
    numRollouts - the maximum number of rollouts permitted before planning is forced to terminate. If set to -1 then there is no limit.
  - setMaxRolloutDepth
```
public void setMaxRolloutDepth(int maxDepth)
```
    Sets the maximum rollout depth of any rollout. If set to -1, then there is no limit on rollout depth.
    
    Parameters:
    
    maxDepth - the maximum rollout depth of any rollout. If set to -1, then there is no limit on rollout depth.
  - setMaxDifference
```
public void setMaxDifference(double maxDiff)
```
    Sets the max permitted difference in value function margin to permit planning termination. This value is also used to prematurely stop a rollout if the next state's margin is under this value.
    
    Parameters:
    
    maxDiff - the max permitted difference in value function margin to permit planning termination.
  - setStateSelectionMode
```
public void setStateSelectionMode(BoundedRTDP.StateSelectionMode selectionMode)
```
    Sets the state selection mode used when choosing next states to expand. See the BoundedRTDP.StateSelectionMode documentation for more information on the modes.
    
    Parameters:
    
    selectionMode - the state selection mode to use.
  - setDefaultValueFunctionAfterARollout
```
public void setDefaultValueFunctionAfterARollout(boolean useLowerBound)
```
    Use this method to set which value function--the lower bound or upper bound--to use after a planning rollout is complete. Setting this value affects which values the DynamicProgramming.value(State), DynamicProgramming.qValues(State), and DynamicProgramming.qValue(State, Action) methods returns. Using the lower bound results in anytime performance.
    
    Parameters:
    
    useLowerBound - if true, then the value function is set to use the lower bound after planning. If false, then the upper bound is used.
  - setRunRolloutsInRevere
```
public void setRunRolloutsInRevere(boolean runRolloutsInRevers)
```
    Sets whether each rollout should be run in reverse after completion. This is useful in goal-directed MDPs because it backups the goal reward to the initial state.
    
    Parameters:
    
    runRolloutsInRevers - if true, then rollouts will be run in reverse. If false, then they will not be run in reverse.
  - planFromState
```
public GreedyQPolicy planFromState(State initialState)
```
    Plans from the input state and then returns a GreedyQPolicy that greedily selects the action with the highest Q-value and breaks ties uniformly randomly.
    
    Specified by:
    
    planFromState in interface Planner
    
    Parameters:
    
    initialState - the initial state of the planning problem
    
    Returns:
    
    a GreedyQPolicy.
  - setValueFunctionToUpperBound
```
public void setValueFunctionToUpperBound()
```
    Sets the value function to use to be the upper bound.
  - setValueFunctionToLowerBound
```
public void setValueFunctionToLowerBound()
```
    Sets the value function to use to be the lower bound.
  - getNumberOfBellmanUpdates
```
public int getNumberOfBellmanUpdates()
```
    Returns the total number of Bellman updates across all planning
    
    Returns:
    
    the total number of Bellman updates across all planning
  - getNumberOfSteps
```
public int getNumberOfSteps()
```
    Returns the total number of planning steps that have been performed.
    
    Returns:
    
    the total number of planning steps that have been performed.
  - runRollout
```
public double runRollout(State s)
```
    Runs a planning rollout from the provided state.
    
    Parameters:
    
    s - the initial state from which a planning rollout should be performed.
    
    Returns:
    
    the margin between the lower bound and upper bound value function for the initial state.
  - getNextState
```
protected BoundedRTDP.StateSelectionAndExpectedGap getNextState(State s,
                                                                Action a)
```
    Selects a next state for expansion when action a is applied in state s.
    
    Parameters:
    
    s - the source state of the transition
    
    a - the action applied in the source state
    
    Returns:
    
    a BoundedRTDP.StateSelectionAndExpectedGap object holding the next state to be expanded and the expected margin size of this transition.
  - getNextStateByMaxMargin
```
protected BoundedRTDP.StateSelectionAndExpectedGap getNextStateByMaxMargin(State s,
                                                                           Action a)
```
    Selects a next state for expansion when action a is applied in state s according to the next possible state that has the largest lower and upper bound margin. Ties are broken randomly.
    
    Parameters:
    
    s - the source state of the transition
    
    a - the action applied in the source state
    
    Returns:
    
    a BoundedRTDP.StateSelectionAndExpectedGap object holding the next state to be expanded and the expected margin size of this transition.
  - getNextStateBySampling
```
protected BoundedRTDP.StateSelectionAndExpectedGap getNextStateBySampling(State s,
                                                                          Action a)
```
    Selects a next state for expansion when action a is applied in state s by randomly sampling from the transition dynamics weighted by the margin of the lower and upper bound value functions.
    
    Parameters:
    
    s - the source state of the transition
    
    a - the action applied in the source state
    
    Returns:
    
    a BoundedRTDP.StateSelectionAndExpectedGap object holding the next state to be expanded and the expected margin size of this transition.
  - getGap
```
protected double getGap(HashableState sh)
```
    Returns the lower bound and upper bound value function margin/gap for the given state
    
    Parameters:
    
    sh - the state whose margin should be returned.
    
    Returns:
    
    the lower bound and upper bound value function margin/gap for the given state
  - maxQ
```
protected QValue maxQ(State s)
```
    Returns the maximum Q-value entry for the given state with ties broken randomly.
    
    Parameters:
    
    s - the query state for the Q-value
    
    Returns:
    
    the maximum Q-value entry for the given state with ties broken randomly.

Class BoundedRTDP

Nested Class Summary

Nested classes/interfaces inherited from interface burlap.behavior.valuefunction.QProvider

Field Summary

Fields inherited from class burlap.behavior.singleagent.planning.stochastic.DynamicProgramming

Fields inherited from class burlap.behavior.singleagent.MDPSolver

Constructor Summary

Method Summary

Methods inherited from class burlap.behavior.singleagent.planning.stochastic.DynamicProgramming

Methods inherited from class burlap.behavior.singleagent.MDPSolver

Methods inherited from class java.lang.Object

Methods inherited from interface burlap.behavior.singleagent.MDPSolverInterface

Field Detail

lowerBoundV

upperBoundV

lowerVInit

upperVInit

maxRollouts

maxDiff

maxDepth

currentValueFunctionIsLower

defaultToLowerValueAfterPlanning

selectionMode

numBellmanUpdates

numSteps

runRolloutsInReverse

Constructor Detail

BoundedRTDP

Method Detail

setOperator

setMaxNumberOfRollouts

setMaxRolloutDepth

setMaxDifference

setStateSelectionMode

setDefaultValueFunctionAfterARollout

setRunRolloutsInRevere

planFromState

setValueFunctionToUpperBound

setValueFunctionToLowerBound

getNumberOfBellmanUpdates

getNumberOfSteps

runRollout

getNextState

getNextStateByMaxMargin

getNextStateBySampling

getGap

maxQ