public class BoundedRTDP extends DynamicProgramming implements Planner
RTDP
[2] with the main difference
being that both an upper bound and lower bound value function is computed. Like RTDP, the upper bound is used for planning rollout
exploration, but when planning rollouts are complete, the lower bound value function is used to direct behavior. Using the lower
bound provides significantly better anytime planning performance that does not require convergence to get resaonable results.
Another differences in Bounded RTDP is updating the value of each state in a rollout in reverse after its completion,
which has the effect of propagating back goal state values to the beginning (though this can be disbaled in this implementation using
the setRunRolloutsInRevere(boolean)
method).
Finally, after action selection, the next outcome state from which the rollout continues may also be selected in a number of ways.
The way presnted in the original paper is to select the next state randomly according to the transition dynamics probability weighted
by the margin between the upper bound and lower bound of the states, which promotes exploration toward states that are uncertain.
However, in practice, we found that the standard unweighted sampling apporach of RTDP can work better and is more efficient; therefore,
the default in this implementation is to use the unweighted transition dynamics, but it can be changed to way presented in the paper using
the method setStateSelectionMode(StateSelectionMode)
. Another optional state selection mode is to always choose the next state
with the highest uncertainty, but this tends to be even slower due to being overly conservative so it is not reccommended in genral.
See the BoundedRTDP.StateSelectionMode
documentation for more information.
1.McMahan, H. Brendan, Maxim Likhachev, and Geoffrey J. Gordon. "Bounded realtime dynamic programming: RTDP with monotone upper bounds and performance guarantees." Proceedings of the 22nd international conference on Machine learning. ACM, 2005.
2. Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. "Learning to act using realtime dynamic programming." Artificial Intelligence 72.1 (1995): 81138.
Modifier and Type  Class and Description 

protected static class 
BoundedRTDP.StateSelectionAndExpectedGap
A tuple class for a hashed state and the expected value function margin/gap of a the source transition.

static class 
BoundedRTDP.StateSelectionMode
The different ways that states can be selected for expansion.

QProvider.Helper
Modifier and Type  Field and Description 

protected boolean 
currentValueFunctionIsLower
Whether the current
DynamicProgramming valueFunction reference points to the lower bound value function or the upper bound value function. 
protected boolean 
defaultToLowerValueAfterPlanning
Sets what the
DynamicProgramming valueFunction reference points to (the lower bound or upperbound) once a planning rollout is complete. 
protected java.util.Map<HashableState,java.lang.Double> 
lowerBoundV
The lower bound value function

protected ValueFunction 
lowerVInit
The lowerbound value function initialization

protected int 
maxDepth
The maximum depth/length of a rollout before it is terminated and Bellman updates are performed.

protected double 
maxDiff
The max permitted difference between the lower bound and upperbound for planning termination.

protected int 
maxRollouts
the max number of rollouts to perform when planning is started unless the value function margin is small enough.

protected int 
numBellmanUpdates
Keeps track of the number of Bellman updates that have been performed across all planning.

protected int 
numSteps
Keeps track of the number of rollout steps that have been performed across all planning rollouts.

protected boolean 
runRolloutsInReverse
Whether each rollout should be run in reverse after completion.

protected BoundedRTDP.StateSelectionMode 
selectionMode
Which state selection mode is used.

protected java.util.Map<HashableState,java.lang.Double> 
upperBoundV
The upperbound value function

protected ValueFunction 
upperVInit
The upperbound value function initialization

operator, valueFunction, valueInitializer
actionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel
Constructor and Description 

BoundedRTDP(SADomain domain,
double gamma,
HashableStateFactory hashingFactory,
ValueFunction lowerVInit,
ValueFunction upperVInit,
double maxDiff,
int maxRollouts)
Initializes.

Modifier and Type  Method and Description 

protected double 
getGap(HashableState sh)
Returns the lower bound and upper bound value function margin/gap for the given state

protected BoundedRTDP.StateSelectionAndExpectedGap 
getNextState(State s,
Action a)
Selects a next state for expansion when action a is applied in state s.

protected BoundedRTDP.StateSelectionAndExpectedGap 
getNextStateByMaxMargin(State s,
Action a)
Selects a next state for expansion when action a is applied in state s according to the next possible state that has the largest lower and upper bound margin.

protected BoundedRTDP.StateSelectionAndExpectedGap 
getNextStateBySampling(State s,
Action a)
Selects a next state for expansion when action a is applied in state s by randomly sampling from the transition dynamics weighted by the margin of the lower and
upper bound value functions.

int 
getNumberOfBellmanUpdates()
Returns the total number of Bellman updates across all planning

int 
getNumberOfSteps()
Returns the total number of planning steps that have been performed.

protected QValue 
maxQ(State s)
Returns the maximum Qvalue entry for the given state with ties broken randomly.

GreedyQPolicy 
planFromState(State initialState)
Plans from the input state and then returns a
GreedyQPolicy that greedily
selects the action with the highest Qvalue and breaks ties uniformly randomly. 
double 
runRollout(State s)
Runs a planning rollout from the provided state.

void 
setDefaultValueFunctionAfterARollout(boolean useLowerBound)
Use this method to set which value functionthe lower bound or upper boundto use after a planning rollout is complete.

void 
setMaxDifference(double maxDiff)
Sets the max permitted difference in value function margin to permit planning termination.

void 
setMaxNumberOfRollouts(int numRollouts)
Sets the maximum number of rollouts permitted before planning is forced to terminate.

void 
setMaxRolloutDepth(int maxDepth)
Sets the maximum rollout depth of any rollout.

void 
setOperator(DPOperator operator)
Sets the dynamic programming operator use.

void 
setRunRolloutsInRevere(boolean runRolloutsInRevers)
Sets whether each rollout should be run in reverse after completion.

void 
setStateSelectionMode(BoundedRTDP.StateSelectionMode selectionMode)
Sets the state selection mode used when choosing next states to expand.

void 
setValueFunctionToLowerBound()
Sets the value function to use to be the lower bound.

void 
setValueFunctionToUpperBound()
Sets the value function to use to be the upper bound.

computeQ, DPPInit, getAllStates, getCopyOfValueFunction, getDefaultValue, getModel, getOperator, getValueFunctionInitialization, hasComputedValueFor, loadValueTable, performBellmanUpdateOn, performBellmanUpdateOn, performFixedPolicyBellmanUpdateOn, performFixedPolicyBellmanUpdateOn, qValue, qValues, resetSolver, setValueFunctionInitialization, value, value, writeValueTable
addActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrinting
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addActionType, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, resetSolver, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, toggleDebugPrinting
protected java.util.Map<HashableState,java.lang.Double> lowerBoundV
protected java.util.Map<HashableState,java.lang.Double> upperBoundV
protected ValueFunction lowerVInit
protected ValueFunction upperVInit
protected int maxRollouts
protected double maxDiff
protected int maxDepth
protected boolean currentValueFunctionIsLower
DynamicProgramming
valueFunction reference points to the lower bound value function or the upper bound value function.
If true, then it points to the lower bound; if false then to the upper bound.protected boolean defaultToLowerValueAfterPlanning
DynamicProgramming
valueFunction reference points to (the lower bound or upperbound) once a planning rollout is complete.
If true, then it points to the lower bound; if false then the upper bound. Pointing to the lower bound is default and provides anytime planning
performance.protected BoundedRTDP.StateSelectionMode selectionMode
BoundedRTDP.StateSelectionMode
documentation for more information on the modes. The default is MODELBASED.protected int numBellmanUpdates
protected int numSteps
protected boolean runRolloutsInReverse
public BoundedRTDP(SADomain domain, double gamma, HashableStateFactory hashingFactory, ValueFunction lowerVInit, ValueFunction upperVInit, double maxDiff, int maxRollouts)
domain
 the domain in which to plangamma
 the discount factorhashingFactory
 the state hashing factor to uselowerVInit
 the value function lower bound initializationupperVInit
 the value function upper bound initializationmaxDiff
 the max permitted difference in value function margin to permit planning termination. This value is also used to prematurely stop a rollout if the next state's margin is under this value.maxRollouts
 the maximum number of rollouts permitted before planning is forced to terminate. If set to 1 then there is no limit.public void setOperator(DPOperator operator)
DynamicProgramming
BellmanOperator
(max)setOperator
in class DynamicProgramming
operator
 the dynamic programming operator to use.public void setMaxNumberOfRollouts(int numRollouts)
numRollouts
 the maximum number of rollouts permitted before planning is forced to terminate. If set to 1 then there is no limit.public void setMaxRolloutDepth(int maxDepth)
maxDepth
 the maximum rollout depth of any rollout. If set to 1, then there is no limit on rollout depth.public void setMaxDifference(double maxDiff)
maxDiff
 the max permitted difference in value function margin to permit planning termination.public void setStateSelectionMode(BoundedRTDP.StateSelectionMode selectionMode)
BoundedRTDP.StateSelectionMode
documentation for more information on the modes.selectionMode
 the state selection mode to use.public void setDefaultValueFunctionAfterARollout(boolean useLowerBound)
DynamicProgramming.value(State)
, DynamicProgramming.qValues(State)
, and DynamicProgramming.qValue(State, Action)
methods returns.
Using the lower bound results in anytime performance.useLowerBound
 if true, then the value function is set to use the lower bound after planning. If false, then the upper bound is used.public void setRunRolloutsInRevere(boolean runRolloutsInRevers)
runRolloutsInRevers
 if true, then rollouts will be run in reverse. If false, then they will not be run in reverse.public GreedyQPolicy planFromState(State initialState)
GreedyQPolicy
that greedily
selects the action with the highest Qvalue and breaks ties uniformly randomly.planFromState
in interface Planner
initialState
 the initial state of the planning problemGreedyQPolicy
.public void setValueFunctionToUpperBound()
public void setValueFunctionToLowerBound()
public int getNumberOfBellmanUpdates()
public int getNumberOfSteps()
public double runRollout(State s)
s
 the initial state from which a planning rollout should be performed.protected BoundedRTDP.StateSelectionAndExpectedGap getNextState(State s, Action a)
s
 the source state of the transitiona
 the action applied in the source stateBoundedRTDP.StateSelectionAndExpectedGap
object holding the next state to be expanded and the expected margin size of this transition.protected BoundedRTDP.StateSelectionAndExpectedGap getNextStateByMaxMargin(State s, Action a)
s
 the source state of the transitiona
 the action applied in the source stateBoundedRTDP.StateSelectionAndExpectedGap
object holding the next state to be expanded and the expected margin size of this transition.protected BoundedRTDP.StateSelectionAndExpectedGap getNextStateBySampling(State s, Action a)
s
 the source state of the transitiona
 the action applied in the source stateBoundedRTDP.StateSelectionAndExpectedGap
object holding the next state to be expanded and the expected margin size of this transition.protected double getGap(HashableState sh)
sh
 the state whose margin should be returned.