public class BoundedRTDP extends DynamicProgramming implements Planner
RTDP
[2] with the main difference
being that both an upper bound and lower bound value function is computed. Like RTDP, the upper bound is used for planning rollout
exploration, but when planning rollouts are complete, the lower bound value function is used to direct behavior. Using the lower
bound provides significantly better any-time planning performance that does not require convergence to get resaonable results.
Another differences in Bounded RTDP is updating the value of each state in a rollout in reverse after its completion,
which has the effect of propagating back goal state values to the beginning (though this can be disbaled in this implementation using
the setRunRolloutsInRevere(boolean)
method).
Finally, after action selection, the next outcome state from which the rollout continues may also be selected in a number of ways.
The way presnted in the original paper is to select the next state randomly according to the transition dynamics probability weighted
by the margin between the upper bound and lower bound of the states, which promotes exploration toward states that are uncertain.
However, in practice, we found that the standard unweighted sampling apporach of RTDP can work better and is more efficient; therefore,
the default in this implementation is to use the unweighted transition dynamics, but it can be changed to way presented in the paper using
the method setStateSelectionMode(StateSelectionMode)
. Another optional state selection mode is to always choose the next state
with the highest uncertainty, but this tends to be even slower due to being overly conservative so it is not reccommended in genral.
See the BoundedRTDP.StateSelectionMode
documentation for more information.
1.McMahan, H. Brendan, Maxim Likhachev, and Geoffrey J. Gordon. "Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees." Proceedings of the 22nd international conference on Machine learning. ACM, 2005.
2. Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. "Learning to act using real-time dynamic programming." Artificial Intelligence 72.1 (1995): 81-138.
Modifier and Type | Class and Description |
---|---|
protected static class |
BoundedRTDP.StateSelectionAndExpectedGap
A tuple class for a hashed state and the expected value function margin/gap of a the source transition.
|
static class |
BoundedRTDP.StateSelectionMode
The different ways that states can be selected for expansion.
|
QProvider.Helper
Modifier and Type | Field and Description |
---|---|
protected boolean |
currentValueFunctionIsLower
Whether the current
DynamicProgramming valueFunction reference points to the lower bound value function or the upper bound value function. |
protected boolean |
defaultToLowerValueAfterPlanning
Sets what the
DynamicProgramming valueFunction reference points to (the lower bound or upperbound) once a planning rollout is complete. |
protected java.util.Map<HashableState,java.lang.Double> |
lowerBoundV
The lower bound value function
|
protected ValueFunction |
lowerVInit
The lowerbound value function initialization
|
protected int |
maxDepth
The maximum depth/length of a rollout before it is terminated and Bellman updates are performed.
|
protected double |
maxDiff
The max permitted difference between the lower bound and upperbound for planning termination.
|
protected int |
maxRollouts
the max number of rollouts to perform when planning is started unless the value function margin is small enough.
|
protected int |
numBellmanUpdates
Keeps track of the number of Bellman updates that have been performed across all planning.
|
protected int |
numSteps
Keeps track of the number of rollout steps that have been performed across all planning rollouts.
|
protected boolean |
runRolloutsInReverse
Whether each rollout should be run in reverse after completion.
|
protected BoundedRTDP.StateSelectionMode |
selectionMode
Which state selection mode is used.
|
protected java.util.Map<HashableState,java.lang.Double> |
upperBoundV
The upperbound value function
|
protected ValueFunction |
upperVInit
The upperbound value function initialization
|
operator, valueFunction, valueInitializer
actionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel
Constructor and Description |
---|
BoundedRTDP(SADomain domain,
double gamma,
HashableStateFactory hashingFactory,
ValueFunction lowerVInit,
ValueFunction upperVInit,
double maxDiff,
int maxRollouts)
Initializes.
|
Modifier and Type | Method and Description |
---|---|
protected double |
getGap(HashableState sh)
Returns the lower bound and upper bound value function margin/gap for the given state
|
protected BoundedRTDP.StateSelectionAndExpectedGap |
getNextState(State s,
Action a)
Selects a next state for expansion when action a is applied in state s.
|
protected BoundedRTDP.StateSelectionAndExpectedGap |
getNextStateByMaxMargin(State s,
Action a)
Selects a next state for expansion when action a is applied in state s according to the next possible state that has the largest lower and upper bound margin.
|
protected BoundedRTDP.StateSelectionAndExpectedGap |
getNextStateBySampling(State s,
Action a)
Selects a next state for expansion when action a is applied in state s by randomly sampling from the transition dynamics weighted by the margin of the lower and
upper bound value functions.
|
int |
getNumberOfBellmanUpdates()
Returns the total number of Bellman updates across all planning
|
int |
getNumberOfSteps()
Returns the total number of planning steps that have been performed.
|
protected QValue |
maxQ(State s)
Returns the maximum Q-value entry for the given state with ties broken randomly.
|
GreedyQPolicy |
planFromState(State initialState)
Plans from the input state and then returns a
GreedyQPolicy that greedily
selects the action with the highest Q-value and breaks ties uniformly randomly. |
double |
runRollout(State s)
Runs a planning rollout from the provided state.
|
void |
setDefaultValueFunctionAfterARollout(boolean useLowerBound)
Use this method to set which value function--the lower bound or upper bound--to use after a planning rollout is complete.
|
void |
setMaxDifference(double maxDiff)
Sets the max permitted difference in value function margin to permit planning termination.
|
void |
setMaxNumberOfRollouts(int numRollouts)
Sets the maximum number of rollouts permitted before planning is forced to terminate.
|
void |
setMaxRolloutDepth(int maxDepth)
Sets the maximum rollout depth of any rollout.
|
void |
setOperator(DPOperator operator)
Sets the dynamic programming operator use.
|
void |
setRunRolloutsInRevere(boolean runRolloutsInRevers)
Sets whether each rollout should be run in reverse after completion.
|
void |
setStateSelectionMode(BoundedRTDP.StateSelectionMode selectionMode)
Sets the state selection mode used when choosing next states to expand.
|
void |
setValueFunctionToLowerBound()
Sets the value function to use to be the lower bound.
|
void |
setValueFunctionToUpperBound()
Sets the value function to use to be the upper bound.
|
computeQ, DPPInit, getAllStates, getCopyOfValueFunction, getDefaultValue, getModel, getOperator, getValueFunctionInitialization, hasComputedValueFor, loadValueTable, performBellmanUpdateOn, performBellmanUpdateOn, performFixedPolicyBellmanUpdateOn, performFixedPolicyBellmanUpdateOn, qValue, qValues, resetSolver, setValueFunctionInitialization, value, value, writeValueTable
addActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrinting
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addActionType, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, resetSolver, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, toggleDebugPrinting
protected java.util.Map<HashableState,java.lang.Double> lowerBoundV
protected java.util.Map<HashableState,java.lang.Double> upperBoundV
protected ValueFunction lowerVInit
protected ValueFunction upperVInit
protected int maxRollouts
protected double maxDiff
protected int maxDepth
protected boolean currentValueFunctionIsLower
DynamicProgramming
valueFunction reference points to the lower bound value function or the upper bound value function.
If true, then it points to the lower bound; if false then to the upper bound.protected boolean defaultToLowerValueAfterPlanning
DynamicProgramming
valueFunction reference points to (the lower bound or upperbound) once a planning rollout is complete.
If true, then it points to the lower bound; if false then the upper bound. Pointing to the lower bound is default and provides any-time planning
performance.protected BoundedRTDP.StateSelectionMode selectionMode
BoundedRTDP.StateSelectionMode
documentation for more information on the modes. The default is MODELBASED.protected int numBellmanUpdates
protected int numSteps
protected boolean runRolloutsInReverse
public BoundedRTDP(SADomain domain, double gamma, HashableStateFactory hashingFactory, ValueFunction lowerVInit, ValueFunction upperVInit, double maxDiff, int maxRollouts)
domain
- the domain in which to plangamma
- the discount factorhashingFactory
- the state hashing factor to uselowerVInit
- the value function lower bound initializationupperVInit
- the value function upper bound initializationmaxDiff
- the max permitted difference in value function margin to permit planning termination. This value is also used to prematurely stop a rollout if the next state's margin is under this value.maxRollouts
- the maximum number of rollouts permitted before planning is forced to terminate. If set to -1 then there is no limit.public void setOperator(DPOperator operator)
DynamicProgramming
BellmanOperator
(max)setOperator
in class DynamicProgramming
operator
- the dynamic programming operator to use.public void setMaxNumberOfRollouts(int numRollouts)
numRollouts
- the maximum number of rollouts permitted before planning is forced to terminate. If set to -1 then there is no limit.public void setMaxRolloutDepth(int maxDepth)
maxDepth
- the maximum rollout depth of any rollout. If set to -1, then there is no limit on rollout depth.public void setMaxDifference(double maxDiff)
maxDiff
- the max permitted difference in value function margin to permit planning termination.public void setStateSelectionMode(BoundedRTDP.StateSelectionMode selectionMode)
BoundedRTDP.StateSelectionMode
documentation for more information on the modes.selectionMode
- the state selection mode to use.public void setDefaultValueFunctionAfterARollout(boolean useLowerBound)
DynamicProgramming.value(State)
, DynamicProgramming.qValues(State)
, and DynamicProgramming.qValue(State, Action)
methods returns.
Using the lower bound results in anytime performance.useLowerBound
- if true, then the value function is set to use the lower bound after planning. If false, then the upper bound is used.public void setRunRolloutsInRevere(boolean runRolloutsInRevers)
runRolloutsInRevers
- if true, then rollouts will be run in reverse. If false, then they will not be run in reverse.public GreedyQPolicy planFromState(State initialState)
GreedyQPolicy
that greedily
selects the action with the highest Q-value and breaks ties uniformly randomly.planFromState
in interface Planner
initialState
- the initial state of the planning problemGreedyQPolicy
.public void setValueFunctionToUpperBound()
public void setValueFunctionToLowerBound()
public int getNumberOfBellmanUpdates()
public int getNumberOfSteps()
public double runRollout(State s)
s
- the initial state from which a planning rollout should be performed.protected BoundedRTDP.StateSelectionAndExpectedGap getNextState(State s, Action a)
s
- the source state of the transitiona
- the action applied in the source stateBoundedRTDP.StateSelectionAndExpectedGap
object holding the next state to be expanded and the expected margin size of this transition.protected BoundedRTDP.StateSelectionAndExpectedGap getNextStateByMaxMargin(State s, Action a)
s
- the source state of the transitiona
- the action applied in the source stateBoundedRTDP.StateSelectionAndExpectedGap
object holding the next state to be expanded and the expected margin size of this transition.protected BoundedRTDP.StateSelectionAndExpectedGap getNextStateBySampling(State s, Action a)
s
- the source state of the transitiona
- the action applied in the source stateBoundedRTDP.StateSelectionAndExpectedGap
object holding the next state to be expanded and the expected margin size of this transition.protected double getGap(HashableState sh)
sh
- the state whose margin should be returned.