public abstract class Option extends Action implements FullActionModel
Action
class, they may be trivally
added to any planning or learning algorithm. Some planning and learning algorithms should
handle options specially; for instance Qlearning needs to treat the return from options
specially. However, the current planning and learning algorithms all handle options in the
appropriately special ways so that Options may be used confidently with existing algorithms.
In order for correct value function returns from option executions to be determined,
options need to keep track of the cumulative reward and number of steps they've taken
since they began execution. This abstract class has data structures and code in place to automatically
handle that information so that any subclass of this Option class should "just work." When
an option is added to an MDPSolver
object
through the MDPSolver.addNonDomainReferencedAction(Action)
method, it will automatically tell the Option which reward function and discount factor it should be using
to keep track of the cumulative reward.
Note that value function planning algorithms that use the Bellman update (such as value iteration)
require the option to return not only the possible terminal states, but the expected number of
steps to those terminal states and the expected cumulative reward. By default, this
abstract Option class will compute those transition dynamics through a branching
exploration of the possible outcomes at each step of execution and save the results
so that they do not need to be computed again. If an option is stochastic or if
the underlining domain is stochastic, there may be an infinite number of possible outcomes.
As a result, the transition dynamics computation will stop searching for states at given
horizons that are less than some small probability of occurring (by default set to
0.001). This threshold hold may be modified. However, if these transition dynamics can be specified
a priori, it is recommended that the getTransitions(burlap.oomdp.core.states.State, burlap.oomdp.singleagent.GroundedAction)
method is overridden
and specified by hand rather than requiring this class to have to enumerate the results. Finally,
note that the getTransitions(State, burlap.oomdp.singleagent.GroundedAction)
returns TransitionProbability
elements, where each TransitionProbability
holds the probability of transitioning to a state discounted
by the the expected length of time. That is, the probability value in each TransitionProbability
is
\sum_k \gamma^k * p(s, s', k)
where p(s, s', k) is the
probability that the option will terminate in s' after being initiated in state s and taking k steps, gamma is the discount
factor and s' is the state associated with the probability value in the TransitionProbability
object.
1. Sutton, Richard S., Doina Precup, and Satinder Singh. "Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning." Artificial intelligence 112.1 (1999): 181211.
FullActionModel.FullActionModelHelper
Modifier and Type  Field and Description 

protected java.util.Map<HashableState,java.util.List<TransitionProbability>> 
cachedExpectations
The cached transition probabilities from each initiation state

protected java.util.Map<HashableState,java.lang.Double> 
cachedExpectedRewards
The cached expected reward from each initiation state

protected double 
cumulativeDiscount
How much to discount the reward in the next option step

protected double 
discountFactor
discount factor of the MDP in which this option will be applied

protected double 
expectationSearchCutoffProb
The minimum probability a possible terminal state being reached to be included in the computed transition dynamics

protected HashableStateFactory 
expectationStateHashingFactory
State hash factory used to cache the transition probabilities so that they only need to be computed once for each state

protected TerminalFunction 
externalTerminalFunction
the terminal function of the MDP in which this option is to be executed.

protected boolean 
keepTrackOfReward
boolean indicating whether the cumulative reward during execution should be recorded

protected double 
lastCumulativeReward
the cumulative reward received during the last execution of this option

protected int 
lastNumSteps
How many steps were taken in the options last execution

protected EpisodeAnalysis 
lastOptionExecutionResults
Stores the last execution results of an option from the initiation state to the state in which it terminated

protected java.util.Random 
rand
Random object for following stochastic option policies

protected RewardFunction 
rf
reward function for keeping track of the cumulative reward during an execution

protected boolean 
shouldAnnotateExecution
Boolean indicating whether the last option execution recording annotates the selected actions with this option's name

protected boolean 
shouldRecordResults
Boolean indicating whether the last option execution result should be saved

protected StateMapping 
stateMapping
An option state mapping to use to map from a source MDP state representation to a representation that this option will use
for action selection.

protected DirectOptionTerminateMapper 
terminateMapper
An optional mapping from initiation states to terminal states so that the execution of an option does not need to be simulated.

actionObservers, domain, name
Constructor and Description 

Option()
Initializes an option without a name and parameters.

Option(java.lang.String name,
Domain domain)
Initializes.

Modifier and Type  Method and Description 

protected void 
accumulateDiscountedProb(java.util.Map<HashableState,java.lang.Double> possibleTerminations,
State s,
double p)
Adds to the expected discounted probability of reaching state given a value p, where p = \gamma^k * p(s, s', k), where
s' is a possible terminal state and k is a unique number of steps not yet added to sum over all possible step sizes
to s'.

boolean 
continueFromState(State s,
GroundedAction groundedAction)
This method will use this option's termination probability, roll the dice, and
return whether the option should continue or terminate.

abstract java.util.List<Policy.ActionProb> 
getActionDistributionForState(State s,
GroundedAction groundedAction)
Returns the option's policy distribution for a given state.

protected java.util.List<Policy.ActionProb> 
getDeterministicPolicy(State s,
GroundedAction groundedAction)
This method creates a deterministic action selection probability distribution where the deterministic action
to be selected with probability 1 is the one returned by the method
getDeterministicPolicy(State, burlap.oomdp.singleagent.GroundedAction) . 
double 
getExpectedRewards(State s,
GroundedAction groundedAction)
Returns the expected reward to be received from initiating this option from state s.

double 
getLastCumulativeReward()
Returns the cumulative discounted reward received in last execution of this option.

EpisodeAnalysis 
getLastExecutionResults()
Returns the events from this option's last execution

int 
getLastNumSteps()
Returns the number of steps taken in the last execution of this option.

java.util.List<TransitionProbability> 
getTransitions(State st,
GroundedAction groundedAction)
Returns the transition probabilities for applying this action in the given state with the given set of parameters.

void 
initiateInState(State s,
GroundedAction groundedAction)
Tells the option that it is being initiated in the given state with the given parameters.

abstract void 
initiateInStateHelper(State s,
GroundedAction groundedAction)
This method is always called when an option is initiated and begins execution.

boolean 
isAnnotatingExecutionResults()
Returns whether this option is annotating recorded action executions with this option's name.

abstract boolean 
isMarkov()
Returns whether this option is Markov or not; that is, whether action selection and termination only depends on the current state.

boolean 
isPrimitive()
Returns whether this action is a primitive action of the domain or not.

boolean 
isRecordingExecutionResults()
Returns whether this option is recording its executions

protected void 
iterateExpectationScan(burlap.behavior.singleagent.options.Option.ExpectationSearchNode src,
double stackedDiscount,
java.util.Map<HashableState,java.lang.Double> possibleTerminations,
double[] expectedReturn)
This method will recursively determine all possible paths that could occur from execution of the option as well
as the expected return.

void 
keepTrackOfRewardWith(RewardFunction rf,
double discount)
Tells this option to keep track the cumulative reward from its execution using the given reward function and the given discount factor.

protected State 
map(State s)
Returns the state that is mapped from the input state.

EnvironmentOutcome 
oneStep(Environment env,
GroundedAction groundedAction)
Performs one step of execution of the option in the provided
Environment . 
State 
oneStep(State s,
GroundedAction groundedAction)
Performs one step of execution of the option.

abstract GroundedAction 
oneStepActionSelection(State s,
GroundedAction groundedAction)
This method causes the option to select a single step in the given state, when the option was initiated with the provided parameters.

protected State 
performActionHelper(State st,
GroundedAction groundedAction)
This method determines what happens when an action is applied in the given state with the given parameters.

EnvironmentOutcome 
performInEnvironment(Environment env,
GroundedAction groundedActions)
Executes this action with the specified parameters in the provided environment and returns the
EnvironmentOutcome result. 
abstract double 
probabilityOfTermination(State s,
GroundedAction groundedAction)
Returns the probability that this option (executed with the given parameters) will terminate in the given state

void 
setExernalTermination(TerminalFunction tf)
Sets what the external MDPs terminal function is that will cause this option to terminate if it enters those terminal states.

void 
setExpectationCalculationProbabilityCutoff(double cutoff)
Sets the minimum probability of reaching a terminal state for it to be included in the options computed transition dynamics distribution.

void 
setExpectationHashingFactory(HashableStateFactory hashingFactory)
Sets the option to use the provided hashing factory for caching transition probability results.

void 
setStateMapping(StateMapping m)
Sets this option to use a state mapping that maps from the source MDP states to another state representation that will be used by this option for making
action selections.

void 
setTerminateMapper(DirectOptionTerminateMapper tm)
Sets this option to determine its execution results using a direct terminal state mapping rather than actually executing each action selcted
by the option step by step.

void 
toggleShouldAnnotateResults(boolean toggle)
Toggle whether the last recorded option execution will annotate the actions taken with this option's name

void 
toggleShouldRecordResults(boolean toggle)
Change whether the options last execution will be recorded or not.

abstract boolean 
usesDeterministicPolicy()
Returns whether this option's policy is deterministic or stochastic

abstract boolean 
usesDeterministicTermination()
Returns whether this option's termination conditions are deterministic or stochastic

addActionObserver, applicableInState, clearAllActionsObservers, deterministicTransition, equals, getAllApplicableGroundedActions, getAllApplicableGroundedActionsFromActionList, getAssociatedGroundedAction, getDomain, getGroundedAction, getName, hashCode, isParameterized, performAction
protected java.util.Random rand
protected EpisodeAnalysis lastOptionExecutionResults
protected boolean shouldRecordResults
protected boolean shouldAnnotateExecution
protected RewardFunction rf
protected boolean keepTrackOfReward
protected double discountFactor
protected double lastCumulativeReward
protected double cumulativeDiscount
protected int lastNumSteps
protected TerminalFunction externalTerminalFunction
protected HashableStateFactory expectationStateHashingFactory
protected java.util.Map<HashableState,java.util.List<TransitionProbability>> cachedExpectations
protected java.util.Map<HashableState,java.lang.Double> cachedExpectedRewards
protected double expectationSearchCutoffProb
protected StateMapping stateMapping
protected DirectOptionTerminateMapper terminateMapper
DirectOptionTerminateMapper
class documentation for more
information.public Option()
public Option(java.lang.String name, Domain domain)
name
 the name of the option (should be unique from other options and actions a planning/learning algorithm can use)domain
 a domain with which this option is associated; note that this option will *not* be added to domain's list of actions like a normal action.public abstract boolean isMarkov()
public abstract boolean usesDeterministicTermination()
public abstract boolean usesDeterministicPolicy()
public abstract double probabilityOfTermination(State s, GroundedAction groundedAction)
s
 the state to test for terminationgroundedAction
 the parameters in which this option was initiatedpublic abstract void initiateInStateHelper(State s, GroundedAction groundedAction)
performActionHelper(burlap.oomdp.core.states.State, burlap.oomdp.singleagent.GroundedAction)
For Markov options, this method probably does not need to do anything, but for nonMarkov options, like Macro actions, it may need
to initialize some structures for determining termination and action selection.s
 the state in which the option was initiatedgroundedAction
 the parameters in which this option will be initiatedpublic abstract GroundedAction oneStepActionSelection(State s, GroundedAction groundedAction)
performActionHelper(burlap.oomdp.core.states.State, burlap.oomdp.singleagent.GroundedAction)
method until it is determined that the option terminates.s
 the state in which an action should be selected.groundedAction
 the parameters in which this option was initiateds
public abstract java.util.List<Policy.ActionProb> getActionDistributionForState(State s, GroundedAction groundedAction)
s
 the state for which this option's policy distribution should be returnedgroundedAction
 the parameters in which this option was initiatedpublic void setExpectationHashingFactory(HashableStateFactory hashingFactory)
hashingFactory
 the state hashing factory to use.public void setExpectationCalculationProbabilityCutoff(double cutoff)
cutoff
 the minimum probability of reaching a terminal state for it to be included in the options computed transition dynamics distribution.public void toggleShouldRecordResults(boolean toggle)
toggle
 true if the last option execution should be saved; false otherwise.public void toggleShouldAnnotateResults(boolean toggle)
toggle
 true if the last recorded option execution will annotate the actions taken with this option's name; false otherwisepublic boolean isRecordingExecutionResults()
public boolean isAnnotatingExecutionResults()
public EpisodeAnalysis getLastExecutionResults()
public void setStateMapping(StateMapping m)
m
 the state mapping to use.public void setTerminateMapper(DirectOptionTerminateMapper tm)
DirectOptionTerminateMapper
class documentation for more information.tm
 the direct state to terminal state mapping to use.public void setExernalTermination(TerminalFunction tf)
tf
 the external MDPs terminal function isprotected State map(State s)
s
 the input state from which a mapped state is to be returned.public void keepTrackOfRewardWith(RewardFunction rf, double discount)
rf
 the reward function to usediscount
 the discount factor to usepublic double getLastCumulativeReward()
public int getLastNumSteps()
public boolean isPrimitive()
Action
isPrimitive
in class Action
public void initiateInState(State s, GroundedAction groundedAction)
initiateInStateHelper(State, burlap.oomdp.singleagent.GroundedAction)
method will be called before exiting.s
 the state in which the option is being initiated.groundedAction
 the parameters in which this option was initiatedprotected State performActionHelper(State st, GroundedAction groundedAction)
Action
Action.performAction(burlap.oomdp.core.states.State, GroundedAction)
first copies the input state to pass
to this helper method. The resulting state (which may be s) should then be returned.performActionHelper
in class Action
st
 the state to perform the action ongroundedAction
 the GroundedAction
specifying the parameters to usepublic EnvironmentOutcome performInEnvironment(Environment env, GroundedAction groundedActions)
Action
EnvironmentOutcome
result.performInEnvironment
in class Action
env
 the environment in which the action should be performed.groundedActions
 the GroundedAction
specifying the parameters to useEnvironmentOutcome
specifying the result of the action execution in the environmentpublic State oneStep(State s, GroundedAction groundedAction)
initiateInState(burlap.oomdp.core.states.State, burlap.oomdp.singleagent.GroundedAction)
method was called previously for the state in which this option was initiated.s
 the state in which a single step of the option is to be taken.groundedAction
 the parameters in which this option was initiatedpublic EnvironmentOutcome oneStep(Environment env, GroundedAction groundedAction)
Environment
.
This method assuems that the initiateInState(burlap.oomdp.core.states.State, burlap.oomdp.singleagent.GroundedAction)
method
was called previously for the state in which this option was initiated.env
 The Environment
in which this option is to be appliedgroundedAction
 the parameters in which this option was initiatedEnvironmentOutcome
of the one step of interaction.public boolean continueFromState(State s, GroundedAction groundedAction)
s
 the state to check againstgroundedAction
 the parameters in which this option was initiatedpublic double getExpectedRewards(State s, GroundedAction groundedAction)
s
 the state in which the option is initiatedgroundedAction
 the parameters in which this option was initiatedpublic java.util.List<TransitionProbability> getTransitions(State st, GroundedAction groundedAction)
FullActionModel
TransitionProbability
objects. The list
is only required to contain transitions with nonzero probability.getTransitions
in interface FullActionModel
st
 the state from which the transition probabilities when applying this action will be returned.groundedAction
 the GroundedAction
specifying the parameters to useprotected void iterateExpectationScan(burlap.behavior.singleagent.options.Option.ExpectationSearchNode src, double stackedDiscount, java.util.Map<HashableState,java.lang.Double> possibleTerminations, double[] expectedReturn)
expectationSearchCutoffProb
src
 the source node from which to expand possible pathsstackedDiscount
 the discount amount up to this pointpossibleTerminations
 a map of possible termination states and their probabilityexpectedReturn
 the expected discounted cumulative reward up to node src (this is an array of length 1 that is used to be a mutable double)protected void accumulateDiscountedProb(java.util.Map<HashableState,java.lang.Double> possibleTerminations, State s, double p)
possibleTerminations
 the map from of all possible termination states to the expected discounted probability of reaching thems
 a possible termination statep
 the discounted probability of reaching s for some specific number of steps not already summed into the respective possibleTerminations map.protected java.util.List<Policy.ActionProb> getDeterministicPolicy(State s, GroundedAction groundedAction)
getDeterministicPolicy(State, burlap.oomdp.singleagent.GroundedAction)
.
This method is helpful for quickly defining the action selection distribution for deterministic option policies.s
 the state for which the action selection distribution should be returned.groundedAction
 the parameters in which this option was initiated