public abstract class Option extends Action implements FullActionModel
Action
class, they may be trivally
added to any planning or learning algorithm. Some planning and learning algorithms should
handle options specially; for instance Q-learning needs to treat the return from options
specially. However, the current planning and learning algorithms all handle options in the
appropriately special ways so that Options may be used confidently with existing algorithms.
In order for correct value function returns from option executions to be determined,
options need to keep track of the cumulative reward and number of steps they've taken
since they began execution. This abstract class has data structures and code in place to automatically
handle that information so that any subclass of this Option class should "just work." When
an option is added to an MDPSolver
object
through the MDPSolver.addNonDomainReferencedAction(Action)
method, it will automatically tell the Option which reward function and discount factor it should be using
to keep track of the cumulative reward.
Note that value function planning algorithms that use the Bellman update (such as value iteration)
require the option to return not only the possible terminal states, but the expected number of
steps to those terminal states and the expected cumulative reward. By default, this
abstract Option class will compute those transition dynamics through a branching
exploration of the possible outcomes at each step of execution and save the results
so that they do not need to be computed again. If an option is stochastic or if
the underlining domain is stochastic, there may be an infinite number of possible outcomes.
As a result, the transition dynamics computation will stop searching for states at given
horizons that are less than some small probability of occurring (by default set to
0.001). This threshold hold may be modified. However, if these transition dynamics can be specified
a priori, it is recommended that the getTransitions(burlap.oomdp.core.states.State, burlap.oomdp.singleagent.GroundedAction)
method is overridden
and specified by hand rather than requiring this class to have to enumerate the results. Finally,
note that the getTransitions(State, burlap.oomdp.singleagent.GroundedAction)
returns TransitionProbability
elements, where each TransitionProbability
holds the probability of transitioning to a state discounted
by the the expected length of time. That is, the probability value in each TransitionProbability
is
TransitionProbability
object.
1. Sutton, Richard S., Doina Precup, and Satinder Singh. "Between MDPs and semi-MDPs: A framework for temporal abstraction
in reinforcement learning." Artificial intelligence 112.1 (1999): 181-211.FullActionModel.FullActionModelHelper
Modifier and Type | Field and Description |
---|---|
protected java.util.Map<HashableState,java.util.List<TransitionProbability>> |
cachedExpectations
The cached transition probabilities from each initiation state
|
protected java.util.Map<HashableState,java.lang.Double> |
cachedExpectedRewards
The cached expected reward from each initiation state
|
protected double |
cumulativeDiscount
How much to discount the reward in the next option step
|
protected double |
discountFactor
discount factor of the MDP in which this option will be applied
|
protected double |
expectationSearchCutoffProb
The minimum probability a possible terminal state being reached to be included in the computed transition dynamics
|
protected HashableStateFactory |
expectationStateHashingFactory
State hash factory used to cache the transition probabilities so that they only need to be computed once for each state
|
protected TerminalFunction |
externalTerminalFunction
the terminal function of the MDP in which this option is to be executed.
|
protected boolean |
keepTrackOfReward
boolean indicating whether the cumulative reward during execution should be recorded
|
protected double |
lastCumulativeReward
the cumulative reward received during the last execution of this option
|
protected int |
lastNumSteps
How many steps were taken in the options last execution
|
protected EpisodeAnalysis |
lastOptionExecutionResults
Stores the last execution results of an option from the initiation state to the state in which it terminated
|
protected java.util.Random |
rand
Random object for following stochastic option policies
|
protected RewardFunction |
rf
reward function for keeping track of the cumulative reward during an execution
|
protected boolean |
shouldAnnotateExecution
Boolean indicating whether the last option execution recording annotates the selected actions with this option's name
|
protected boolean |
shouldRecordResults
Boolean indicating whether the last option execution result should be saved
|
protected StateMapping |
stateMapping
An option state mapping to use to map from a source MDP state representation to a representation that this option will use
for action selection.
|
protected DirectOptionTerminateMapper |
terminateMapper
An optional mapping from initiation states to terminal states so that the execution of an option does not need to be simulated.
|
actionObservers, domain, name
Constructor and Description |
---|
Option()
Initializes an option without a name and parameters.
|
Option(java.lang.String name,
Domain domain)
Initializes.
|
Modifier and Type | Method and Description |
---|---|
protected void |
accumulateDiscountedProb(java.util.Map<HashableState,java.lang.Double> possibleTerminations,
State s,
double p)
Adds to the expected discounted probability of reaching state given a value p, where p = \gamma^k * p(s, s', k), where
s' is a possible terminal state and k is a unique number of steps not yet added to sum over all possible step sizes
to s'.
|
boolean |
continueFromState(State s,
GroundedAction groundedAction)
This method will use this option's termination probability, roll the dice, and
return whether the option should continue or terminate.
|
abstract java.util.List<Policy.ActionProb> |
getActionDistributionForState(State s,
GroundedAction groundedAction)
Returns the option's policy distribution for a given state.
|
protected java.util.List<Policy.ActionProb> |
getDeterministicPolicy(State s,
GroundedAction groundedAction)
This method creates a deterministic action selection probability distribution where the deterministic action
to be selected with probability 1 is the one returned by the method
getDeterministicPolicy(State, burlap.oomdp.singleagent.GroundedAction) . |
double |
getExpectedRewards(State s,
GroundedAction groundedAction)
Returns the expected reward to be received from initiating this option from state s.
|
double |
getLastCumulativeReward()
Returns the cumulative discounted reward received in last execution of this option.
|
EpisodeAnalysis |
getLastExecutionResults()
Returns the events from this option's last execution
|
int |
getLastNumSteps()
Returns the number of steps taken in the last execution of this option.
|
java.util.List<TransitionProbability> |
getTransitions(State st,
GroundedAction groundedAction)
Returns the transition probabilities for applying this action in the given state with the given set of parameters.
|
void |
initiateInState(State s,
GroundedAction groundedAction)
Tells the option that it is being initiated in the given state with the given parameters.
|
abstract void |
initiateInStateHelper(State s,
GroundedAction groundedAction)
This method is always called when an option is initiated and begins execution.
|
boolean |
isAnnotatingExecutionResults()
Returns whether this option is annotating recorded action executions with this option's name.
|
abstract boolean |
isMarkov()
Returns whether this option is Markov or not; that is, whether action selection and termination only depends on the current state.
|
boolean |
isPrimitive()
Returns whether this action is a primitive action of the domain or not.
|
boolean |
isRecordingExecutionResults()
Returns whether this option is recording its executions
|
protected void |
iterateExpectationScan(burlap.behavior.singleagent.options.Option.ExpectationSearchNode src,
double stackedDiscount,
java.util.Map<HashableState,java.lang.Double> possibleTerminations,
double[] expectedReturn)
This method will recursively determine all possible paths that could occur from execution of the option as well
as the expected return.
|
void |
keepTrackOfRewardWith(RewardFunction rf,
double discount)
Tells this option to keep track the cumulative reward from its execution using the given reward function and the given discount factor.
|
protected State |
map(State s)
Returns the state that is mapped from the input state.
|
EnvironmentOutcome |
oneStep(Environment env,
GroundedAction groundedAction)
Performs one step of execution of the option in the provided
Environment . |
State |
oneStep(State s,
GroundedAction groundedAction)
Performs one step of execution of the option.
|
abstract GroundedAction |
oneStepActionSelection(State s,
GroundedAction groundedAction)
This method causes the option to select a single step in the given state, when the option was initiated with the provided parameters.
|
protected State |
performActionHelper(State st,
GroundedAction groundedAction)
This method determines what happens when an action is applied in the given state with the given parameters.
|
EnvironmentOutcome |
performInEnvironment(Environment env,
GroundedAction groundedActions)
Executes this action with the specified parameters in the provided environment and returns the
EnvironmentOutcome result. |
abstract double |
probabilityOfTermination(State s,
GroundedAction groundedAction)
Returns the probability that this option (executed with the given parameters) will terminate in the given state
|
void |
setExernalTermination(TerminalFunction tf)
Sets what the external MDPs terminal function is that will cause this option to terminate if it enters those terminal states.
|
void |
setExpectationCalculationProbabilityCutoff(double cutoff)
Sets the minimum probability of reaching a terminal state for it to be included in the options computed transition dynamics distribution.
|
void |
setExpectationHashingFactory(HashableStateFactory hashingFactory)
Sets the option to use the provided hashing factory for caching transition probability results.
|
void |
setStateMapping(StateMapping m)
Sets this option to use a state mapping that maps from the source MDP states to another state representation that will be used by this option for making
action selections.
|
void |
setTerminateMapper(DirectOptionTerminateMapper tm)
Sets this option to determine its execution results using a direct terminal state mapping rather than actually executing each action selcted
by the option step by step.
|
void |
toggleShouldAnnotateResults(boolean toggle)
Toggle whether the last recorded option execution will annotate the actions taken with this option's name
|
void |
toggleShouldRecordResults(boolean toggle)
Change whether the options last execution will be recorded or not.
|
abstract boolean |
usesDeterministicPolicy()
Returns whether this option's policy is deterministic or stochastic
|
abstract boolean |
usesDeterministicTermination()
Returns whether this option's termination conditions are deterministic or stochastic
|
addActionObserver, applicableInState, clearAllActionsObservers, deterministicTransition, equals, getAllApplicableGroundedActions, getAllApplicableGroundedActionsFromActionList, getAssociatedGroundedAction, getDomain, getGroundedAction, getName, hashCode, isParameterized, performAction
protected java.util.Random rand
protected EpisodeAnalysis lastOptionExecutionResults
protected boolean shouldRecordResults
protected boolean shouldAnnotateExecution
protected RewardFunction rf
protected boolean keepTrackOfReward
protected double discountFactor
protected double lastCumulativeReward
protected double cumulativeDiscount
protected int lastNumSteps
protected TerminalFunction externalTerminalFunction
protected HashableStateFactory expectationStateHashingFactory
protected java.util.Map<HashableState,java.util.List<TransitionProbability>> cachedExpectations
protected java.util.Map<HashableState,java.lang.Double> cachedExpectedRewards
protected double expectationSearchCutoffProb
protected StateMapping stateMapping
protected DirectOptionTerminateMapper terminateMapper
DirectOptionTerminateMapper
class documentation for more
information.public Option()
public Option(java.lang.String name, Domain domain)
name
- the name of the option (should be unique from other options and actions a planning/learning algorithm can use)domain
- a domain with which this option is associated; note that this option will *not* be added to domain's list of actions like a normal action.public abstract boolean isMarkov()
public abstract boolean usesDeterministicTermination()
public abstract boolean usesDeterministicPolicy()
public abstract double probabilityOfTermination(State s, GroundedAction groundedAction)
s
- the state to test for terminationgroundedAction
- the parameters in which this option was initiatedpublic abstract void initiateInStateHelper(State s, GroundedAction groundedAction)
performActionHelper(burlap.oomdp.core.states.State, burlap.oomdp.singleagent.GroundedAction)
For Markov options, this method probably does not need to do anything, but for non-Markov options, like Macro actions, it may need
to initialize some structures for determining termination and action selection.s
- the state in which the option was initiatedgroundedAction
- the parameters in which this option will be initiatedpublic abstract GroundedAction oneStepActionSelection(State s, GroundedAction groundedAction)
performActionHelper(burlap.oomdp.core.states.State, burlap.oomdp.singleagent.GroundedAction)
method until it is determined that the option terminates.s
- the state in which an action should be selected.groundedAction
- the parameters in which this option was initiateds
public abstract java.util.List<Policy.ActionProb> getActionDistributionForState(State s, GroundedAction groundedAction)
s
- the state for which this option's policy distribution should be returnedgroundedAction
- the parameters in which this option was initiatedpublic void setExpectationHashingFactory(HashableStateFactory hashingFactory)
hashingFactory
- the state hashing factory to use.public void setExpectationCalculationProbabilityCutoff(double cutoff)
cutoff
- the minimum probability of reaching a terminal state for it to be included in the options computed transition dynamics distribution.public void toggleShouldRecordResults(boolean toggle)
toggle
- true if the last option execution should be saved; false otherwise.public void toggleShouldAnnotateResults(boolean toggle)
toggle
- true if the last recorded option execution will annotate the actions taken with this option's name; false otherwisepublic boolean isRecordingExecutionResults()
public boolean isAnnotatingExecutionResults()
public EpisodeAnalysis getLastExecutionResults()
public void setStateMapping(StateMapping m)
m
- the state mapping to use.public void setTerminateMapper(DirectOptionTerminateMapper tm)
DirectOptionTerminateMapper
class documentation for more information.tm
- the direct state to terminal state mapping to use.public void setExernalTermination(TerminalFunction tf)
tf
- the external MDPs terminal function isprotected State map(State s)
s
- the input state from which a mapped state is to be returned.public void keepTrackOfRewardWith(RewardFunction rf, double discount)
rf
- the reward function to usediscount
- the discount factor to usepublic double getLastCumulativeReward()
public int getLastNumSteps()
public boolean isPrimitive()
Action
isPrimitive
in class Action
public void initiateInState(State s, GroundedAction groundedAction)
initiateInStateHelper(State, burlap.oomdp.singleagent.GroundedAction)
method will be called before exiting.s
- the state in which the option is being initiated.groundedAction
- the parameters in which this option was initiatedprotected State performActionHelper(State st, GroundedAction groundedAction)
Action
Action.performAction(burlap.oomdp.core.states.State, GroundedAction)
first copies the input state to pass
to this helper method. The resulting state (which may be s) should then be returned.performActionHelper
in class Action
st
- the state to perform the action ongroundedAction
- the GroundedAction
specifying the parameters to usepublic EnvironmentOutcome performInEnvironment(Environment env, GroundedAction groundedActions)
Action
EnvironmentOutcome
result.performInEnvironment
in class Action
env
- the environment in which the action should be performed.groundedActions
- the GroundedAction
specifying the parameters to useEnvironmentOutcome
specifying the result of the action execution in the environmentpublic State oneStep(State s, GroundedAction groundedAction)
initiateInState(burlap.oomdp.core.states.State, burlap.oomdp.singleagent.GroundedAction)
method was called previously for the state in which this option was initiated.s
- the state in which a single step of the option is to be taken.groundedAction
- the parameters in which this option was initiatedpublic EnvironmentOutcome oneStep(Environment env, GroundedAction groundedAction)
Environment
.
This method assuems that the initiateInState(burlap.oomdp.core.states.State, burlap.oomdp.singleagent.GroundedAction)
method
was called previously for the state in which this option was initiated.env
- The Environment
in which this option is to be appliedgroundedAction
- the parameters in which this option was initiatedEnvironmentOutcome
of the one step of interaction.public boolean continueFromState(State s, GroundedAction groundedAction)
s
- the state to check againstgroundedAction
- the parameters in which this option was initiatedpublic double getExpectedRewards(State s, GroundedAction groundedAction)
s
- the state in which the option is initiatedgroundedAction
- the parameters in which this option was initiatedpublic java.util.List<TransitionProbability> getTransitions(State st, GroundedAction groundedAction)
FullActionModel
TransitionProbability
objects. The list
is only required to contain transitions with non-zero probability.getTransitions
in interface FullActionModel
st
- the state from which the transition probabilities when applying this action will be returned.groundedAction
- the GroundedAction
specifying the parameters to useprotected void iterateExpectationScan(burlap.behavior.singleagent.options.Option.ExpectationSearchNode src, double stackedDiscount, java.util.Map<HashableState,java.lang.Double> possibleTerminations, double[] expectedReturn)
expectationSearchCutoffProb
src
- the source node from which to expand possible pathsstackedDiscount
- the discount amount up to this pointpossibleTerminations
- a map of possible termination states and their probabilityexpectedReturn
- the expected discounted cumulative reward up to node src (this is an array of length 1 that is used to be a mutable double)protected void accumulateDiscountedProb(java.util.Map<HashableState,java.lang.Double> possibleTerminations, State s, double p)
possibleTerminations
- the map from of all possible termination states to the expected discounted probability of reaching thems
- a possible termination statep
- the discounted probability of reaching s for some specific number of steps not already summed into the respective possibleTerminations map.protected java.util.List<Policy.ActionProb> getDeterministicPolicy(State s, GroundedAction groundedAction)
getDeterministicPolicy(State, burlap.oomdp.singleagent.GroundedAction)
.
This method is helpful for quickly defining the action selection distribution for deterministic option policies.s
- the state for which the action selection distribution should be returned.groundedAction
- the parameters in which this option was initiated