public class GradientDescentSarsaLam extends OOMDPPlanner implements QComputablePlanner, LearningAgent
ValueFunctionApproximation
interface provided. The implementation can either be used for learning or planning,
the latter of which is performed by running many learning episodes in succession. The number of episodes used for planning can be determined
by a threshold maximum number of episodes, or by a maximum change in the VFA weight threshold.Modifier and Type | Class and Description |
---|---|
static class |
GradientDescentSarsaLam.EligibilityTraceVector
An object for keeping track of the eligibility traces within an episode for each VFA weight
|
QComputablePlanner.QComputablePlannerHelper
LearningAgent.LearningAgentBookKeeping
Modifier and Type | Field and Description |
---|---|
protected java.util.LinkedList<EpisodeAnalysis> |
episodeHistory
the saved previous learning episodes
|
protected int |
eStepCounter
A counter for counting the number of steps in an episode that have been taken thus far
|
protected double |
lambda
the strength of eligibility traces (0 for one step, 1 for full propagation)
|
protected Policy |
learningPolicy
The learning policy to use.
|
protected LearningRate |
learningRate
A learning rate function to use
|
protected int |
maxEpisodeSize
The maximum number of steps that will be taken in an episode before the agent terminates a learning episode
|
protected double |
maxWeightChangeForPlanningTermination
The maximum allowable change in the VFA weights during an episode before the planning method terminates.
|
protected double |
maxWeightChangeInLastEpisode
The maximum VFA weight change that occurred in the last learning episode.
|
protected double |
minEligibityForUpdate
The minimum eligibility value of a trace that will cause it to be updated
|
protected int |
numEpisodesForPlanning
The maximum number of episodes to use for planning
|
protected int |
numEpisodesToStore
The number of the most recent learning episodes to store.
|
protected boolean |
shouldAnnotateOptions
Whether decomposed options should have their primitive actions annotated with the options name in the returned
EpisodeAnalysis objects. |
protected boolean |
shouldDecomposeOptions
Whether options should be decomposed into actions in the returned
EpisodeAnalysis objects. |
protected int |
totalNumberOfSteps
The total number of learning steps performed by this agent.
|
protected boolean |
useFeatureWiseLearningRate
Whether the learning rate polls should be based on the VFA state features or OO-MDP state.
|
protected boolean |
useReplacingTraces
Whether to use accumulating or replacing eligibility traces.
|
protected ValueFunctionApproximation |
vfa
The object that performs value function approximation
|
actions, containsParameterizedActions, debugCode, domain, gamma, hashingFactory, mapToStateIndex, rf, tf
Constructor and Description |
---|
GradientDescentSarsaLam(Domain domain,
RewardFunction rf,
TerminalFunction tf,
double gamma,
ValueFunctionApproximation vfa,
double learningRate,
double lambda)
Initializes SARSA(\lambda) with 0.1 epsilon greedy policy and places no limit on the number of steps the
agent can take in an episode.
|
GradientDescentSarsaLam(Domain domain,
RewardFunction rf,
TerminalFunction tf,
double gamma,
ValueFunctionApproximation vfa,
double learningRate,
int maxEpisodeSize,
double lambda)
Initializes SARSA(\lambda) with 0.1 epsilon greedy policy.
|
GradientDescentSarsaLam(Domain domain,
RewardFunction rf,
TerminalFunction tf,
double gamma,
ValueFunctionApproximation vfa,
double learningRate,
Policy learningPolicy,
int maxEpisodeSize,
double lambda)
Initializes SARSA(\lambda) By default the agent will only save the last learning episode and a call to the
planFromState(State) method
will cause the planner to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm. |
Modifier and Type | Method and Description |
---|---|
protected void |
GDSLInit(Domain domain,
RewardFunction rf,
TerminalFunction tf,
double gamma,
ValueFunctionApproximation vfa,
double learningRate,
Policy learningPolicy,
int maxEpisodeSize,
double lambda)
Initializes SARSA(\lambda) By default the agent will only save the last learning episode and a call to the
planFromState(State) method
will cause the planner to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm. |
protected ActionApproximationResult |
getActionApproximation(State s,
GroundedAction ga)
Returns the VFA Q-value approximation for the given state and action.
|
protected java.util.List<ActionApproximationResult> |
getAllActionApproximations(State s)
Gets all Q-value VFA results for each action for a given state
|
java.util.List<EpisodeAnalysis> |
getAllStoredLearningEpisodes()
Returns all saved
EpisodeAnalysis objects of which the agent has kept track. |
EpisodeAnalysis |
getLastLearningEpisode()
Returns the last learning episode of the agent.
|
int |
getLastNumSteps()
Returns the number of steps taken in the last episode;
|
QValue |
getQ(State s,
AbstractGroundedAction a)
Returns the
QValue for the given state-action pair. |
protected QValue |
getQFromFeaturesFor(java.util.List<ActionApproximationResult> results,
State s,
GroundedAction ga)
Creates a Q-value object in which the Q-value is determined from VFA.
|
java.util.List<QValue> |
getQs(State s)
Returns a
List of QValue objects for ever permissible action for the given input state. |
void |
planFromState(State initialState)
This method will cause the planner to begin planning from the specified initial state
|
void |
resetPlannerResults()
Use this method to reset all planner results so that planning can be started fresh with a call to
OOMDPPlanner.planFromState(State)
as if no planning had ever been performed before. |
EpisodeAnalysis |
runLearningEpisodeFrom(State initialState)
Causes the agent to perform a learning episode starting in the given initial state.
|
EpisodeAnalysis |
runLearningEpisodeFrom(State initialState,
int maxSteps)
Causes the agent to perform a learning episode starting in the given initial state.
|
void |
setLearningPolicy(Policy p)
Sets which policy this agent should use for learning.
|
void |
setLearningRate(LearningRate lr)
Sets the learning rate function to use.
|
void |
setMaximumEpisodesForPlanning(int n)
Sets the maximum number of episodes that will be performed when the
planFromState(State) method is called. |
void |
setMaxVFAWeightChangeForPlanningTerminaiton(double m)
Sets a max change in the VFA weight threshold that will cause the
planFromState(State) to stop planning
when it is achieved. |
void |
setNumEpisodesToStore(int numEps)
Tells the agent how many
EpisodeAnalysis objects representing learning episodes to internally store. |
void |
setUseFeatureWiseLearningRate(boolean useFeatureWiseLearningRate)
Sets whether learning rate polls should be based on the VFA state feature ids, or the OO-MDP state.
|
void |
setUseReplaceTraces(boolean toggle)
Sets whether to use replacing eligibility traces rather than accumulating traces.
|
void |
toggleShouldAnnotateOptionDecomposition(boolean toggle)
Sets whether options that are decomposed into primitives will have the option that produced them and listed.
|
void |
toggleShouldDecomposeOption(boolean toggle)
Sets whether the primitive actions taken during an options will be included as steps in produced EpisodeAnalysis objects.
|
addNonDomainReferencedAction, getActions, getAllGroundedActions, getDebugCode, getDomain, getGamma, getHashingFactory, getRf, getRF, getTf, getTF, plannerInit, setActions, setDebugCode, setDomain, setGamma, setRf, setTf, stateHash, toggleDebugPrinting, translateAction
protected ValueFunctionApproximation vfa
protected LearningRate learningRate
protected Policy learningPolicy
protected double lambda
protected int maxEpisodeSize
protected int eStepCounter
protected int numEpisodesForPlanning
protected double maxWeightChangeForPlanningTermination
protected double maxWeightChangeInLastEpisode
protected boolean useFeatureWiseLearningRate
protected double minEligibityForUpdate
protected java.util.LinkedList<EpisodeAnalysis> episodeHistory
protected int numEpisodesToStore
protected boolean useReplacingTraces
protected boolean shouldDecomposeOptions
EpisodeAnalysis
objects.protected boolean shouldAnnotateOptions
EpisodeAnalysis
objects.protected int totalNumberOfSteps
public GradientDescentSarsaLam(Domain domain, RewardFunction rf, TerminalFunction tf, double gamma, ValueFunctionApproximation vfa, double learningRate, double lambda)
planFromState(State)
method
will cause the planner to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learnrf
- the reward functiontf
- the terminal functiongamma
- the discount factorvfa
- the value function approximation method to use for estimate Q-valueslearningRate
- the learning ratelambda
- specifies the strength of eligibility traces (0 for one step, 1 for full propagation)public GradientDescentSarsaLam(Domain domain, RewardFunction rf, TerminalFunction tf, double gamma, ValueFunctionApproximation vfa, double learningRate, int maxEpisodeSize, double lambda)
planFromState(State)
method
will cause the planner to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learnrf
- the reward functiontf
- the terminal functiongamma
- the discount factorvfa
- the value function approximation method to use for estimate Q-valueslearningRate
- the learning ratemaxEpisodeSize
- the maximum number of steps the agent will take in an episode before terminatinglambda
- specifies the strength of eligibility traces (0 for one step, 1 for full propagation)public GradientDescentSarsaLam(Domain domain, RewardFunction rf, TerminalFunction tf, double gamma, ValueFunctionApproximation vfa, double learningRate, Policy learningPolicy, int maxEpisodeSize, double lambda)
planFromState(State)
method
will cause the planner to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learnrf
- the reward functiontf
- the terminal functiongamma
- the discount factorvfa
- the value function approximation method to use for estimate Q-valueslearningRate
- the learning ratelearningPolicy
- the learning policy to follow during a learning episode.maxEpisodeSize
- the maximum number of steps the agent will take in an episode before terminatinglambda
- specifies the strength of eligibility traces (0 for one step, 1 for full propagation)protected void GDSLInit(Domain domain, RewardFunction rf, TerminalFunction tf, double gamma, ValueFunctionApproximation vfa, double learningRate, Policy learningPolicy, int maxEpisodeSize, double lambda)
planFromState(State)
method
will cause the planner to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learnrf
- the reward functiontf
- the terminal functiongamma
- the discount factorvfa
- the value function approximation method to use for estimate Q-valueslearningRate
- the learning ratelearningPolicy
- the learning policy to follow during a learning episode.maxEpisodeSize
- the maximum number of steps the agent will take in an episode before terminatinglambda
- specifies the strength of eligibility traces (0 for one step, 1 for full propagation)public void setLearningRate(LearningRate lr)
lr
- the learning rate function to use.public void setUseFeatureWiseLearningRate(boolean useFeatureWiseLearningRate)
useFeatureWiseLearningRate
- if true then learning rate polls are based on VFA state feature ids; if false then they are based on the OO-MDP state object.public void setLearningPolicy(Policy p)
p
- the policy to use for learning.public void setMaximumEpisodesForPlanning(int n)
planFromState(State)
method is called.n
- the maximum number of episodes that will be performed when the planFromState(State)
method is called.public void setMaxVFAWeightChangeForPlanningTerminaiton(double m)
planFromState(State)
to stop planning
when it is achieved.m
- the maximum allowable change in the VFA weights before planning stopspublic int getLastNumSteps()
public void setUseReplaceTraces(boolean toggle)
toggle
- public void toggleShouldDecomposeOption(boolean toggle)
toggle
- whether to decompose options into the primitive actions taken by them or not.public void toggleShouldAnnotateOptionDecomposition(boolean toggle)
toggle
- whether to annotate the primitive actions of options with the calling option's name.public EpisodeAnalysis runLearningEpisodeFrom(State initialState)
LearningAgent
runLearningEpisodeFrom
in interface LearningAgent
initialState
- The initial state in which the agent will start the episode.EpisodeAnalysis
object.public EpisodeAnalysis runLearningEpisodeFrom(State initialState, int maxSteps)
LearningAgent
runLearningEpisodeFrom
in interface LearningAgent
initialState
- The initial state in which the agent will start the episode.maxSteps
- the maximum number of steps in the episodeEpisodeAnalysis
object.public EpisodeAnalysis getLastLearningEpisode()
LearningAgent
getLastLearningEpisode
in interface LearningAgent
public void setNumEpisodesToStore(int numEps)
LearningAgent
EpisodeAnalysis
objects representing learning episodes to internally store.
For instance, if the number of set to 5, then the agent should remember the save the last 5 learning episodes. Note that this number
has nothing to do with how learning is performed; it is purely for performance gathering.setNumEpisodesToStore
in interface LearningAgent
numEps
- the number of learning episodes to remember.public java.util.List<EpisodeAnalysis> getAllStoredLearningEpisodes()
LearningAgent
EpisodeAnalysis
objects of which the agent has kept track.getAllStoredLearningEpisodes
in interface LearningAgent
EpisodeAnalysis
objects of which the agent has kept track.public java.util.List<QValue> getQs(State s)
QComputablePlanner
List
of QValue
objects for ever permissible action for the given input state.getQs
in interface QComputablePlanner
s
- the state for which Q-values are to be returned.List
of QValue
objects for ever permissible action for the given input state.public QValue getQ(State s, AbstractGroundedAction a)
QComputablePlanner
QValue
for the given state-action pair.getQ
in interface QComputablePlanner
s
- the input statea
- the input actionQValue
for the given state-action pair.protected QValue getQFromFeaturesFor(java.util.List<ActionApproximationResult> results, State s, GroundedAction ga)
results
- the VFA prediction results for each action.s
- the state of the Q-valuega
- the action takenprotected java.util.List<ActionApproximationResult> getAllActionApproximations(State s)
s
- the state for which the Q-Value VFA results should be returned.protected ActionApproximationResult getActionApproximation(State s, GroundedAction ga)
s
- the state for which the VFA result should be returnedga
- the action for which the VFA result should be returnedpublic void planFromState(State initialState)
OOMDPPlanner
planFromState
in class OOMDPPlanner
initialState
- the initial state of the planning problempublic void resetPlannerResults()
OOMDPPlanner
OOMDPPlanner.planFromState(State)
as if no planning had ever been performed before. Specifically, data produced from calls to the
OOMDPPlanner.planFromState(State)
will be cleared, but all other planner settings should remain the same.
This is useful if the reward function or transition dynamics have changed, thereby
requiring new results to be computed. If there were other objects this planner was provided that may have changed
and need to be reset, you will need to reset them yourself. For instance, if you told a planner to follow a policy
that had a temperature parameter decrease with time, you will need to reset the policy's temperature yourself.resetPlannerResults
in class OOMDPPlanner