public class GradientDescentSarsaLam extends MDPSolver implements QProvider, LearningAgent, Planner
DifferentiableStateActionValue
implementation provided.
The implementation can either be used for learning or planning,
the latter of which is performed by running many learning episodes in succession in a SimulatedEnvironment
.
If you are going to use this algorithm for planning, call the initializeForPlanning(int)
method before calling planFromState(State)
. The number of episodes used for planning can be determined
by a threshold maximum number of episodes, or by a maximum change in the VFA weight threshold.
By default, this agent will use an epsilon-greedy policy with epsilon=0.1. You can change the learning policy to
anything with the setLearningPolicy(burlap.behavior.policy.Policy)
policy.
If you
want to use a custom learning rate decay schedule rather than a constant learning rate, use the
setLearningRate(burlap.behavior.learningrate.LearningRate)
.
1. Rummery, Gavin A., and Mahesan Niranjan. On-line Q-learning using connectionist systems. University of Cambridge, Department of Engineering, 1994.
2. 2. Sutton, Richard S., Doina Precup, and Satinder Singh. "Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning." Artificial intelligence 112.1 (1999): 181-211.
Modifier and Type | Class and Description |
---|---|
static class |
GradientDescentSarsaLam.EligibilityTraceVector
An object for keeping track of the eligibility traces within an episode for each VFA weight
|
QProvider.Helper
Modifier and Type | Field and Description |
---|---|
protected int |
eStepCounter
A counter for counting the number of steps in an episode that have been taken thus far
|
protected double |
lambda
the strength of eligibility traces (0 for one step, 1 for full propagation)
|
protected Policy |
learningPolicy
The learning policy to use.
|
protected LearningRate |
learningRate
A learning rate function to use
|
protected int |
maxEpisodeSize
The maximum number of steps that will be taken in an episode before the agent terminates a learning episode
|
protected double |
maxWeightChangeForPlanningTermination
The maximum allowable change in the VFA weights during an episode before the planning method terminates.
|
protected double |
maxWeightChangeInLastEpisode
The maximum VFA weight change that occurred in the last learning episode.
|
protected double |
minEligibityForUpdate
The minimum eligibility value of a trace that will cause it to be updated
|
protected int |
numEpisodesForPlanning
The maximum number of episodes to use for planning
|
protected boolean |
shouldDecomposeOptions
Whether options should be decomposed into actions in the returned
Episode objects. |
protected int |
totalNumberOfSteps
The total number of learning steps performed by this agent.
|
protected boolean |
useFeatureWiseLearningRate
Whether the learning rate polls should be based on the VFA state features or OO-MDP state.
|
protected boolean |
useReplacingTraces
Whether to use accumulating or replacing eligibility traces.
|
protected DifferentiableStateActionValue |
vfa
The object that performs value function approximation
|
actionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel
Constructor and Description |
---|
GradientDescentSarsaLam(SADomain domain,
double gamma,
DifferentiableStateActionValue vfa,
double learningRate,
double lambda)
Initializes SARSA(\lambda) with 0.1 epsilon greedy policy and places no limit on the number of steps the
agent can take in an episode.
|
GradientDescentSarsaLam(SADomain domain,
double gamma,
DifferentiableStateActionValue vfa,
double learningRate,
int maxEpisodeSize,
double lambda)
Initializes SARSA(\lambda) with 0.1 epsilon greedy policy.
|
GradientDescentSarsaLam(SADomain domain,
double gamma,
DifferentiableStateActionValue vfa,
double learningRate,
Policy learningPolicy,
int maxEpisodeSize,
double lambda)
Initializes SARSA(\lambda) By default the agent will only save the last learning episode and a call to the
planFromState(State) method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm. |
Modifier and Type | Method and Description |
---|---|
protected void |
GDSLInit(SADomain domain,
double gamma,
DifferentiableStateActionValue vfa,
double learningRate,
Policy learningPolicy,
int maxEpisodeSize,
double lambda)
Initializes SARSA(\lambda) By default the agent will only save the last learning episode and a call to the
planFromState(State) method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm. |
int |
getLastNumSteps()
Returns the number of steps taken in the last episode;
|
void |
initializeForPlanning(int numEpisodesForPlanning)
Sets the
RewardFunction , TerminalFunction ,
and the number of simulated episodes to use for planning when
the planFromState(State) method is called. |
GreedyQPolicy |
planFromState(State initialState)
Plans from the input state and then returns a
GreedyQPolicy that greedily
selects the action with the highest Q-value and breaks ties uniformly randomly. |
double |
qValue(State s,
Action a)
Returns the
QValue for the given state-action pair. |
java.util.List<QValue> |
qValues(State s)
Returns a
List of QValue objects for ever permissible action for the given input state. |
void |
resetSolver()
This method resets all solver results so that a solver can be restarted fresh
as if had never solved the MDP.
|
Episode |
runLearningEpisode(Environment env) |
Episode |
runLearningEpisode(Environment env,
int maxSteps) |
void |
setLearningPolicy(Policy p)
Sets which policy this agent should use for learning.
|
void |
setLearningRate(LearningRate lr)
Sets the learning rate function to use.
|
void |
setMaximumEpisodesForPlanning(int n)
Sets the maximum number of episodes that will be performed when the
planFromState(State) method is called. |
void |
setMaxVFAWeightChangeForPlanningTerminaiton(double m)
Sets a max change in the VFA weight threshold that will cause the
planFromState(State) to stop planning
when it is achieved. |
void |
setUseFeatureWiseLearningRate(boolean useFeatureWiseLearningRate)
Sets whether learning rate polls should be based on the VFA state feature ids, or the OO-MDP state.
|
void |
setUseReplaceTraces(boolean toggle)
Sets whether to use replacing eligibility traces rather than accumulating traces.
|
void |
toggleShouldDecomposeOption(boolean toggle)
Sets whether the primitive actions taken during an options will be included as steps in produced EpisodeAnalysis objects.
|
double |
value(State s)
Returns the value function evaluation of the given state.
|
addActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrinting
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addActionType, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, toggleDebugPrinting
protected DifferentiableStateActionValue vfa
protected LearningRate learningRate
protected Policy learningPolicy
protected double lambda
protected int maxEpisodeSize
protected int eStepCounter
protected int numEpisodesForPlanning
protected double maxWeightChangeForPlanningTermination
protected double maxWeightChangeInLastEpisode
protected boolean useFeatureWiseLearningRate
protected double minEligibityForUpdate
protected boolean useReplacingTraces
protected boolean shouldDecomposeOptions
Episode
objects.protected int totalNumberOfSteps
public GradientDescentSarsaLam(SADomain domain, double gamma, DifferentiableStateActionValue vfa, double learningRate, double lambda)
planFromState(State)
method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learngamma
- the discount factorvfa
- the value function approximation method to use for estimate Q-valueslearningRate
- the learning ratelambda
- specifies the strength of eligibility traces (0 for one step, 1 for full propagation)public GradientDescentSarsaLam(SADomain domain, double gamma, DifferentiableStateActionValue vfa, double learningRate, int maxEpisodeSize, double lambda)
planFromState(State)
method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learngamma
- the discount factorvfa
- the value function approximation method to use for estimate Q-valueslearningRate
- the learning ratemaxEpisodeSize
- the maximum number of steps the agent will take in an episode before terminatinglambda
- specifies the strength of eligibility traces (0 for one step, 1 for full propagation)public GradientDescentSarsaLam(SADomain domain, double gamma, DifferentiableStateActionValue vfa, double learningRate, Policy learningPolicy, int maxEpisodeSize, double lambda)
planFromState(State)
method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learngamma
- the discount factorvfa
- the value function approximation method to use for estimate Q-valueslearningRate
- the learning ratelearningPolicy
- the learning policy to follow during a learning episode.maxEpisodeSize
- the maximum number of steps the agent will take in an episode before terminatinglambda
- specifies the strength of eligibility traces (0 for one step, 1 for full propagation)protected void GDSLInit(SADomain domain, double gamma, DifferentiableStateActionValue vfa, double learningRate, Policy learningPolicy, int maxEpisodeSize, double lambda)
planFromState(State)
method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learngamma
- the discount factorvfa
- the value function approximation method to use for estimate Q-valueslearningRate
- the learning ratelearningPolicy
- the learning policy to follow during a learning episode.maxEpisodeSize
- the maximum number of steps the agent will take in an episode before terminatinglambda
- specifies the strength of eligibility traces (0 for one step, 1 for full propagation)public void initializeForPlanning(int numEpisodesForPlanning)
RewardFunction
, TerminalFunction
,
and the number of simulated episodes to use for planning when
the planFromState(State)
method is called. If the
RewardFunction
and TerminalFunction
are not set, the planFromState(State)
method will throw a runtime exception.numEpisodesForPlanning
- the number of simulated episodes to run for planning.public void setLearningRate(LearningRate lr)
lr
- the learning rate function to use.public void setUseFeatureWiseLearningRate(boolean useFeatureWiseLearningRate)
useFeatureWiseLearningRate
- if true then learning rate polls are based on VFA state feature ids; if false then they are based on the OO-MDP state object.public void setLearningPolicy(Policy p)
p
- the policy to use for learning.public void setMaximumEpisodesForPlanning(int n)
planFromState(State)
method is called.n
- the maximum number of episodes that will be performed when the planFromState(State)
method is called.public void setMaxVFAWeightChangeForPlanningTerminaiton(double m)
planFromState(State)
to stop planning
when it is achieved.m
- the maximum allowable change in the VFA weights before planning stopspublic int getLastNumSteps()
public void setUseReplaceTraces(boolean toggle)
toggle
- true to use replacing traces, false to use accumulating tracespublic void toggleShouldDecomposeOption(boolean toggle)
toggle
- whether to decompose options into the primitive actions taken by them or not.public Episode runLearningEpisode(Environment env)
runLearningEpisode
in interface LearningAgent
public Episode runLearningEpisode(Environment env, int maxSteps)
runLearningEpisode
in interface LearningAgent
public java.util.List<QValue> qValues(State s)
QProvider
List
of QValue
objects for ever permissible action for the given input state.public double qValue(State s, Action a)
QFunction
QValue
for the given state-action pair.public double value(State s)
ValueFunction
value
in interface ValueFunction
s
- the state to evaluate.public GreedyQPolicy planFromState(State initialState)
GreedyQPolicy
that greedily
selects the action with the highest Q-value and breaks ties uniformly randomly.planFromState
in interface Planner
initialState
- the initial state of the planning problemGreedyQPolicy
.public void resetSolver()
MDPSolverInterface
resetSolver
in interface MDPSolverInterface
resetSolver
in class MDPSolver