GradientDescentSarsaLam

java.lang.Object
- burlap.behavior.singleagent.MDPSolver
- - burlap.behavior.singleagent.learning.tdmethods.vfa.GradientDescentSarsaLam

All Implemented Interfaces:

LearningAgent, MDPSolverInterface, Planner, QFunction, QProvider, ValueFunction
```
public class GradientDescentSarsaLam
extends MDPSolver
implements QProvider, LearningAgent, Planner
```
Gradient Descent SARSA(\lambda) implementation [1]. This implementation will work correctly with options [2]. This implementation will work with both linear and non-linear value function approximations by using the gradient value provided to it through the DifferentiableStateActionValue implementation provided.
The implementation can either be used for learning or planning, the latter of which is performed by running many learning episodes in succession in a SimulatedEnvironment. If you are going to use this algorithm for planning, call the initializeForPlanning(int) method before calling planFromState(State). The number of episodes used for planning can be determined by a threshold maximum number of episodes, or by a maximum change in the VFA weight threshold.
By default, this agent will use an epsilon-greedy policy with epsilon=0.1. You can change the learning policy to anything with the setLearningPolicy(burlap.behavior.policy.Policy) policy.
If you want to use a custom learning rate decay schedule rather than a constant learning rate, use the setLearningRate(burlap.behavior.learningrate.LearningRate).

Author:

James MacGlashan
1. Rummery, Gavin A., and Mahesan Niranjan. On-line Q-learning using connectionist systems. University of Cambridge, Department of Engineering, 1994.
2. 2. Sutton, Richard S., Doina Precup, and Satinder Singh. "Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning." Artificial intelligence 112.1 (1999): 181-211.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`GradientDescentSarsaLam.EligibilityTraceVector` An object for keeping track of the eligibility traces within an episode for each VFA weight

Nested classes/interfaces inherited from interface burlap.behavior.valuefunction.QProvider
QProvider.Helper

Field Summary

Fields
Modifier and Type	Field and Description
`protected int`	`eStepCounter` A counter for counting the number of steps in an episode that have been taken thus far
`protected double`	`lambda` the strength of eligibility traces (0 for one step, 1 for full propagation)
`protected Policy`	`learningPolicy` The learning policy to use.
`protected LearningRate`	`learningRate` A learning rate function to use
`protected int`	`maxEpisodeSize` The maximum number of steps that will be taken in an episode before the agent terminates a learning episode
`protected double`	`maxWeightChangeForPlanningTermination` The maximum allowable change in the VFA weights during an episode before the planning method terminates.
`protected double`	`maxWeightChangeInLastEpisode` The maximum VFA weight change that occurred in the last learning episode.
`protected double`	`minEligibityForUpdate` The minimum eligibility value of a trace that will cause it to be updated
`protected int`	`numEpisodesForPlanning` The maximum number of episodes to use for planning
`protected boolean`	`shouldDecomposeOptions` Whether options should be decomposed into actions in the returned `Episode` objects.
`protected int`	`totalNumberOfSteps` The total number of learning steps performed by this agent.
`protected boolean`	`useFeatureWiseLearningRate` Whether the learning rate polls should be based on the VFA state features or OO-MDP state.
`protected boolean`	`useReplacingTraces` Whether to use accumulating or replacing eligibility traces.
`protected DifferentiableStateActionValue`	`vfa` The object that performs value function approximation

Fields inherited from class burlap.behavior.singleagent.MDPSolver
actionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel

Constructor Summary

Constructors
Constructor and Description
`GradientDescentSarsaLam(SADomain domain, double gamma, DifferentiableStateActionValue vfa, double learningRate, double lambda)` Initializes SARSA(\lambda) with 0.1 epsilon greedy policy and places no limit on the number of steps the agent can take in an episode.
`GradientDescentSarsaLam(SADomain domain, double gamma, DifferentiableStateActionValue vfa, double learningRate, int maxEpisodeSize, double lambda)` Initializes SARSA(\lambda) with 0.1 epsilon greedy policy.
`GradientDescentSarsaLam(SADomain domain, double gamma, DifferentiableStateActionValue vfa, double learningRate, Policy learningPolicy, int maxEpisodeSize, double lambda)` Initializes SARSA(\lambda) By default the agent will only save the last learning episode and a call to the `planFromState(State)` method will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this algorithm as a planning algorithm.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected void`	`GDSLInit(SADomain domain, double gamma, DifferentiableStateActionValue vfa, double learningRate, Policy learningPolicy, int maxEpisodeSize, double lambda)` Initializes SARSA(\lambda) By default the agent will only save the last learning episode and a call to the `planFromState(State)` method will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this algorithm as a planning algorithm.
`int`	`getLastNumSteps()` Returns the number of steps taken in the last episode;
`void`	`initializeForPlanning(int numEpisodesForPlanning)` Sets the `RewardFunction`, `TerminalFunction`, and the number of simulated episodes to use for planning when the `planFromState(State)` method is called.
`GreedyQPolicy`	`planFromState(State initialState)` Plans from the input state and then returns a `GreedyQPolicy` that greedily selects the action with the highest Q-value and breaks ties uniformly randomly.
`double`	`qValue(State s, Action a)` Returns the `QValue` for the given state-action pair.
`java.util.List<QValue>`	`qValues(State s)` Returns a `List` of `QValue` objects for ever permissible action for the given input state.
`void`	`resetSolver()` This method resets all solver results so that a solver can be restarted fresh as if had never solved the MDP.
`Episode`	`runLearningEpisode(Environment env)`
`Episode`	`runLearningEpisode(Environment env, int maxSteps)`
`void`	`setLearningPolicy(Policy p)` Sets which policy this agent should use for learning.
`void`	`setLearningRate(LearningRate lr)` Sets the learning rate function to use.
`void`	`setMaximumEpisodesForPlanning(int n)` Sets the maximum number of episodes that will be performed when the `planFromState(State)` method is called.
`void`	`setMaxVFAWeightChangeForPlanningTerminaiton(double m)` Sets a max change in the VFA weight threshold that will cause the `planFromState(State)` to stop planning when it is achieved.
`void`	`setUseFeatureWiseLearningRate(boolean useFeatureWiseLearningRate)` Sets whether learning rate polls should be based on the VFA state feature ids, or the OO-MDP state.
`void`	`setUseReplaceTraces(boolean toggle)` Sets whether to use replacing eligibility traces rather than accumulating traces.
`void`	`toggleShouldDecomposeOption(boolean toggle)` Sets whether the primitive actions taken during an options will be included as steps in produced EpisodeAnalysis objects.
`double`	`value(State s)` Returns the value function evaluation of the given state.

Methods inherited from class burlap.behavior.singleagent.MDPSolver
addActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrinting

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface burlap.behavior.singleagent.MDPSolverInterface
addActionType, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, toggleDebugPrinting

- Field Detail
  - vfa
```
protected DifferentiableStateActionValue vfa
```
    The object that performs value function approximation
  - learningRate
```
protected LearningRate learningRate
```
    A learning rate function to use
  - learningPolicy
```
protected Policy learningPolicy
```
    The learning policy to use. Typically these will be policies that link back to this object so that they change as the Q-value estimate change.
  - lambda
```
protected double lambda
```
    the strength of eligibility traces (0 for one step, 1 for full propagation)
  - maxEpisodeSize
```
protected int maxEpisodeSize
```
    The maximum number of steps that will be taken in an episode before the agent terminates a learning episode
  - eStepCounter
```
protected int eStepCounter
```
    A counter for counting the number of steps in an episode that have been taken thus far
  - numEpisodesForPlanning
```
protected int numEpisodesForPlanning
```
    The maximum number of episodes to use for planning
  - maxWeightChangeForPlanningTermination
```
protected double maxWeightChangeForPlanningTermination
```
    The maximum allowable change in the VFA weights during an episode before the planning method terminates.
  - maxWeightChangeInLastEpisode
```
protected double maxWeightChangeInLastEpisode
```
    The maximum VFA weight change that occurred in the last learning episode.
  - useFeatureWiseLearningRate
```
protected boolean useFeatureWiseLearningRate
```
    Whether the learning rate polls should be based on the VFA state features or OO-MDP state. If true, then based on feature VFA state features; if false then the OO-MDP state. Default is to use feature ids.
  - minEligibityForUpdate
```
protected double minEligibityForUpdate
```
    The minimum eligibility value of a trace that will cause it to be updated
  - useReplacingTraces
```
protected boolean useReplacingTraces
```
    Whether to use accumulating or replacing eligibility traces.
  - shouldDecomposeOptions
```
protected boolean shouldDecomposeOptions
```
    Whether options should be decomposed into actions in the returned Episode objects.
  - totalNumberOfSteps
```
protected int totalNumberOfSteps
```
    The total number of learning steps performed by this agent.
- Constructor Detail
  - GradientDescentSarsaLam
```
public GradientDescentSarsaLam(SADomain domain,
                               double gamma,
                               DifferentiableStateActionValue vfa,
                               double learningRate,
                               double lambda)
```
    Initializes SARSA(\lambda) with 0.1 epsilon greedy policy and places no limit on the number of steps the agent can take in an episode. By default the agent will only save the last learning episode and a call to the planFromState(State) method will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this algorithm as a planning algorithm.
    
    Parameters:
    
    domain - the domain in which to learn
    
    gamma - the discount factor
    
    vfa - the value function approximation method to use for estimate Q-values
    
    learningRate - the learning rate
    
    lambda - specifies the strength of eligibility traces (0 for one step, 1 for full propagation)
  - GradientDescentSarsaLam
```
public GradientDescentSarsaLam(SADomain domain,
                               double gamma,
                               DifferentiableStateActionValue vfa,
                               double learningRate,
                               int maxEpisodeSize,
                               double lambda)
```
    Initializes SARSA(\lambda) with 0.1 epsilon greedy policy. By default the agent will only save the last learning episode and a call to the planFromState(State) method will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this algorithm as a planning algorithm.
    
    Parameters:
    
    domain - the domain in which to learn
    
    gamma - the discount factor
    
    vfa - the value function approximation method to use for estimate Q-values
    
    learningRate - the learning rate
    
    maxEpisodeSize - the maximum number of steps the agent will take in an episode before terminating
    
    lambda - specifies the strength of eligibility traces (0 for one step, 1 for full propagation)
  - GradientDescentSarsaLam
```
public GradientDescentSarsaLam(SADomain domain,
                               double gamma,
                               DifferentiableStateActionValue vfa,
                               double learningRate,
                               Policy learningPolicy,
                               int maxEpisodeSize,
                               double lambda)
```
    Initializes SARSA(\lambda) By default the agent will only save the last learning episode and a call to the planFromState(State) method will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this algorithm as a planning algorithm.
    
    Parameters:
    
    domain - the domain in which to learn
    
    gamma - the discount factor
    
    vfa - the value function approximation method to use for estimate Q-values
    
    learningRate - the learning rate
    
    learningPolicy - the learning policy to follow during a learning episode.
    
    maxEpisodeSize - the maximum number of steps the agent will take in an episode before terminating
    
    lambda - specifies the strength of eligibility traces (0 for one step, 1 for full propagation)
- Method Detail
  - GDSLInit
```
protected void GDSLInit(SADomain domain,
                        double gamma,
                        DifferentiableStateActionValue vfa,
                        double learningRate,
                        Policy learningPolicy,
                        int maxEpisodeSize,
                        double lambda)
```
    Initializes SARSA(\lambda) By default the agent will only save the last learning episode and a call to the planFromState(State) method will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this algorithm as a planning algorithm.
    
    Parameters:
    
    domain - the domain in which to learn
    
    gamma - the discount factor
    
    vfa - the value function approximation method to use for estimate Q-values
    
    learningRate - the learning rate
    
    learningPolicy - the learning policy to follow during a learning episode.
    
    maxEpisodeSize - the maximum number of steps the agent will take in an episode before terminating
    
    lambda - specifies the strength of eligibility traces (0 for one step, 1 for full propagation)
  - initializeForPlanning
```
public void initializeForPlanning(int numEpisodesForPlanning)
```
    Sets the RewardFunction, TerminalFunction, and the number of simulated episodes to use for planning when the planFromState(State) method is called. If the RewardFunction and TerminalFunction are not set, the planFromState(State) method will throw a runtime exception.
    
    Parameters:
    
    numEpisodesForPlanning - the number of simulated episodes to run for planning.
  - setLearningRate
```
public void setLearningRate(LearningRate lr)
```
    Sets the learning rate function to use.
    
    Parameters:
    
    lr - the learning rate function to use.
  - setUseFeatureWiseLearningRate
```
public void setUseFeatureWiseLearningRate(boolean useFeatureWiseLearningRate)
```
    Sets whether learning rate polls should be based on the VFA state feature ids, or the OO-MDP state. Default is to use feature ids.
    
    Parameters:
    
    useFeatureWiseLearningRate - if true then learning rate polls are based on VFA state feature ids; if false then they are based on the OO-MDP state object.
  - setLearningPolicy
```
public void setLearningPolicy(Policy p)
```
    Sets which policy this agent should use for learning.
    
    Parameters:
    
    p - the policy to use for learning.
  - setMaximumEpisodesForPlanning
```
public void setMaximumEpisodesForPlanning(int n)
```
    Sets the maximum number of episodes that will be performed when the planFromState(State) method is called.
    
    Parameters:
    
    n - the maximum number of episodes that will be performed when the planFromState(State) method is called.
  - setMaxVFAWeightChangeForPlanningTerminaiton
```
public void setMaxVFAWeightChangeForPlanningTerminaiton(double m)
```
    Sets a max change in the VFA weight threshold that will cause the planFromState(State) to stop planning when it is achieved.
    
    Parameters:
    
    m - the maximum allowable change in the VFA weights before planning stops
  - getLastNumSteps
```
public int getLastNumSteps()
```
    Returns the number of steps taken in the last episode;
    
    Returns:
    
    the number of steps taken in the last episode;
  - setUseReplaceTraces
```
public void setUseReplaceTraces(boolean toggle)
```
    Sets whether to use replacing eligibility traces rather than accumulating traces.
    
    Parameters:
    
    toggle - true to use replacing traces, false to use accumulating traces
  - toggleShouldDecomposeOption
```
public void toggleShouldDecomposeOption(boolean toggle)
```
    Sets whether the primitive actions taken during an options will be included as steps in produced EpisodeAnalysis objects. The default value is true. If this is set to false, then EpisodeAnalysis objects returned from a learning episode will record options as a single "action" and the steps taken by the option will be hidden.
    
    Parameters:
    
    toggle - whether to decompose options into the primitive actions taken by them or not.
  - runLearningEpisode
```
public Episode runLearningEpisode(Environment env)
```
    Specified by:
    
    runLearningEpisode in interface LearningAgent
  - runLearningEpisode
```
public Episode runLearningEpisode(Environment env,
                                  int maxSteps)
```
    Specified by:
    
    runLearningEpisode in interface LearningAgent
  - qValues
```
public java.util.List<QValue> qValues(State s)
```
    Description copied from interface: QProvider
    
    Returns a List of QValue objects for ever permissible action for the given input state.
    
    Specified by:
    
    qValues in interface QProvider
    
    Parameters:
    
    s - the state for which Q-values are to be returned.
    
    Returns:
    
    a List of QValue objects for ever permissible action for the given input state.
  - qValue
```
public double qValue(State s,
                     Action a)
```
    Description copied from interface: QFunction
    
    Returns the QValue for the given state-action pair.
    
    Specified by:
    
    qValue in interface QFunction
    
    Parameters:
    
    s - the input state
    
    a - the input action
    
    Returns:
    
    the QValue for the given state-action pair.
  - value
```
public double value(State s)
```
    Description copied from interface: ValueFunction
    
    Returns the value function evaluation of the given state. If the value is not stored, then the default value specified by the ValueFunctionInitialization object of this class is returned.
    
    Specified by:
    
    value in interface ValueFunction
    
    Parameters:
    
    s - the state to evaluate.
    
    Returns:
    
    the value function evaluation of the given state.
  - planFromState
```
public GreedyQPolicy planFromState(State initialState)
```
    Plans from the input state and then returns a GreedyQPolicy that greedily selects the action with the highest Q-value and breaks ties uniformly randomly.
    
    Specified by:
    
    planFromState in interface Planner
    
    Parameters:
    
    initialState - the initial state of the planning problem
    
    Returns:
    
    a GreedyQPolicy.
  - resetSolver
```
public void resetSolver()
```
    Description copied from interface: MDPSolverInterface
    
    This method resets all solver results so that a solver can be restarted fresh as if had never solved the MDP.
    
    Specified by:
    
    resetSolver in interface MDPSolverInterface
    
    Specified by:
    
    resetSolver in class MDPSolver

Class GradientDescentSarsaLam

Nested Class Summary

Nested classes/interfaces inherited from interface burlap.behavior.valuefunction.QProvider

Field Summary

Fields inherited from class burlap.behavior.singleagent.MDPSolver

Constructor Summary

Method Summary

Methods inherited from class burlap.behavior.singleagent.MDPSolver

Methods inherited from class java.lang.Object

Methods inherited from interface burlap.behavior.singleagent.MDPSolverInterface

Field Detail

vfa

learningRate

learningPolicy

lambda

maxEpisodeSize

eStepCounter

numEpisodesForPlanning

maxWeightChangeForPlanningTermination

maxWeightChangeInLastEpisode

useFeatureWiseLearningRate

minEligibityForUpdate

useReplacingTraces

shouldDecomposeOptions

totalNumberOfSteps

Constructor Detail

GradientDescentSarsaLam

GradientDescentSarsaLam

GradientDescentSarsaLam

Method Detail

GDSLInit

initializeForPlanning

setLearningRate

setUseFeatureWiseLearningRate

setLearningPolicy

setMaximumEpisodesForPlanning

setMaxVFAWeightChangeForPlanningTerminaiton

getLastNumSteps

setUseReplaceTraces

toggleShouldDecomposeOption

runLearningEpisode

runLearningEpisode

qValues

qValue

value

planFromState

resetSolver