public class TDLambda extends MDPSolver implements Critic, ValueFunction
ActorCritic
algorithms [1].
1. Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. "Learning to act using real-time dynamic programming." Artificial Intelligence 72.1 (1995): 81-138.
Modifier and Type | Class and Description |
---|---|
static class |
TDLambda.StateEligibilityTrace
A data structure for storing the elements of an eligibility trace.
|
Modifier and Type | Field and Description |
---|---|
protected double |
lambda
Indicates the strength of eligibility traces.
|
protected LearningRate |
learningRate |
protected int |
totalNumberOfSteps
The total number of learning steps performed by this agent.
|
protected java.util.LinkedList<TDLambda.StateEligibilityTrace> |
traces
The eligibility traces for the current episode.
|
protected java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue> |
vIndex
The state value function.
|
protected ValueFunction |
vInitFunction
Defines how the value function is initialized for unvisited states
|
actionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel
Constructor and Description |
---|
TDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda)
Initializes the algorithm.
|
TDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
ValueFunction vinit,
double lambda)
Initializes the algorithm.
|
Modifier and Type | Method and Description |
---|---|
CritiqueResult |
critiqueAndUpdate(EnvironmentOutcome eo)
This method's implementation provides the critique for some specific instance of the behavior.
|
void |
endEpisode()
This method is called whenever a learning episode terminates
|
protected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue |
getV(HashableState sh)
Returns the
TDLambda.VValue object (storing the value) for a given hashed stated. |
void |
initializeEpisode(State s)
This method is called whenever a new learning episode begins
|
void |
resetData()
Used to reset any data that was created/modified during learning so that learning can be begin anew.
|
void |
resetSolver()
This method resets all solver results so that a solver can be restarted fresh
as if had never solved the MDP.
|
void |
setLearningRate(LearningRate lr)
Sets the learning rate function to use.
|
double |
value(State s)
Returns the value function evaluation of the given state.
|
addActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrinting
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addActionType
protected LearningRate learningRate
protected ValueFunction vInitFunction
protected double lambda
protected java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue> vIndex
protected java.util.LinkedList<TDLambda.StateEligibilityTrace> traces
protected int totalNumberOfSteps
public TDLambda(double gamma, HashableStateFactory hashingFactory, double learningRate, double vinit, double lambda)
gamma
- the discount factorhashingFactory
- the state hashing factory to use for hashing states and performing equality checks.learningRate
- the learning rate that affects how quickly the estimated value function is adjusted.vinit
- a constant value function initialization value to use.lambda
- indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic TDLambda(double gamma, HashableStateFactory hashingFactory, double learningRate, ValueFunction vinit, double lambda)
gamma
- the discount factorhashingFactory
- the state hashing factory to use for hashing states and performing equality checks.learningRate
- the learning rate that affects how quickly the estimated value function is adjusted.vinit
- a method of initializing the value function for previously unvisited states.lambda
- indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic void initializeEpisode(State s)
Critic
initializeEpisode
in interface Critic
s
- the initial state of the new learning episodepublic void endEpisode()
Critic
endEpisode
in interface Critic
public void setLearningRate(LearningRate lr)
lr
- the learning rate function to use.public CritiqueResult critiqueAndUpdate(EnvironmentOutcome eo)
Critic
critiqueAndUpdate
in interface Critic
eo
- the EnvironmentOutcome
specifying the eventpublic double value(State s)
ValueFunction
value
in interface ValueFunction
s
- the state to evaluate.public void resetSolver()
MDPSolverInterface
resetSolver
in interface MDPSolverInterface
resetSolver
in class MDPSolver
public void resetData()
Critic
protected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue getV(HashableState sh)
TDLambda.VValue
object (storing the value) for a given hashed stated.sh
- the hased state for which the value should be returned.TDLambda.VValue
object (storing the value) for the given hashed stated.