public class TDLambda extends MDPSolver implements Critic, ValueFunction
ActorCritic algorithms [1].
1. Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. "Learning to act using real-time dynamic programming." Artificial Intelligence 72.1 (1995): 81-138.
| Modifier and Type | Class and Description |
|---|---|
static class |
TDLambda.StateEligibilityTrace
A data structure for storing the elements of an eligibility trace.
|
| Modifier and Type | Field and Description |
|---|---|
protected double |
lambda
Indicates the strength of eligibility traces.
|
protected LearningRate |
learningRate |
protected int |
totalNumberOfSteps
The total number of learning steps performed by this agent.
|
protected java.util.LinkedList<TDLambda.StateEligibilityTrace> |
traces
The eligibility traces for the current episode.
|
protected java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue> |
vIndex
The state value function.
|
protected ValueFunction |
vInitFunction
Defines how the value function is initialized for unvisited states
|
actionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel| Constructor and Description |
|---|
TDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda)
Initializes the algorithm.
|
TDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
ValueFunction vinit,
double lambda)
Initializes the algorithm.
|
| Modifier and Type | Method and Description |
|---|---|
CritiqueResult |
critiqueAndUpdate(EnvironmentOutcome eo)
This method's implementation provides the critique for some specific instance of the behavior.
|
void |
endEpisode()
This method is called whenever a learning episode terminates
|
protected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue |
getV(HashableState sh)
Returns the
TDLambda.VValue object (storing the value) for a given hashed stated. |
void |
initializeEpisode(State s)
This method is called whenever a new learning episode begins
|
void |
resetData()
Used to reset any data that was created/modified during learning so that learning can be begin anew.
|
void |
resetSolver()
This method resets all solver results so that a solver can be restarted fresh
as if had never solved the MDP.
|
void |
setLearningRate(LearningRate lr)
Sets the learning rate function to use.
|
double |
value(State s)
Returns the value function evaluation of the given state.
|
addActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrintingclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitaddActionTypeprotected LearningRate learningRate
protected ValueFunction vInitFunction
protected double lambda
protected java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue> vIndex
protected java.util.LinkedList<TDLambda.StateEligibilityTrace> traces
protected int totalNumberOfSteps
public TDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda)
gamma - the discount factorhashingFactory - the state hashing factory to use for hashing states and performing equality checks.learningRate - the learning rate that affects how quickly the estimated value function is adjusted.vinit - a constant value function initialization value to use.lambda - indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic TDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
ValueFunction vinit,
double lambda)
gamma - the discount factorhashingFactory - the state hashing factory to use for hashing states and performing equality checks.learningRate - the learning rate that affects how quickly the estimated value function is adjusted.vinit - a method of initializing the value function for previously unvisited states.lambda - indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic void initializeEpisode(State s)
CriticinitializeEpisode in interface Critics - the initial state of the new learning episodepublic void endEpisode()
CriticendEpisode in interface Criticpublic void setLearningRate(LearningRate lr)
lr - the learning rate function to use.public CritiqueResult critiqueAndUpdate(EnvironmentOutcome eo)
CriticcritiqueAndUpdate in interface Criticeo - the EnvironmentOutcome specifying the eventpublic double value(State s)
ValueFunctionvalue in interface ValueFunctions - the state to evaluate.public void resetSolver()
MDPSolverInterfaceresetSolver in interface MDPSolverInterfaceresetSolver in class MDPSolverpublic void resetData()
Criticprotected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue getV(HashableState sh)
TDLambda.VValue object (storing the value) for a given hashed stated.sh - the hased state for which the value should be returned.TDLambda.VValue object (storing the value) for the given hashed stated.