public class TimeIndexedTDLambda extends TDLambda
ActorCritic algorithms [1], except
that this class treats states at different depths as unique states. In general the typical TDLambda method is recommend unless a special
Actor object that exploits the time information is to be used as well.
1. Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. "Learning to act using real-time dynamic programming." Artificial Intelligence 72.1 (1995): 81-138.
| Modifier and Type | Class and Description |
|---|---|
static class |
TimeIndexedTDLambda.StateTimeElibilityTrace
Extends the standard
TDLambda.StateEligibilityTrace to include time/depth information. |
TDLambda.StateEligibilityTrace| Modifier and Type | Field and Description |
|---|---|
protected int |
curTime
The current time index / depth of the current episode
|
protected int |
maxEpisodeSize
The maximum number of steps possible in an episode.
|
protected java.util.List<java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue>> |
vTIndex
The time/depth indexed value function
|
lambda, learningRate, totalNumberOfSteps, traces, vIndex, vInitFunctionactionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel| Constructor and Description |
|---|
TimeIndexedTDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda)
Initializes the algorithm.
|
TimeIndexedTDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
ValueFunction vinit,
double lambda,
int maxEpisodeSize)
Initializes the algorithm.
|
TimeIndexedTDLambda(RewardFunction rf,
TerminalFunction tf,
double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda,
int maxEpisodeSize)
Initializes the algorithm.
|
| Modifier and Type | Method and Description |
|---|---|
CritiqueResult |
critiqueAndUpdate(EnvironmentOutcome eo)
This method's implementation provides the critique for some specific instance of the behavior.
|
void |
endEpisode()
This method is called whenever a learning episode terminates
|
int |
getCurTime()
Returns the current time/depth of the current episodes
|
protected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue |
getV(HashableState sh,
int t)
Returns the
TDLambda.VValue object (storing the value) for a given hashed stated at the specified time/depth. |
void |
initializeEpisode(State s)
This method is called whenever a new learning episode begins
|
void |
resetData()
Used to reset any data that was created/modified during learning so that learning can be begin anew.
|
void |
setCurTime(int t)
Sets the time/depth of the current episode.
|
getV, resetSolver, setLearningRate, valueaddActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrintingclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitaddActionTypeprotected java.util.List<java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue>> vTIndex
protected int curTime
protected int maxEpisodeSize
public TimeIndexedTDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda)
gamma - the discount factorhashingFactory - the state hashing factory to use for hashing states and performing equality checks.learningRate - the learning rate that affects how quickly the estimated value function is adjusted.vinit - a constant value function initialization value to use.lambda - indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic TimeIndexedTDLambda(RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, double learningRate, double vinit, double lambda, int maxEpisodeSize)
rf - the reward functiontf - the terminal state functiongamma - the discount factorhashingFactory - the state hashing factory to use for hashing states and performing equality checks.learningRate - the learning rate that affects how quickly the estimated value function is adjusted.vinit - a constant value function initialization value to use.lambda - indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupsmaxEpisodeSize - the maximum number of steps possible in an episodepublic TimeIndexedTDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
ValueFunction vinit,
double lambda,
int maxEpisodeSize)
gamma - the discount factorhashingFactory - the state hashing factory to use for hashing states and performing equality checks.learningRate - the learning rate that affects how quickly the estimated value function is adjusted.vinit - a method of initializing the value function for previously unvisited states.lambda - indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupsmaxEpisodeSize - the maximum number of steps possible in an episodepublic int getCurTime()
public void setCurTime(int t)
t - the time/depth of the current episode.public void initializeEpisode(State s)
CriticinitializeEpisode in interface CriticinitializeEpisode in class TDLambdas - the initial state of the new learning episodepublic void endEpisode()
CriticendEpisode in interface CriticendEpisode in class TDLambdapublic CritiqueResult critiqueAndUpdate(EnvironmentOutcome eo)
CriticcritiqueAndUpdate in interface CriticcritiqueAndUpdate in class TDLambdaeo - the EnvironmentOutcome specifying the eventprotected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue getV(HashableState sh, int t)
TDLambda.VValue object (storing the value) for a given hashed stated at the specified time/depth.sh - the hashed state for which the value should be returned.t - the time/depth at which the state is visitedTDLambda.VValue object (storing the value) for a given hashed stated at the specified time/depth