public class TimeIndexedTDLambda extends TDLambda
ActorCritic algorithms [1], except
that this class treats states at different depths as unique states. In general the typical TDLambda method is recommend unless a special
Actor object that exploits the time information is to be used as well.
1. Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. "Learning to act using real-time dynamic programming." Artificial Intelligence 72.1 (1995): 81-138.| Modifier and Type | Class and Description |
|---|---|
static class |
TimeIndexedTDLambda.StateTimeElibilityTrace
Extends the standard
TDLambda.StateEligibilityTrace to include time/depth information. |
TDLambda.StateEligibilityTrace| Modifier and Type | Field and Description |
|---|---|
protected int |
curTime
The current time index / depth of the current episode
|
protected int |
maxEpisodeSize
The maximum number of steps possible in an episode.
|
protected java.util.List<java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue>> |
vTIndex
The time/depth indexed value function
|
gamma, hashingFactory, lambda, learningRate, rf, tf, totalNumberOfSteps, traces, vIndex, vInitFunction| Constructor and Description |
|---|
TimeIndexedTDLambda(RewardFunction rf,
TerminalFunction tf,
double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda)
Initializes the algorithm.
|
TimeIndexedTDLambda(RewardFunction rf,
TerminalFunction tf,
double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda,
int maxEpisodeSize)
Initializes the algorithm.
|
TimeIndexedTDLambda(RewardFunction rf,
TerminalFunction tf,
double gamma,
HashableStateFactory hashingFactory,
double learningRate,
ValueFunctionInitialization vinit,
double lambda,
int maxEpisodeSize)
Initializes the algorithm.
|
| Modifier and Type | Method and Description |
|---|---|
CritiqueResult |
critiqueAndUpdate(State s,
GroundedAction ga,
State sprime)
This method's implementation provides the critique for some specific instance of the behavior.
|
void |
endEpisode()
This method is called whenever a learning episode terminates
|
int |
getCurTime()
Returns the current time/depth of the current episodes
|
protected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue |
getV(HashableState sh,
int t)
Returns the
TDLambda.VValue object (storing the value) for a given hashed stated at the specified time/depth. |
void |
initializeEpisode(State s)
This method is called whenever a new learning episode begins
|
void |
resetData()
Used to reset any data that was created/modified during learning so that learning can be begin anew.
|
void |
setCurTime(int t)
Sets the time/depth of the current episode.
|
addNonDomainReferencedAction, getV, setLearningRate, setRewardFunction, valueprotected java.util.List<java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue>> vTIndex
protected int curTime
protected int maxEpisodeSize
public TimeIndexedTDLambda(RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, double learningRate, double vinit, double lambda)
rf - the reward functiontf - the terminal state functiongamma - the discount factorhashingFactory - the state hashing factory to use for hashing states and performing equality checks.learningRate - the learning rate that affects how quickly the estimated value function is adjusted.vinit - a constant value function initialization value to use.lambda - indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic TimeIndexedTDLambda(RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, double learningRate, double vinit, double lambda, int maxEpisodeSize)
rf - the reward functiontf - the terminal state functiongamma - the discount factorhashingFactory - the state hashing factory to use for hashing states and performing equality checks.learningRate - the learning rate that affects how quickly the estimated value function is adjusted.vinit - a constant value function initialization value to use.lambda - indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupsmaxEpisodeSize - the maximum number of steps possible in an episodepublic TimeIndexedTDLambda(RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, double learningRate, ValueFunctionInitialization vinit, double lambda, int maxEpisodeSize)
rf - the reward functiontf - the terminal state functiongamma - the discount factorhashingFactory - the state hashing factory to use for hashing states and performing equality checks.learningRate - the learning rate that affects how quickly the estimated value function is adjusted.vinit - a method of initializing the value function for previously unvisited states.lambda - indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupsmaxEpisodeSize - the maximum number of steps possible in an episodepublic int getCurTime()
public void setCurTime(int t)
t - the time/depth of the current episode.public void initializeEpisode(State s)
CriticinitializeEpisode in interface CriticinitializeEpisode in class TDLambdas - the initial state of the new learning episodepublic void endEpisode()
CriticendEpisode in interface CriticendEpisode in class TDLambdapublic CritiqueResult critiqueAndUpdate(State s, GroundedAction ga, State sprime)
CriticcritiqueAndUpdate in interface CriticcritiqueAndUpdate in class TDLambdas - an input statega - an action taken in ssprime - the state the agent transitioned to for taking action ga in state sprotected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue getV(HashableState sh, int t)
TDLambda.VValue object (storing the value) for a given hashed stated at the specified time/depth.sh - the hashed state for which the value should be returned.t - the time/depth at which the state is visitedTDLambda.VValue object (storing the value) for a given hashed stated at the specified time/depth