public class TimeIndexedTDLambda extends TDLambda
ActorCritic
algorithms [1], except
that this class treats states at different depths as unique states. In general the typical TDLambda
method is recommend unless a special
Actor
object that exploits the time information is to be used as well.
1. Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. "Learning to act using real-time dynamic programming." Artificial Intelligence 72.1 (1995): 81-138.
Modifier and Type | Class and Description |
---|---|
static class |
TimeIndexedTDLambda.StateTimeElibilityTrace
Extends the standard
TDLambda.StateEligibilityTrace to include time/depth information. |
TDLambda.StateEligibilityTrace
Modifier and Type | Field and Description |
---|---|
protected int |
curTime
The current time index / depth of the current episode
|
protected int |
maxEpisodeSize
The maximum number of steps possible in an episode.
|
protected java.util.List<java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue>> |
vTIndex
The time/depth indexed value function
|
lambda, learningRate, totalNumberOfSteps, traces, vIndex, vInitFunction
actionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel
Constructor and Description |
---|
TimeIndexedTDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda)
Initializes the algorithm.
|
TimeIndexedTDLambda(double gamma,
HashableStateFactory hashingFactory,
double learningRate,
ValueFunction vinit,
double lambda,
int maxEpisodeSize)
Initializes the algorithm.
|
TimeIndexedTDLambda(RewardFunction rf,
TerminalFunction tf,
double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda,
int maxEpisodeSize)
Initializes the algorithm.
|
Modifier and Type | Method and Description |
---|---|
CritiqueResult |
critiqueAndUpdate(EnvironmentOutcome eo)
This method's implementation provides the critique for some specific instance of the behavior.
|
void |
endEpisode()
This method is called whenever a learning episode terminates
|
int |
getCurTime()
Returns the current time/depth of the current episodes
|
protected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue |
getV(HashableState sh,
int t)
Returns the
TDLambda.VValue object (storing the value) for a given hashed stated at the specified time/depth. |
void |
initializeEpisode(State s)
This method is called whenever a new learning episode begins
|
void |
resetData()
Used to reset any data that was created/modified during learning so that learning can be begin anew.
|
void |
setCurTime(int t)
Sets the time/depth of the current episode.
|
getV, resetSolver, setLearningRate, value
addActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrinting
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addActionType
protected java.util.List<java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue>> vTIndex
protected int curTime
protected int maxEpisodeSize
public TimeIndexedTDLambda(double gamma, HashableStateFactory hashingFactory, double learningRate, double vinit, double lambda)
gamma
- the discount factorhashingFactory
- the state hashing factory to use for hashing states and performing equality checks.learningRate
- the learning rate that affects how quickly the estimated value function is adjusted.vinit
- a constant value function initialization value to use.lambda
- indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic TimeIndexedTDLambda(RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, double learningRate, double vinit, double lambda, int maxEpisodeSize)
rf
- the reward functiontf
- the terminal state functiongamma
- the discount factorhashingFactory
- the state hashing factory to use for hashing states and performing equality checks.learningRate
- the learning rate that affects how quickly the estimated value function is adjusted.vinit
- a constant value function initialization value to use.lambda
- indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupsmaxEpisodeSize
- the maximum number of steps possible in an episodepublic TimeIndexedTDLambda(double gamma, HashableStateFactory hashingFactory, double learningRate, ValueFunction vinit, double lambda, int maxEpisodeSize)
gamma
- the discount factorhashingFactory
- the state hashing factory to use for hashing states and performing equality checks.learningRate
- the learning rate that affects how quickly the estimated value function is adjusted.vinit
- a method of initializing the value function for previously unvisited states.lambda
- indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupsmaxEpisodeSize
- the maximum number of steps possible in an episodepublic int getCurTime()
public void setCurTime(int t)
t
- the time/depth of the current episode.public void initializeEpisode(State s)
Critic
initializeEpisode
in interface Critic
initializeEpisode
in class TDLambda
s
- the initial state of the new learning episodepublic void endEpisode()
Critic
endEpisode
in interface Critic
endEpisode
in class TDLambda
public CritiqueResult critiqueAndUpdate(EnvironmentOutcome eo)
Critic
critiqueAndUpdate
in interface Critic
critiqueAndUpdate
in class TDLambda
eo
- the EnvironmentOutcome
specifying the eventprotected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue getV(HashableState sh, int t)
TDLambda.VValue
object (storing the value) for a given hashed stated at the specified time/depth.sh
- the hashed state for which the value should be returned.t
- the time/depth at which the state is visitedTDLambda.VValue
object (storing the value) for a given hashed stated at the specified time/depth