public class TDLambda extends java.lang.Object implements Critic, ValueFunction
ActorCritic algorithms [1].
1. Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. "Learning to act using real-time dynamic programming." Artificial Intelligence 72.1 (1995): 81-138.| Modifier and Type | Class and Description |
|---|---|
static class |
TDLambda.StateEligibilityTrace
A data structure for storing the elements of an eligibility trace.
|
| Modifier and Type | Field and Description |
|---|---|
protected double |
gamma
The discount factor
|
protected HashableStateFactory |
hashingFactory
The state hashing factor used for hashing states and performing state equality checks.
|
protected double |
lambda
Indicates the strength of eligibility traces.
|
protected LearningRate |
learningRate
The learning rate function that affects how quickly the estimated value function changes.
|
protected RewardFunction |
rf
The reward function used for learning.
|
protected TerminalFunction |
tf
The state termination function to indicate end states
|
protected int |
totalNumberOfSteps
The total number of learning steps performed by this agent.
|
protected java.util.LinkedList<TDLambda.StateEligibilityTrace> |
traces
The eligibility traces for the current episode.
|
protected java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue> |
vIndex
The state value function.
|
protected ValueFunctionInitialization |
vInitFunction
Defines how the value function is initialized for unvisited states
|
| Constructor and Description |
|---|
TDLambda(RewardFunction rf,
TerminalFunction tf,
double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda)
Initializes the algorithm.
|
TDLambda(RewardFunction rf,
TerminalFunction tf,
double gamma,
HashableStateFactory hashingFactory,
double learningRate,
ValueFunctionInitialization vinit,
double lambda)
Initializes the algorithm.
|
| Modifier and Type | Method and Description |
|---|---|
void |
addNonDomainReferencedAction(Action a)
This method allows the critic to critique actions that are not apart of the domain definition.
|
CritiqueResult |
critiqueAndUpdate(State s,
GroundedAction ga,
State sprime)
This method's implementation provides the critique for some specific instance of the behavior.
|
void |
endEpisode()
This method is called whenever a learning episode terminates
|
protected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue |
getV(HashableState sh)
Returns the
TDLambda.VValue object (storing the value) for a given hashed stated. |
void |
initializeEpisode(State s)
This method is called whenever a new learning episode begins
|
void |
resetData()
Used to reset any data that was created/modified during learning so that learning can be begin anew.
|
void |
setLearningRate(LearningRate lr)
Sets the learning rate function to use.
|
void |
setRewardFunction(RewardFunction rf)
Sets the reward function to use.
|
double |
value(State s)
Returns the value function evaluation of the given state.
|
protected RewardFunction rf
protected TerminalFunction tf
protected double gamma
protected HashableStateFactory hashingFactory
protected LearningRate learningRate
protected ValueFunctionInitialization vInitFunction
protected double lambda
protected java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue> vIndex
protected java.util.LinkedList<TDLambda.StateEligibilityTrace> traces
protected int totalNumberOfSteps
public TDLambda(RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, double learningRate, double vinit, double lambda)
rf - the reward functiontf - the terminal state functiongamma - the discount factorhashingFactory - the state hashing factory to use for hashing states and performing equality checks.learningRate - the learning rate that affects how quickly the estimated value function is adjusted.vinit - a constant value function initialization value to use.lambda - indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic TDLambda(RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, double learningRate, ValueFunctionInitialization vinit, double lambda)
rf - the reward functiontf - the terminal state functiongamma - the discount factorhashingFactory - the state hashing factory to use for hashing states and performing equality checks.learningRate - the learning rate that affects how quickly the estimated value function is adjusted.vinit - a method of initializing the value function for previously unvisited states.lambda - indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic void addNonDomainReferencedAction(Action a)
CriticaddNonDomainReferencedAction in interface Critica - a an action not apart of the of the domain definition that this critic should be able to crique.public void setRewardFunction(RewardFunction rf)
rf - public void initializeEpisode(State s)
CriticinitializeEpisode in interface Critics - the initial state of the new learning episodepublic void endEpisode()
CriticendEpisode in interface Criticpublic void setLearningRate(LearningRate lr)
lr - the learning rate function to use.public CritiqueResult critiqueAndUpdate(State s, GroundedAction ga, State sprime)
CriticcritiqueAndUpdate in interface Critics - an input statega - an action taken in ssprime - the state the agent transitioned to for taking action ga in state spublic double value(State s)
ValueFunctionvalue in interface ValueFunctions - the state to evaluate.public void resetData()
Criticprotected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue getV(HashableState sh)
TDLambda.VValue object (storing the value) for a given hashed stated.sh - the hased state for which the value should be returned.TDLambda.VValue object (storing the value) for the given hashed stated.