public class TDLambda extends java.lang.Object implements Critic, ValueFunction
ActorCritic
algorithms [1].
1. Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. "Learning to act using real-time dynamic programming." Artificial Intelligence 72.1 (1995): 81-138.Modifier and Type | Class and Description |
---|---|
static class |
TDLambda.StateEligibilityTrace
A data structure for storing the elements of an eligibility trace.
|
Modifier and Type | Field and Description |
---|---|
protected double |
gamma
The discount factor
|
protected HashableStateFactory |
hashingFactory
The state hashing factor used for hashing states and performing state equality checks.
|
protected double |
lambda
Indicates the strength of eligibility traces.
|
protected LearningRate |
learningRate
The learning rate function that affects how quickly the estimated value function changes.
|
protected RewardFunction |
rf
The reward function used for learning.
|
protected TerminalFunction |
tf
The state termination function to indicate end states
|
protected int |
totalNumberOfSteps
The total number of learning steps performed by this agent.
|
protected java.util.LinkedList<TDLambda.StateEligibilityTrace> |
traces
The eligibility traces for the current episode.
|
protected java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue> |
vIndex
The state value function.
|
protected ValueFunctionInitialization |
vInitFunction
Defines how the value function is initialized for unvisited states
|
Constructor and Description |
---|
TDLambda(RewardFunction rf,
TerminalFunction tf,
double gamma,
HashableStateFactory hashingFactory,
double learningRate,
double vinit,
double lambda)
Initializes the algorithm.
|
TDLambda(RewardFunction rf,
TerminalFunction tf,
double gamma,
HashableStateFactory hashingFactory,
double learningRate,
ValueFunctionInitialization vinit,
double lambda)
Initializes the algorithm.
|
Modifier and Type | Method and Description |
---|---|
void |
addNonDomainReferencedAction(Action a)
This method allows the critic to critique actions that are not apart of the domain definition.
|
CritiqueResult |
critiqueAndUpdate(State s,
GroundedAction ga,
State sprime)
This method's implementation provides the critique for some specific instance of the behavior.
|
void |
endEpisode()
This method is called whenever a learning episode terminates
|
protected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue |
getV(HashableState sh)
Returns the
TDLambda.VValue object (storing the value) for a given hashed stated. |
void |
initializeEpisode(State s)
This method is called whenever a new learning episode begins
|
void |
resetData()
Used to reset any data that was created/modified during learning so that learning can be begin anew.
|
void |
setLearningRate(LearningRate lr)
Sets the learning rate function to use.
|
void |
setRewardFunction(RewardFunction rf)
Sets the reward function to use.
|
double |
value(State s)
Returns the value function evaluation of the given state.
|
protected RewardFunction rf
protected TerminalFunction tf
protected double gamma
protected HashableStateFactory hashingFactory
protected LearningRate learningRate
protected ValueFunctionInitialization vInitFunction
protected double lambda
protected java.util.Map<HashableState,burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue> vIndex
protected java.util.LinkedList<TDLambda.StateEligibilityTrace> traces
protected int totalNumberOfSteps
public TDLambda(RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, double learningRate, double vinit, double lambda)
rf
- the reward functiontf
- the terminal state functiongamma
- the discount factorhashingFactory
- the state hashing factory to use for hashing states and performing equality checks.learningRate
- the learning rate that affects how quickly the estimated value function is adjusted.vinit
- a constant value function initialization value to use.lambda
- indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic TDLambda(RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, double learningRate, ValueFunctionInitialization vinit, double lambda)
rf
- the reward functiontf
- the terminal state functiongamma
- the discount factorhashingFactory
- the state hashing factory to use for hashing states and performing equality checks.learningRate
- the learning rate that affects how quickly the estimated value function is adjusted.vinit
- a method of initializing the value function for previously unvisited states.lambda
- indicates the strength of eligibility traces. Use 1 for Monte-carlo-like traces and 0 for single step backupspublic void addNonDomainReferencedAction(Action a)
Critic
addNonDomainReferencedAction
in interface Critic
a
- a an action not apart of the of the domain definition that this critic should be able to crique.public void setRewardFunction(RewardFunction rf)
rf
- public void initializeEpisode(State s)
Critic
initializeEpisode
in interface Critic
s
- the initial state of the new learning episodepublic void endEpisode()
Critic
endEpisode
in interface Critic
public void setLearningRate(LearningRate lr)
lr
- the learning rate function to use.public CritiqueResult critiqueAndUpdate(State s, GroundedAction ga, State sprime)
Critic
critiqueAndUpdate
in interface Critic
s
- an input statega
- an action taken in ssprime
- the state the agent transitioned to for taking action ga in state spublic double value(State s)
ValueFunction
value
in interface ValueFunction
s
- the state to evaluate.public void resetData()
Critic
protected burlap.behavior.singleagent.learning.actorcritic.critics.TDLambda.VValue getV(HashableState sh)
TDLambda.VValue
object (storing the value) for a given hashed stated.sh
- the hased state for which the value should be returned.TDLambda.VValue
object (storing the value) for the given hashed stated.