public class QLearning extends MDPSolver implements QProvider, LearningAgent, Planner
SimulatedEnvironment
.
If you are going to use this algorithm for planning, call the initializeForPlanning(int)
method before calling planFromState(State)
.
The number of episodes used for planning can be determined
by a threshold maximum number of episodes, or by a maximum change in the Q-function threshold.
By default, this agent will use an epsilon-greedy policy with epsilon=0.1. You can change the learning policy to
anything with the setLearningPolicy(burlap.behavior.policy.Policy)
policy.
If you
want to use a custom learning rate decay schedule rather than a constant learning rate, use the
setLearningRateFunction(burlap.behavior.learningrate.LearningRate)
.
1. Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8.3-4 (1992): 279-292.
2. Sutton, Richard S., Doina Precup, and Satinder Singh. "Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning." Artificial intelligence 112.1 (1999): 181-211.
QProvider.Helper
Modifier and Type | Field and Description |
---|---|
protected int |
eStepCounter
A counter for counting the number of steps in an episode that have been taken thus far
|
protected Policy |
learningPolicy
The learning policy to use.
|
protected LearningRate |
learningRate
The learning rate function used.
|
protected int |
maxEpisodeSize
The maximum number of steps that will be taken in an episode before the agent terminates a learning episode
|
protected double |
maxQChangeForPlanningTermination
The maximum allowable change in the Q-function during an episode before the planning method terminates.
|
protected double |
maxQChangeInLastEpisode
The maximum Q-value change that occurred in the last learning episode.
|
protected int |
numEpisodesForPlanning
The maximum number of episodes to use for planning
|
protected java.util.Map<HashableState,QLearningStateNode> |
qFunction
The tabular mapping from states to Q-values
|
protected QFunction |
qInitFunction
The object that defines how Q-values are initialized.
|
protected boolean |
shouldDecomposeOptions
Whether options should be decomposed into actions in the returned
Episode objects. |
protected int |
totalNumberOfSteps
The total number of learning steps performed by this agent.
|
actionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel
Constructor and Description |
---|
QLearning(SADomain domain,
double gamma,
HashableStateFactory hashingFactory,
double qInit,
double learningRate)
Initializes Q-learning with 0.1 epsilon greedy policy, the same Q-value initialization everywhere, and places no limit on the number of steps the
agent can take in an episode.
|
QLearning(SADomain domain,
double gamma,
HashableStateFactory hashingFactory,
double qInit,
double learningRate,
int maxEpisodeSize)
Initializes Q-learning with 0.1 epsilon greedy policy, the same Q-value initialization everywhere.
|
QLearning(SADomain domain,
double gamma,
HashableStateFactory hashingFactory,
double qInit,
double learningRate,
Policy learningPolicy,
int maxEpisodeSize)
Initializes the same Q-value initialization everywhere.
|
QLearning(SADomain domain,
double gamma,
HashableStateFactory hashingFactory,
QFunction qInit,
double learningRate,
Policy learningPolicy,
int maxEpisodeSize)
Initializes the algorithm.
|
Modifier and Type | Method and Description |
---|---|
int |
getLastNumSteps()
Returns the number of steps taken in the last episode;
|
protected double |
getMaxQ(HashableState s)
Returns the maximum Q-value in the hashed stated.
|
protected QValue |
getQ(HashableState s,
Action a)
Returns the Q-value for a given hashed state and action.
|
protected java.util.List<QValue> |
getQs(HashableState s)
Returns the possible Q-values for a given hashed stated.
|
protected QLearningStateNode |
getStateNode(HashableState s)
Returns the
QLearningStateNode object stored for the given hashed state. |
void |
initializeForPlanning(int numEpisodesForPlanning)
Sets the
RewardFunction , TerminalFunction ,
and the number of simulated episodes to use for planning when
the planFromState(State) method is called. |
void |
loadQTable(java.lang.String path)
Loads the q-function table located on disk at the specified path.
|
GreedyQPolicy |
planFromState(State initialState)
Plans from the input state and then returns a
GreedyQPolicy that greedily
selects the action with the highest Q-value and breaks ties uniformly randomly. |
protected void |
QLInit(SADomain domain,
double gamma,
HashableStateFactory hashingFactory,
QFunction qInitFunction,
double learningRate,
Policy learningPolicy,
int maxEpisodeSize)
Initializes the algorithm.
|
double |
qValue(State s,
Action a)
Returns the
QValue for the given state-action pair. |
java.util.List<QValue> |
qValues(State s)
Returns a
List of QValue objects for ever permissible action for the given input state. |
void |
resetSolver()
This method resets all solver results so that a solver can be restarted fresh
as if had never solved the MDP.
|
Episode |
runLearningEpisode(Environment env) |
Episode |
runLearningEpisode(Environment env,
int maxSteps) |
void |
setLearningPolicy(Policy p)
Sets which policy this agent should use for learning.
|
void |
setLearningRateFunction(LearningRate lr)
Sets the learning rate function to use
|
void |
setMaximumEpisodesForPlanning(int n)
Sets the maximum number of episodes that will be performed when the
planFromState(State) method is called. |
void |
setMaxQChangeForPlanningTerminaiton(double m)
Sets a max change in the Q-function threshold that will cause the
planFromState(State) to stop planning
when it is achieved. |
void |
setQInitFunction(QFunction qInit)
Sets how to initialize Q-values for previously unexperienced state-action pairs.
|
void |
toggleShouldDecomposeOption(boolean toggle)
Sets whether the primitive actions taken during an options will be included as steps in returned EpisodeAnalysis objects.
|
double |
value(State s)
Returns the value function evaluation of the given state.
|
void |
writeQTable(java.lang.String path)
Writes the q-function table stored in this object to the specified file path.
|
addActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrinting
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addActionType, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, toggleDebugPrinting
protected java.util.Map<HashableState,QLearningStateNode> qFunction
protected QFunction qInitFunction
protected LearningRate learningRate
protected Policy learningPolicy
protected int maxEpisodeSize
protected int eStepCounter
protected int numEpisodesForPlanning
protected double maxQChangeForPlanningTermination
protected double maxQChangeInLastEpisode
protected boolean shouldDecomposeOptions
Episode
objects.protected int totalNumberOfSteps
public QLearning(SADomain domain, double gamma, HashableStateFactory hashingFactory, double qInit, double learningRate)
planFromState(State)
method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learngamma
- the discount factorhashingFactory
- the state hashing factory to use for Q-lookupsqInit
- the initial Q-value to user everywherelearningRate
- the learning ratepublic QLearning(SADomain domain, double gamma, HashableStateFactory hashingFactory, double qInit, double learningRate, int maxEpisodeSize)
planFromState(State)
method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learngamma
- the discount factorhashingFactory
- the state hashing factory to use for Q-lookupsqInit
- the initial Q-value to user everywherelearningRate
- the learning ratemaxEpisodeSize
- the maximum number of steps the agent will take in a learning episode for the agent stops trying.public QLearning(SADomain domain, double gamma, HashableStateFactory hashingFactory, double qInit, double learningRate, Policy learningPolicy, int maxEpisodeSize)
planFromState(State)
method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learngamma
- the discount factorhashingFactory
- the state hashing factory to use for Q-lookupsqInit
- the initial Q-value to user everywherelearningRate
- the learning ratelearningPolicy
- the learning policy to follow during a learning episode.maxEpisodeSize
- the maximum number of steps the agent will take in a learning episode for the agent stops trying.public QLearning(SADomain domain, double gamma, HashableStateFactory hashingFactory, QFunction qInit, double learningRate, Policy learningPolicy, int maxEpisodeSize)
planFromState(State)
method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learngamma
- the discount factorhashingFactory
- the state hashing factory to use for Q-lookupsqInit
- a QFunction
object that can be used to initialize the Q-values.learningRate
- the learning ratelearningPolicy
- the learning policy to follow during a learning episode.maxEpisodeSize
- the maximum number of steps the agent will take in a learning episode for the agent stops trying.protected void QLInit(SADomain domain, double gamma, HashableStateFactory hashingFactory, QFunction qInitFunction, double learningRate, Policy learningPolicy, int maxEpisodeSize)
planFromState(State)
method
will cause the valueFunction to use only one episode for planning; this should probably be changed to a much larger value if you plan on using this
algorithm as a planning algorithm.domain
- the domain in which to learngamma
- the discount factorhashingFactory
- the state hashing factory to use for Q-lookupsqInitFunction
- a QFunction
object that can be used to initialize the Q-values.learningRate
- the learning ratelearningPolicy
- the learning policy to follow during a learning episode.maxEpisodeSize
- the maximum number of steps the agent will take in a learning episode for the agent stops trying.public void initializeForPlanning(int numEpisodesForPlanning)
RewardFunction
, TerminalFunction
,
and the number of simulated episodes to use for planning when
the planFromState(State)
method is called.numEpisodesForPlanning
- the number of simulated episodes to run for planning.public void setLearningRateFunction(LearningRate lr)
lr
- the learning rate function to usepublic void setQInitFunction(QFunction qInit)
qInit
- a QFunction
object that can be used to initialize the Q-values.public void setLearningPolicy(Policy p)
p
- the policy to use for learning.public void setMaximumEpisodesForPlanning(int n)
planFromState(State)
method is called.n
- the maximum number of episodes that will be performed when the planFromState(State)
method is called.public void setMaxQChangeForPlanningTerminaiton(double m)
planFromState(State)
to stop planning
when it is achieved.m
- the maximum allowable change in the Q-function before planning stopspublic int getLastNumSteps()
public void toggleShouldDecomposeOption(boolean toggle)
toggle
- whether to decompose options into the primitive actions taken by them or not.public java.util.List<QValue> qValues(State s)
QProvider
List
of QValue
objects for ever permissible action for the given input state.public double qValue(State s, Action a)
QFunction
QValue
for the given state-action pair.protected java.util.List<QValue> getQs(HashableState s)
s
- the hashed state for which to get the Q-values.protected QValue getQ(HashableState s, Action a)
s
- the hashed statea
- the actionpublic double value(State s)
ValueFunction
value
in interface ValueFunction
s
- the state to evaluate.protected QLearningStateNode getStateNode(HashableState s)
QLearningStateNode
object stored for the given hashed state. If no QLearningStateNode
object.
is stored, then it is created and has its Q-value initialize using this objects QFunction
data member.s
- the hashed state for which to get the QLearningStateNode
objectQLearningStateNode
object stored for the given hashed state. If no QLearningStateNode
object.protected double getMaxQ(HashableState s)
s
- the state for which to get he maximum Q-value;public GreedyQPolicy planFromState(State initialState)
GreedyQPolicy
that greedily
selects the action with the highest Q-value and breaks ties uniformly randomly.planFromState
in interface Planner
initialState
- the initial state of the planning problemGreedyQPolicy
.public Episode runLearningEpisode(Environment env)
runLearningEpisode
in interface LearningAgent
public Episode runLearningEpisode(Environment env, int maxSteps)
runLearningEpisode
in interface LearningAgent
public void resetSolver()
MDPSolverInterface
resetSolver
in interface MDPSolverInterface
resetSolver
in class MDPSolver
public void writeQTable(java.lang.String path)
path
- the path to write the value functionpublic void loadQTable(java.lang.String path)
Map
from HashableState
to QLearningStateNode
.path
- the path to the save value function table