public class LSPI extends MDPSolver implements QProvider, LearningAgent, Planner
planFromState(State)
or runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
methods, you should instead use a SARSCollector
object to gather a bunch of example state-action-reward-state tuples that are then used for policy iteration. You can
set the dataset to use using the setDataset(SARSData)
method and then you can run LSPI on it using the runPolicyIteration(int, double)
method. LSPI requires
initialize a matrix to an identity matrix multiplied by some large positive constant (see the reference for more information).
By default this constant is 100, but you can change it with the setIdentityScalar(double)
method.
If you do use the planFromState(State)
method, you should first initialize the parameters for it using the
initializeForPlanning(int, SARSCollector)
or
initializeForPlanning(int)
method.
If you do not set a SARSCollector
to use for planning
a SARSCollector.UniformRandomSARSCollector
will be automatically created. After collecting data, it will call
the runPolicyIteration(int, double)
method using a maximum of 30 policy iterations. You can change the SARSCollector
this method uses, the number of samples it acquires, the maximum weight change for PI termination,
and the maximum number of policy iterations by using the setPlanningCollector(SARSCollector)
, setNumSamplesForPlanning(int)
, setMaxChange(double)
, and
setMaxNumPlanningIterations(int)
methods respectively.
If you use the runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
method (or the runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int)
method),
it will work by following a learning policy for the episode and adding its observations to its dataset for its
policy iteration. After enough new data has been acquired, policy iteration will be rereun.
You can adjust the learning policy, the maximum number of allowed learning steps in an
episode, and the minimum number of new observations until LSPI is rerun using the setLearningPolicy(Policy)
, setMaxLearningSteps(int)
, setMinNewStepsForLearningPI(int)
methods respectively. The LSPI termination parameters are set using the same methods that you use for adjusting the results from the planFromState(State)
method discussed above.
This data gathering and replanning behavior from learning episodes is not expected to be an especially good choice.
Therefore, if you want a better online data acquisition, you should consider subclassing this class
and overriding the methods updateDatasetWithLearningEpisode(Episode)
and shouldRereunPolicyIteration(Episode)
, or
the runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int)
method
itself.
Note that LSPI is not well defined for domains with terminal states. Therefore, you need to make sure your reward function returns a value for terminal transitions that offsets the effect of the state not being terminal. For example, for goal states, it should return a large enough value to offset any costs incurred from continuing. For failure states, it should return a negative reward large enough to offset any gains incurred from continuing.
1. Lagoudakis, Michail G., and Ronald Parr. "Least-squares policy iteration." The Journal of Machine Learning Research 4 (2003): 1107-1149.
Modifier and Type | Class and Description |
---|---|
protected class |
LSPI.SSFeatures
Pair of the the state-action features and the next state-action features.
|
QProvider.Helper
Modifier and Type | Field and Description |
---|---|
protected SARSData |
dataset
The SARS dataset on which LSPI is performed
|
protected java.util.LinkedList<Episode> |
episodeHistory
the saved previous learning episodes
|
protected double |
identityScalar
The initial LSPI identity matrix scalar; default is 100.
|
protected org.ejml.simple.SimpleMatrix |
lastWeights
The last weight values set from LSTDQ
|
protected Policy |
learningPolicy
The learning policy followed in
runLearningEpisode(burlap.mdp.singleagent.environment.Environment) method calls. |
protected double |
maxChange
The maximum change in weights permitted to terminate LSPI.
|
protected int |
maxLearningSteps
The maximum number of learning steps in an episode when the
runLearningEpisode(burlap.mdp.singleagent.environment.Environment) method is called. |
protected int |
maxNumPlanningIterations
The maximum number of policy iterations permitted when LSPI is run from the
planFromState(State) or runLearningEpisode(burlap.mdp.singleagent.environment.Environment) methods. |
protected int |
minNewStepsForLearningPI
The minimum number of new observations received from learning episodes before LSPI will be run again.
|
protected int |
numEpisodesToStore
The number of the most recent learning episodes to store.
|
protected int |
numSamplesForPlanning
the number of samples that are acquired for this object's dataset when the
planFromState(State) method is called. |
protected int |
numStepsSinceLastLearningPI
Number of new observations received from learning episodes since LSPI was run
|
protected SARSCollector |
planningCollector
The data collector used by the
planFromState(State) method. |
protected DenseStateActionFeatures |
saFeatures
The state feature database on which the linear VFA is performed
|
protected DenseStateActionLinearVFA |
vfa
The object that performs value function approximation given the weights that are estimated
|
actionTypes, debugCode, domain, gamma, hashingFactory, model, usingOptionModel
Constructor and Description |
---|
LSPI(SADomain domain,
double gamma,
DenseStateActionFeatures saFeatures)
Initializes.
|
LSPI(SADomain domain,
double gamma,
DenseStateActionFeatures saFeatures,
SARSData dataset)
Initializes.
|
Modifier and Type | Method and Description |
---|---|
java.util.List<Episode> |
getAllStoredLearningEpisodes() |
SARSData |
getDataset()
Returns the dataset this object uses for LSPI
|
double |
getIdentityScalar()
Returns the initial LSPI identity matrix scalar used
|
Episode |
getLastLearningEpisode() |
Policy |
getLearningPolicy()
The learning policy followed by the
runLearningEpisode(burlap.mdp.singleagent.environment.Environment) and runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int) methods. |
double |
getMaxChange()
The maximum change in weights required to terminate policy iteration when called from the
planFromState(State) , runLearningEpisode(burlap.mdp.singleagent.environment.Environment) or runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int) methods. |
int |
getMaxLearningSteps()
The maximum number of learning steps permitted by the
runLearningEpisode(burlap.mdp.singleagent.environment.Environment) method. |
int |
getMaxNumPlanningIterations()
The maximum number of policy iterations that will be used by the
planFromState(State) method. |
int |
getMinNewStepsForLearningPI()
The minimum number of new learning observations before policy iteration is run again.
|
int |
getNumSamplesForPlanning()
Gets the number of SARS samples that will be gathered by the
planFromState(State) method. |
SARSCollector |
getPlanningCollector()
Gets the
SARSCollector used by the planFromState(State) method for collecting data. |
DenseStateActionFeatures |
getSaFeatures()
Returns the state-action features used
|
void |
initializeForPlanning(int numSamplesForPlanning)
Sets the number of
SARSData.SARS samples to use for planning when
the planFromState(State) method is called. |
void |
initializeForPlanning(int numSamplesForPlanning,
SARSCollector planningCollector)
Sets the number of
SARSData.SARS samples, and the SARSCollector to use
to collect samples for planning when
the planFromState(State) method is called. |
org.ejml.simple.SimpleMatrix |
LSTDQ()
Runs LSTDQ on this object's current
SARSData dataset. |
protected org.ejml.simple.SimpleMatrix |
phiConstructor(double[] features,
int nf)
Constructs the state-action feature vector as a
SimpleMatrix . |
GreedyQPolicy |
planFromState(State initialState)
Plans from the input state and then returns a
GreedyQPolicy that greedily
selects the action with the highest Q-value and breaks ties uniformly randomly. |
double |
qValue(State s,
Action a)
Returns the
QValue for the given state-action pair. |
java.util.List<QValue> |
qValues(State s)
Returns a
List of QValue objects for ever permissible action for the given input state. |
void |
resetSolver()
This method resets all solver results so that a solver can be restarted fresh
as if had never solved the MDP.
|
Episode |
runLearningEpisode(Environment env) |
Episode |
runLearningEpisode(Environment env,
int maxSteps) |
GreedyQPolicy |
runPolicyIteration(int numIterations,
double maxChange)
Runs LSPI for either numIterations or until the change in the weight matrix is no greater than maxChange.
|
void |
setDataset(SARSData dataset)
Sets the SARS dataset this object will use for LSPI
|
void |
setIdentityScalar(double identityScalar)
Sets the initial LSPI identity matrix scalar used.
|
void |
setLearningPolicy(Policy learningPolicy)
Sets the learning policy followed by the
runLearningEpisode(burlap.mdp.singleagent.environment.Environment) and runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int) methods. |
void |
setMaxChange(double maxChange)
Sets the maximum change in weights required to terminate policy iteration when called from the
planFromState(State) , runLearningEpisode(burlap.mdp.singleagent.environment.Environment) or runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int) methods. |
void |
setMaxLearningSteps(int maxLearningSteps)
Sets the maximum number of learning steps permitted by the
runLearningEpisode(burlap.mdp.singleagent.environment.Environment) method. |
void |
setMaxNumPlanningIterations(int maxNumPlanningIterations)
Sets the maximum number of policy iterations that will be used by the
planFromState(State) method. |
void |
setMinNewStepsForLearningPI(int minNewStepsForLearningPI)
Sets the minimum number of new learning observations before policy iteration is run again.
|
void |
setNumEpisodesToStore(int numEps) |
void |
setNumSamplesForPlanning(int numSamplesForPlanning)
Sets the number of SARS samples that will be gathered by the
planFromState(State) method. |
void |
setPlanningCollector(SARSCollector planningCollector)
Sets the
SARSCollector used by the planFromState(State) method for collecting data. |
void |
setSaFeatures(DenseStateActionFeatures saFeatures)
Sets the state-action features to used
|
protected boolean |
shouldRereunPolicyIteration(Episode ea)
Returns whether LSPI should be rereun given the latest learning episode results.
|
protected void |
updateDatasetWithLearningEpisode(Episode ea)
Updates this object's
SARSData to include the results of a learning episode. |
double |
value(State s)
Returns the value function evaluation of the given state.
|
addActionType, applicableActions, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, stateHash, toggleDebugPrinting
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
addActionType, getActionTypes, getDebugCode, getDomain, getGamma, getHashingFactory, getModel, setActionTypes, setDebugCode, setDomain, setGamma, setHashingFactory, setModel, solverInit, toggleDebugPrinting
protected DenseStateActionLinearVFA vfa
protected SARSData dataset
protected DenseStateActionFeatures saFeatures
protected double identityScalar
protected org.ejml.simple.SimpleMatrix lastWeights
protected int numSamplesForPlanning
planFromState(State)
method is called.protected double maxChange
protected SARSCollector planningCollector
planFromState(State)
method.protected int maxNumPlanningIterations
planFromState(State)
or runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
methods.protected Policy learningPolicy
runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
method calls. Default is 0.1 epsilon greedy.protected int maxLearningSteps
runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
method is called. Default is INT_MAX.protected int numStepsSinceLastLearningPI
protected int minNewStepsForLearningPI
protected java.util.LinkedList<Episode> episodeHistory
protected int numEpisodesToStore
public LSPI(SADomain domain, double gamma, DenseStateActionFeatures saFeatures)
domain
- the problem domaingamma
- the discount factorsaFeatures
- the state-action features to usepublic LSPI(SADomain domain, double gamma, DenseStateActionFeatures saFeatures, SARSData dataset)
domain
- the problem domaingamma
- the discount factorsaFeatures
- the state-action featuresdataset
- the dataset of transitions to usepublic void initializeForPlanning(int numSamplesForPlanning)
SARSData.SARS
samples to use for planning when
the planFromState(State)
method is called. If the
RewardFunction
and TerminalFunction
are not set, the planFromState(State)
method will throw a runtime exception.numSamplesForPlanning
- the number of SARS samples to collect for planning.public void initializeForPlanning(int numSamplesForPlanning, SARSCollector planningCollector)
SARSData.SARS
samples, and the SARSCollector
to use
to collect samples for planning when
the planFromState(State)
method is called. If the
RewardFunction
and TerminalFunction
are not set, the planFromState(State)
method will throw a runtime exception.numSamplesForPlanning
- the number of SARS samples to collect for planning.planningCollector
- the dataset collector to use for planningpublic void setDataset(SARSData dataset)
dataset
- the SARSA datasetpublic SARSData getDataset()
public DenseStateActionFeatures getSaFeatures()
public void setSaFeatures(DenseStateActionFeatures saFeatures)
saFeatures
- the state-action feature to usepublic double getIdentityScalar()
public void setIdentityScalar(double identityScalar)
identityScalar
- the initial LSPI identity matrix scalar used.public int getNumSamplesForPlanning()
planFromState(State)
method.planFromState(State)
method.public void setNumSamplesForPlanning(int numSamplesForPlanning)
planFromState(State)
method.numSamplesForPlanning
- the number of SARS samples that will be gathered by the planFromState(State)
method.public SARSCollector getPlanningCollector()
SARSCollector
used by the planFromState(State)
method for collecting data.SARSCollector
used by the planFromState(State)
method for collecting data.public void setPlanningCollector(SARSCollector planningCollector)
SARSCollector
used by the planFromState(State)
method for collecting data.planningCollector
- the SARSCollector
used by the planFromState(State)
method for collecting data.public int getMaxNumPlanningIterations()
planFromState(State)
method.planFromState(State)
method.public void setMaxNumPlanningIterations(int maxNumPlanningIterations)
planFromState(State)
method.maxNumPlanningIterations
- the maximum number of policy iterations that will be used by the planFromState(State)
method.public Policy getLearningPolicy()
runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
and runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int)
methods.runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
and runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int)
methods.public void setLearningPolicy(Policy learningPolicy)
runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
and runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int)
methods.learningPolicy
- the learning policy followed by the runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
and runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int)
methods.public int getMaxLearningSteps()
runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
method.runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
method.public void setMaxLearningSteps(int maxLearningSteps)
runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
method.maxLearningSteps
- the maximum number of learning steps permitted by the runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
method.public int getMinNewStepsForLearningPI()
public void setMinNewStepsForLearningPI(int minNewStepsForLearningPI)
minNewStepsForLearningPI
- the minimum number of new learning observations before policy iteration is run again.public double getMaxChange()
planFromState(State)
, runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
or runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int)
methods.planFromState(State)
, runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
or runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int)
methods.public void setMaxChange(double maxChange)
planFromState(State)
, runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
or runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int)
methods.maxChange
- the maximum change in weights required to terminate policy iteration when called from the runLearningEpisode(burlap.mdp.singleagent.environment.Environment)
or runLearningEpisode(burlap.mdp.singleagent.environment.Environment, int)
methods.public org.ejml.simple.SimpleMatrix LSTDQ()
SARSData
dataset.SimpleMatrix
object.public GreedyQPolicy runPolicyIteration(int numIterations, double maxChange)
numIterations
- the maximum number of policy iterations.maxChange
- when the weight change is smaller than this value, LSPI terminates.GreedyQPolicy
using this object as the QProvider
source.protected org.ejml.simple.SimpleMatrix phiConstructor(double[] features, int nf)
SimpleMatrix
.features
- the state-action featuresnf
- the total number of state-action features.SimpleMatrix
.public java.util.List<QValue> qValues(State s)
QProvider
List
of QValue
objects for ever permissible action for the given input state.public double qValue(State s, Action a)
QFunction
QValue
for the given state-action pair.public double value(State s)
ValueFunction
value
in interface ValueFunction
s
- the state to evaluate.public GreedyQPolicy planFromState(State initialState)
GreedyQPolicy
that greedily
selects the action with the highest Q-value and breaks ties uniformly randomly.planFromState
in interface Planner
initialState
- the initial state of the planning problemGreedyQPolicy
.public void resetSolver()
MDPSolverInterface
resetSolver
in interface MDPSolverInterface
resetSolver
in class MDPSolver
public Episode runLearningEpisode(Environment env)
runLearningEpisode
in interface LearningAgent
public Episode runLearningEpisode(Environment env, int maxSteps)
runLearningEpisode
in interface LearningAgent
protected void updateDatasetWithLearningEpisode(Episode ea)
SARSData
to include the results of a learning episode.ea
- the learning episode as an Episode
object.protected boolean shouldRereunPolicyIteration(Episode ea)
numStepsSinceLastLearningPI
threshold.ea
- the most recent learning episodepublic Episode getLastLearningEpisode()
public void setNumEpisodesToStore(int numEps)
public java.util.List<Episode> getAllStoredLearningEpisodes()