UCT

java.lang.Object
- burlap.behavior.singleagent.MDPSolver
- - burlap.behavior.singleagent.planning.stochastic.montecarlo.uct.UCT

All Implemented Interfaces:

MDPSolverInterface, Planner, QFunction, ValueFunction
```
public class UCT
extends MDPSolver
implements Planner, QFunction
```
An implementation of UCT [1]. This class can be augmented with a goal state specification (using a StateConditionTest) that will cause the planning algorithm to terminate early once it has found a path to the goal. This may be useful if randomly finding the goal state is rare.

The class also implements the QFunction interface. However, it will only return the Q-value for a state if that state is the root node of the tree. If it is not the root node of the tree, then it will automatically reset the planning results and replan from that state as the root node and then return the result. This allows the client to use a GreedyQPolicy with this valueFunction in which it replans with each step in the world, thereby forcing the Q-values for every state to be for the same horizon. Replanning fresh after each step in the world is the standard UCT approach. If you instead want a policy that walks through the tree it generated from some source state, (so that each step computes a Q-value for a shorter horizon than the step before), you can use the UCTTreeWalkPolicy. The TreeWalkPolicy will be more computationally efficient than replanning at each step, but may have degrading performance after each step since each step has a shorter horizon from which to plan and may not have as many samples from which it estimated its Q-value.

1. Kocsis, Levente, and Csaba Szepesvari. "Bandit based monte-carlo planning." ECML (2006). 282-293.

Author:

James MacGlashan

Nested Class Summary
- Nested classes/interfaces inherited from interface burlap.behavior.valuefunction.QFunction
  QFunction.QFunctionHelper

Field Summary

Fields
Modifier and Type	Field and Description
`protected UCTActionNode.UCTActionConstructor`	`actionNodeConstructor`
`protected double`	`explorationBias`
`protected boolean`	`foundGoal`
`protected boolean`	`foundGoalOnRollout`
`protected StateConditionTest`	`goalCondition`
`protected int`	`maxHorizon`
`protected int`	`maxRollOutsFromRoot`
`protected int`	`numRollOutsFromRoot`
`protected int`	`numVisits`
`protected java.util.Random`	`rand`
`protected UCTStateNode`	`root`
`protected java.util.List<java.util.Map<HashableState,UCTStateNode>>`	`stateDepthIndex`
`protected UCTStateNode.UCTStateConstructor`	`stateNodeConstructor`
`protected java.util.Map<HashableState,java.util.List<UCTStateNode>>`	`statesToStateNodes`
`protected int`	`treeSize`
`protected java.util.Set<HashableState>`	`uniqueStatesInTree`

Fields inherited from class burlap.behavior.singleagent.MDPSolver
actions, debugCode, domain, gamma, hashingFactory, mapToStateIndex, rf, tf

Constructor Summary

Constructors
Constructor and Description
`UCT(Domain domain, RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, int horizon, int nRollouts, int explorationBias)` Initializes UCT

Method Summary

Methods
Modifier and Type	Method and Description
`protected void`	`addNodeToIndexTree(UCTStateNode snode)` Adds a `UCTStateNode` to the UCT tree
`protected UCTActionNode`	`bestReturnAction(UCTStateNode snode)` Returns the `UCTActionNode` with the highest average sample Q-value.
`protected double`	`computeUCTQ(UCTStateNode snode, UCTActionNode anode)` Returns the upper confidence Q-value for a given state node and action node.
`protected boolean`	`containsActionPreference(UCTStateNode snode)` Returns true if the sample returns for any actions are different
`protected double`	`explorationQBoost(int ns, int na)` Returns the extra value added to the average sample Q-value that is sued to produce the upper confidence Q-value.
`QValue`	`getQ(State s, AbstractGroundedAction a)` Returns the `QValue` for the given state-action pair.
`java.util.List<QValue>`	`getQs(State s)` Returns a `List` of `QValue` objects for ever permissible action for the given input state.
`UCTStateNode`	`getRoot()` Returns the root node of the UCT tree.
`protected void`	`initializeRollOut()`
`GreedyQPolicy`	`planFromState(State initialState)` Plans from the input state and then returns a `GreedyQPolicy` that greedily selects the action with the highest Q-value and breaks ties uniformly randomly.
`protected UCTStateNode`	`queryTreeIndex(HashableState sh, int d)` Returns the `UCTStateNode` for the given (hashed) state at the given depth.
`void`	`resetSolver()` This method resets all solver results so that a solver can be restarted fresh as if had never solved the MDP.
`protected UCTActionNode`	`selectActionNode(UCTStateNode snode)` Selections which action to take.
`boolean`	`stopPlanning()` Returns true if rollouts and planning should cease.
`double`	`treeRollOut(UCTStateNode node, int depth, int childrenLeftToAdd)` Performs a rollout in the UCT tree from the given node, keeping track of how many new nodes can be added to the tree.
`protected void`	`UCTInit(Domain domain, RewardFunction rf, TerminalFunction tf, double gamma, HashableStateFactory hashingFactory, int horizon, int nRollouts, int explorationBias)`
`void`	`useGoalConditionStopCriteria(StateConditionTest gc)` Tells the valueFunction to stop planning if a goal state is ever found.
`double`	`value(State s)` Returns the value function evaluation of the given state.

Methods inherited from class burlap.behavior.singleagent.MDPSolver
addNonDomainReferencedAction, getActions, getAllGroundedActions, getDebugCode, getDomain, getGamma, getHashingFactory, getRf, getRF, getTf, getTF, setActions, setDebugCode, setDomain, setGamma, setHashingFactory, setRf, setTf, solverInit, stateHash, toggleDebugPrinting, translateAction

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface burlap.behavior.singleagent.MDPSolverInterface
addNonDomainReferencedAction, getActions, getDebugCode, getDomain, getGamma, getHashingFactory, getRf, getRF, getTf, getTF, setActions, setDebugCode, setDomain, setGamma, setHashingFactory, setRf, setTf, solverInit, toggleDebugPrinting

- Field Detail
  - stateDepthIndex
```
protected java.util.List<java.util.Map<HashableState,UCTStateNode>> stateDepthIndex
```
  - statesToStateNodes
```
protected java.util.Map<HashableState,java.util.List<UCTStateNode>> statesToStateNodes
```
  - root
```
protected UCTStateNode root
```
  - maxHorizon
```
protected int maxHorizon
```
  - maxRollOutsFromRoot
```
protected int maxRollOutsFromRoot
```
  - numRollOutsFromRoot
```
protected int numRollOutsFromRoot
```
  - explorationBias
```
protected double explorationBias
```
  - stateNodeConstructor
```
protected UCTStateNode.UCTStateConstructor stateNodeConstructor
```
  - actionNodeConstructor
```
protected UCTActionNode.UCTActionConstructor actionNodeConstructor
```
  - goalCondition
```
protected StateConditionTest goalCondition
```
  - foundGoal
```
protected boolean foundGoal
```
  - foundGoalOnRollout
```
protected boolean foundGoalOnRollout
```
  - uniqueStatesInTree
```
protected java.util.Set<HashableState> uniqueStatesInTree
```
  - treeSize
```
protected int treeSize
```
  - numVisits
```
protected int numVisits
```
  - rand
```
protected java.util.Random rand
```
- Constructor Detail
  - UCT
```
public UCT(Domain domain,
   RewardFunction rf,
   TerminalFunction tf,
   double gamma,
   HashableStateFactory hashingFactory,
   int horizon,
   int nRollouts,
   int explorationBias)
```
    Initializes UCT
    
    Parameters:
    domain - the domain in which to plan
    rf - the reward function to use
    tf - the terminal function to use
    gamma - the discount factor
    hashingFactory - the state hashing factory
    horizon - the planning horizon
    nRollouts - the number of rollouts to perform
    explorationBias - the exploration bias constant (suggested >2)
- Method Detail
  - UCTInit
```
protected void UCTInit(Domain domain,
           RewardFunction rf,
           TerminalFunction tf,
           double gamma,
           HashableStateFactory hashingFactory,
           int horizon,
           int nRollouts,
           int explorationBias)
```
  - getRoot
```
public UCTStateNode getRoot()
```
    Returns the root node of the UCT tree.
    
    Returns:
    the root node of the UCT tree.
  - useGoalConditionStopCriteria
```
public void useGoalConditionStopCriteria(StateConditionTest gc)
```
    Tells the valueFunction to stop planning if a goal state is ever found.
    
    Parameters:
    gc - a StateConditionTest object used to specify goal states (whereever it evaluates as true).
  - planFromState
```
public GreedyQPolicy planFromState(State initialState)
```
    Plans from the input state and then returns a GreedyQPolicy that greedily selects the action with the highest Q-value and breaks ties uniformly randomly.
    
    Specified by:
    
    planFromState in interface Planner
    
    Parameters:
    initialState - the initial state of the planning problem
    
    Returns:
    a GreedyQPolicy.
  - getQs
```
public java.util.List<QValue> getQs(State s)
```
    Description copied from interface: QFunction
    
    Returns a List of QValue objects for ever permissible action for the given input state.
    
    Specified by:
    
    getQs in interface QFunction
    
    Parameters:
    s - the state for which Q-values are to be returned.
    
    Returns:
    a List of QValue objects for ever permissible action for the given input state.
  - getQ
```
public QValue getQ(State s,
          AbstractGroundedAction a)
```
    Description copied from interface: QFunction
    
    Returns the QValue for the given state-action pair.
    
    Specified by:
    
    getQ in interface QFunction
    
    Parameters:
    s - the input state
    a - the input action
    
    Returns:
    the QValue for the given state-action pair.
  - value
```
public double value(State s)
```
    Description copied from interface: ValueFunction
    
    Returns the value function evaluation of the given state. If the value is not stored, then the default value specified by the ValueFunctionInitialization object of this class is returned.
    
    Specified by:
    
    value in interface ValueFunction
    
    Parameters:
    s - the state to evaluate.
    
    Returns:
    the value function evaluation of the given state.
  - resetSolver
```
public void resetSolver()
```
    Description copied from interface: MDPSolverInterface
    
    This method resets all solver results so that a solver can be restarted fresh as if had never solved the MDP.
    
    Specified by:
    
    resetSolver in interface MDPSolverInterface
    
    Specified by:
    
    resetSolver in class MDPSolver
  - initializeRollOut
```
protected void initializeRollOut()
```
  - treeRollOut
```
public double treeRollOut(UCTStateNode node,
                 int depth,
                 int childrenLeftToAdd)
```
    Performs a rollout in the UCT tree from the given node, keeping track of how many new nodes can be added to the tree.
    
    Parameters:
    node - the node from which to rollout
    depth - the depth of the node
    childrenLeftToAdd - the number of new subsequent nodes that can be connected to the tree
    
    Returns:
    the sample return from rolling out from this node
  - stopPlanning
```
public boolean stopPlanning()
```
    Returns true if rollouts and planning should cease. Planning will stop if the valueFunction is told to terminate upon finding a goal and one was found, or if the maximum number of rollouts have already been performed.
    
    Returns:
    true if rollouts and planning should cease; false otherwise.
  - selectActionNode
```
protected UCTActionNode selectActionNode(UCTStateNode snode)
```
    Selections which action to take. Unexplored actions from the node are selected first. If all actions have been explored, then the action with the highest upper confidence Q-value is selected, ties are broken randomly.
    
    Parameters:
    snode - the UCT node from which to select an action.
    
    Returns:
    the UCTActionNode to be taken.
  - computeUCTQ
```
protected double computeUCTQ(UCTStateNode snode,
                 UCTActionNode anode)
```
    Returns the upper confidence Q-value for a given state node and action node.
    
    Parameters:
    snode - the state node
    anode - the action node
    
    Returns:
    the upper confidence Q-value
  - explorationQBoost
```
protected double explorationQBoost(int ns,
                       int na)
```
    Returns the extra value added to the average sample Q-value that is sued to produce the upper confidence Q-value.
    
    Parameters:
    ns - the number of times the state node has been visited
    na - the number of times the action node has been visited
    
    Returns:
    the extra value added to the average sample Q-value that is sued to produce the upper confidence Q-value.
  - queryTreeIndex
```
protected UCTStateNode queryTreeIndex(HashableState sh,
                          int d)
```
    Returns the UCTStateNode for the given (hashed) state at the given depth.
    
    Parameters:
    sh - the state whose node should be returned
    d - the depth of the state
    
    Returns:
    the corresponding UCTStateNode
  - addNodeToIndexTree
```
protected void addNodeToIndexTree(UCTStateNode snode)
```
    Adds a UCTStateNode to the UCT tree
    
    Parameters:
    snode - the UCTStateNode to add
  - bestReturnAction
```
protected UCTActionNode bestReturnAction(UCTStateNode snode)
```
    Returns the UCTActionNode with the highest average sample Q-value. Ties are broken by returning the first UCTActionNode with the highest value.
    
    Parameters:
    snode - the UCTStateNode to query
    
    Returns:
    the UCTActionNode with the highest average sample Q-value
  - containsActionPreference
```
protected boolean containsActionPreference(UCTStateNode snode)
```
    Returns true if the sample returns for any actions are different
    
    Parameters:
    snode - the node to check for an action preference
    
    Returns:
    true if the sample returns for any actions are different; false otherwise or if there is only one action to take.

Class UCT

Nested Class Summary

Nested classes/interfaces inherited from interface burlap.behavior.valuefunction.QFunction

Field Summary

Fields inherited from class burlap.behavior.singleagent.MDPSolver

Constructor Summary

Method Summary

Methods inherited from class burlap.behavior.singleagent.MDPSolver

Methods inherited from class java.lang.Object

Methods inherited from interface burlap.behavior.singleagent.MDPSolverInterface

Field Detail

stateDepthIndex

statesToStateNodes

root

maxHorizon

maxRollOutsFromRoot

numRollOutsFromRoot

explorationBias

stateNodeConstructor

actionNodeConstructor

goalCondition

foundGoal

foundGoalOnRollout

uniqueStatesInTree

treeSize

numVisits

rand

Constructor Detail

UCT

Method Detail

UCTInit

getRoot

useGoalConditionStopCriteria

planFromState

getQs

getQ

value

resetSolver

initializeRollOut

treeRollOut

stopPlanning

selectActionNode

computeUCTQ

explorationQBoost

queryTreeIndex

addNodeToIndexTree

bestReturnAction

containsActionPreference