This policy is for use with UCT. Note that UCT can only guarantee the policy for the initial state of planning.
However, the policy from states that lie on the greedy path from the initial state are likely "okay" as well
since they were used for the determining which action to take in the initial state.
This class defines the policy for states that lie along the the greedy path of the UCT
tree. Any state not visited by the greedy path in the UCT tree is excluded from the policy and will result
in an error if this policy is queried for such a state.
A more robust policy would cause the valueFunction to be called at each state to build a new tree.
This method will return an action sampled by the policy for the given state. If the defined policy is
stochastic, then multiple calls to this method for the same state may return different actions. The sampling
should be with respect to defined action distribution that is returned by getActionDistributionForState
This method will return action probability distribution defined by the policy. The action distribution is represented
by a list of ActionProb objects, each which specifies a grounded action and a probability of that grounded action being
taken. The returned list does not have to include actions with probability 0.