public class BoltzmannPolicyGradient
extends java.lang.Object
| Constructor and Description |
|---|
BoltzmannPolicyGradient() |
| Modifier and Type | Method and Description |
|---|---|
static double[] |
computeBoltzmannPolicyGradient(State s,
GroundedAction a,
QGradientPlanner planner,
double beta)
Computes the gradient of a Boltzmann policy using the given differentiable planner.
|
static double[] |
computePolicyGradient(DifferentiableRF rf,
double beta,
double[] qs,
double maxBetaScaled,
double logSum,
double[][] gqs,
int aInd)
Computes the gradient of a Boltzmann policy using values derived from a Differentiable Botlzmann backup planner.
|
static double |
logSum(double[] qs,
double maxBetaScaled,
double beta)
Computes the log sum of exponentiated Q-values (Scaled by beta)
|
static double |
maxBetaScaled(double[] qs,
double beta)
Given an array of Q-values, returns the maximum Q-value multiplied by the parameter beta.
|
public static double[] computeBoltzmannPolicyGradient(State s, GroundedAction a, QGradientPlanner planner, double beta)
s - the input state of the policy gradienta - the action whose policy probability gradient being queriedplanner - the differentiable QGradientPlanner plannerbeta - the Boltzmann beta parameter. This parameter is the inverse of the Botlzmann temperature. As beta becomes larger, the policy becomes more deterministic. Should lie in [0, +ifnty].public static double[] computePolicyGradient(DifferentiableRF rf, double beta, double[] qs, double maxBetaScaled, double logSum, double[][] gqs, int aInd)
rf - the planner's DifferentiableRFbeta - the Boltzmann beta parameter. This parameter is the inverse of the Botlzmann temperature. As beta becomes larger, the policy becomes more deterministic. Should lie in [0, +ifnty].qs - an array holding the Q-value for each action.maxBetaScaled - the maximum Q-value after being scaled by the parameter betalogSum - the log sum of the exponentiated q valuesgqs - a matrix holding the Q-value gradient for each action. The matrix's major order is the action index, followed by the parameter gradientaInd - the index of the query action for which the policy's gradient is being computedpublic static double maxBetaScaled(double[] qs,
double beta)
qs - an array of Q-valuesbeta - the scaling beta parameter.public static double logSum(double[] qs,
double maxBetaScaled,
double beta)
qs - the Q-valuesmaxBetaScaled - the maximum Q-value scaled by the parameter betabeta - the scaling value.