public class BoltzmannPolicyGradient
extends java.lang.Object
Modifier and Type | Method and Description |
---|---|
protected static java.util.Set<java.lang.Integer> |
combinedNonZeroPDParameters(FunctionGradient... gradients) |
static FunctionGradient |
computeBoltzmannPolicyGradient(State s,
Action a,
DifferentiableQFunction planner,
double beta)
Computes the gradient of a Boltzmann policy using the given differentiable valueFunction.
|
static FunctionGradient |
computePolicyGradient(double beta,
double[] qs,
double maxBetaScaled,
double logSum,
FunctionGradient[] gqs,
int aInd)
Computes the gradient of a Boltzmann policy using values derived from a Differentiable Botlzmann backup valueFunction.
|
static double |
logSum(double[] qs,
double maxBetaScaled,
double beta)
Computes the log sum of exponentiated Q-values (Scaled by beta)
|
static double |
maxBetaScaled(double[] qs,
double beta)
Given an array of Q-values, returns the maximum Q-value multiplied by the parameter beta.
|
public static FunctionGradient computeBoltzmannPolicyGradient(State s, Action a, DifferentiableQFunction planner, double beta)
s
- the input state of the policy gradienta
- the action whose policy probability gradient being queriedplanner
- the differentiable DifferentiableQFunction
valueFunctionbeta
- the Boltzmann beta parameter. This parameter is the inverse of the Botlzmann temperature. As beta becomes larger, the policy becomes more deterministic. Should lie in [0, +ifnty].public static FunctionGradient computePolicyGradient(double beta, double[] qs, double maxBetaScaled, double logSum, FunctionGradient[] gqs, int aInd)
beta
- the Boltzmann beta parameter. This parameter is the inverse of the Botlzmann temperature. As beta becomes larger, the policy becomes more deterministic. Should lie in [0, +ifnty].qs
- an array holding the Q-value for each action.maxBetaScaled
- the maximum Q-value after being scaled by the parameter betalogSum
- the log sum of the exponentiated q valuesgqs
- a matrix holding the Q-value gradient for each action. The matrix's major order is the action index, followed by the parameter gradientaInd
- the index of the query action for which the policy's gradient is being computedpublic static double maxBetaScaled(double[] qs, double beta)
qs
- an array of Q-valuesbeta
- the scaling beta parameter.public static double logSum(double[] qs, double maxBetaScaled, double beta)
qs
- the Q-valuesmaxBetaScaled
- the maximum Q-value scaled by the parameter betabeta
- the scaling value.protected static java.util.Set<java.lang.Integer> combinedNonZeroPDParameters(FunctionGradient... gradients)