BoltzmannPolicyGradient

java.lang.Object
- burlap.behavior.singleagent.learnfromdemo.mlirl.support.BoltzmannPolicyGradient

```
public class BoltzmannPolicyGradient
extends java.lang.Object
```
This class provides methods to compute the gradient of a Boltzmann policy. Numerous logarithmic tricks are performed to to avoid overflow issues that a straight computation of the exponentials might induce. The methods require that that input come from a differentiable valueFunction and reward functions, which means that the valueFunction should be implementing a Boltzmann value backup instead of a Bellman value backup.

Author:

James MacGlashan.

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`protected static java.util.Set<java.lang.Integer>`	`combinedNonZeroPDParameters(FunctionGradient... gradients)`
`static FunctionGradient`	`computeBoltzmannPolicyGradient(State s, Action a, DifferentiableQFunction planner, double beta)` Computes the gradient of a Boltzmann policy using the given differentiable valueFunction.
`static FunctionGradient`	`computePolicyGradient(double beta, double[] qs, double maxBetaScaled, double logSum, FunctionGradient[] gqs, int aInd)` Computes the gradient of a Boltzmann policy using values derived from a Differentiable Botlzmann backup valueFunction.
`static double`	`logSum(double[] qs, double maxBetaScaled, double beta)` Computes the log sum of exponentiated Q-values (Scaled by beta)
`static double`	`maxBetaScaled(double[] qs, double beta)` Given an array of Q-values, returns the maximum Q-value multiplied by the parameter beta.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Method Detail
  - computeBoltzmannPolicyGradient
```
public static FunctionGradient computeBoltzmannPolicyGradient(State s,
                                                              Action a,
                                                              DifferentiableQFunction planner,
                                                              double beta)
```
    Computes the gradient of a Boltzmann policy using the given differentiable valueFunction.
    
    Parameters:
    
    s - the input state of the policy gradient
    
    a - the action whose policy probability gradient being queried
    
    planner - the differentiable DifferentiableQFunction valueFunction
    
    beta - the Boltzmann beta parameter. This parameter is the inverse of the Botlzmann temperature. As beta becomes larger, the policy becomes more deterministic. Should lie in [0, +ifnty].
    
    Returns:
    
    the gradient of the policy.
  - computePolicyGradient
```
public static FunctionGradient computePolicyGradient(double beta,
                                                     double[] qs,
                                                     double maxBetaScaled,
                                                     double logSum,
                                                     FunctionGradient[] gqs,
                                                     int aInd)
```
    Computes the gradient of a Boltzmann policy using values derived from a Differentiable Botlzmann backup valueFunction.
    
    Parameters:
    
    beta - the Boltzmann beta parameter. This parameter is the inverse of the Botlzmann temperature. As beta becomes larger, the policy becomes more deterministic. Should lie in [0, +ifnty].
    
    qs - an array holding the Q-value for each action.
    
    maxBetaScaled - the maximum Q-value after being scaled by the parameter beta
    
    logSum - the log sum of the exponentiated q values
    
    gqs - a matrix holding the Q-value gradient for each action. The matrix's major order is the action index, followed by the parameter gradient
    
    aInd - the index of the query action for which the policy's gradient is being computed
    
    Returns:
    
    the gradient of the policy.
  - maxBetaScaled
```
public static double maxBetaScaled(double[] qs,
                                   double beta)
```
    Given an array of Q-values, returns the maximum Q-value multiplied by the parameter beta.
    
    Parameters:
    
    qs - an array of Q-values
    
    beta - the scaling beta parameter.
    
    Returns:
    
    the maximum Q-value multiplied by the parameter beta
  - logSum
```
public static double logSum(double[] qs,
                            double maxBetaScaled,
                            double beta)
```
    Computes the log sum of exponentiated Q-values (Scaled by beta)
    
    Parameters:
    
    qs - the Q-values
    
    maxBetaScaled - the maximum Q-value scaled by the parameter beta
    
    beta - the scaling value.
    
    Returns:
    
    the log sum of exponentiated Q-values (Scaled by beta)
  - combinedNonZeroPDParameters
```
protected static java.util.Set<java.lang.Integer> combinedNonZeroPDParameters(FunctionGradient... gradients)
```

Class BoltzmannPolicyGradient

Method Summary

Methods inherited from class java.lang.Object

Method Detail

computeBoltzmannPolicyGradient

computePolicyGradient

maxBetaScaled

logSum

combinedNonZeroPDParameters