Abstract
Onpolicy reinforcement learning (RL) has become a popular framework for solving sequential decision problems due to its computational efficiency and theoretical simplicity. Some onpolicy methods guarantee every policy update is constrained to a trust region relative to the prior policy to ensure training stability. These methods often require computationally intensive nonlinear optimization or require a particular form of action distribution. In this work, we show that applying KL penalization alone is nearly sufficient to enforce such trust regions. Then, we show that introducing afixupphase is sufficient to guarantee a trust region is enforced on every policy update while adding fewer than 5% additional gradient steps in practice. The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces, is easy to implement, and produces results competitive with other trust region methods.
Introduction
Onpolicy reinforcement learning (RL) methods seek to optimize a stochastic policy, where a neural network is used to parameterize a distribution $\pi(as)$ over actions conditioned on the current state. In this framework, most onpolicy RL methods seek to limit the scale of updates between successive policies during optimization. Some onpolicy RL methods operate by guaranteeing that each policy update remains within a “trust region” ( Schulman et al. 2015 ). These methods are used when training stability during a long period of training is essential. However, finding a policy update near the edge of the trust region often comes at significant computational cost. Another branch of onpolicy methods instead perform “proximal” policy updates, that limit the expected scale of policy updates, but can result in individual policy updates being of arbitrary magnitude ( Schulman et al. 2017 ). These methods are much more computationally efficient, but largescale training can require the use of multiple training runs or human intervention to recover from training instabilities. In this work we propose Fixup Policy Optimization (FixPO), which combines both a proximal primary phase with a precise fixup phase, that operate by sharing a single penalty coefficient $\beta$. By performing a more conservative proximal update before strictly enforcing a trust region, FixPO is able to approximately match the computational efficiency and rewards of proximal methods while providing the same stability guarantees as trust region methods.
An important result in the development of trustregion methods is a proof presented with the Trust Region Policy Optimization algorithm (TRPO) ( Schulman et al. 2015 ) that for a particular value of $C$, iteratively applying the following update provably results in monotonic improvement of the expected return of $\pi_i$:
$\pi_{i + 1} = \text{argmax}_{\pi} \left[L_{\pi_i}(\pi)  CD_{K\!L}^{max}(\pi_i, \pi)\right]$where $L_{\pi_i}$ is the importance sampled “policy gradient” loss, $D_{K\!L}^{max}(\pi_i, \pi)$ is the maximal value of the KullbackLeibler (KL) divergence between the action distributions of $\pi_i(sa)$ and of $\pi(sa)$, and $C$ is a function of the characteristics of the Markov Decision Process (MDP). In practice, TRPO uses constrained optimization to perform policy updates subject to a constraint on the average KL divergence instead of only penalizing the maximal value. Due to constrained optimization preventing the use of minibatching and increasing the computational cost of optimizing deep neural networks, Proximal Policy Optimization algorithms ( Schulman et al. 2017 ) are more frequently used in practice. These methods do not guarantee that any precise trust region constraint is enforced but approximately limit the scale of $D_{K\!L}(\pi_i, \pi)$.
The most wellstudied PPO algorithm, often referred to as PPOclip, or shortened to just PPO, operates by zeroing the loss contribution from likelihood ratios outside of the range $1 \pm \epsilon_{C\!L\!I\!P}$. Clipping in this way is very computationally efficient and ensures that for each state, at most one gradient step is taken which could increase the $D_{K\!L}(\pi_i, \pi)$ beyond the trust region. However, another class of PPO algorithms also introduced in ( Schulman et al. 2017 ) instead feature a policy update inspired by the theory above:
$\pi_{i + 1} = \text{argmax}_{\pi} \left[L_{\pi_i}(\pi)  \beta D_{K\!L}(\pi_i, \pi)\right]$In the above equation, $\beta$ is a hyperparameter that is typically tuned dynamically in response to the scale of recent policy updates as measured in $D_{K\!L}(\pi_i, \pi)$. Although this PPO variant is believed to perform worse than PPOclip, its simplicity and connection to the above theory have made it a subject of study in several later works, which have extended it in various ways. In this work, we demonstrate that by rapidly adapting $\beta$, it is possible to nearly enforce a trust region. Then, by performing a small number of additional gradient steps in a “fixup phase,” we can guarantee the trust region is precisely enforced for a wide range of policy classes.
This work provides the following contributions:

An RL algorithm, FixPO, that efficiently enforces a guaranteed trust region between every policy update using only KL penalization.

Experiments showing the performance of the proposed algorithm on a variety of benchmarks compared to other trust region methods.

Ablation experiments showing the effect of each component of the proposed algorithm and why those components are necessary.
Related Work
Trust Region Methods
The algorithm presented in this work follows in large part from the theory of trust region reinforcement learning methods, namely ( Schulman et al. 2015 ) and ( Schulman et al. 2017 ), combined with more recent insights from ( Andrychowicz et al. 2020 ). Work on FixPO was also guided by publications on PPO variants, such as ( Cobbe et al. 2020 ), from which the term “phase” was borrowed, and ( Hilton et al. 2021 ), which analyzes the effect of $\beta$ in relation to batch size. Works that analyze various aspects of PPO were also extremely useful, including ( Engstrom et al. 2020 ), which provides a detailed analysis of the relationship between PPO and TRPO, and ( Hsu et al. 2020 ), which examines several aspects of PPO in detail, such as the action distribution parameterization and effect of different KL penalties. More recently, ( Huang et al. 2022 ) provides an analysis of the effect of many proposed changes to PPO which was invaluable in this research. Besides ( Schulman et al. 2015 ), other methods for guaranteeing constrained updates have been proposed specifically for Gaussian policies ( Akrour et al. 2019 , Otto et al. 2021 ).
Lagrangian Methods
Although we are not aware of any comprehensive survey on the topic, loss functions structured similarly to Augmented Lagrangian methods ( Hestenes 1969 ) are frequently used in various Deep RL methods, including ( Song et al. 2019 , Andrychowicz et al. 2020 ). Our proposed $L_\beta$ is similar to the losses proposed in those works, with two additions we describe in Section 3.1. Lagrangian methods are also used in some offpolicy Deep RL work, such as for automatic entropy tuning in ( Haarnoja et al. 2018 ) and constraining offline Q estimates in ( Kumar et al. 2020 ). There are several applications of Lagrangian methods in Safe Deep RL works ( Chow et al. 2015 , Achiam et al. 2017 ), Imitation Learning and Inverse RL ( Peng et al. 2018 ), Differentiable Economics ( Dütting et al. 2017 , Ivanov et al. 2022 ), and MultiAgent RL ( Ivanov et al. 2023 ).
KL Regularized RL
Outsides of trust region methods, using the KL divergence to regularize RL has been a longstanding method ( Rawlik et al. 2012 ), and continues to be used in recent methods such as ( Kozuno et al. 2022 ), ( Vieillard et al. 2020 ), and ( Galashov et al. 2019 ). KL regularization is also a critical component of several recent offline RL methods, such as ( Wu et al. 2019 ), ( Nair et al. 2020 ), and ( Jaques et al. 2019 ).
Benchmarks and Libraries
The primary benchmarks used in this work were the Mujoco ( Todorov et al. 2012 ) benchmarks from OpenAI Gym ( Brockman et al. 2016 ), and the MetaWorld ( Yu et al. 2019 ) benchmarks.
In most of our experiments, we make use of code from Tianshou
( Weng et al. 2022 ), although we used stablebaselines3
( Raffin et al. 2021 ) in earlier experiments.
We also used samplefactory
( Petrenko et al. 2020 ) to run experiments on tasks from the DMLab30 ( Beattie et al. 2016 ) benchmark.
Method
Loss Functions
Our method begins with the wellknown loss function that results in policy updates that approximate Equation 1, also known as the KL regularized importance sampled policy gradient.
$L(s, a, \hat{A}) = \underbrace{\frac {\pi_{\theta}(as)} {\pi_{\theta'}(as)} \hat{A}}_{L_\pi} + \beta \overbrace{D_{K\!L}\left[\pi_{\theta}(as), \pi_{\theta'}(as)\right]}^{L_{K\!L}}$Where $\pi_{\theta}$ is the policy undergoing the update, $\pi_{\theta'}$ is the policy from the previous step, and $\hat{A}$ are advantages estimated using GAE ( Schulman et al. 2015 ). In order to define modified forms of this loss function, we also define the individual components, $L_\pi$ and $L_{K\!L}$, both of which will be used to optimize the policy parameters $\theta$.
We depart from ( Schulman et al. 2017 ) in how we acquire $\beta$. Instead of using a fixed value or dynamically tuning $\beta$ on an epochbyepoch manner, we instead tune $\beta$ as a Lagrange multiplier using a loss similar to those described in ( Song et al. 2019 , Andrychowicz et al. 2020 ). However, we make two modifications that differ from the losses described in those works. First, we enforce a target on $D^{max}_{K\!L}\left[\pi_{\theta}, \pi_{\theta'}\right]$, which is theoretically justified by Equation 1. Although typically dismissed as fragile to outliers, we find that the maximal KL value is less sensitive to hyperparameter choices, which we discuss in Section 4.1.0.1. Secondly, we add a term $C_\beta$, which mirrors the $C$ value in Equation [eq:argmaxpi]. This results in moving the optima of the Lagrangian optimization away from the constraint surface, which we discuss in more detail in Paragraph 3.2.0.1. This results in the following loss, which tunes $\beta$ to approximately enforce the trust region constraint.
$L_\beta = \beta\mathop{sg}{\left[\epsilon_{K\!L} C_{\beta}D_{K\!L}^{max}\left[\pi_{\theta}, \pi_{\theta'}\right]\right]}$Where $\mathop{sg}{}$ is the “stopgradient” operator, which we include as a reminder that this loss function should only be tuning $\beta$, and should not modify the policy parameters $\theta$. $\epsilon_{K\!L}$ is a hyperparameter that controls the size of the trust region, which performs a similar role as $\epsilon_{CLIP}$ in PPOclip. $C_\beta$ moves the target of the primary phase away from the edge of the trust region and compensates for bias introduced by computing $D_{K\!L}^{max}$ on minibatches. When $C_\beta = 1$, this loss function tunes $\beta$ such that $D_{K\!L}^{max} \approx \epsilon_{K\!L}$, by increasing $\beta$ when $C_{\beta}D_{K\!L}^{max} > \epsilon_{K\!L}$ and decreasing $\beta$ when $C_{\beta}D_{K\!L}^{max} < \epsilon_{K\!L}$. When $C_\beta > 1$, optimizing $L_\beta$ results in $D_{K\!L}^{max} \approx \epsilon_{K\!L}/ C_\beta < \epsilon_{K\!L}$. This difference between the expected convergence in the primary phase ($\epsilon_{K\!L}/ C_\beta$) and the exit condition of the fixup phase ($\epsilon_{K\!L}$) is effective at limiting the number of iterations of the fixup phase, as we show below.
In practice, $\pi_{\theta}$ and the value function may share some of the parameters $\theta$, so the loss function on $\theta$ includes a loss $L_{V\!F}$ on the value function. Typically, $L_{V\!F}$ will be the mean squared error of the predicted returns, although value clipping may also be used ( Andrychowicz et al. 2020 ), which most PPO implementations use by default. Combining the policy gradient loss with the value function loss and KL penalty results in a comprehensive loss on the neural network parameters that we use in Algorithm 1:
$L_{\theta} = L_{\pi} + L_{V\!F} + \beta L_{K\!L}$Fixup Phase
The most significant unique component of FixPO is the fixup phase, which runs after the primary phase. In the primary phase (lines 5  7 in Algorithm 1), we repeatedly optimize $\gamma_\theta L_\theta + \gamma_\beta L_\beta$. By choosing $\gamma_\beta$ such that $\gamma_\beta >> \gamma_\theta$ (and $L_\theta$ and $L_\beta$ have similar scales), minimizing the weighted combination of the losses results in approximate convergence to an optimum of $L_\beta \approx 0$. However, it is still possible that $D^{max}_{K\!L}\left[\pi_{\theta}, \pi_{\theta'}\right] > \epsilon_{K\!L}$, and thus that the trust region constraint is not satisfied. To guarantee that the trust region is enforced, the fixup phase iterates through all minibatches, and checks to see if the constraint is satisfied at every state. If the constraint is not satisfied at any state, then the fixup phase performs an update using $\gamma_\theta L_{K\!L}+ \gamma_\beta L_\beta$ and resumes checking all minibatches, as described in lines 8  15 in Algorithm 1.
Fixup Phase Termination
Because the fixup phase does not terminate until the trust region constraint is satisfied, it is evident that the trust region constraint is enforced between every policy update. Although we cannot guarantee the fixup phase terminates in general, there are strong reasons to expect it to terminate in practical cases. Because $L_{K\!L}= 0$ when $\theta = \theta'$, we know that a global optima of $L_{K\!L}= 0$ exists. Therefore, assuming the central result of ( Kawaguchi 2016 ) can be extended to this case, all local optima of $L_{K\!L}$ equal $0$ for these policy classes. Consequently, we can expect the fixup phase to terminate as long as it is able to optimize $L_{K\!L}$ to a local optimum. In theory, convergence to such a local optima may take an arbitrary number of gradient steps, and require an arbitrarily low learning rate. By applying an upper bound to $\beta$ and decreasing $\gamma_\theta$ when $L_{K\!L}$ reaches a plateau, $\theta$ such that $D_{K\!L}< \epsilon_{K\!L}$ can be guaranteed to eventually be found, although without any upper bound on the runtime. In practice, using $C_\beta > 1$ requires $L_{K\!L}$ to only be optimized to near a local optimum for the trust region constraint to be satisfied, and consequently for sufficiently large $C_\beta$ values the fixup phase only requires a very small number of gradient steps to terminate, as shown in Figure 1.
Subroutines
Algorithm 1 makes use of two subroutines, rollout
and minibatch
.
rollout
runs full episodes of the policy $\pi_\theta$ in the MDP, collects the resulting tuples, and computes advantages $\hat{A}$ using a value function (also parameterized by $\theta$) and GAE ( Schulman et al. 2015 ).
minibatch
splits the collected data into minibatches on which we compute loss terms.
Except when noted, we use the implementation of these routines typically used in Tianshou ( Weng et al. 2022 ).
Using Momentum Optimizers
As is standard practice ( Andrychowicz et al. 2020 ), we use Adam ( Kingma et al. 2014 ) (instead of SGD) to optimize both $\theta$ and $\beta$. Therefore, in the initial gradient steps of the fixup phase, optimizing $L_{K\!L}$ also optimizes $L_{\theta}$, and optimizing $L_{\theta}$ in the next few iterations of the primary phase additionally optimizes $L_{K\!L}$. We have not found this to be a problem in practice using the default hyperparameters for Adam, as long as $C_\beta \ge 2$.
Experiments
Gym Mujoco Control Tasks
Our first experiments demonstrate that FixPO performs competitively to other trust region methods on the Mujoco control tasks from the OpenAI Gym ( Brockman et al. 2016 ), a finite horizon, continuous action space RL benchmark. We compare our implementation using the Tianshou RL framework ( Weng et al. 2022 ) to the PPOclip implementation in that framework, as well as to the KL projection layer described in ( Otto et al. 2021 ). The results in Figure 3 show that FixPO is generally able to match or exceed other trust region methods on these tasks, and exhibits consistent training stability.
HyperParameter Robustness
FixPO appears to be highly robust to choices of hyperparameter values. As we will show in Section 4.2, FixPO can perform moderately well with many of its components removed in isolation. We performed hyperparameter sweeps for all of the major hyperparameters, notably $C_\beta, \epsilon_{K\!L}$, $\gamma_{\beta}$, the minibatch size, and the batch size. Changing these parameters within a wide range of values had minimal effect on the algorithm’s wallclock time and rewards. In particular, performance was approximately equal while $0.1 \le \epsilon_{K\!L}\le 0.5$ and $2 \le C_\beta \le 10$ and $0.001 \le \gamma_{\beta} \le 0.1$, and the minibatch size was not more than 512. This allows FixPO to use larger batch and minibatch sizes than the baseline algorithms, allowing for faster wallclock times in the following experiments. Other experiments we have performed indicate that FixPO does not require the corrections to $\beta$ described in ( Hilton et al. 2021 ), which we speculate is due to the constraint on $D_{K\!L}\left[\pi_{\theta}, \pi_{\theta'}\right]$ more closely following the trust region theory. This includes the MetaWorld benchmark, where PPO typically requires batch sizes of at least 50,000 timesteps to stabilize training.
Higher Entropy Policies
On these environments, FixPO naturally learns a higher entropy policy than PPOclip, without using entropy regularization. This confirms a pattern described in ( Otto et al. 2021 ). Figure [fig:ppoclipstddevcomparison] shows the relative standard deviation of FixPO and PPOclip on Walker2d.
Gym Mujoco Ablation Experiments
We performed a series of ablation experiments using the Mujoco environments described above. Each ablation experiment removes a unique component of FixPO.
Remove Fixup Phase
This ablation (labeled No Fixup Phase
in Figure 5) removes the fixup phase entirely, relying on only tuning $\beta$ in the primary phase to enforce the trust region.
This results in an algorithm similar to those described in ( Andrychowicz et al. 2020 ).
Although this ablation is able to perform well in most runs, we observe poor performance in a portion of runs due to the trust region not being reliably enforced.
This matches the theoretical and experimental predictions made of KL regularized PPO in ( Hsu et al. 2020 ).
Although this ablation achieves higher reward on most tasks than FixPO, it does not guarantee that the trust region is enforced, which is the primary objective of FixPO.
Limit the Mean KL Value
This ablation (labeled Limit Mean KL
in Figure 5) tunes $\beta$ to limit the mean value of the KL divergence, instead of limiting the maximal value using the following loss function:
This is similar to losses described in ( Song et al. 2019 , Andrychowicz et al. 2020 ) but using $C_\beta$ to adjust the optima. In this ablation, we still run the fixup phase introduced in this work, but exit it once the mean KL divergence is less than the target (i.e. $\overline{D_{K\!L}(s)} \le \epsilon_{K\!L}$). We performed a hyper parameter sweep over $\epsilon_{K\!L}$ for each environment for the ablation, and found this ablation to perform similarly to our proposed $L_\beta$ given an optimal $\epsilon_{K\!L}$. Although this ablation is able to reach similar rewards as FixPO, we believe the increased sensitivity to the value of $\epsilon_{K\!L}$ makes it worse.
Lagrangian Optima on Constraint Boundary ($C_\beta = 1$)
In this ablation, we remove $C_\beta$ by setting it to $1$. This results in the optima of $L_\beta + L_\theta$ being a $\theta$ such that $D^{max}_{K\!L}\left[\pi_{\theta}, \pi_{\theta'}\right] \approx \epsilon_{K\!L}$. Due to the optima not being reached exactly, and bias introduced by minibatching the losses, it is often the case that $D^{max}_{K\!L}\left[\pi_{\theta}, \pi_{\theta'}\right]$ is significantly greater than $\epsilon_{K\!L}$, requiring over 100 gradient steps in the fixup phase to correct and significant wallclock time. A large number of gradient steps in the fixup phase also appears to result in poor reward, which we attribute to catastrophic forgetting in the value function network. See Figure 1 for the effect of this ablation on the number of gradient steps in the fixup phase.
Only Run Fixup Phase in Last Epoch
In this ablation, we move the fixup phase out of the epoch loop and run it only between policy improvement steps after all epochs have been completed. Equivalently, we run the fixup phase only on the last epoch of each policy improvement step. Due to only needing to enforce the trust region between each policy update step (and not each epoch), this ablation still enforces the trust region guarantee. This results in performing a smaller number of fixup phase gradient steps. However, we found that this ablation slightly decreased the rewards on most environments, including all shown here. Decreasing the overall number of gradient steps by $<5\%$ also did not measurably improve wall clock time. If decreasing fixup phase gradient steps is absolutely necessary, increasing $C_\beta$ is more effective than this ablation.
Use a Constant $\beta = 10$
In these experiments, we do not use $L_\beta$, and use a constant value of $\beta = 10$. The fixup phase is still performed. Equivalently, in these experiments $\gamma_\beta = 0$. This ablation performs very well on HalfCheetah, and moderately well on other environments. However, in some individual runs a large number of gradient steps are necessary in the fixup loop, and rewards sometimes decrease significantly. Notably, we were only able to perform this ablation because we observed that $L_\beta$ often tuned $\beta$ to a value between 5 and 20. We believe that the decrease in stability and the need to potentially tune $\beta$ make this ablation worse than the proposed method.
MetaWorld Experiments
In this section, we use the MetaWorld ( Yu et al. 2019 ), a continuous action space infinite horizon RL benchmark, to run a very large number of experiments comparing FixPO to other trust region methods. Due to these tasks containing randomized goals and starting states, PPOclip requires a very large batch size (50000), to solve these tasks, or suffers from high instability when trainig. In Figure 6, we use 450 experiment runs to demonstrate that FixPO is able to match the performance of other trust region methods, without requiring a change in hyper parameters. In Figure 7, we perform some simple Transfer RL experiments that show how FixPO is able to finetune without any special handling of the value function, such as described in ( Zentner et al. 2022 ).
DMLab30 Experiments
To demonstrate that FixPO is able to scale where constrained optimization (specifically ( Schulman et al. 2015 )) cannot, we implement FixPO in the popular samplefactory
RL framework.
This implementation of FixPO was based on the highly performant implementation of APPO in samplefactory
.
We chose the DMLab30 ( Beattie et al. 2016 ) environment because of its high dimensional visual observation space and partial observability, both properties that make training policies with constrained optimization challenging due to the large number of neural network parameters required.
We compared the reward of FixPO and APPO on three DMLab30 tasks: rooms collect good objects train, rooms select nonmatching object, and rooms exploit deferred effects train. To make the comparison fair, FixPO uses the same hyperparameters as APPO, except for hyperparameters specific to FixPO, which we set $\epsilon_{K\!L}= 1.0$ and $C_\beta = 2$. Figure 8 shows that FixPO is able to match the performance of APPO on those tasks.
Limitations
MultiTask RL
We experimented with running FixPO as a multitask RL method on the DMLab30 benchmark. However, we found that strictly applying the trust region constraint across all tasks simultaneously prevented progress from being made on multiple tasks at once. In the future, we would like to experiment with using one $\beta$ value per task, which may alleviate this limitation.
More Policy Architectures
One of the advantages of FixPO relative to prior trust region methods is the ability to combine minibatching with trustregion optimization of policies besides Gaussian policies (which works such as ( Otto et al. 2021 ) are limited to). Our DMLab30 experiments show these capabilities in a discrete action space, and we were also able to run our implementation using the Tianshou framework on the Arcade Learning Environment ( Bellemare et al. 2013 ). However, further experiments with different action distributions would be a useful direction for future work.
Conclusion
In this work we have shown how FixPO is able to combine the guarantees of trust region methods with the computational efficiency and rewards of proximal methods. FixPO enforces its trust region via KL penalization, which is flexible and well understood in the machine learning community. In future work, we would like to extend our work to a multitask setting.
Footnote 1: These authors contributed equally to this work.
Footnote 2: GSS holds concurrent appointments as a Professor at USC and as an Amazon Scholar. This paper describes work performed at USC and is not associated with Amazon.
Bibliography
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel,
Trust region policy optimization
,
2015, [URL] Backrefs: [1][2][3][4][5]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov,
Proximal policy optimization algorithms
,
2017, [URL] Backrefs: [1][2][3][4][5]
Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem,
What matters in onpolicy reinforcement learning? A largescale empirical study
,
2020, [URL] Backrefs: [1][2][3][4][5][6][7]
Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman,
Phasic policy gradient
,
2020, [URL] Backrefs: [1]
Jacob Hilton, Karl Cobbe, John Schulman,
Batch sizeinvariance for policy optimization
,
2021, [URL] Backrefs: [1][2]
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, Aleksander Madry,
Implementation matters in deep {RL}: A case study on {PPO} and {TRPO}
,
2020, [URL] Backrefs: [1]
Chloe ChingYun Hsu, Celestine MendlerDünner, Moritz Hardt,
Revisiting design choices in proximal policy optimization
,
2020, [URL] Backrefs: [1][2][3]
Shengyi Huang, Rousslan Fernand Julien Dossa, Antonin Raffin, Anssi Kanervisto, Weixun Wang,
The 37 implementation details of proximal policy optimization
,
2022, (https://iclrblogtrack.github.io/2022/03/25/ppoimplementationdetails/) [URL] Backrefs: [1]
Riad Akrour, Joni Pajarinen, Jan Peters, Gerhard Neumann,
Projections for approximate policy iteration algorithms
,
2019, [URL] Backrefs: [1]
Fabian Otto, Philipp Becker, Ngo Anh Vien, Hanna Carolin Ziesche, Gerhard Neumann,
Differentiable trust region layers for deep reinforcement learning
,
2021, [URL] Backrefs: [1][2][3][4][5][6][7][8]
Magnus R. Hestenes,
Multiplier and gradient methods
,
1969,11, [URL] Backrefs: [1]
H. Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick,
VMPO: Onpolicy maximum a posteriori policy optimization for discrete and continuous control
,
2019, [URL] Backrefs: [1][2][3]
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, ,
Soft actorcritic algorithms and applications
,
2018, Backrefs: [1]
Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine,
Conservative qlearning for offline reinforcement learning
,
2020, [URL] Backrefs: [1]
Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, Marco Pavone,
Riskconstrained reinforcement learning with percentile risk criteria
,
2015, [URL] Backrefs: [1]
Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel,
Constrained policy optimization
,
2017, [URL] Backrefs: [1]
Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, Sergey Levine,
Variational discriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining information flow
,
2018, [URL] Backrefs: [1]
Paul Dütting, Zhe Feng, Harikrishna Narasimhan, David C. Parkes, Sai Srivatsa Ravindranath,
Optimal auctions through deep learning: Advances in differentiable economics
,
2017, [URL] Backrefs: [1]
Dmitry Ivanov, Iskander Safiulin, Igor Filippov, Ksenia Balabaeva,
Optimaler auctions through attention
,
2022, [URL] Backrefs: [1]
Dmitry Ivanov, Ilya Zisman, Kirill Chernyshev,
Mediated multiagent reinforcement learning
,
2023, [URL] Backrefs: [1]
Konrad Rawlik, Marc Toussaint, Sethu Vijayakumar,
On stochastic optimal control and reinforcement learning by approximate inference
,
2012, Backrefs: [1]
Tadashi Kozuno, Wenhao Yang, Nino Vieillard, Toshinori Kitamura, Yunhao Tang, Jincheng Mei, Pierre M’enard, Mohammad Gheshlaghi Azar, M. Vaĺko, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvari,
KLentropyregularized RL with a generative model is minimax optimal
,
2022, Backrefs: [1]
Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, Matthieu Geist,
Leverage the average: An analysis of regularization in RL
,
2020, Backrefs: [1]
Alexandre Galashov, Siddhant M. Jayakumar, Leonard Hasenclever, Dhruva Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M. Czarnecki, Yee Whye Teh, Razvan Pascanu, Nicolas Manfred Otto Heess,
Information asymmetry in KLregularized RL
,
2019, Backrefs: [1]
Yifan Wu, G. Tucker, Ofir Nachum,
Behavior regularized offline reinforcement learning
,
2019, Backrefs: [1]
Ashvin Nair, Murtaza Dalal, Abhishek Gupta, Sergey Levine,
Accelerating online reinforcement learning with offline datasets
,
2020, Backrefs: [1]
Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Lapedriza, Noah J. Jones, Shixiang Shane Gu, Rosalind W. Picard,
Way offpolicy batch deep reinforcement learning of implicit human preferences in dialog
,
2019, Backrefs: [1]
Emanuel Todorov, Tom Erez, Yuval Tassa,
Mujoco: A physics engine for modelbased control
,
2012, Backrefs: [1]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, Wojciech Zaremba,
OpenAI gym
,
2016, [URL] Backrefs: [1][2]
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, Sergey Levine,
Metaworld: A benchmark and evaluation for multitask and meta reinforcement learning
,
2019, [URL] Backrefs: [1][2]
Jiayi Weng, Huayu Chen, Dong Yan, Kaichao You, Alexis Duburcq, Minghao Zhang, Yi Su, Hang Su, Jun Zhu,
Tianshou: A highly modularized deep reinforcement learning library
,
2022, [URL] Backrefs: [1][2][3]
Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, Noah Dormann,
StableBaselines3: Reliable reinforcement learning implementations
,
2021, [URL] Backrefs: [1]
Aleksei Petrenko, Zhehui Huang, Tushar Kumar, Gaurav S. Sukhatme, Vladlen Koltun,
Sample factory: Egocentric 3D control from pixels at 100000 FPS with asynchronous reinforcement learning
,
2020, [URL] Backrefs: [1]
Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, Stig Petersen,
DeepMind lab
,
2016, [URL] Backrefs: [1][2]
John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, Pieter Abbeel,
Highdimensional continuous control using generalized advantage estimation
,
2015, Backrefs: [1][2]
Kenji Kawaguchi,
Deep learning without poor local minima
,
2016, [URL] Backrefs: [1]
Diederik P. Kingma, Jimmy Ba,
Adam: A method for stochastic optimization
,
2014, Backrefs: [1]
K. R. Zentner , Ujjwal Puri, Yulun Zhang, Ryan Julian, Gaurav S. Sukhatme,
Efficient multitask learning via iterated singletask transfer
,
2022, Backrefs: [1]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov,
Proximal policy optimization algorithms
,
2017, Backrefs: [1]
M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling,
The arcade learning environment: An evaluation platform for general agents
,
2013,6, [URL] Backrefs: [1]