## Retrospectives @NeurIPS 2019

# A Retrospective for "Deep Reinforcement Learning That Matters- A Closer Look at Policy Gradient Algorithms, Theory and Practice"

- Original Paper : Deep Reinforcement Learning That Matters- A Closer Look at Policy Gradient Algorithms, Theory and Practice
- Paper written by : Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger
- Retrospective written by : Riashat Islam

## Reproducibility in Deep Reinforcement Learning

Deep RL, as a field has had lot of success, primarily empirical breakthroughs (e.g superhuman performance in Atari games, solving the game of Go), and lot of these successes have come about due to designing algorithms with more engineered efforts. Reproducing state of the art deep RL methods (particularly policy gradient based algorithms) can often not be straightforward; Intrinsic variance in algorithms can make results difficult to interpret; Performance improvements may often not be due to algorithmic advances, but rather different experimental techniques and reporting procedures;

This paper gained a lot of interest in the Deep RL and the ML community in general, questioning our methods and practices, and the experimental techniques we use for reproducing results of experiments. As a community, we realized that while there are tons of research papers being published, not all papers and algorithms can be easily reproduced or re-implemented, since often minor implementation details are left out. In the Deep RL community particularly, we realized that state-of-the-art algorithms are often difficult to implement, and the intrinsic variance in the algorithms, combined with the external stochasticity in the environments can often make policy gradient based algorithms difficult to interpet. Our paper re-emphasized the difficulties of hyperparameter tuning and susceptibility to different hyperparameters (including random seeds!) and we concluded that there are several factors that affect the reproducibility of state-of-the-art RL methods.

Yes, reproducibility of deep RL algorithms is still a major issue and as a community we should develop tools and experiment pipelines to make our results more reproducible and algorithms easily implementable. There are numerous efforts now to make proper engineering pipelines to solve reproducibility issues, which is really encouraging to see! However, is this the only problem we should be concerned about?

### Let’s take a step back

What makes deep RL algorithms so much prone to hyperparameters and sensitive to even random seeds? What have we done that our algorithms are so much sensitive to these things? There are so many questions that can be raised…

But hold on, do we at all understand **why** these algorithms even work in the first place? What makes these algorithms be able to solve such complex tasks so efficiently? We question reproducibility of deep RL - but where are all the convergence guarantees that said deep RL algorithms won’t be susceptible to hyperparameters and random seeds? Where is the underlying theory behind these algorithms? Have we not just imported theory from deep learning and trying to make these algorithms work with neural networks? As an example - why do we use ADAM as an optimizer in actor-critic algorithms like DDPG/TRPO? From an optimization perspective, we are trying to solve even more complex non-convex functions these days, which may have different properties in the RL case than in supervised learning. As algorithms are becoming even more complex, do we understand the complexities we are dealing with when solving these non-convex objectives?

## Retrospective

In this article, I want to highlight that while engineering tools are important for a field to progress practically, in aiming to do so, as a community we are delving away from theoretical contributions and advancements, i.e, the practical efforts are overwriting the need for better theoretically motivated algorithms. Maybe, with stronger theoretical guarantees and efforts in making our algorithms have theoretical understandings/justifications, as a community we may be able to solve the reproducibility issues as well.

In our paper, we highlighted the susceptibilility of deep RL algorithms on a wide range of hyperparameters (including random seeds). We concluded that this can lead to reproduciblity issues as not all engineering tricks and trades are often repored. However, looking back, now I think a lot of these issues of reproducibility are arising because we are moving away from the theory that deep RL requires. Perhaps, stronger theoretical guarantees and understanding will make deep RL algorithms more reproducible, less susceptible and sensitive to hyperparameters.

### Example : Re-visiting Actor-Critic in Deep RL

Actor-Critic algorithms are two time-scale algorithms where choosing the learning rates/step sizes of actor and critic plays an important role. In addition, we use n-step methods and lambda returns in the critic evaluation. In the linear function approximation case, there were numerious studies (see Konda and Tsitsiklis, 2000) on the convergence of actor-critic like two time-scale algorithms.

Perhaps, in deep RL with non-linear function approximators, we require re-visiting and understanding two time-scale algorithms again? As a community, we need to re-visit established theory of two time-scale actor-critic algorithms and understand their convergence properties with non-linear function approximators. Perhaps from an optimization perspective, we are solving an even more harder problem given our environments are often long horizon, and the min-max objectives need careful interpretation when trying to minimize the Bellman error while learning policies to maximize cumulative returns.

Do we understand what role our optimizers such as ADAM or RMSProp are playing in deep RL actor-critic algorithms? In the tabular case, we used to tune the learning rates carefully, making sure the critic has a higher learning rate than the actor. In deep RL however, we tend to use the default learning rates of ADAM for actor, and use a higher learning rate for critic? Do we understand how optimizers such as ADAM are interacting with our deep RL algorithms in the non-linear approximation case?

The goal of this article is to highlight that we need more careful understanding of our deep RL algorithms, back up with proper theory and tools, and analyze them properly. Yes, reproducibility is a major issue, but the way the field is progressing with engineering efforts to solve practical problems, we must not delve away from the beautiful theory and motivational frameworks of our algorithms. We need a balance of both, before blaming deep RL to be often non-reproducible.

## Is Reproducibility in Deep RL the only issue?

Every year, there are tons of papers advancing Deep RL methods, and variants of algorithms built on top of another, aiming for performance improvements. Despite the achievements from deep RL algorithms, we often conclude that these algorithms are hard to re-implement and results not reproducible. Despite the large number of algorithms in deep RL, when we want to extend these algorithms to the real world, there are almost a handful that we can take. Why is that? Few primary reasons are that these algorithms are highly sample inefficient, sensitive to even the smallest of the hyperparameters, and we often lack theoretical understanding of why these algorithms are behaving the way it is! More importantly, we lack theoretical guarantees with these algorithms - guarantees that are vital to ensure that these algorithms will work when applied to the real world. The sample complexity of reinforcement learning algorithms has been a major problem [Kakade, 2003], and the inability to learn from small amounts of data often make RL algorithms difficult to scale up in practice. A major ongoing field of work in RL is towards efficient learning. Perhaps, as a community we should focus more on these aspects, e.g sample complexity and efficiency of RL algorithms, and focus more on theoretical advances and understanding why deep RL algorithms work.

In this article, I want to highlight we are in a race to produce the next best deep RL algorithm that achieves “better performance”, and in trying to do so, we are blaming reproducibility isses of deep RL! I agree that these are issues we should definitely be concerned about. But as a community to progress, we need a balance of both empirical and theoretical work. We need theoretical understanding of our algorithms, before drawing conclusions, and this is equally vital to reproducibility issues in deep RL. Nowadays, there is a lot more emphasis on engineering practices and advances with RL algorithms, and we are deviating from the theoretical guarantees deep RL algorithms require. We need more efforts towards theoretical foundations of our algorithms - why such algorithms work the way they work!

For the rest of the article, I will particularly be focussing on policy gradient based algorithms. I want to highlight that there are several theoretical considerations to make when developing new policy gradient based algorithms, and often a lot of these theoretical perspectives are ignored.

## Theory and Practice

We highlight several theroetical aspects of policy gradient based methods that are often overlooked these days. As a community working on RL, we are more and more moving away from the theory and emphasizing on the engineering aspects. Our current approach towards research advances in deep RL is inclined more on engineering heavywork, and less on theoretical advances and guarantees.

As an example, let’s re-visit the policy gradient approach :

The policy gradient theorem [Sutton, 1999] is based on the starting state and average reward formulation. The starting state formulation is a crucial step when we express the policy gradient theorem as an expectation that we can estimate by sampling. The policy gradient theorem was originally derived for both the episodic and infinite horizon setting. In the past, we used to be lot more concerned about average reward case and the discounted formulation [Kakade, 2001, Bartlett, 2001].

However, these days in the deep RL case, we somehow only consider the episodic case. More so, we use undiscounted returns in finite horizon when evaluating policies, even though the true objective we want to optimize is the discounted returns with infinite horizon. In practice, the objetive we optimize, however, is the one with discounted returns in a finite horizon, episodic setting. While OpenAI gym wrappers have made it a lot easier to play around with RL algorithms, implicitly we have all narrowed down to an episodic formulation for policy gradient methods with a fixed reset distribution. We no longer look for examples with infinite horizon discounted settings, and most of our policy gradient algorithms are evaluated in the episodic setting. These wrappers, while useful in practice to implement these algorithms, often circumvent the underlying theory and original motivating framework. When developing new algorithms, it is often easy to overlook the differences these can make.

Recently, [Pardo and Tavakoli, 2017] brought out the issue of time-limits in RL, due to the nature of the tasks we are dealing with. Not explicitly handling the time-limit issue in RL can have severe effects, leading to invalidation of experience replay and training instabilities. This notion of time-limit in RL is brought up due to the nature of our environments these days, and we often overlook these minor details of how our environments are defined. Whereas, such notions of finite versus infinite horizon tasks goes long back in the past!

Several works have also proposed the normalized counterpart with the discounted weighting of states, and [Thomas 2014] argued that there is a discount factor term missing in the derivation that makes the gradient estimator biased. [Nota, 2019] recently showed the policy gradient methods we develop do not actually follow the gradient of any objective and re-classifies them as semi-gradient methods. Furthermore, [Ilyas, 2018] studied the behaviour of policy gradient algorithms and found that the behaviour of these algorithms often deviate from the original motivating framework. A lot of these works suggested steps to solidify the foundations of policy based algorithms, and perhaps we need to re-visit the theory to make them more suitable for the current evaluation benchmark tasks.

As the research field progresses, the community adapts certain beliefs and traditions, and we keep adding engineering modules to develop better algorithms. It is often easy to overlook the underlying theory and we tend to not re-visit the theory too often. For example, the need for exploration in policy gradient based methods, and whether it actually helps or not, is still an open problem. As a community, we are still unsure of the impact that exploration truly has on policy-based methods. Noisy networks [Fortunato, 2017], adding noise to parameter space have been shown to be an effective tool for exploration in deep RL, often achieving state of the art results. We tie such approaches as leading to better exploration for deep RL. However, little is known on the effects these have from an optimization perspective. Why at all does noisy networks perform well? Is it simply adding more stochasticity in SGD (and variants like ADAM) due to which we see performance improvements? If so, then how does this interact with the role of exploration?

Does exploration in policy-based methods mean that we are implicitly smoothening out the optimization landscape? Or is there more to it? Such questions need more analysis and study, rather than drawing conclusions about them purely based on empirical performance. We often use entropy regularization to induce more stochastic policies, assuming that stochastic policies will aid in exploration leading to better performance. However, [Ahmed, 2018] recently studied the impact of entropy regularization to find that it in fact leads to a smoothening effect on the optimization landscape, making it easy to search for global optimas, rather than explicitly having an effect on exploration. While policy-based methods fundamentally rely on gradient based approaches, e.g SGD and variants, as a community we often do not study the convergence of these algorithms from an optimization perspective. Only recently few works studied the convergence properties of policy based methods in RL [Bhandari, 2019; Agarwal, 2019].

## Summary

Our paper studied the reproducibility and engineering aspects of policy gradient based algorithms. We started questioning the intricacies most of these algorithms come with, and blamed their susceptibility to hyperparameter tuning, sample inefficiencies and instabilities in performance. This led to a widely grown interest on reproducibility in RL in general.

However, perhaps in addition to questioning reproducibility issues, we should also take a step back on the original motivating frameworks, and make theoretical advances in the field. A lot of recent deep RL algorithms heavily rely on engineering tools, moving away from theoretical guarantees and contributions. We should perhaps re-visit theory more often than we do. To make deep RL algorithms more efficient in learning, and to scale up to the real world, perhaps instead of fixing reproducibility issues, we should focus more on theroetical advances and guarantees, and develop mathemtical tools to make these algorithms more reliable.

We tend to focus a lot on reproducibility. Of course this will be an issue, the more we rely on engineering practices and complexities. Perhaps, if we are equally concerned about the lack of theoretical guarantees that most of these algorithms currently have, especially with the non-linear approximator frameworks, we can make major advances in the deep RL field. Importantly, we need better understanding our deep RL algorithms. This can be either by developing tools to study the intricacies of deep RL algorithms, or by delving more into theory. Stronger theoretical guarantees will perhaps help us get rid of the difficulties with reproducibility!