Abstract

In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.

Keywords

Reinforcement learningComputer scienceStandardizationBenchmark (surveying)Artificial intelligenceField (mathematics)Variance (accounting)Deep learningMachine learningData scienceRisk analysis (engineering)Mathematics

Affiliated Institutions

Related Publications

Publication Info

Year
2018
Type
article
Volume
32
Issue
1
Citations
1397
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

1397
OpenAlex

Cite This

Peter Henderson, Riashat Islam, Philip Bachman et al. (2018). Deep Reinforcement Learning That Matters. Proceedings of the AAAI Conference on Artificial Intelligence , 32 (1) . https://doi.org/10.1609/aaai.v32i1.11694

Identifiers

DOI
10.1609/aaai.v32i1.11694