Abstract

Many real-world problems, such as network packet routing and the coordination of autonomous vehicles, are naturally modelled as cooperative multi-agent systems. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.

Keywords

Counterfactual thinkingComputer scienceReinforcement learningBaseline (sea)ObservabilityArtificial intelligenceTestbedComputer securityDistributed computingComputer networkPolitical scienceLawMathematics

Affiliated Institutions

Related Publications

Publication Info

Year
2018
Type
article
Volume
32
Issue
1
Citations
1491
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

1491
OpenAlex
265
Influential
1095
CrossRef

Cite This

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras et al. (2018). Counterfactual Multi-Agent Policy Gradients. Proceedings of the AAAI Conference on Artificial Intelligence , 32 (1) . https://doi.org/10.1609/aaai.v32i1.11794

Identifiers

DOI
10.1609/aaai.v32i1.11794
arXiv
1705.08926

Data Quality

Data completeness: 84%