Multi-Agent Reinforcement Learning on Google Research Football Environment

Introduction
Background Theory
Literature Review
Methodology
- Environment Selection
- System Setup and Technical Constraints
- Understanding the Chosen Environment
- Designing the Experiments
Results and Observations
Key Learnings
Future Work
Resources
Keywords

Introduction

This research project was the final year project for a Bachelor of Engineering in Electronic and Electrical Engineering at University College London. The aim was to explore whether reinforcement learning (RL) could be used to teach cooperative behaviour among multiple agents in a game environment.

The work was completed independently, with no collaborators or continuous supervision. It was also my first machine learning project, undertaken without any prior experience in simpler ML domains such as supervised or unsupervised learning. Starting directly with multi-agent reinforcement learning presented a particularly steep learning curve, making this project a demanding and immersive introduction to real-world RL experimentation and research methodologies.

Background Theory

Artificial Intelligence (AI) has been an integral part of video games since the 1950s. Initially, it consisted of rule-based systems (hard-coded algorithms that followed predefined logic). In contrast, Machine Learning (ML), a modern subfield of AI, enables algorithms to improve automatically by processing large amounts of data. This shift allows game agents and other applications to adapt over time instead of relying on static, manually crafted rules.

Machine Learning (ML) is broadly categorized into supervised, unsupervised, and reinforcement learning. Unlike supervised and unsupervised learning, which rely on labelled or structured data, Reinforcement Learning (RL) involves learning through interaction. An agent interacts with an environment, taking actions and receiving observations and rewards as feedback.

In RL, the goal is to learn a policy (a mapping from states to actions) that maximizes the cumulative reward over time. This is achieved through trial and error, where the agent refines its policy based on the rewards it receives. The environment provides a state representation (e.g., pixel input or structured data), and based on this state, the agent selects an action according to its current policy.

Several foundational concepts in RL include:

Agent: The learner or decision-maker.
Environment: The external system the agent interacts with.
State: A representation of the environment at a given time.
Action: A choice made by the agent.
Reward: Feedback signal indicating the value of an action.
Policy: A strategy used by the agent to decide actions.

Another foundational concept in RL is the balance between exploration (trying new actions to discover better rewards) and exploitation (choosing actions known to yield high rewards). This trade-off is crucial to ensure the agent doesn't converge prematurely to suboptimal strategies and continues to improve its policy.

Reinforcement Learning (RL) can involve a single agent or multiple agents. In traditional single-agent RL, one agent interacts with the environment to learn a policy that maximizes cumulative rewards. In contrast, Multi-Agent Reinforcement Learning (MARL) involves multiple agents operating in the same environment. These agents may cooperate or compete, leading to significantly more complex dynamics. Coordination between agents is particularly valuable in real-world applications such as traffic signal optimization or teams of autonomous vehicles where joint strategies are essential.

RL algorithms can be grouped into several major classes. One foundational approach is value-based methods, such as Q-learning and Deep Q-Networks (DQN). These methods estimate the expected cumulative reward (Q-value) of actions taken in given states and choose the action with the highest value. While DQN uses a neural network to approximate Q-values, it can become unstable with large or continuous action spaces.

In contrast, policy gradient methods optimize the policy directly, without requiring an explicit value estimation for each action. A popular algorithm in this class is Proximal Policy Optimization (PPO). PPO uses the actor-critic architecture:

Actor: decides what action to take given the state.
Critic: estimates the value of the state (expected future reward).

The goal of PPO is to stabilize training by limiting how much the policy is updated per iteration. The critic (value function) is trained using Mean Squared Error (MSE), where the error represents the difference between the predicted state value and the actual cumulative reward observed during rollouts. For example, if the model predicts a reward of 0.5 for a given state, but the actual observed reward is 1.0, the MSE loss would be (1.0 - 0.5)². The policy (actor) is trained using a clipped surrogate objective function, which restricts the magnitude of policy updates by bounding the probability ratio of new and old actions within a small range (typically [1 - ε, 1 + ε], where ε is a small constant like 0.2). This clipping helps ensure more stable and conservative learning, preventing the policy from changing too drastically in a single update.

The most comprehensive resources used to build this theoretical foundation were the Introduction to Reinforcement Learning course by David Silver, created by UCL and DeepMind, and the textbook 'Reinforcement Learning: An Introduction' by Sutton and Barto.

Literature Review

The literature review process began with a broad exploration of how reinforcement learning is applied within game environments. Several academic papers and research projects were examined to understand existing benchmarks and approaches. After surveying the landscape, two primary studies were identified as particularly relevant to this project.

The first is OpenAI’s "Hide and Seek" research, a landmark study in multi-agent reinforcement learning (MARL) that showcased the emergence of complex strategies through self-play and environment manipulation. The study not only emphasized MARL's potential for generating emergent behaviours but also highlighted future directions like improved exploration techniques and environment complexity as drivers of learning.

The second is the foundational work by Google Research’s Brain Team titled "Google Research Football: A Novel Reinforcement Learning Environment." The paper introduced the GRF platform and benchmarked multiple reinforcement learning algorithms (specifically PPO, IMPALA, and DQN) across tactical football scenarios. GRF was particularly compelling due to its strategic depth and support for team-based behaviours. The paper also identified future directions for research, including self-play, sparse rewards, model-based RL, and sample efficiency.

For this project, the focus was placed on multi-agent learning, drawing from the GRF paper’s insights into team coordination, while also exploring sample efficiency and sparse reward mechanisms highlighted as promising areas in both studies.

Methodology

Environment Selection

A major part of the early work involved identifying a suitable game environment for reinforcement learning experimentation. Several factors were taken into account during this process, including:

Open-source access and modifiability.
Built-in support for multi-agent configurations.
A balance between environmental complexity and practical feasibility (e.g., training time).
Compatibility with remote university server infrastructure and available GPU hardware.

The initial search explored widely used RL environments such as OpenAI Gym, Unity ML-Agents, and MuJoCo. However, many of these platforms either lacked support for multi-agent setups or required extensive environment customization. To narrow down the options, a structured review of benchmark environments was conducted using curated lists of environments suitable for multi-agent reinforcement learning.

This process led to the shortlisting of candidates with well-established use in academic research. Among these, Google Research Football (GRF) stood out due to its:

Built-in multi-agent capabilities (e.g., control of multiple teammates).
Strategic gameplay and team-based objectives suitable for coordination learning.
Support for curriculum learning and scenario scripting.
Endorsement in the academic community through benchmark results in the GRF paper.

GRF was ultimately selected because it offered the best trade-off between depth, customizability, research relevance, and practical feasibility within the timeframe of the project.

System Setup and Technical Constraints

Setting up GRF on the available infrastructure presented significant technical challenges, especially in the early stages of the project. While the environment was ultimately a good fit for the project’s goals, configuring it required navigating a complex web of dependencies (including specific versions of Python, TensorFlow, and other libraries) which made local setup time-consuming and error-prone. Although GRF was successfully run on a personal machine at times, consistent GPU-based training proved difficult to sustain locally due to limited resources and recurring compatibility issues.

As a result, the project was migrated to GPU-enabled remote university servers accessed via SSH. This enabled stable execution of long-running training jobs but introduced additional complexity. The setup process involved:

Troubleshooting TensorFlow GPU support and other compatibility issues.
Installing GRF from source and resolving dependency conflicts across multiple environments.
Designing training pipelines that could run autonomously with appropriate logging and monitoring tools for remote experimentation.

The limited documentation for GRF meant much of the setup involved reverse engineering provided examples and tracing source code to resolve issues. Despite consuming a substantial portion of the early timeline, these efforts laid the groundwork for stable and scalable experimentation in later phases of the project.

Understanding the Chosen Environment

Google Research Football (GRF) is a reinforcement learning environment developed by the Brain team at Google Research. It provides a physics-based simulation of football and supports a wide range of scenarios, from simple one-on-one drills to full 11-vs-11 matches. GRF is designed to be lightweight and customizable, with built-in support for curriculum learning, scripted scenarios, and multi-agent control.

The game environment provides multiple observation types. The most common are:

Pixel Representation: A stack of rendered image frames (typically RGB, 96x72 resolution), similar to raw video input.
Simple115v2: A structured, low-dimensional feature vector of length 115, encoding the positions and velocities of all players, ball location, score, game mode, and other state variables.

In this project, the Simple115v2 format was chosen due to its lower computational requirements and improved sample efficiency. The 115 features were provided to the neural network model as input, forming the basis for the agent's perception of the environment.

To fully understand the internal data flow within GRF, I frequently delved into its open-source codebase. This involved navigating the source files to trace how observations were constructed and passed to agents, how rewards were computed, and how episode resets and transitions were handled. By examining specific Python modules and following function calls across the environment wrappers and core simulation logic, I was able to debug critical training behaviours and resolve inconsistencies. This hands-on reverse engineering approach was essential for gaining confidence in how GRF operated under the hood and for building a solid mental model of the learning pipeline.

Designing the Experiments

The experimentation process was iterative, beginning with foundational setups and gradually increasing in complexity. The project was structured to start with single-agent reinforcement learning to build the model architecture, validate the PPO implementation, and understand the underlying mechanisms before transitioning to more complex multi-agent scenarios.

The initial experiments focused on replicating results from the GRF paper to confirm correct implementation. These experiments involved training a single agent to score in an empty goal using PPO. During this phase, several technical and modelling components were tested: different input types (pixel-based vs. Simple115v2 state vector), PPO loss configurations (standard MSE loss for the value head vs. custom clipped objective for the policy head), and the use of Generalized Advantage Estimation (GAE) for advantage calculation.

Once the PPO model was stable in the empty goal scenario, focus shifted to exploring the impact of reward functions. Two versions of the task were trained - one using sparse rewards (goal scoring only), and another using checkpoint-based intermediate rewards. This comparison laid the groundwork for understanding how reward design influences convergence and agent behavior.

After validating the setup and reward mechanisms in single-agent settings, the project moved to a multi-agent setting. In this setup, two teammates were controlled simultaneously in a 3v1 match scenario, facing off against a single defender and goalkeeper. This phase involved experimenting with different numbers of controlled players (1, 2, or 3) and reward types (sparse vs. checkpoint), resulting in six model configurations. All were trained for 5 million steps in parallel on the university’s remote GPU servers, using shared policy parameters.

As experiments progressed, overfitting became evident, particularly in models that repeatedly selected the same actions. To address this, two modifications were introduced:

Action Probability Capping: A hard cap was applied to action probabilities to prevent agents from excessively favouring one action. This was implemented manually by modifying the probability distribution in the environment’s action-selection loop.
Epsilon-Greedy Exploration: Inspired by Q-learning, a random action was taken with a fixed probability (epsilon). A higher epsilon encouraged exploration by introducing stochasticity into the otherwise deterministic PPO policy.

These changes were first tested in a single-agent context, then extended to the multi-agent models, resulting in a final round of experiments covering all six previously mentioned configurations with added exploration mechanisms.

This flow - from replication and architecture validation to complex team-based scenarios - reflected a structured approach to experimentation, balancing exploration of both technical implementation and learning dynamics. For a detailed breakdown of all experiments and hyperparameters, refer to the final report.

Results and Observations

The results reflect the progression of experiments, showcasing the impact of reward design, action probabilities, and exploration strategies.

Empty Goal Scenario - Sparse Reward Only:

In early experiments, the agent successfully learned to score by running straight toward the goal. This behaviour emerged after roughly 64,000 training steps and became highly consistent, with the model eventually favouring the shoot action almost exclusively. Action probability distributions confirmed this overfitting, with some models assigning nearly 100% probability to a single action like “run right” or “shoot,” depending on the trained policy.

Empty Goal - Checkpoint Reward:

Introducing intermediate checkpoint rewards in an otherwise simple scenario proved detrimental. Agents overfit to intermediate signals (e.g., running near the corner), receiving high but misleading rewards that prevented them from learning the actual objective of scoring. Despite extended training (up to 1 million steps), these models failed to converge.

Multi-Agent 3v1 Setup:

When controlling multiple agents in a 3v1 setting, various combinations of agent count and reward types were explored. While agents began to exhibit early signs of coordination - such as passing between players - scoring was not consistently achieved. Notably, one model without the checkpoint reward learned to pass effectively through the defender, suggesting emergent cooperative behaviour even without explicit coordination programming.

Overall, while consistent goal-scoring was only achieved in the simplest setups, several models exhibited signs of emergent behaviour and strategic learning. These findings underscore the sensitivity of multi-agent reinforcement learning to reward shaping, training stability, and exploration mechanisms.

Key Learnings

This project offered deep learning opportunities across theoretical foundations, practical implementation, and general research skills:

1. Theoretical Learning

Gained a deep understanding of Reinforcement Learning (RL) from scratch, including concepts like state, action, policy, and reward functions.
Studied advanced RL techniques, such as Proximal Policy Optimization (PPO), actor-critic models, and Q-Learning.
Learned and applied exploration strategies like epsilon-greedy methods and action probability capping to address overfitting.
Extended knowledge into Multi-Agent Reinforcement Learning (MARL), including how agents coordinate policies and share parameters.
Followed external materials independently (e.g., David Silver’s DeepMind course, Sutton & Barto textbook) to guide and reinforce theoretical understanding.

2. Practical & Infrastructure Learning

Learned how to configure and debug complex ML environments, resolving compatibility issues between Python, TensorFlow, and other dependencies.
Migrated from local machines to remote GPU servers, learning how to manage jobs, use SSH, and monitor long training runs.
Handled long-running, compute-intensive jobs by managing virtual environments and persistent logging remotely.
Gained experience reverse-engineering the environment's lightly-documented code and understanding its internal reward and observation logic.

3. General and Soft Skills

Learned how to read and adapt academic literature, identifying relevant research and applying ideas to practical experiments.
Developed better planning and prioritization, managing a solo research project with iterative weekly goals and continuous scope management.
Improved ability to analyse failed experiments, extract meaningful insights, and use them to guide further iterations.
Built confidence in learning new technologies independently, a skill applicable to any future research or engineering role.

Future Work

Building on the foundation laid in this project, several directions can be pursued to deepen technical rigor, enhance training quality, and explore research-grade experimentation:

1. Reward and Training Design

Designing improved reward functions: Explore alternative shaping strategies that better balance intermediate and terminal objectives, avoiding local optima.
Improving sample efficiency: Investigate methods like experience replay, prioritized sampling, or curriculum learning to accelerate convergence and reduce training time.

2. Algorithmic and Architectural Improvements

Benchmark other RL algorithms: Train and compare performance across algorithms such as A3C, SAC, DDPG, and IMPALA to evaluate their effectiveness in multi-agent setups.
Explore self-play and competitive training: Implement mechanisms for agents to train against previous versions of themselves or each other to encourage robustness and emergent strategy.
Use ray RLLib or Stable Baselines3: Leverage high-level libraries to run large-scale, parallelized experiments with better abstraction and modularity.

3. Environment and Scenario Scaling

Full match setups (3v3, 5v5, or 11v11): Scale experiments beyond 3v1 to examine emergent coordination in more realistic scenarios, subject to available compute resources.
Train models to 50M steps: Push training further using remote servers to assess long-term learning and compare with benchmark results like those of IMPALA in the GRF paper.
Switching to pixel-based observations: Transition from structured input (Simple115v2) to raw visual input to test generalization across observation modalities.

4. Research and Deployment Readiness

Refining hyperparameters systematically: Apply Bayesian optimization or grid search to fine-tune learning rates, batch sizes, clipping ranges, and exploration parameters.
Building reproducible pipelines: Improve code modularity, experiment tracking, and logging to make the research easier to replicate and extend.

Resources

This project’s GitHub repository
Google Research Football’s GitHub repository
Literature review paper: Emergent Tool Use From Multi-Agent Autocurricula
Literature review paper: Google Research Football: A Novel Reinforcement Learning Environment
RL Lectures from DeepMind & UCL: Introduction to Reinforcement Learning with David Silver

Keywords

Tools: TensorFlow, Keras, Google Research Football (GRF), Proximal Policy Optimization (PPO), Python, SSH, Git, GitHub, Remote GPU Servers

Tags: Reinforcement Learning (RL), Multi-Agent Reinforcement Learning (MARL), Policy Gradient Methods, Actor-Critic Models, Q-Learning, Deep Q-Networks (DQN), Exploration vs. Exploitation, Epsilon-Greedy Exploration, Undergraduate Research, Self-Learning, Google DeepMind, David Silver

Back to Projects