Project report

Learning algorithm

This project is based on the Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments algorithm to solve an episodic task. The implementation is based on Multi Agent Deep Deterministic Policy Gradient.

Deep Deterministic Policy Gradient (DDPG) is an off policy actor critic algorithm, that uses the concept of target networks.

To solve the tennis environment, the agents have to work together: they've got to collaborate to achieve the goal, which is to keep the ball in play. Both agents have their own local observations, own policies and own actions.

During training,

the critic uses extra information: all states observed and actions taken by all other agents.
each actor has only access to its agents own observation and actions.

During execution, only the actors are present and only its own states and observations can be used.

This approach of decentralized actor and centralized critic has been adopted by this paper and is shown in this figure

Parameters and hyperparameters

Neural network architecture

The network consists of two networks: The actor network and the critic network.

Actor network

Input layer: 33 input nodes: size of state vector
Hidden layer 1: 128 nodes, ReLU activations
Hidden layer: 64 nodes,ReLU activations
Output layer: 4 output nodes: size of action vector, tanh activation

Critic network

Input layer: 33+4 input nodes: size of state and action vector per agent
Hidden layer 1: 256 nodes, ReLu activation
Hidden layer: 64 nodes, ReLu activation
Output layer: 1 no activation, Q Value. In DDPG the critic is used to approximate the maximizer of the Q values of the next state.

Hyperparamters

n_episodes Maximum number of episodes. Default: 20000
ou_noise Starting noise of the Ornstein–Uhlenbeck process noise. Setting this to zero makes your agent unable to learn. Default: 2.0
ou_noise_decay_rate Rate with which to decay the noise after each epoch. Default: 0.998
buffer_size Size of the replaybuffer in samples. Default: 1000000
batch_size size of batches to sample. Default: 512
update_every after how many epochs to update the agents. Default: 2
tau rate at which to softupdate the networks. Default: 0.01
lr_actor Learning rate of the actor. Default: 0.001
lr_critic Learning rate of the critic. Default: 0.001

Results

Episode 100 Average: 0.002 Min:0.000 Max:0.100
...
Episode 600 Average: 0.104 Max:0.300
Episode 900 Average: 0.283 Max:2.600
Episode 926 Average: 0.503 Max:2.600

Environment solved after 926 episodes!

Here's a plot that shows the development of scores and moving average per episode.

Watch the agents play

Next steps

To improve the agents performance

train longer
try different learning rates, maybe decaying the noise faster and...
of course: implement a scaled up version of PPO and play Starcraft!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report.md

report.md

Project report

Learning algorithm

Parameters and hyperparameters

Neural network architecture

Actor network

Critic network

Hyperparamters

Results

Next steps

Files

report.md

Latest commit

History

report.md

File metadata and controls

Project report

Learning algorithm

Parameters and hyperparameters

Neural network architecture

Actor network

Critic network

Hyperparamters

Results

Next steps