This project is based on the Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments algorithm to solve an episodic task. The implementation is based on Multi Agent Deep Deterministic Policy Gradient.
Deep Deterministic Policy Gradient (DDPG) is an off policy actor critic algorithm, that uses the concept of target networks.
To solve the tennis environment, the agents have to work together: they've got to collaborate to achieve the goal, which is to keep the ball in play. Both agents have their own local observations, own policies and own actions.
During training,
- the critic uses extra information: all states observed and actions taken by all other agents.
- each actor has only access to its agents own observation and actions.
During execution, only the actors are present and only its own states and observations can be used.
This approach of decentralized actor and centralized critic has been adopted by this paper and is shown in this figure
The network consists of two networks: The actor network and the critic network.
- Input layer: 33 input nodes: size of state vector
- Hidden layer 1: 128 nodes, ReLU activations
- Hidden layer: 64 nodes,ReLU activations
- Output layer: 4 output nodes: size of action vector, tanh activation
- Input layer: 33+4 input nodes: size of state and action vector per agent
- Hidden layer 1: 256 nodes, ReLu activation
- Hidden layer: 64 nodes, ReLu activation
- Output layer: 1 no activation, Q Value. In DDPG the critic is used to approximate the maximizer of the Q values of the next state.
- n_episodes Maximum number of episodes. Default: 20000
- ou_noise Starting noise of the Ornstein–Uhlenbeck process noise. Setting this to zero makes your agent unable to learn. Default: 2.0
- ou_noise_decay_rate Rate with which to decay the noise after each epoch. Default: 0.998
- buffer_size Size of the replaybuffer in samples. Default: 1000000
- batch_size size of batches to sample. Default: 512
- update_every after how many epochs to update the agents. Default: 2
- tau rate at which to softupdate the networks. Default: 0.01
- lr_actor Learning rate of the actor. Default: 0.001
- lr_critic Learning rate of the critic. Default: 0.001
Episode 100 Average: 0.002 Min:0.000 Max:0.100
...
Episode 600 Average: 0.104 Max:0.300
Episode 900 Average: 0.283 Max:2.600
Episode 926 Average: 0.503 Max:2.600
Environment solved after 926 episodes!
Here's a plot that shows the development of scores and moving average per episode.
To improve the agents performance
- train longer
- try different learning rates, maybe decaying the noise faster and...
- of course: implement a scaled up version of PPO and play Starcraft!