Skip to content

[Question] Significant Performance Disparity Between Maskable PPO and PPO #283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
gemelom opened this issue Mar 8, 2025 · 1 comment
Open
4 tasks done
Labels
question Further information is requested

Comments

@gemelom
Copy link

gemelom commented Mar 8, 2025

❓ Question

I tried to run PPO and Maskable PPO on my custom environment with the same configuration, but I found that Maskable PPO(~5fps) is mush slower than PPO(~140fps).

Here's my configurations:

  • environment setup
    env.action_space = MultiBinary(339)
  • reproduction code
    config = {
        "env_name": "my_env_name",
        "n_envs": 16,
        "policy_type": "MlpPolicy",
        "total_timesteps": 100000,
    }
    
    # DummyVecEnv
    vec_env = make_vec_env(config["env_name"], n_envs=config["n_envs"])
    
    model = MaskablePPO("MlpPolicy", vec_env, n_steps=128, verbose=1")
    # model = PPO("MlpPolicy", vec_env, n_steps=128, verbose=1")
    
    model.learn(
        total_timesteps=config["total_timesteps"],
        callback=WandbCallback(
            gradient_save_freq=100,
            model_save_path=f"models/{experiment_name}",
            verbose=2,
        ),
        progress_bar=True,
    )

I also tried to profile my code with py-spy, and I found that MaskablePPO spent many extra time in these lines

Image

while PPO spends much less time in train and most of its time in collect_rollouts just as expected.

I wonder if this extreme decline in training efficiency is a normal situation because of the large action_space or if there are other bugs in the implementation.

Checklist

@gemelom gemelom added the question Further information is requested label Mar 8, 2025
@araffin
Copy link
Member

araffin commented Mar 10, 2025

Hello,

I also tried to profile my code with py-spy, and I found that MaskablePPO spent many extra time in these lines

at least the slow down is where it would be expected.
I'm a bit surprised by how much slow down it, but the code was never optimized for speed, so there is probably room for improvement.

Related code:

masks_tensor = th.as_tensor(masks)
# Restructure shape to align with logits
masks_tensor = masks_tensor.view(-1, sum(self.action_dims))
# Then split columnwise for each discrete action
split_masks = th.split(masks_tensor, list(self.action_dims), dim=1) # type: ignore[assignment]
for distribution, mask in zip(self.distributions, split_masks):
distribution.apply_masking(mask)

and

device = self.logits.device
self.masks = th.as_tensor(masks, dtype=th.bool, device=device).reshape(self.logits.shape)
HUGE_NEG = th.tensor(-1e8, dtype=self.logits.dtype, device=device)
logits = th.where(self.masks, self._original_logits, HUGE_NEG)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants