[Question] Significant Performance Disparity Between Maskable PPO and PPO #283

gemelom · 2025-03-08T08:09:36Z

❓ Question

I tried to run PPO and Maskable PPO on my custom environment with the same configuration, but I found that Maskable PPO(~5fps) is mush slower than PPO(~140fps).

Here's my configurations:

environment setup
```
env.action_space = MultiBinary(339)
```

reproduction code

config = {
    "env_name": "my_env_name",
    "n_envs": 16,
    "policy_type": "MlpPolicy",
    "total_timesteps": 100000,
}

# DummyVecEnv
vec_env = make_vec_env(config["env_name"], n_envs=config["n_envs"])

model = MaskablePPO("MlpPolicy", vec_env, n_steps=128, verbose=1")
# model = PPO("MlpPolicy", vec_env, n_steps=128, verbose=1")

model.learn(
    total_timesteps=config["total_timesteps"],
    callback=WandbCallback(
        gradient_save_freq=100,
        model_save_path=f"models/{experiment_name}",
        verbose=2,
    ),
    progress_bar=True,
)

I also tried to profile my code with py-spy, and I found that MaskablePPO spent many extra time in these lines

while PPO spends much less time in train and most of its time in collect_rollouts just as expected.

I wonder if this extreme decline in training efficiency is a normal situation because of the large action_space or if there are other bugs in the implementation.

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

araffin · 2025-03-10T11:12:48Z

Hello,

I also tried to profile my code with py-spy, and I found that MaskablePPO spent many extra time in these lines

at least the slow down is where it would be expected.
I'm a bit surprised by how much slow down it, but the code was never optimized for speed, so there is probably room for improvement.

Related code:

stable-baselines3-contrib/sb3_contrib/common/maskable/distributions.py

Lines 245 to 252 in 55a0332

    
               masks_tensor = th.as_tensor(masks) 
        
               # Restructure shape to align with logits 
        
               masks_tensor = masks_tensor.view(-1, sum(self.action_dims)) 
        
               # Then split columnwise for each discrete action 
        
               split_masks = th.split(masks_tensor, list(self.action_dims), dim=1)  # type: ignore[assignment] 
        
           for distribution, mask in zip(self.distributions, split_masks): 
        
               distribution.apply_masking(mask)

and

stable-baselines3-contrib/sb3_contrib/common/maskable/distributions.py

Lines 58 to 62 in 55a0332

    
           device = self.logits.device 
        
           self.masks = th.as_tensor(masks, dtype=th.bool, device=device).reshape(self.logits.shape) 
        
           HUGE_NEG = th.tensor(-1e8, dtype=self.logits.dtype, device=device) 
        
           logits = th.where(self.masks, self._original_logits, HUGE_NEG)

gemelom added the question label Mar 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Significant Performance Disparity Between Maskable PPO and PPO #283

[Question] Significant Performance Disparity Between Maskable PPO and PPO #283

gemelom commented Mar 8, 2025

araffin commented Mar 10, 2025

[Question] Significant Performance Disparity Between Maskable PPO and PPO #283

[Question] Significant Performance Disparity Between Maskable PPO and PPO #283

Comments

gemelom commented Mar 8, 2025

❓ Question

Checklist

araffin commented Mar 10, 2025