Skip to content

[Question] Why env_change[batch_inds] is not considered during _get_samples(*) in RecurrentRolloutBuffer? #284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 4 tasks
Hhannzch opened this issue Mar 11, 2025 · 4 comments · May be fixed by #290
Open
3 of 4 tasks
Labels
question Further information is requested

Comments

@Hhannzch
Copy link

Hhannzch commented Mar 11, 2025

❓ Question

When getting batches from a well-collected RecurrentRolloutBuffer, only episode_starts[batch_inds] will be returned to the sequence data. And this "episode_starts" is important for lstm policy to reset the hidden state during the training.
However, I have a question about the behavior here. As the seq_start_indices are decided together by both episode_starts and env_change, why are only episode_starts returned?
To be more clear, why the line 240 in common.recurrent.buffers is like "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds])" instead of "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds] or env_change[batch_inds])"?

Thank you for the explanation in advance.

Checklist

@Hhannzch Hhannzch added the question Further information is requested label Mar 11, 2025
@araffin
Copy link
Member

araffin commented Mar 13, 2025

Hello,
thanks for pointing that out.
Might be a bug.
I need to dig deeper, this code is overly complex even with all the comments 🙈

@araffin
Copy link
Member

araffin commented Mar 31, 2025

To refresh my memory, I've created some graphics (I should probably put them in SB3 doc later).

First, we collect n_steps * n_envs transitions and then flatten it to be able to sample from it:

Image

Optionally, we split randomly the flattened sequences to add a bit of randomness when sampling for multiple epochs:

Image

We need to zero-out the hidden state of LSTM whenever a new sequence starts (so episode starts or env change event).

Then we take batch_size transitions and reshape by sequences to be able to pad them:

Image

Note: we use the flattened sequence to compute things like advantage but the last operation will be reverted before the mini-batch is fed to the LSTM (which expects (max length, n_seq, features_dim) as input)

@araffin
Copy link
Member

araffin commented Mar 31, 2025

To be more clear, why the line 240 in common.recurrent.buffers is like "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds])" instead of "episode_starts=self.pad_and_flatten(self.episode_starts[batch_inds] or env_change[batch_inds])"?

After looking at it, it should probably be like:

seq_start = np.logical_or(episode_starts, env_change).flatten()

We should also double check the effect of the split on it...

@ar-too
Copy link

ar-too commented Apr 16, 2025

This looks like a feature, not a bug. We will cut multiple sub-sequences, but only a few will be actual episode starts. For those that are actual starts we will later reset the hidden state (on a first sequence step). For those that aren't starts we will use stored hidden states instead of resetting them on the first seq step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
3 participants