Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weight sharing not consistent with paper #67

Open
gshaikov opened this issue Oct 13, 2023 · 0 comments
Open

Weight sharing not consistent with paper #67

gshaikov opened this issue Oct 13, 2023 · 0 comments

Comments

@gshaikov
Copy link

Hi Phil,

Want to confirm the reason behind this design choice:

for i in range(depth):
should_cache = i > 0 and weight_tie_layers
cache_args = {'_cache': should_cache}
self_attns = nn.ModuleList([])
for block_ind in range(self_per_cross_attn):
self_attns.append(nn.ModuleList([
get_latent_attn(**cache_args, key = block_ind),
get_latent_ff(**cache_args, key = block_ind)
]))
self.layers.append(nn.ModuleList([
get_cross_attn(**cache_args),
get_cross_ff(**cache_args),
self_attns
]))

In the paper, they say that they tie all the latent transformer weights. However in this implementation, TF in the first layer is not shared with the rest.

image

It should probably be

            for block_ind in range(self_per_cross_attn):
                self_attns.append(nn.ModuleList([
                    get_latent_attn(_cache=True, key = block_ind),
                    get_latent_ff(_cache=True, key = block_ind)
                ]))

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant