Weight sharing not consistent with paper #67

gshaikov · 2023-10-13T12:44:21Z

Hi Phil,

Want to confirm the reason behind this design choice:

perceiver-pytorch/perceiver_pytorch/perceiver_pytorch.py

Lines 194 to 210 in c3d505a

    
           for i in range(depth): 
        
               should_cache = i > 0 and weight_tie_layers 
        
               cache_args = {'_cache': should_cache} 
        
               self_attns = nn.ModuleList([]) 
        
               for block_ind in range(self_per_cross_attn): 
        
                   self_attns.append(nn.ModuleList([ 
        
                       get_latent_attn(**cache_args, key = block_ind), 
        
                       get_latent_ff(**cache_args, key = block_ind) 
        
                   ])) 
        
               self.layers.append(nn.ModuleList([ 
        
                   get_cross_attn(**cache_args), 
        
                   get_cross_ff(**cache_args), 
        
                   self_attns 
        
               ]))

In the paper, they say that they tie all the latent transformer weights. However in this implementation, TF in the first layer is not shared with the rest.

It should probably be

            for block_ind in range(self_per_cross_attn):
                self_attns.append(nn.ModuleList([
                    get_latent_attn(_cache=True, key = block_ind),
                    get_latent_ff(_cache=True, key = block_ind)
                ]))

What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weight sharing not consistent with paper #67

Weight sharing not consistent with paper #67

gshaikov commented Oct 13, 2023

Weight sharing not consistent with paper #67

Weight sharing not consistent with paper #67

Comments

gshaikov commented Oct 13, 2023