-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several questions on reproducing the training steps #125
Comments
We use the same teachers across stages, and interpolate the output features to match. The current algorithm resamples the student and teacher features to the minimum size of the two.
One stage totally makes sense, but out of sheer laziness/time constraints, between stages we actually just start over with the initial 1e-3 learning rate and a new schedule. Totally possible that we're leaving something on the table in that regard.
We actually use bilinear resampling for images. Even true for the images we feed SAM. It totally makes sense to hand higher-res images to the hi-res partition during training, because you're right, most of the images in DataComp-1B actually aren't very hi-res.
Tables 2 and 3 do not include SAM. We re-ran the models in both of those settings. The key difference is that Table 3 used [OpenAI CLIP 336px, DINOv2] as the teachers, whereas Table 2 used [MetaCLIP, DINOv2] as the teachers. This came down to timing as, over the course of writing the paper, both MetaCLIP and DFN CLIP were released to the public. Admittedly, it makes the paper a bit harder to track, but the most important part is just that the ablations are internally consistent.
Your understanding is correct. If multiple teachers are sharing a partition, then they're receiving the same images (although the sizes may be different, based on the teacher input resolution). Teachers on different partitions are operating on different GPUs, and are receiving different data. Teacher overhead changes because in the multi-partition scheme, each teacher's effective batch size is reduced per step. Take this example: Batch Size: 1024, Teachers: CLIP, DINOv2 1 partition: Both teachers get 1024 images For simplicity, if we assume that both teachers are equally expensive to run inference on, then in the 2-partition scheme, we cut the teacher overhead in half. The student still sees all 1024 images, but the training signal is not coming from all teachers for all images. |
That would help a lot. Thanks again for your answer! |
Hi,
I appreciate your work so much. However, when I tried to reproduce some training results, I encountered several questions on the implementation details.
I’m an undergraduate and didn’t have much experience with model training before, so some of these questions might sound trivial.
Thanks much for your answer!
The text was updated successfully, but these errors were encountered: