Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several questions on reproducing the training steps #125

Open
DerickJin3316 opened this issue Feb 13, 2025 · 2 comments
Open

Several questions on reproducing the training steps #125

DerickJin3316 opened this issue Feb 13, 2025 · 2 comments

Comments

@DerickJin3316
Copy link

Hi,
I appreciate your work so much. However, when I tried to reproduce some training results, I encountered several questions on the implementation details.

  1. (referring to the AM-RADIO paper here) I didn't quite understand how the student is trained under multiple resolutions since I suppose CLIP works under fixed resolutions, and the student must match the shape of the teacher's features. So did you use different CLIP models during the two training stages? (Concretely, CLIP@224px to match 256px input at stage 1, and CLIP@378px to match 432px input at stage 2.) Or you interpolate the features just like DINOv2.
  2. Do the two stages belong to a single run? In other words, you cover 600k steps by a single-cycle CosineAnnealing schedule, right? Would it make sense if I run the first 300k steps, save&load the ckpt, then begin a new run with fresh optimizer/scheduler to run the last 300k steps?
  3. According to the code, it seems like you bicubicly interpolate the data to match the input resolution. I wonder if it’s the same for SAM hi-res inputs. Is there a need to use a selected subset of Datacomp1B (or other datasets) that has relatively higher original resolution to avoid poor interpolation quality?
  4. Previous Github Issues asked about some details about the ablation study shown in Table 3. Instead, I’m focusing on Table 2 results (training dataset ablation study) recently. Do these studies use SAM teacher? Do they use the two-stage multi-resolution setting like the best model setting? According to my understanding, these are where the settings could be different between Table 2 row 3 and Table 3 row 4. Since their result metrics differ, I wonder where this divergence comes from.
  5. A question about the RADIO-amplified paper, in section 4.6 partitioning. I’m not sure how partitioning impacts the training results. Is it correct to say that, in essential, “in a partition” means the teachers receive the same data within one step, and “in different partitions” means they are receiving different data and can have different batch sizes? But how does this impact teacher overhead?

I’m an undergraduate and didn’t have much experience with model training before, so some of these questions might sound trivial.
Thanks much for your answer!

@mranzinger
Copy link
Collaborator

(referring to the AM-RADIO paper here) I didn't quite understand how the student is trained under multiple resolutions since I suppose CLIP works under fixed resolutions, and the student must match the shape of the teacher's features. So did you use different CLIP models during the two training stages? (Concretely, CLIP@224px to match 256px input at stage 1, and CLIP@378px to match 432px input at stage 2.) Or you interpolate the features just like DINOv2.

We use the same teachers across stages, and interpolate the output features to match. The current algorithm resamples the student and teacher features to the minimum size of the two.

Do the two stages belong to a single run? In other words, you cover 600k steps by a single-cycle CosineAnnealing schedule, right? Would it make sense if I run the first 300k steps, save&load the ckpt, then begin a new run with fresh optimizer/scheduler to run the last 300k steps?

One stage totally makes sense, but out of sheer laziness/time constraints, between stages we actually just start over with the initial 1e-3 learning rate and a new schedule. Totally possible that we're leaving something on the table in that regard.

According to the code, it seems like you bicubicly interpolate the data to match the input resolution. I wonder if it’s the same for SAM hi-res inputs. Is there a need to use a selected subset of Datacomp1B (or other datasets) that has relatively higher original resolution to avoid poor interpolation quality?

We actually use bilinear resampling for images. Even true for the images we feed SAM. It totally makes sense to hand higher-res images to the hi-res partition during training, because you're right, most of the images in DataComp-1B actually aren't very hi-res.

Previous Github Issues asked about some details about the ablation study shown in Table 3. Instead, I’m focusing on Table 2 results (training dataset ablation study) recently. Do these studies use SAM teacher? Do they use the two-stage multi-resolution setting like the best model setting?

Tables 2 and 3 do not include SAM. We re-ran the models in both of those settings. The key difference is that Table 3 used [OpenAI CLIP 336px, DINOv2] as the teachers, whereas Table 2 used [MetaCLIP, DINOv2] as the teachers. This came down to timing as, over the course of writing the paper, both MetaCLIP and DFN CLIP were released to the public. Admittedly, it makes the paper a bit harder to track, but the most important part is just that the ablations are internally consistent.

A question about the RADIO-amplified paper, in section 4.6 partitioning. I’m not sure how partitioning impacts the training results. Is it correct to say that, in essential, “in a partition” means the teachers receive the same data within one step, and “in different partitions” means they are receiving different data and can have different batch sizes? But how does this impact teacher overhead?

Your understanding is correct. If multiple teachers are sharing a partition, then they're receiving the same images (although the sizes may be different, based on the teacher input resolution). Teachers on different partitions are operating on different GPUs, and are receiving different data. Teacher overhead changes because in the multi-partition scheme, each teacher's effective batch size is reduced per step.

Take this example:

Batch Size: 1024, Teachers: CLIP, DINOv2

1 partition: Both teachers get 1024 images
2 partitions: CLIP gets 512 images, DINOv2 gets a different 512 images

For simplicity, if we assume that both teachers are equally expensive to run inference on, then in the 2-partition scheme, we cut the teacher overhead in half. The student still sees all 1024 images, but the training signal is not coming from all teachers for all images.

@DerickJin3316
Copy link
Author

That would help a lot. Thanks again for your answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants