Skip to content

Add Streaming Dataset Loader Support #55

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AdonaiVera
Copy link
Contributor

Congrats again, for this amazing repo! 🎉
This PR adds support for streaming datasets when using large-scale datasets with data_cutoff_idx. By enabling streaming, only the required number of samples are loaded on-the-fly, significantly reducing disk usage. This addresses #54.

@lusxvr
Copy link
Member

lusxvr commented May 19, 2025

Great point you are raising!

We just pushed a big update to the train.py file to enable DDP training, but I am happy to include your changes if you take care of the conflict!

@AdonaiVera
Copy link
Contributor Author

Hi @lusxvr
Sure! I believe it was just related to one import library, the DDP training shouldn't affect any other changes.

Let me know what you think! 😊

@lusxvr
Copy link
Member

lusxvr commented May 20, 2025

When I try this I get a bunch of errors in the image processing:

Error processing image at index 54
Error processing image at index 862
Error processing image at index 305
Error processing image at index 836
Error processing image at index 661
Error processing image at index 416
Error processing image at index 683
Error processing image at index 887
Error processing image at index 53
Error processing image at index 707
...

@AdonaiVera
Copy link
Contributor Author

Hi @lusxvr , thanks a lot for your feedback! 🙌 Quick question, what data_cutoff_idx are you currently using in your config file?

I ran a test using a small cutoff of just 1000 images, and everything trained without any issues on my side. Here’s a quick summary from my run:

Loading from backbone weights
Successfully loaded google/siglip-base-patch16-224 weights from safetensors. Model has 85,797,120 parameters.
Successfully loaded HuggingFaceTB/SmolLM2-135M weights from safetensors. Model has 134,515,008 parameters.
nanoVLM initialized with 222,081,600 parameters
Training summary: 975 samples, 121 batches/epoch, batch size 8
Validation summary: 25 samples, 3 batches/epoch, batch size 8
Step: 0, Loss: 4.6885, Tokens/s: 1664.98, Accuracy: 0.1820
Epoch 1/5, Train Loss: 2.6754 | Time: 45.28s | T/s: 1980.65
Epoch 2/5, Train Loss: 1.3940 | Time: 12.44s | T/s: 7206.65
Step: 250, Loss: 0.3996, Tokens/s: 7911.87
Epoch 3/5, Train Loss: 0.7891 | Time: 12.83s | T/s: 6992.59
Epoch 4/5, Train Loss: 0.4599 | Time: 12.42s | T/s: 7222.44
Step: 500, Loss: 1.1108, Tokens/s: 8470.26, Accuracy: 0.0313
Epoch 5/5, Train Loss: 0.2894 | Time: 44.71s | T/s: 2005.95
Average time per epoch: 25.54s
Average time per sample: 0.0262s
MMStar Accuracy: 0.0307
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                accuracy █▁
wandb:              batch_loss ▇▆▇█▇▇▇▆▅▇▆▆▅▃▅▃▅▄▃▂▂▃▃▁▂▂▂▁▂▂▁▁▁▂▂▁▂▁▁▂
wandb:          epoch_duration █▁▁▁█
wandb:              epoch_loss █▄▂▂▁
wandb: epoch_tokens_per_second ▁███▁
wandb:       tokens_per_second ▃▅▆▃▄▄▅▃▃▂▄▅▃▃▃▆▅▂▅▆▆▁▃▂▂▅▆▆▃▂▄█▃▃▄▆▅▂▆▃
wandb:                val_loss █▁▁
wandb: 
wandb: Run summary:
wandb:                accuracy 0.03133
wandb:          avg_epoch_time 25.53587
wandb:     avg_time_per_sample 0.02619
wandb:              batch_loss 0.06084
wandb:          epoch_duration 44.7099
wandb:              epoch_loss 0.28938
wandb: epoch_tokens_per_second 2005.95402
wandb:              mmstar_acc 0.03067
wandb:       tokens_per_second 6781.2405
wandb:                val_loss 1.76389
wandb: 
wandb: 🚀 View run nanoVLM_1xGPU_1000samples_bs8_ep5_lr0.0001-0.002_0520 at: https://wandb.ai/rock-gaussians/nanoVLM/runs/8j4esq08
wandb: ⭐️ View project at: https://wandb.ai/rock-gaussians/nanoVLM
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20250520_060010-8j4esq08/logs

Would love to compare settings and see if anything on your end might be affecting the training. Let me know!

@lusxvr
Copy link
Member

lusxvr commented May 20, 2025

I used 2000. Some of the datasets are a bit different, so it might only come with later samples.

@AdonaiVera
Copy link
Contributor Author

Hi @lusxvr, I’ve pulled the latest changes and I ran two tests — one with data_cutoff_idx=2000 and another with 10000. Here are the results. However, I wasn’t able to reproduce the Error processing image at index ....

The only difference on my side is the batch size, which I set lower since I don’t have enough RAM to run with 256. All other configurations remain the same.

data_cutoff_idx: int = 2000

Loading from backbone weights
Successfully loaded google/siglip-base-patch16-224 weights from safetensors. Model has 85,797,120 parameters.
Successfully loaded HuggingFaceTB/SmolLM2-135M weights from safetensors. Model has 134,515,008 parameters.
nanoVLM initialized with 222,081,600 parameters
Training summary: 1950 samples, 243 batches/epoch, batch size 8
Validation summary: 50 samples, 6 batches/epoch, batch size 8
Step: 0, Loss: 3.2168, Tokens/s: 1681.70, Accuracy: 0.1793
Epoch 1/5, Train Loss: 2.4353 | Time: 57.38s | T/s: 3144.94
Step: 250, Loss: 1.1637, Tokens/s: 7712.55
Epoch 2/5, Train Loss: 1.2887 | Time: 25.00s | T/s: 7219.57
Step: 500, Loss: 1.5609, Tokens/s: 7813.96, Accuracy: 0.0433
Epoch 3/5, Train Loss: 0.7330 | Time: 56.47s | T/s: 3195.76
Step: 750, Loss: 0.3446, Tokens/s: 7383.20
Epoch 4/5, Train Loss: 0.4207 | Time: 25.08s | T/s: 7193.98
Step: 1000, Loss: 0.2099, Tokens/s: 7936.12, Accuracy: 0.0433
Epoch 5/5, Train Loss: 0.2504 | Time: 56.49s | T/s: 3194.55
Average time per epoch: 44.08s
Average time per sample: 0.0226s
MMStar Accuracy: 0.0393
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                accuracy █▁▁
wandb:              batch_loss ██▅▆▆▆▇▆▇█▅▂▆▆▅▄▅▁▄▃▂▁▃▂▂▂▁▂▁▃▃▂▁▂▁▂▁▁▁▁
wandb:          epoch_duration █▁█▁█
wandb:              epoch_loss █▄▃▂▁
wandb: epoch_tokens_per_second ▁█▁█▁
wandb:       tokens_per_second ▃▄▃▇▃▆▆▃▅▆▂▃▄▂▆▅▅▃▆▆▂▃▄▂▆▆▆▅▂▃▅▆█▅▂▅▆▁▄▄
wandb:                val_loss █▁▁▁▂
wandb: 
wandb: Run summary:
wandb:                accuracy 0.04333
wandb:          avg_epoch_time 44.08311
wandb:     avg_time_per_sample 0.02261
wandb:              batch_loss 0.4263
wandb:          epoch_duration 56.48868
wandb:              epoch_loss 0.25039
wandb: epoch_tokens_per_second 3194.55135
wandb:              mmstar_acc 0.03933
wandb:       tokens_per_second 8642.68277
wandb:                val_loss 1.89304
wandb: 
wandb: 🚀 View run nanoVLM_1xGPU_2000samples_bs8_ep5_lr0.0001-0.002_0520 at: https://wandb.ai/rock-gaussians/nanoVLM/runs/dx8y059q
wandb: ⭐️ View project at: https://wandb.ai/rock-gaussians/nanoVLM
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20250520_070256-dx8y059q/logs

data_cutoff_idx: int = 10000

Loading from backbone weights
Successfully loaded google/siglip-base-patch16-224 weights from safetensors. Model has 85,797,120 parameters.
Successfully loaded HuggingFaceTB/SmolLM2-135M weights from safetensors. Model has 134,515,008 parameters.
nanoVLM initialized with 222,081,600 parameters
Training summary: 9750 samples, 1218 batches/epoch, batch size 8
Validation summary: 250 samples, 31 batches/epoch, batch size 8
Step: 0, Loss: 5.1823, Tokens/s: 1640.25, Accuracy: 0.1793
Step: 250, Loss: 1.3486, Tokens/s: 7899.89
Step: 500, Loss: 1.7672, Tokens/s: 7334.33, Accuracy: 0.0867
Step: 750, Loss: 2.3561, Tokens/s: 6268.56
Step: 1000, Loss: 1.7056, Tokens/s: 7267.37, Accuracy: 0.1127
Epoch 1/5, Train Loss: 2.1899 | Time: 223.09s | T/s: 4047.72
Step: 1250, Loss: 2.6386, Tokens/s: 7340.32
Step: 1500, Loss: 1.3072, Tokens/s: 6689.96, Accuracy: 0.1173
Step: 1750, Loss: 1.6147, Tokens/s: 7705.98
Step: 2000, Loss: 0.8673, Tokens/s: 6562.28, Accuracy: 0.1660
Step: 2250, Loss: 1.5418, Tokens/s: 8323.13
Epoch 2/5, Train Loss: 1.2680 | Time: 190.17s | T/s: 4748.19
Step: 2500, Loss: 2.0146, Tokens/s: 7314.41, Accuracy: 0.1680
Step: 2750, Loss: 1.5919, Tokens/s: 4914.17
Step: 3000, Loss: 0.8382, Tokens/s: 6805.50, Accuracy: 0.1540
Step: 3250, Loss: 0.9466, Tokens/s: 7307.86
Step: 3500, Loss: 0.5415, Tokens/s: 7130.13, Accuracy: 0.1720
Epoch 3/5, Train Loss: 0.7887 | Time: 225.30s | T/s: 4007.87
Step: 3750, Loss: 0.5129, Tokens/s: 6826.28
Step: 4000, Loss: 0.4116, Tokens/s: 7043.20, Accuracy: 0.1553
Step: 4250, Loss: 0.5776, Tokens/s: 7280.92
Step: 4500, Loss: 0.1124, Tokens/s: 6994.71, Accuracy: 0.2080
Step: 4750, Loss: 0.4304, Tokens/s: 7865.47
Epoch 4/5, Train Loss: 0.4883 | Time: 198.32s | T/s: 4553.14
Step: 5000, Loss: 0.5250, Tokens/s: 6867.43, Accuracy: 0.1907
Step: 5250, Loss: 0.2437, Tokens/s: 8741.34
Step: 5500, Loss: 0.0434, Tokens/s: 8540.49, Accuracy: 0.1553
Step: 5750, Loss: 0.1400, Tokens/s: 7515.39
Step: 6000, Loss: 0.4042, Tokens/s: 6413.09, Accuracy: 0.2000
Epoch 5/5, Train Loss: 0.3195 | Time: 225.19s | T/s: 4009.90
Average time per epoch: 212.41s
Average time per sample: 0.0218s
MMStar Accuracy: 0.2033
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                accuracy ▆▁▂▃▆▆▅▆▅█▇▅█
wandb:              batch_loss █▃▆▄▇▅▄▅▃▅█▄▄▅▄▄▄▂▃▅▂▁▂▂▂▂▂▁▂▁▂▃▂▁▂▁▂▂▁▁
wandb:          epoch_duration █▁█▃█
wandb:              epoch_loss █▅▃▂▁
wandb: epoch_tokens_per_second ▁█▁▆▁
wandb:       tokens_per_second ▇▄▄▆▆▄▇▄▆▇▆▆▅▅▆▆█▅▅▄▅▁▇▅▅▄▂▃▁▄▅▃▅█▃▅▆▂▄▅
wandb:                val_loss █▂▁▁▁▁▁▁▁▁▁▂▁▂▂▁▂▂▂▂▂▃▃▃▃
wandb: 
wandb: Run summary:
wandb:                accuracy 0.2
wandb:          avg_epoch_time 212.41489
wandb:     avg_time_per_sample 0.02179
wandb:              batch_loss 0.47361
wandb:          epoch_duration 225.1893
wandb:              epoch_loss 0.3195
wandb: epoch_tokens_per_second 4009.90194
wandb:              mmstar_acc 0.20333
wandb:       tokens_per_second 7619.25362
wandb:                val_loss 2.5901
wandb: 
wandb: 🚀 View run nanoVLM_1xGPU_10000samples_bs8_ep5_lr0.0001-0.002_0520 at: https://wandb.ai/rock-gaussians/nanoVLM/runs/51uwi3gw
wandb: ⭐️ View project at: https://wandb.ai/rock-gaussians/nanoVLM

I also tested it directly on Google Colab, and it runs fine there as well. Here’s the code I used.
https://colab.research.google.com/drive/1Ru82mC2OAEwS8m4jQpEb6R-Y6_cBJgJk?usp=sharing

I haven’t added it to the Colab example in the repo, but if you think it would be helpful, I’d be happy to update it!

Let me know what you think.

@AdonaiVera
Copy link
Contributor Author

Hey @lusxvr ! Just checking in, let me know if there's anything else you'd like me to add or adjust in this PR. I think this feature could really help folks with limited RAM setups, so I’d love to help push it forward.

@lusxvr
Copy link
Member

lusxvr commented May 22, 2025

Taking a look right now :)

@lusxvr
Copy link
Member

lusxvr commented May 22, 2025

I don't know if it is just because I am running it for the first time and it has to newly fill the cache due to the new download but for the moment it seems very slow to me (but I don't know why).

After my comment (10min ago), I pulled the PR and started a run with 15000 samples and it is still "Resolving Data Files"

@lusxvr
Copy link
Member

lusxvr commented May 22, 2025

Hm, sorry to report that I still have the same error. It comes from data/datasets.py

# Now process the image
        if isinstance(image, Image.Image):
            if image.mode != 'RGB':
                image = image.convert('RGB')
            processed_image = self.image_processor(image)
        else:
            print(f"Error processing image at index {idx}")
            # Create empty tensor with right dimensions as fallback
            processed_image = torch.zeros(
                3, cfg.VLMConfig.vit_img_size, cfg.VLMConfig.vit_img_size)

I don't know why though at the moment.

...
Error processing image at index 189
Error processing image at index 447
Error processing image at index 874
Error processing image at index 458
Error processing image at index 320
Error processing image at index 464
Error processing image at index 450
Error processing image at index 878
Error processing image at index 809
Error processing image at index 182
Epoch 5/5, Train Loss: 2.2457 | Time: 1.90s | T/s: 37505.11
Average time per epoch: 6.17s
Average time per sample: 0.0062s
MMStar Accuracy: 0.0720
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                accuracy ▁
wandb:              batch_loss █▆▅▄▅▄▃▃▂▁▃▂▁▂▂
wandb:          epoch_duration █▁▁▁▁
wandb:              epoch_loss █▅▃▁▁
wandb: epoch_tokens_per_second ▁████
wandb:       tokens_per_second ▁▆█████████████
wandb:                val_loss ▁
wandb: 
wandb: Run summary:
wandb:                accuracy 0.13733
wandb:          avg_epoch_time 6.16707
wandb:     avg_time_per_sample 0.00617
wandb:              batch_loss 2.23033
wandb:          epoch_duration 1.89913
wandb:              epoch_loss 2.24574
wandb: epoch_tokens_per_second 37505.1131
wandb:              mmstar_acc 0.072
wandb:       tokens_per_second 62103.02443
wandb:                val_loss 3.49195
wandb: 
wandb: 🚀 View run nanoVLM_1xGPU_2000samples_bs256_ep5_lr0.0001-0.002_0522 at: https://wandb.ai/huggingface/nanoVLM/runs/a002piw5
wandb: ⭐️ View project at: https://wandb.ai/huggingface/nanoVLM
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20250522_135910-a002piw5/logs

@lusxvr
Copy link
Member

lusxvr commented May 22, 2025

Unfortunately It does not seem to work correctly for me. We have the error with processing the image files correctly, but in addition even when I relaunch a training run with the same parameters and the streaming support, it takes a substantial amount of time to load the data. It seems to me that since we are still taking a sample from every individual dataset until we reach the cutoff index, the whole streaming does not work correctly yet / does not fall back efficiently to the cache.

I believe for low resource scenarios, it is viable to just select one or two small datasets individually from the whole cauldron, I tried this in colab and it works quite well (see notebook in the repo).

While I still really like the idea of enabling streaming for the datasets, unfortunately I cannot reproduce it correctly on my machine at the moment so I cannot merge it just yet :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants