-
Notifications
You must be signed in to change notification settings - Fork 252
Add Streaming Dataset Loader Support #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Great point you are raising! We just pushed a big update to the |
Hi @lusxvr Let me know what you think! 😊 |
When I try this I get a bunch of errors in the image processing:
|
Hi @lusxvr , thanks a lot for your feedback! 🙌 Quick question, what data_cutoff_idx are you currently using in your config file? I ran a test using a small cutoff of just 1000 images, and everything trained without any issues on my side. Here’s a quick summary from my run:
Would love to compare settings and see if anything on your end might be affecting the training. Let me know! |
I used 2000. Some of the datasets are a bit different, so it might only come with later samples. |
Hi @lusxvr, I’ve pulled the latest changes and I ran two tests — one with data_cutoff_idx=2000 and another with 10000. Here are the results. However, I wasn’t able to reproduce the Error processing image at index .... The only difference on my side is the batch size, which I set lower since I don’t have enough RAM to run with 256. All other configurations remain the same. data_cutoff_idx: int = 2000
data_cutoff_idx: int = 10000
I also tested it directly on Google Colab, and it runs fine there as well. Here’s the code I used. I haven’t added it to the Colab example in the repo, but if you think it would be helpful, I’d be happy to update it! Let me know what you think. |
Hey @lusxvr ! Just checking in, let me know if there's anything else you'd like me to add or adjust in this PR. I think this feature could really help folks with limited RAM setups, so I’d love to help push it forward. |
Taking a look right now :) |
I don't know if it is just because I am running it for the first time and it has to newly fill the cache due to the new download but for the moment it seems very slow to me (but I don't know why). After my comment (10min ago), I pulled the PR and started a run with 15000 samples and it is still "Resolving Data Files" |
Hm, sorry to report that I still have the same error. It comes from
I don't know why though at the moment.
|
Unfortunately It does not seem to work correctly for me. We have the error with processing the image files correctly, but in addition even when I relaunch a training run with the same parameters and the streaming support, it takes a substantial amount of time to load the data. It seems to me that since we are still taking a sample from every individual dataset until we reach the cutoff index, the whole streaming does not work correctly yet / does not fall back efficiently to the cache. I believe for low resource scenarios, it is viable to just select one or two small datasets individually from the whole cauldron, I tried this in colab and it works quite well (see notebook in the repo). While I still really like the idea of enabling streaming for the datasets, unfortunately I cannot reproduce it correctly on my machine at the moment so I cannot merge it just yet :( |
Congrats again, for this amazing repo! 🎉
This PR adds support for streaming datasets when using large-scale datasets with data_cutoff_idx. By enabling streaming, only the required number of samples are loaded on-the-fly, significantly reducing disk usage. This addresses #54.