-
Notifications
You must be signed in to change notification settings - Fork 388
Custom dataloader registry support #2932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…try' into ori-2907-custom-dataloader-registry
…module / registry big change
for more information, see https://pre-commit.ci
…un, we will later adjust this file
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2932 +/- ##
==========================================
- Coverage 80.14% 80.10% -0.05%
==========================================
Files 193 194 +1
Lines 17121 17604 +483
==========================================
+ Hits 13722 14101 +379
- Misses 3399 3503 +104
🚀 New features to boost your workflow:
|
and fix the test for custom dataloaders
Hi Ori, Code: model = scvi.model.SCVI(adata=None, registry=datamodule.registry, datamodule=datamodule) |
I fixed it. In the tutorial file: scvi-tools/docs/user_guide/use_case /custom_dataloaders.md Replace the line: with: |
@marianogabitto thanks. I fixed those things. As you also saw, my concern right now is the slowness of the process, in the data loading between batches. Let's focus on this. feel free to fork the branch and add your commits. |
Hi Ori, Let me tell you how I did this comparison. On the one hand, I save the TileDB experiment as an anndata and I run scvi regularly. On the other hand, I grab the data loader code from scvi (splitting, Anndataloader, AnnDataset) and create an scvi external AnnData loader (just to be sure there was no difference between running the anndata in regular scvi versus passing it as a new anndataloader). 2 questions: M |
Just checking is it still faster if you load the AmnData in disk backed mode? This would indeed be surprising while the other overhead could come from loading from disk? Any chance to use a fast SSD storage of the data as usually recommended for from-disk loading. Data could also be scattered across SSDs and there tileDB might have suggestions how to optimize this - not perfect randomness is not really an issue if they first randomize order on disks (not a single experiment on one SSD). |
Speeds are: yes, I am testing two things: I am talking with my support team about a partition with fast disk access and touching base with tiled about how to optimize this. |
@marianogabitto I will try to do some testing on my end. |
But you need the "on_before_batch" to move the batch to PyTorch, no ? isn't the reason of that callback to get the X, batch, labels data and to move it to tensors? what happens if they are already tensors? |
One more... how about making the anndata_manager and the registry tutorial ? |
We will deal with everything, but let's take it step by step. on_before_batch is needed of course, but lets see how we can improve it. |
Your suggestion worked and I do see much better performance in train time when using
See my updated tutorial on the other branch |
Ori, I got lost ... what branch should I look for the tutorial? What branch should I check for testing your commits? |
This are the tutorials: tests are in the current branch |
…try' into ori-2907-custom-dataloader-registry
…try' into ori-2907-custom-dataloader-registry
Ori, this is not working for me. When I invoke in the notebook: |
Ori, all the examples that I am listing below are run by removing the code ".on_before_batch_transfer()". The way I posted before.
These led me to believe that we are not loading data into GPU memory fast enough.
|
No description provided.