Skip to content

Function parameter description #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
HarryZhang1224 opened this issue Feb 21, 2025 · 3 comments
Open

Function parameter description #2

HarryZhang1224 opened this issue Feb 21, 2025 · 3 comments

Comments

@HarryZhang1224
Copy link

Thanks for the amazing tool! It would be great if you can update the function files with a description for each parameter. For example, it is not clear what the parameter mik_graph in sn.pp.prepare_data_batch mean and how should the users choose a value.

@ForwardYang98
Copy link
Collaborator

Thanks for your interest in our work.
Sorry to say that since I have been busy with my Ph.D. thesis lately, I expect to update the function files after a while.
For your question:
The parameter 'mik_graph' determines the number of nearest neighbors identified for each sample in the multi-view mutual information maximization (MMIM) module. Subsequently, the MMIM module will boost the similarity of the multi-view joint representations of each sample and its nearest neighbors to guide the model to ultimately generate more useful and discriminative joint representations. A detailed description of this can be found in the “Methods” section of our article. In general, we don't need to change the default value of the parameter 'mik_graph'.

@HarryZhang1224
Copy link
Author

Thank you! Is there any recommendations for setting the parameters in Xenium data/Xenium data across multiple samples (much higher cell number than the examples given in the paper, gene panel ~ 500)

@ForwardYang98
Copy link
Collaborator

ForwardYang98 commented Feb 23, 2025

Hi Zhang,
While we have not tested scNiche on Xenium data, our scalability analysis on the mouse whole brain MERFISH dataset (129 slices, about 3.7 million cells) shows that scNiche can effectively scale to large datasets containing multiple samples (see the "Scalability analysis of scNiche to large datasets" section of our article). The following are the parameter settings we used on this dataset for your reference:
k_cutoff = 30, epochs = 25, lr = 0.01, batch_num = 500.
The running time of scNiche is about 3h.

Additionally, based on my personal experience, the following three may be worth noting in practice:

  1. batch number setting. Usually ~5k cells/batch can balance accuracy and computational efficiency;
  2. epoch number setting. For large datasets (e.g., over 1 million cells), starting with a smaller epochs initially is recommended. And you can also evaluate the convergence of the model by visualizing the training information stored in adata.uns['loss'].
  3. Dimensionality reduction and batch effect removal. Considering that the number of genes measured in Xenium data usually far exceeds the number of cell types that exist, dimensionality reduction (scVI, scArches, or PCA) can help balance the dimensionality of features across different views, allowing for more accurate niche identification.

Overall, it may take some time to find the optimal parameter configuration. If you have any results to share, I would be most interested in seeing them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants