Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve spectra by USI #14

Closed
wants to merge 11 commits into from
Closed

Conversation

douweschulte
Copy link
Contributor

I wanted to handle USIs in the annotator so I needed support to get the spectrum by USI. Let me know if I do something totally off.

I want to add some tiny things (mostly error handling) but the main stuff of the PR is fixed.

@douweschulte
Copy link
Contributor Author

Sorry apparently the history has been messed up a bit in my branch, I will fix that stuff (along with the other edits) tomorrow.

@mobiusklein
Copy link
Owner

Thank you for writing this. I'll take a closer look later today.

Could you please compare your spectrum retrieval code with https://github.com/mobiusklein/mzdata/blob/main/src/io/proxi.rs to see if they're compatible, or we can remove my existing PROXI codec since it's only for parsing, not making network requests. I also tangled with serde_json and number parsing ambiguity.

All the same, please gate this functionality behind the proxi feature since that's the proper name for the API you're using.

@douweschulte
Copy link
Contributor Author

douweschulte commented Oct 28, 2024

Will do! Thanks for the pointers.

@douweschulte
Copy link
Contributor Author

douweschulte commented Oct 29, 2024

I merged the two files, mostly leaning upon your present serializations and deserializations as they seemed more complete than mine. I fixed the parsing errors that arose from most PROXI servers but stuck with one from PRIDE, where the CURIE string has text in the accession number (see below). I can either work on the CURIE handling to allow text in the accession numbers, or try to ignore this in parsing.

{
    "accession": "NCIT:C19067",
    "name": "project title",
    "value": "Sequencing the anti-MUC1 hybridoma antibody 139H2"
}

Additionally some of the PROXI server are more strict in handling the interpretation part (need the charge for the peptide) so I choose to strip the interpretation from the USI before sending it to any server. Do you think there is a better way of doing this? I could add the current USI again when the result is formed so that this behaviour is hidden for end users.

I want to add an async version as well to allow for different access patterns so that will be added still. Oh and I will fix the git history at some point.

@mobiusklein
Copy link
Owner

Additionally some of the PROXI server are more strict in handling the interpretation part (need the charge for the peptide) so I choose to strip the interpretation from the USI before sending it to any server. Do you think there is a better way of doing this? I could add the current USI again when the result is formed so that this behaviour is hidden for end users.

The PROXI docs are incomplete/outdated and leaves much as implementation details. For example , if I remember correctly, the MassIVE endpoint would include the USI in the response while the PRIDE wouldn't. Additionally, different implementations might indicate failure differently. If you've a will to do it, you could send the USI as-is, and if it is rejected try to re-send it again without the interpretation, but what might be best right now is to understand the failure modes rather than trying to hide them from the end-user, and then when those modes are well understood, try to add error handling.

I fixed the parsing errors that arose from most PROXI servers but stuck with one from PRIDE, where the CURIE string has text in the accession number (see below). I can either work on the CURIE handling to allow text in the accession numbers, or try to ignore this in parsing.

I was not planning to handle additional controlled vocabularies beyond the two used for mzML in mzdata, but I see two paths forward here:

  1. Change the accession field in CURIE and Param with an opaque token data structure that we may need to bend over backwards to make const-compatible, causing massive breaking changes everywhere.
  2. Write a run-time or compile-time look-up table/perfect hashing function to map NCIT accession codes that we can tie to a ControlledVocabulary method to encode/decode the accession. A PFH is guaranteed to exist as the keyspace is much, much smaller than the capacity of a u32, but a look up table is easier to invert to render it back to the text version. I know that look up tables can be generated at compile time and are definitely const-compatible.

I'll expand on the drawbacks later today.

@douweschulte
Copy link
Contributor Author

I removed the USI interpretation stripping and tortured some apis a bit to get more known error messages. I do agree with the idea of having as much control as library user as possible, it is very possible still to remove the interpretation field before running the get_spectrum command.

On the NCIT, I think I understand what you mean with the perfect hashing scheme, but I would not trust my own implementation at this point. Do you need this behaviour in more places, or do you think you will need it in the future? Otherwise I suggest to create a CURIEOrString type to contain any nonstandard strings, which would only make this file slightly more complex. Or possibly ignore any CURIES that cannot be parsed, which would remove some metadata, but this one is only used as a general title so might not be very much used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants