Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I have added a draft blacklist file, but I am not 100% convinced this is the right format or the right location. However, I wanted to ensure that we record the work I am doing in checking these sequences, and this seemed a good starting point!
The blacklist file includes sequences in our database (currently all in the
virus_secondary_aa/sequenceDB
database). You can extract that database to.fasta
:and then find these sequences using grep. I recommend;
This will list all versions of the sequence with ID
G0W2I5
in the file, in fasta format, and then you can paste them into the NCBI blast website and search them all in one go.I have used
YAML
format for this blacklist file. I chose YAML because it is simple, well supported, and many of the other config files inhecatomb
use YAML format.The data is called
blacklist_ids
, and you can read it in python so:or in R like so:
Entries in the
blacklist_ids
have the following attributes:id
field that also contains the sequence ID. This is redundant, and maybe we should drop it.header
field provides the whole header from themmseqs
database for this entry. Note that some IDs map to more than one entry, and only one of them maybe incorrect. Currently, we can't map from result ID to entry, but this information will likely help cleaning up the data later.reason
field is a note about why this sequence has been blacklisted.Feel free to nominate other fields that should be added, because I envision that this resource will continue to be used, and so I'd like to future-proof it.