Adding an ID blacklist file #111

linsalrob · 2024-06-06T07:30:04Z

I have added a draft blacklist file, but I am not 100% convinced this is the right format or the right location. However, I wanted to ensure that we record the work I am doing in checking these sequences, and this seemed a good starting point!

The blacklist file includes sequences in our database (currently all in the virus_secondary_aa/sequenceDB database). You can extract that database to .fasta:

mmseqs convert2fasta databases/aa/virus_secondary_aa/sequenceDB secondary_aa.faa

and then find these sequences using grep. I recommend;

grep --no-group-separator -A1 G0W2I5

This will list all versions of the sequence with ID G0W2I5 in the file, in fasta format, and then you can paste them into the NCBI blast website and search them all in one go.

I have used YAML format for this blacklist file. I chose YAML because it is simple, well supported, and many of the other config files in hecatomb use YAML format.

The data is called blacklist_ids, and you can read it in python so:

import yaml
with open("blacklist_ids.yaml", 'r') as file:
    data = yaml.safe_load(file)
for id in data['blacklist_ids']:
    print((id)

or in R like so:

library(yaml)
data <- yaml.load_file('blacklist_ids.yaml')
print(data)

Entries in the blacklist_ids have the following attributes:

each entry is identified by the sequence ID
there is an id field that also contains the sequence ID. This is redundant, and maybe we should drop it.
the header field provides the whole header from the mmseqs database for this entry. Note that some IDs map to more than one entry, and only one of them maybe incorrect. Currently, we can't map from result ID to entry, but this information will likely help cleaning up the data later.
The reason field is a note about why this sequence has been blacklisted.

Feel free to nominate other fields that should be added, because I envision that this resource will continue to be used, and so I'd like to future-proof it.

beardymcjohnface · 2024-06-10T00:27:44Z

Ideally in the future this is something we handle with new releases of the database, but this list will be a great sanity check for a high-throughput method for filtering the databases.

initially adding a blacklist file

6966a26

linsalrob added the enhancement New feature or request label Jun 6, 2024

linsalrob requested review from shandley and beardymcjohnface June 6, 2024 07:30

linsalrob self-assigned this Jun 6, 2024

beardymcjohnface marked this pull request as ready for review June 10, 2024 00:23

beardymcjohnface approved these changes Jun 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding an ID blacklist file #111

Adding an ID blacklist file #111

linsalrob commented Jun 6, 2024

beardymcjohnface commented Jun 10, 2024

Adding an ID blacklist file #111

Are you sure you want to change the base?

Adding an ID blacklist file #111

Conversation

linsalrob commented Jun 6, 2024

beardymcjohnface commented Jun 10, 2024