Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an ID blacklist file #111

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

linsalrob
Copy link
Collaborator

I have added a draft blacklist file, but I am not 100% convinced this is the right format or the right location. However, I wanted to ensure that we record the work I am doing in checking these sequences, and this seemed a good starting point!

The blacklist file includes sequences in our database (currently all in the virus_secondary_aa/sequenceDB database). You can extract that database to .fasta:

mmseqs convert2fasta databases/aa/virus_secondary_aa/sequenceDB secondary_aa.faa

and then find these sequences using grep. I recommend;

grep --no-group-separator -A1 G0W2I5

This will list all versions of the sequence with ID G0W2I5 in the file, in fasta format, and then you can paste them into the NCBI blast website and search them all in one go.

I have used YAML format for this blacklist file. I chose YAML because it is simple, well supported, and many of the other config files in hecatomb use YAML format.

The data is called blacklist_ids, and you can read it in python so:

import yaml
with open("blacklist_ids.yaml", 'r') as file:
    data = yaml.safe_load(file)
for id in data['blacklist_ids']:
    print((id)

or in R like so:

library(yaml)
data <- yaml.load_file('blacklist_ids.yaml')
print(data)

Entries in the blacklist_ids have the following attributes:

  1. each entry is identified by the sequence ID
  2. there is an id field that also contains the sequence ID. This is redundant, and maybe we should drop it.
  3. the header field provides the whole header from the mmseqs database for this entry. Note that some IDs map to more than one entry, and only one of them maybe incorrect. Currently, we can't map from result ID to entry, but this information will likely help cleaning up the data later.
  4. The reason field is a note about why this sequence has been blacklisted.

Feel free to nominate other fields that should be added, because I envision that this resource will continue to be used, and so I'd like to future-proof it.

@linsalrob linsalrob added the enhancement New feature or request label Jun 6, 2024
@linsalrob linsalrob self-assigned this Jun 6, 2024
@beardymcjohnface beardymcjohnface marked this pull request as ready for review June 10, 2024 00:23
@beardymcjohnface
Copy link
Collaborator

Ideally in the future this is something we handle with new releases of the database, but this list will be a great sanity check for a high-throughput method for filtering the databases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants