Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix metaphlan profiles with duplicate entries by aggregating them #153

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from
2 changes: 1 addition & 1 deletion .editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ max_line_length = 88
[*.{json,yml,yaml}]
indent_size = 2

[*.{md,rst}]
[*.{md,rst,txt}]
trim_trailing_whitespace = false

[Makefile]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,13 @@ def transform(
)
.assign(
**{
# Convert full lineages into leaf node taxon identifiers.
StandardProfile.taxonomy_id: lambda df: df[
StandardProfile.taxonomy_id
]
.str.rsplit("|", n=1)
.str[-1],
# Multiply relative abundances by a large factor to yield integers.
StandardProfile.count: lambda df: df[StandardProfile.count]
* cls.LARGE_INTEGER,
},
Expand All @@ -83,31 +85,35 @@ def transform(
},
)
)
# MetaPhlAn sometimes assigns a taxon by name only, but no identifier exists.
# We have no choice but to count those entries as unclassified at the moment.
# We drop unclassified entries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of feel we should warn for this if this happens (even if it's included in teh higher classified taxon), I'm not really happy with dropping 'valid' contents from the output of the tool when we just say we are essentially changing teh format...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum in docs, but nice to have would be a warn on CLI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a warning emitted if you look a few lines below.

# Their relative abundance is already included in the next higher, classified
# taxon.
unclassified = result[StandardProfile.taxonomy_id] == ""
if (num := int(unclassified.sum())) > 0:
logger.warning("Dropped %d entries from the MetaPhlAn profile.", num)
result = result.loc[~unclassified, :].copy()

# Convert identifiers to integers.
result[StandardProfile.taxonomy_id] = pd.to_numeric(
result[StandardProfile.taxonomy_id],
errors="coerce",
).astype("Int64")
unclassified_mask = result[StandardProfile.taxonomy_id].isna() | (
result[StandardProfile.taxonomy_id] == -1
)
num = int(unclassified_mask.sum())
if num > 0:
).astype(int)
# We assign identifier zero instead of minus one to all unknown entries.
result.loc[
(result[StandardProfile.taxonomy_id] == -1),
StandardProfile.taxonomy_id,
] = 0
# We identify duplicates. Mostly duplicate entries will belong to the
# unclassified category, but sometimes we have other duplicates, too.
grouped_df = result.groupby(StandardProfile.taxonomy_id, sort=False)
entries = grouped_df.size()
for tax_id, num in entries[entries > 1].items():
logger.warning(
"Combining %d entries with unclassified taxa in the profile.",
"Combining %d entries for %s in the profile.",
num,
"unknown taxa" if tax_id == 0 else f"NCBI:txid{tax_id}",
)
return pd.concat(
[
result.loc[~unclassified_mask, :],
pd.DataFrame(
{
StandardProfile.taxonomy_id: [0],
StandardProfile.count: [
result.loc[unclassified_mask, StandardProfile.count].sum(),
],
},
dtype=int,
),
],
ignore_index=True,
)
# We aggregate the counts of all entries. We do so because of duplicate entries.
return grouped_df.sum().reset_index()
Loading