Analyze paired-chain FASTA data with custom sequence identifiers #1893

Cons727 · 2025-01-28T00:40:42Z

Cons727
Jan 28, 2025

Hi,

Thank you for creating this great tool!

I have a FASTA file from a completed study containing heavy and light chain sequences for each cell. The sequence identifiers (sequence header lines) follow the pattern ">numeric-cellid_contig_ig[hk]", where the 'numeric-cellid' are 19 digits that uniquely identify a cell. I'd like to use MiXCR to align and clonotype these sequences in a paired-chain fashion. Specifically, I need guidance on how to use the Sample Tag to analyze this data effectively.

To accomplish this, I believe I need to:
Use the --tag-pattern option to extract the cell ID from the sequence identifiers of the FASTA file.

Is this the correct approach for analyzing paired-chain data from a FASTA file with custom identifiers in the sequence header line?

Any guidance or examples for handling this type of data would be greatly appreciated. Thank you!

Answered by mizraelson

Jan 30, 2025

Hi, sorry for the delay. I’ve created a custom preset for you. Unzip the attached YAML file and place it in the ~/.mixcr/presets/ folder, or simply keep it in the directory where you run MiXCR.

To run the preset use:

mixcr analyze local:fasta-single-cell-preset \
input.fasta \
result

There’s no need to specify additional parameters, as the pattern is already included.

By default, the species is set to Human, but you can change it using the --species parameter if needed.

fasta-single-cell-preset.yaml.zip

View full answer

mizraelson · 2025-01-28T02:49:40Z

mizraelson
Jan 28, 2025
Collaborator

Hi, do the sequences you have cover the full VDJRegion of the receptor?

1 reply

Cons727 Jan 28, 2025
Author

Thank you for the quick response!
Yes, the sequences have the VDJ region but not the constant region.

Cons727 · 2025-01-30T22:24:54Z

Cons727
Jan 30, 2025
Author

Hi, I've been trying different patterns to analyze the data by cell. According to the documentation, I should be able to tag the samples directly from the sequencing read headers using regex. Also, I made a sample_table.tsv that looks like this and ran the following code.

Sample    TagPattern    CELL
Sample1                  0000186184388780027-ID
Sample1                  0000849899570359377-ID

mixcr analyze rna-seq \
--species hs \
--threads 32 \
--sample-table sample_table.tsv \
--tag-pattern "^<([^_]+).*" \
input.fasta \
output

However, it returned this error:

picocli.CommandLine$ExecutionException: Error while running command align java.lang.IllegalArgumentException: Unknown symbol ">"
	at com.milaboratory.mixcr.cli.Main.registerExceptionHandlers$lambda-12(SourceFile:395)
	at picocli.CommandLine.execute(CommandLine.java:2088)
	at com.milaboratory.mixcr.cli.CommandAnalyze$Cmd$PlanBuilder.executeSteps(SourceFile:463)
	at com.milaboratory.mixcr.cli.CommandAnalyze$Cmd.run0(SourceFile:368)
	at com.milaboratory.mixcr.cli.MiXCRCommand.run(SourceFile:37)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
	at com.milaboratory.mixcr.cli.Main.registerLogger$lambda-27(SourceFile:514)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at com.milaboratory.mixcr.cli.Main.main(SourceFile:101)
Caused by: java.lang.IllegalArgumentException: Unknown symbol ">"
	at com.milaboratory.o.bJ.dataFromChars(SourceFile:90)
	at com.milaboratory.o.bJ.<init>(SourceFile:31)
	at com.milaboratory.core.sequence.NucleotideSequence.<init>(SourceFile:135)
	at com.milaboratory.mixcr.cli.TagTransformationSteps$MapTags.toNucleotideMap(SourceFile:241)
	at com.milaboratory.mixcr.cli.TagTransformationSteps$MapTags.createTransformer(SourceFile:346)
	at com.milaboratory.mixcr.cli.CommandAlignPipeline.getTagsExtractor(SourceFile:169)
	at com.milaboratory.mixcr.cli.CommandAlign$Cmd.run1(SourceFile:1055)
	at com.milaboratory.mixcr.cli.MiXCRCommandWithOutputs.run0(SourceFile:69)
	at com.milaboratory.mixcr.cli.MiXCRCommand.run(SourceFile:37)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
	at com.milaboratory.mixcr.cli.Main.registerLogger$lambda-27(SourceFile:514)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	... 13 more

How can I use the regex to extract the cell ID from the FASTA sequence header line?

Thanks for helping me!

9 replies

mizraelson Jan 30, 2025
Collaborator

Hi, sorry for the delay. I’ve created a custom preset for you. Unzip the attached YAML file and place it in the ~/.mixcr/presets/ folder, or simply keep it in the directory where you run MiXCR.

To run the preset use:

mixcr analyze local:fasta-single-cell-preset \
input.fasta \
result

There’s no need to specify additional parameters, as the pattern is already included.

By default, the species is set to Human, but you can change it using the --species parameter if needed.

fasta-single-cell-preset.yaml.zip

Answer selected by Cons727

mizraelson Jan 30, 2025
Collaborator

You can use a table like the one in your example if you have multiple samples, but keep in mind that the CELL barcode must exactly match what you have in the header.
Based on your description, I assumed the headers follow this format:
>8828478742837462934_contig_igh

However, if your actual header is: >8828478742837462934-ID_contig_igh you will need to slightly adjust the pattern in the YAML file accordingly.

Cons727 Jan 30, 2025
Author

Thanks a lot for your help! I'm going to run it right now!

Cons727 Feb 3, 2025
Author

Thanks again for sharing the preset!

Since my goal is to clonotype the cells using both heavy and light chains, I ran groupClones and exportCloneGroups, and exported both results with exportAirr after running analyze. I've checked the results from grouped.airr.tsv and it seems that the clonotyping was done independently by each chain. For example, the IGH of cell_A is grouped in clone.0 and the IGK from the same cell is grouped in clone.10. Am I running the correct functions?
Additionally, I was expecting to have a clonotype of around 10 cells, but the biggest clone has only 3 cells.

Thanks a lot for your support and guidance!

mizraelson Feb 4, 2025
Collaborator

Can you share the output you obtained?
“Clone ID” defines a single chain rather than a pair of chains. Therefore, IGH and IGK/L will always have different clone IDs, even if they originate from the same cell—they will share the same CELL tag instead. To obtain chain pairs, you need to sort the table by the CELL tag.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze paired-chain FASTA data with custom sequence identifiers #1893

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Analyze paired-chain FASTA data with custom sequence identifiers #1893

Cons727 Jan 28, 2025

Replies: 2 comments · 10 replies

mizraelson Jan 28, 2025 Collaborator

Cons727 Jan 28, 2025 Author

Cons727 Jan 30, 2025 Author

mizraelson Jan 30, 2025 Collaborator

mizraelson Jan 30, 2025 Collaborator

Cons727 Jan 30, 2025 Author

Cons727 Feb 3, 2025 Author

mizraelson Feb 4, 2025 Collaborator

Cons727
Jan 28, 2025

Replies: 2 comments 10 replies

mizraelson
Jan 28, 2025
Collaborator

Cons727 Jan 28, 2025
Author

Cons727
Jan 30, 2025
Author

mizraelson Jan 30, 2025
Collaborator

mizraelson Jan 30, 2025
Collaborator

Cons727 Jan 30, 2025
Author

Cons727 Feb 3, 2025
Author

mizraelson Feb 4, 2025
Collaborator