BamQC metrics are computed using a number of methods, including third-party software tools, and output in JSON format. The output also contains metadata, such as the instrument and lane names; and input parameters, such as the target file.
This document:
- Summarizes how metrics are computed
- Describes read filtering options
- Details some special cases of interest
Some metrics have been renamed in the move from the old Perl to new Python implementation.
- The 'DS' column indicates which metrics are affected by downsampling (if any).
- The 'WFM' column indicates which fields are in the subset output by
write_fast_metrics.py
.
Name | Former name | Source | DS | WFM |
---|---|---|---|---|
alignment reference | input parameter | N | Y | |
average read length | samtools stats: RL | N | Y | |
bases per target | bedtools | Y | N | |
barcode | metadata | N | N | |
bases mapped | aligned bases | samtools stats: SN | N | Y |
coverage per target | bedtools | Y | N | |
coverage histogram | bedtools | Y | N | |
deleted bases | samtools stats: ID | N | Y | |
donor | metadata | N | N | |
group id | metadata | N | N | |
hard clip bases | CIGAR | Y | N | |
insert max | input parameter | N | Y | |
insert size average | insert mean | samtools stats: SN | N | Y |
insert size histogram | insert histogram | samtools stats: IS | N | Y |
insert size standard deviation | insert stdev | samtools stats: SN | N | Y |
inserted bases | samtools stats: ID | N | Y | |
instrument | metadata | N | N | |
lane | metadata | N | N | |
library | metadata | N | N | |
library design | metadata | N | N | |
low-quality reads meta | metadata | N | N | |
mapped reads | samtools stats: SN | N | Y | |
mark duplicates | Picard MarkDuplicates | N | N | |
mismatched bases | mismatch bases | samtools stats: SN | N | Y |
non primary reads | samtools stats: SN | N | Y | |
non-primary reads meta | metadata | N | N | |
number of targets | bedtools | N | N | |
package version | package version file | N | Y | |
paired end | number of ends | samtools stats: SN | N | Y |
paired reads | samtools stats: SN | N | Y | |
pairsMappedAbnormallyFar | insert size histogram | Y | Y | |
pairsMappedToDifferentChr | samtools stats: SN | N | Y | |
properly paired reads | samtools stats: SN | N | Y | |
qual cut | input parameter | N | N | |
qual fail reads | samtools view | N | N | |
read 1 aligned by cycle | CIGAR | Y | N | |
read 1 average length | samtools stats: FRL | N | Y | |
read 1 deletion by cycle | CIGAR | Y | N | |
read 1 hard clip by cycle | CIGAR | Y | N | |
read 1 insertion by cycle | CIGAR | Y | N | |
read 1 length histogram | samtools stats: FRL | N | Y | |
read 1 mismatch by cycle | samtools stats: MPC | N | Y | |
read 1 quality by cycle | samtools stats: FFQ | N | Y | |
read 1 quality histogram | samtools stats: FFQ | N | Y | |
read 1 soft clip by cycle | CIGAR | Y | N | |
read 2 aligned by cycle | CIGAR | Y | N | |
read 2 average length | samtools stats: LRL | N | Y | |
read 2 deletion by cycle | CIGAR | Y | N | |
read 2 hard clip by cycle | CIGAR | Y | N | |
read 2 insertion by cycle | CIGAR | Y | N | |
read 2 length histogram | samtools stats: LRL | N | Y | |
read 2 mismatch by cycle | samtools stats: MPC | N | Y | |
read 2 quality by cycle | samtools stats: LFQ | N | Y | |
read 2 quality histogram | samtools stats: LFQ | N | Y | |
read 2 soft clip by cycle | CIGAR | Y | N | |
read ? aligned by cycle | CIGAR | Y | N | |
read ? average length | CIGAR | Y | N | |
read ? deletion by cycle | CIGAR | Y | N | |
read ? hard clip by cycle | CIGAR | Y | N | |
read ? insertion by cycle | CIGAR | Y | N | |
read ? length histogram | CIGAR | Y | N | |
read ? mismatch by cycle | samtools stats: MPC | N | Y | |
read ? quality by cycle | CIGAR | Y | N | |
read ? quality histogram | CIGAR | Y | N | |
read ? soft clip by cycle | CIGAR | Y | N | |
reads mapped and paired | mate unmapped reads | samtools stats: SN | N | Y |
reads on target | bedtools | Y | N | |
RPSP (DEPRECATED) | RNAME/POS fields | Y | N | |
readsMissingMDtags | CIGAR | Y | N | |
run name | metadata | N | N | |
sample | metadata | N | N | |
sample level | input parameter | N | N | |
sample total | samtools view -s | N | N | |
soft clip bases | CIGAR | Y | N | |
target file | input parameter | N | N | |
target sizes | bedtools | Y | N | |
tissue origin | metadata | N | N | |
tissue type | metadata | N | N | |
total bases on target | bedtools | Y | N | |
total input reads meta | metadata | N | N | |
total reads | samtools stats: SN | N | Y | |
total target size | target size | bedtools | N | N |
unmapped reads | samtools stats: SN | N | Y | |
unmapped reads meta | metadata | N | N | |
workflow version | input parameter | N | N |
samtools
andbedtools
are wrapped by the Python packagespysam
andpybedtools
, respectively.- Exception: For coverage metrics,
bedtools
is called in a subprocess to circumvent a bug inpybedtools
.) pysam
is also used for processing CIGAR strings.
pairsMappedAbnormallyFar
is a secondary metric, computed from the insert size histogram, which in turn is derived from samtools stats
. In this instance, 'abnormal' is defined by the insert max
parameter.
The metrics non-primary reads meta
, low-quality reads meta
, total input reads meta
, and unmapped reads meta
are supplied as metadata, to report filtering carried out by upstream workflow tasks. They respectively correspond to non primary reads
, qual fail reads
, total reads
, and unmapped reads
, which are generated by the bam-qc-metrics
package using samtools
.
The latter group of metrics are computed on the BAM file input to bam-qc-metrics
(before downsampling, if any). For example, we might filter out unmapped reads at a previous workflow step and report the total in unmapped reads meta
. The filtered BAM file input to bam-qc-metrics
has no unmapped reads, so the unmapped reads
metric will be zero. Similar considerations apply to the other metrics.
The read counts computed by bam-qc-metrics
will be redundant, and potentially confusing, when BAM QC is fully migrated to a workflow which does filtering upstream. Therefore, they will be removed in a future release.
The metadata fields are supplied as JSON input to the run_bam_qc.py
script.
Metadata for read counts consists of the following fields, discussed in the previous section:
[non-primary reads meta, low-quality reads meta, total input reads meta, unmapped reads meta]
Two further sets of metadata fields are in use:
['barcode', 'instrument', 'lane', 'library', 'run name', 'sample']
for lane-level BAM input['donor', 'group id', 'library design', 'tissue origin', 'tissue type']
for merged BAM input
So the full set of metadata consists of the read count fields, plus the appropriate extra fields for lane-level or merged BAM input.
Metadata fields not appropriate for the given BAM input will appear in JSON output with a null value.
There are two possible filter mechanisms in BamQC: downsampling and quality filtering.
If both are in effect, quality filtering is applied first.
Neither filter is supported by the write_fast_metrics.py
script.
On large BAM files, quality filtering is slow, as it writes a large temporary file containing the reads which pass filters. Omitting quality filtering may result in a substantial speed gain. Downsampling is considerably faster, as the output file written is small.
For example, time taken to process a NovaSeq BAM file (22 GB, 545M reads) was as follows:
- Quality filtering on, downsampling off: 10 hours
- Quality filtering on, downsampling on: 67 minutes
- Quality filtering off, downsampling off: 22 minutes
Downsampling, in which metrics are computed on a randomly selected subset of reads, may be applied for faster and more efficient data processing.
If downsampling is in effect, metrics derived from CIGAR strings and bedtools are evaluated on the downsampled reads. The estimated value of the metric on the full dataset must be scaled accordingly. For example, if 500 million reads have been downsampled to 1 million, and the reported 'hard clip bases' is 250, there will be approximately 125,000 hard clip bases in the entire BAM file. Metrics derived directly from samtools will always be calculated on the full dataset, without downsampling. The DS
column in the summary table records which metrics are affected by downsampling.
bam-qc-metrics
can either accept a separate downsampled file as input, or do its own downsampling on the main input BAM file.
A previously downsampled BAM file may be provided to bam-qc-metrics
using the --downsampled-bam
option to run_bam_qc.py
. The slow QC metrics will be run on the BAM file provided. This is helpful if downsampling has already been carried out for other tools such as Picard MarkDuplicates. It also allows downsampling to an exact number of reads, instead of the approximate method used by bam-qc-metrics
.
Note that there is no cross-checking as to whether the downsampled input is really a subset of the main input.
Alternatively, bam-qc-metrics
can carry out its own downsampling using samtools view -s
. This method is fast and preserves read pairs, but does not allow sampling an exact number of reads, as it uses a probability of sampling rather than a fixed sample size. This is a known issue with samtools. A workaround suggested by the samtools developers will be implemented in the OICR bamQC workflow.
The default is downsampling to approximately 1.1 million reads, if the input BAM contains more than this number; and no downsampling otherwise. The default level was chosen so the downsampled file should contain at least 1 million reads. Downsampling can be deactivated, or a custom rate specified, using command-line options. In JSON output, sample_level
is the target number of reads to downsample, and sample_total
is the actual number of reads returned by samtools.
A fixed random seed is used for downsampling, so results will be consistent on repeated runs of bam-qc-metrics
.
BamQC has an option to exclude reads with alignment quality below a given threshold. The filter is implemented using samtools view -q
.
If quality filtering is in effect, the unmapped reads
metric is computed only on reads which failed the filter -- because by definition, an unmapped read does not have an alignment quality. All other metrics are computed only on reads which passed the filter.
The mismatched bases
field may not be consistent with the mismatch-by-cycle fields for each read.
The former is derived from samtools stats SN
; the latter from samtools stats MPC
with a given alignment reference.
Note also that samtools stats MPC
reports a number of cycles 1 greater than the actual number in the BAM file. For example, a 101-base read will have an entry for cycle 102 with zero mismatches. This value is retained in JSON output for consistency with the samtools results.