Data Quality Assessment

Elaborate example FastQC reports or adopt from InsideDNA.com

Purpose of data quality control

assess the quality of DNA sequencing data, and
filter out low quality sequences, in order to achieve better results in the downstream analysis. (garbage in, garbage out)

Because no sequencing technology is perfect, raw reads inevitably contain sequencing errors. Common problems that can affect downstream analysis:

Low confidence base calls (typically towards ends of reads)
Presence of adapter sequence in reads
Overabundant sequence duplicates
Library contamination

Read Quality Assessment

As you now know, the FASTQ format is a next generation sequencing specific format for storing raw sequence reads as well as the sequencer's assessment of the probability of each base being correctly called (i.e. read quality score). Therefore, the very first step of fragment analysis is quality control and filtering. This step aims to remove low quality reads. There are a number of tools for evaluating the quality of sequencing based on the FASTQ file itself, but a commonly used one is FastQC.

What is FastQC?

FastQC is an application which reads raw sequence data from high throughput sequencers and runs a set of quality checks to produce a report which allows you to quickly assess the overall quality of your run, and to spot any potential problems or biases. This software can be run in one of two modes. It can either run as a stand alone interactive application (i.e. desktop application) for the immediate analysis of small numbers of FASTQ files, or it can be run in a non-interactive mode (i.e. linux command-line) where it would be suitable for integrating into a larger analysis pipeline for the systematic processing of large numbers of files. In either case, the result is a set of files, including an web report of the results.

Running FastQC on MGHPCC

First, load the module file for FastQC:

$ module load fastqc/0.10.1

FASTQC has many parameters, but in a simple case we only need to specify -o (path to output directory) and provide a list of FASTQ files on which we want to run quality control. In our example we should type the following:

$  bsub -q short -W 1:00 -R rusage[mem=2048] -J "fastQC_job" -o fastQC_report fastqc -t 8 *.fq

Or, we could write a shell script that would run the program for us. (Due to time constraint, you will not be running the script, but we will spend some time looking through these commands to understand them.)

Change directory to where the raw sequence data are located.
```
cd /home/username/rnaseq_workshop/data
```
Open a new text called fastqc_job.sh using nano editor
```
nano fastqc_job.sh
```
Start writing into the shell script as below:
```
#!/bin/bash

#BSUB -q short
```

#BSUB -W 1:00 #BSUB -n 8 #BSUB -R "rusage[mem=2048]" #BSUB -J "fastQC_job" #BSUB -o myfastqc.out #BSUB -e myfastqc.err

module load fastqc/0.10.1 fastqc -t 8 *.fq


4. Submit your fastqc_job.sh

bsub < fastqc_job.sh


When the task is completed, the FastQC reports will be in the folder specified as the output location. 
The program above will generate 12 self-contained directories:

GSM794483_C1_R1_1.fq_fastqc GSM794483_C1_R1_2.fq_fastqc GSM794484_C1_R2_1.fq_fastqc GSM794484_C1_R2_2.fq_fastqc GSM794485_C1_R3_1.fq_fastqc GSM794485_C1_R3_2.fq_fastqc GSM794486_C2_R1_1.fq_fastqc GSM794486_C2_R1_2.fq_fastqc GSM794487_C2_R2_1.fq_fastqc GSM794487_C2_R2_2.fq_fastqc GSM794488_C2_R3_1.fq_fastqc GSM794488_C2_R4_1.fq_fastqc

Each of these folder contains a HTML formatted report that can be loaded into a browser, as well as a file named "summary.txt".   
The graphic reports (.html files) contain quality plots which provide important insights into the read quality and might potentially raise awareness about library preparation problems or low sequence qualities.

If we change into one of the directory and list the contents of the file "summary.txt" we can see which tests passed and which failed:

cd GSM794483_C1_R1_1.fq_fastqc cat summary.txt PASS Basic Statistics GSM794483_C1_R1_1.fq FAIL Per base sequence quality GSM794483_C1_R1_1.fq FAIL Per sequence quality scores GSM794483_C1_R1_1.fq PASS Per base sequence content GSM794483_C1_R1_1.fq PASS Per base GC content GSM794483_C1_R1_1.fq FAIL Per sequence GC content GSM794483_C1_R1_1.fq PASS Per base N content GSM794483_C1_R1_1.fq PASS Sequence Length Distribution GSM794483_C1_R1_1.fq FAIL Sequence Duplication Levels GSM794483_C1_R1_1.fq PASS Overrepresented sequences GSM794483_C1_R1_1.fq PASS Kmer Content GSM794483_C1_R1_1.fq


As it turned out, the raw sequence data that we will be using for our analysis today are simulated data, thus the Phred quality score in the fastq file are not from a sequencing machine. Therefore, instead of evaluating the FastQC report generated from these data, we will be evaluating some example reports provided in [Babraham Bioinformatics FastQC Project Webpage](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Example fastQC reports:
* [Good Illumina Data](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html)
* [Bad Illumina Data](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html)
* [Adapter dimer contaminated run](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/RNA-Seq_fastqc.html)
* [Small RNA with read-through adapter](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/small_rna_fastqc.html)
* [Reduced Representation BS-Seq](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/RRBS_fastqc.html)
* [PacBio](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/pacbio_srr075104_fastqc.html)
* [454](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/454_SRR073599_fastqc.html)

After potential problems have been identified and noted down, one would try to remove several errors by using trimming tools such as Trimmomatic to remove low quality bases from the sequence end and (potentially more importantly) to also remove remaining adapters from the reads. 

Reference:
* [[This FastQC tutorial video|https://www.youtube.com/watch?v=bz93ReOv87Y]] demonstrates the use of FastQC to analyse some sequence data and goes through the results to explain what a good dataset looks like, and what sort of problems you might encounter.
* https://www.biostars.org/p/191504/ -- A tutorial on the basics of the Phred score concept and introduction of important quality metrics used in a majority of quality control bioinformatics tools such as FastQC 

---

| [[Previous Section|Getting Data for Analysis]] | [[This Section|Data Quality Assessment]] | [[Next Section|Adapter Removal, Trimming, and Filtering]] |
|:------------------------------------:|:--------------------------:|:--------------------------------------------:|
| [[Getting Data for Analysis|Getting Data for Analysis]]| [[Data Quality Assessment|Data Quality Assessment]]| [[Adapter Removal, Trimming, and Filtering|Adapter Removal, Trimming, and Filtering]]|

Home

6-iii. Integrated assignment answers

Alternative options

The Unix Shell Bootcamp

#Table of Contents

Module 0 Setting Up for Data Analysis

Module 1 Introduction/ Overview

Module 2 Quality Control

Module 3 Tuxedo Pipeline

Resources and Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Quality Assessment

Purpose of data quality control

Read Quality Assessment

What is FastQC?

Running FastQC on MGHPCC

Clone this wiki locally