RNA-seq Pipeline

Drag image to reposition

RNA-seq Pipeline

A simple guide/pipeline of RNA-seq data analysis with real data example using two different approach: (1) conventional genome alignment (reads mapping) with STAR/stringtie/Deseq2 and (2) pseudoalignment with kallisto/sleuth.

Table of Content

Table of Content

Sourcing Raw Data File

Download SRA toolkit

Find RNA-seq Data in NCBI SRA

Download SRA file

Convert SRA into FASTQ

Quality Control Analysis

Per base sequence content

Per sequence GC contentsa

Sequence duplication level

MultiQC

Quantifying Tanscripts Abundance with Kallisto

Installation

Preparation and Index Building

Quantification

Exploratory Data Analysis with Sleuth

Installation

Test

Sourcing Raw Data File

Download SRA toolkit

Sequence Read Archive (SRA) is a database that stores raw sequencing data and alignment information. SRA enhance reproducibility and facilitate new discoveries through data analysis. SRA is also a file format, the default file format SRA database. However, most RNA-seq software use .fasta-based file format and .sam file format. Therefore, we need a software that can manipulate .sra file. You can download SRA toolkit from NCBI SRA Toolkit . I use Ubuntu with sudo permission.

# Go to to your download folder
chmod a+x setup-apt.sh
sudo ./setup-apt.sh
source /etc/profile.d/sra-tools.sh

Bash

Documentation of all SRA toolkit software can be read at SRA Toolkit Documentation

Find RNA-seq Data in NCBI SRA

under construction

Download SRA file

Find SRA accession number from NCBI SRA. Type your query in the search bar. You have to find all the relevant SRA accession number from a certain experiment. For example, I search for (MCF-7) AND "Homo sapiens"[orgn:__txid9606]. This query means “search for MCF-7 from Homo sapiens origin. I look for complete experiment with only two group and two or three replicate (for simplicity of the project and “potato-ness” of my laptop). I found MCF-7 OVOL2 gene knockout experiment. 

This experiment have two group: OVOL2 gene knockout and wild-type (control group). Click on one of those results. Click “All Experiment” to see all the RNA-seq data from OVOL2 gene knockout experiment.

Indeed, this experiment only have four RNA-seq data. After that, create a SRA file list using your favorite text editor.

SRR17035672
SRR17035671
SRX13225558
SRX13225559

Bash

Save as OVOL2_sra_list.txt. Then, download using prefetch command from SRA toolkit.

prefetch OVOL2_sra_list.txt

Bash

Convert SRA into FASTQ

Enter your SRA directory, and then use fastq-dump command to convert.

cd SRR17035669
fastq-dump SRR17035669.sra

Bash

It is easier to have all required files in one directory, thus we create symbolic link. In Linux, symbolic link is usually used instead of copying/moving file.

ln -s working-dir/SRR17035669/SRR17035669.fastq working-dir
ln -s working-dir/SRR17035670/SRR17035670.fastq working-dir
ln -s working-dir/SRR17035671/SRR17035671.fastq working-dir
ln -s working-dir/SRR17035672/SRR17035672.fastq working-dir

Bash

Quality Control Analysis

Installing FastQC

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. With FastQC, you can:

Basic sequence stats.

Per base sequence quality

Per base sequence content

Sequence duplication level

Overrepresented sequence

Kmer content

etc

FastQC need Java. In Ubuntu, you can install with the following command.

sudo apt install default-jre
java -version

Bash

Download FastQC from the official website. Unzip, place it into $PATH, and run it.

unzip fastqc_v0.11.9.zip
cd FastQC
chmod a+x fastqc
sudo ln -s /path/to/fastqc /usr/local/bin #locate your fastqc directory

Bash

You can open GUI by typing fastqc or give it argument to run in cli (see below).

Installing MultiQC

MultiQC aggregate results from bioinformatics analyses across many samples into a single report (a HTML file). It's a general use tool, perfect for summarising the output from numerous bioinformatics tools. MultiQC will use FastQC output file and many other RNA-seq software output file as an input.

MultiQC need python and pip. Install pip before installing MultiQC

sudo apt update
sudo apt install python3-pip
pip3 --version #check version
pip install multiqc
multiqc #run multiqc

Bash

QC Analysis

To run multiple QC analysis with fastqc, just type multiple filename as argument.

fastqc SRR17035669.fastq SRR17035670.fastq SRR17035671.fastq SRR17035672.fastq

Bash

You can directly open the html output.

Summary

We can see that there is 3 bad results and one questionable results. Let’s dive in... we will compare the bad one with the reference “good results” from FastQC website.

Per base sequence content

A little bit chaos in the beginning, but not much different with the reference.

Per sequence GC contentsa

It should be perfectly follow the normal distribution.

Sequence duplication level

MultiQC

Use MultiQC to summarize all outpout from FastQC as beautiful as possible. Find fastqc output data (html & zip file) in current directory with MultiQC.

multiqc . #dot represent current directory

Bash

Screenshot

Quantifying Tanscripts Abundance with Kallisto

Kallisto is a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for complete alignment.

Installation

Download release for Linux (I use the latest v0.46.1), then move into $PATH.

#Go to your download directory
tar -zxvf kallisto_linux-v0.46.1.tar.gz
cd kallisto
chmod +x kallisto
sudo mv /usr/local/bin #you can specify another $PATH directory

Bash

Preparation and Index Building

To do quantification of transcripts abundance, kallisto need index file and RNA-seq experiment file. To build index file, we need transcriptome reference file specific to our organism. In this case, it’s Homo sapiens. We will use trimmed RNA-seq .fastq file.

Download transcriptome reference file from Ensembl:

curl -O ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

Bash

After that, we can build kallisto’s index file. Note that index file is the output of this operation. You can name it whatever you like, just make sure that it has .idx file format.

# Template for kallisto index
kallisto index -i index_file.idx reference_file.fa.gz

# Actual command (directory and filename may vary)
kallisto index -i Homo_sapiens.GRCh38.cdna.all.fa_index.idx ./Homo_sapiens.GRCh38.cdna.all.fa.gz

Bash

Quantification

For quantification of transcript abundance, we need an index file that we’ve built before and an .fastq  file that we’ve trimmed. Here is the argument used:

Argument -i specify the index file

Argument -o specify the output directory. We create a new directory of output for each fastq file, since the output of kallisto will have same file name in each run (thus, we will overwrite previous output file if we put each run in the same directory).

Since we only have single paired reads, we will use --single argument. If you use two-paired reads, such as each RNA-seq experiment have two files (usually end in xxx_1.fastq and xxx_2.fastq), we don’t specify anything. 

Argument -l specify the mean length of our sequence. This argument is necessary when we use --single. You can find mean sequence length in fastp  results. We use 296.

Argument -s  specify the standard deviation of our sequence length. This argument is necessary when we use --single.

Last, we specify the path of our fastq file.

mkdir kallisto_quant/SRR17035669 #optional, for tidiness.
kallisto quant -i Homo_sapiens.GRCh38.cdna.all.fa_index.idx -o ./kallisto_quant/SRR17035669 --single -l 296 -s 10 ./SRR17035669_p.fastq

Bash

After run, three output file will be created.

abundance.h5
abundance.tsv
run_info.json

Plain Text

abundance.h5 is ..., abundance.tsv is ..., run_info.json is ...,

Exploratory Data Analysis with Sleuth

Sleuth is a program for analysis of RNA-Seq experiments for which transcript abundances have been quantified with kallisto. sleuth provides tools for exploratory data analysis utilizing Shiny by RStudio, and implements statistical algorithms for differential analysis that leverage the boostrap estimates of kallisto.

Installation

To use sleuth and many software explained afterward, you should install R (at least v.4.0). In conjunction with R, it is advised to use RStudio as your IDE. You can download RStudio from its official website.

Since R is a fast-moving project, the latest stable version isn’t always available from Ubuntu’s repositories, so we’ll install from CRAN repository. This guide assume you have Ubuntu 20.04 LTS.

Add GPG-key, then add repository into ppa:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'

Bash

Install R:

sudo apt update
sudo apt install r-base

Bash

Open R by typing R in the terminal. We will install BioconductoR, a package manager that distributes software for bioinformatic and biology-related data analysis. In this guide, BioconductoR was used to install several R packages such as rhdf5, DESeq2, EdgeR, ggplot2, and many other!

if (!require("BiocManager", quietly = TRUE))
   install.packages("BiocManager")
BiocManager::install(version = "3.14")

sleuth is dependence to package devtools and rhdf5. However, to install devtools, we need to install some dependencies (especially true in Ubuntu, all of its version. See this StackOverFlow discussion). In the terminal,

sudo apt update
sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev

Bash

Back to the R console, we will install rhdf5 and sleuth.

BiocManager::install("rhdf5") #install rhdf5 from BioconductoR
install.packages("devtools") #install devtools
#install sleuth with devtools and script from mschilli87 (see note)
devtools::install_github("mschilli87/sleuth@drop-h5write-default-import")
library("sleuth") #load sleuth

Note: the ‘proper’ (as the dev intended) way of installing sleuth should be from BiocManager or devtools:

#two ways of installing sleuth according to the dev:
BiocManager::install("pachterlab/sleuth") #install from biocmanager
devtools::install_github("pachterlab/sleuth") #install from devtools

However, as 26/12/2021, there is currently an issue in rhdf5. In R v.4.1.2, after installing with ‘proper’ way, R console give us error warning:

Error: object ‘h5write.default’ is not exported by 'namespace:rhdf5'
Execution halted

This issue was addressed in StackOverFlow and GitHub. Thanks to mschilli87 for easy sleuth installation!

Test

123	101
345	910
567	789