A simple guide/pipeline of RNA-seq data analysis with real data example using two different approach: (1) conventional genome alignment (reads mapping) with STAR/stringtie/Deseq2 and (2) pseudoalignment with kallisto/sleuth.
Table of Content
Sourcing Raw Data File
Download SRA toolkit
Sequence Read Archive (SRA) is a database that stores raw sequencing data and alignment information. SRA enhance reproducibility and facilitate new discoveries through data analysis. SRA is also a file format, the default file format SRA database. However, most RNA-seq software use .fasta-based file format and .sam file format. Therefore, we need a software that can manipulate .sra file. You can download SRA toolkit from NCBI SRA Toolkit . I use Ubuntu with sudo permission.
# Go to to your download folder
chmod a+x setup-apt.sh
sudo ./setup-apt.sh
source /etc/profile.d/sra-tools.sh
Bash
Documentation of all SRA toolkit software can be read at SRA Toolkit Documentation
Find RNA-seq Data in NCBI SRA
under construction
Download SRA file
Find SRA accession number from NCBI SRA. Type your query in the search bar. You have to find all the relevant SRA accession number from a certain experiment. For example, I search for (MCF-7) AND "Homo sapiens"[orgn:__txid9606]. This query means āsearch for MCF-7 from Homo sapiens origin. I look for complete experiment with only two group and two or three replicate (for simplicity of the project and āpotato-nessā of my laptop). I found MCF-7 OVOL2 gene knockout experiment.
This experiment have two group: OVOL2 gene knockout and wild-type (control group). Click on one of those results. Click āAll Experimentā to see all the RNA-seq data from OVOL2 gene knockout experiment.
Indeed, this experiment only have four RNA-seq data. After that, create a SRA file list using your favorite text editor.
SRR17035672
SRR17035671
SRX13225558
SRX13225559
Bash
Save as OVOL2_sra_list.txt. Then, download using prefetch command from SRA toolkit.
prefetch OVOL2_sra_list.txt
Bash
Convert SRA into FASTQ
Enter your SRA directory, and then use fastq-dump command to convert.
cd SRR17035669
fastq-dump SRR17035669.sra
Bash
It is easier to have all required files in one directory, thus we create symbolic link. In Linux, symbolic link is usually used instead of copying/moving file.
ln -s working-dir/SRR17035669/SRR17035669.fastq working-dir
ln -s working-dir/SRR17035670/SRR17035670.fastq working-dir
ln -s working-dir/SRR17035671/SRR17035671.fastq working-dir
ln -s working-dir/SRR17035672/SRR17035672.fastq working-dir
Bash
Quality Control Analysis
Installing FastQC
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. With FastQC, you can:
Basic sequence stats.
Per base sequence quality
Per base sequence content
Sequence duplication level
Overrepresented sequence
Kmer content
etc
FastQC need Java. In Ubuntu, you can install with the following command.
sudo apt install default-jre
java -version
Bash
Download FastQC from the official website. Unzip, place it into $PATH, and run it.
unzip fastqc_v0.11.9.zip
cd FastQC
chmod a+x fastqc
sudo ln -s /path/to/fastqc /usr/local/bin #locate your fastqc directory
Bash
You can open GUI by typing fastqc or give it argument to run in cli (see below).
Installing MultiQC
MultiQC aggregate results from bioinformatics analyses across many samples into a single report (a HTML file). It's a general use tool, perfect for summarising the output from numerous bioinformatics tools. MultiQC will use FastQC output file and many other RNA-seq software output file as an input.
MultiQC need python and pip. Install pip before installing MultiQC
sudo apt update
sudo apt install python3-pip
pip3 --version #check version
pip install multiqc
multiqc #run multiqc
Bash
QC Analysis
To run multiple QC analysis with fastqc, just type multiple filename as argument.
fastqc SRR17035669.fastq SRR17035670.fastq SRR17035671.fastq SRR17035672.fastq
Bash
You can directly open the html output.
Summary
We can see that there is 3 bad results and one questionable results. Letās dive in... we will compare the bad one with the reference āgood resultsā from FastQC website.
Per base sequence content
A little bit chaos in the beginning, but not much different with the reference.
Per sequence GC contentsa
It should be perfectly follow the normal distribution.
Sequence duplication level
MultiQC
Use MultiQC to summarize all outpout from FastQC as beautiful as possible. Find fastqc output data (html & zip file) in current directory with MultiQC.
multiqc . #dot represent current directory
Bash
Screenshot
Quantifying Tanscripts Abundance with Kallisto
Kallisto is a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for complete alignment.
Installation
Download release for Linux (I use the latest v0.46.1), then move into $PATH.
#Go to your download directory
tar -zxvf kallisto_linux-v0.46.1.tar.gz
cd kallisto
chmod +x kallisto
sudo mv /usr/local/bin #you can specify another $PATH directory
Bash
Preparation and Index Building
To do quantification of transcripts abundance, kallisto need index file and RNA-seq experiment file. To build index file, we need transcriptome reference file specific to our organism. In this case, itās Homo sapiens. We will use trimmed RNA-seq .fastq file.
Download transcriptome reference file from Ensembl:
curl -O ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
Bash
After that, we can build kallistoās index file. Note that index file is the output of this operation. You can name it whatever you like, just make sure that it has .idx file format.
# Template for kallisto index
kallisto index -i index_file.idx reference_file.fa.gz
# Actual command (directory and filename may vary)
kallisto index -i Homo_sapiens.GRCh38.cdna.all.fa_index.idx ./Homo_sapiens.GRCh38.cdna.all.fa.gz
Bash
Quantification
For quantification of transcript abundance, we need an index file that weāve built before and an .fastq file that weāve trimmed. Here is the argument used:
Argument -i specify the index file
Argument -o specify the output directory. We create a new directory of output for each fastq file, since the output of kallisto will have same file name in each run (thus, we will overwrite previous output file if we put each run in the same directory).
Since we only have single paired reads, we will use --single argument. If you use two-paired reads, such as each RNA-seq experiment have two files (usually end in xxx_1.fastq and xxx_2.fastq), we donāt specify anything.
Argument -l specify the mean length of our sequence. This argument is necessary when we use --single. You can find mean sequence length in fastp results. We use 296.
Argument -s specify the standard deviation of our sequence length. This argument is necessary when we use --single.
Last, we specify the path of our fastq file.
mkdir kallisto_quant/SRR17035669 #optional, for tidiness.
kallisto quant -i Homo_sapiens.GRCh38.cdna.all.fa_index.idx -o ./kallisto_quant/SRR17035669 --single -l 296 -s 10 ./SRR17035669_p.fastq
Bash
After run, three output file will be created.
abundance.h5
abundance.tsv
run_info.json
Plain Text
abundance.h5 is ..., abundance.tsv is ..., run_info.json is ...,
Exploratory Data Analysis with Sleuth
Sleuth is a program for analysis of RNA-Seq experiments for which transcript abundances have been quantified with kallisto. sleuth provides tools for exploratory data analysis utilizing Shiny by RStudio, and implements statistical algorithms for differential analysis that leverage the boostrap estimates of kallisto.
Installation
To use sleuth and many software explained afterward, you should install R (at least v.4.0). In conjunction with R, it is advised to use RStudio as your IDE. You can download RStudio from its official website.
Since R is a fast-moving project, the latest stable version isnāt always available from Ubuntuās repositories, so weāll install from CRAN repository. This guide assume you have Ubuntu 20.04 LTS.
Add GPG-key, then add repository into ppa:
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
Bash
Install R:
sudo apt update
sudo apt install r-base
Bash
Open R by typing R in the terminal. We will install BioconductoR, a package manager that distributes software for bioinformatic and biology-related data analysis. In this guide, BioconductoR was used to install several R packages such as rhdf5, DESeq2, EdgeR, ggplot2, and many other!
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.14")
R
sleuth is dependence to package devtools and rhdf5. However, to install devtools, we need to install some dependencies (especially true in Ubuntu, all of its version. See this StackOverFlow discussion). In the terminal,
sudo apt update
sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev
Bash
Back to the R console, we will install rhdf5 and sleuth.
BiocManager::install("rhdf5") #install rhdf5 from BioconductoR
install.packages("devtools") #install devtools
#install sleuth with devtools and script from mschilli87 (see note)
devtools::install_github("mschilli87/sleuth@drop-h5write-default-import")
library("sleuth") #load sleuth
R
Note: the āproperā (as the dev intended) way of installing sleuth should be from BiocManager or devtools:
#two ways of installing sleuth according to the dev:
BiocManager::install("pachterlab/sleuth") #install from biocmanager
devtools::install_github("pachterlab/sleuth") #install from devtools
R
However, as 26/12/2021, there is currently an issue in rhdf5. In R v.4.1.2, after installing with āproperā way, R console give us error warning:
Error: object āh5write.defaultā is not exported by 'namespace:rhdf5'
Execution halted
R
This issue was addressed in StackOverFlow and GitHub. Thanks to mschilli87 for easy sleuth installation!
Test