šŸ§¬
RNA-seq Pipeline
Search
Try Notion
Drag image to reposition
šŸ§¬šŸ§¬
RNA-seq Pipeline
A simple guide/pipeline of RNA-seq data analysis with real data example using two different approach: (1) conventional genome alignment (reads mapping) with STAR/stringtie/Deseq2 and (2) pseudoalignment with kallisto/sleuth.
Table of Content
Sourcing Raw Data File
Download SRA toolkit
Sequence Read Archive (SRA) is a database that stores raw sequencing data and alignment information. SRA enhance reproducibility and facilitate new discoveries through data analysis. SRA is also a file format, the default file format SRA database. However, most RNA-seq software use .fasta-based file format and .sam file format. Therefore, we need a software that can manipulate .sra file. You can download SRA toolkit from NCBI SRA Toolkit . I use Ubuntu with sudo permission.
# Go to to your download folder chmod a+x setup-apt.sh sudo ./setup-apt.sh source /etc/profile.d/sra-tools.sh
Bash
Documentation of all SRA toolkit software can be read at SRA Toolkit Documentation
Find RNA-seq Data in NCBI SRA
under construction
Download SRA file
Find SRA accession number from NCBI SRA. Type your query in the search bar. You have to find all the relevant SRA accession number from a certain experiment. For example, I search for (MCF-7) AND "Homo sapiens"[orgn:__txid9606]. This query means ā€œsearch for MCF-7 from Homo sapiens origin. I look for complete experiment with only two group and two or three replicate (for simplicity of the project and ā€œpotato-nessā€ of my laptop). I found MCF-7 OVOL2 gene knockout experiment.
This experiment have two group: OVOL2 gene knockout and wild-type (control group). Click on one of those results. Click ā€œAll Experimentā€ to see all the RNA-seq data from OVOL2 gene knockout experiment.
Indeed, this experiment only have four RNA-seq data. After that, create a SRA file list using your favorite text editor.
SRR17035672 SRR17035671 SRX13225558 SRX13225559
Bash
Save as OVOL2_sra_list.txt. Then, download using prefetch command from SRA toolkit.
prefetch OVOL2_sra_list.txt
Bash
Convert SRA into FASTQ
Enter your SRA directory, and then use fastq-dump command to convert.
cd SRR17035669 fastq-dump SRR17035669.sra
Bash
It is easier to have all required files in one directory, thus we create symbolic link. In Linux, symbolic link is usually used instead of copying/moving file.
ln -s working-dir/SRR17035669/SRR17035669.fastq working-dir ln -s working-dir/SRR17035670/SRR17035670.fastq working-dir ln -s working-dir/SRR17035671/SRR17035671.fastq working-dir ln -s working-dir/SRR17035672/SRR17035672.fastq working-dir
Bash
Quality Control Analysis
Installing FastQC
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. With FastQC, you can:
Basic sequence stats.
Per base sequence quality
Per base sequence content
Sequence duplication level
Overrepresented sequence
Kmer content
etc
FastQC need Java. In Ubuntu, you can install with the following command.
sudo apt install default-jre java -version
Bash
Download FastQC from the official website. Unzip, place it into $PATH, and run it.
unzip fastqc_v0.11.9.zip cd FastQC chmod a+x fastqc sudo ln -s /path/to/fastqc /usr/local/bin #locate your fastqc directory
Bash
You can open GUI by typing fastqc or give it argument to run in cli (see below).
Installing MultiQC
MultiQC aggregate results from bioinformatics analyses across many samples into a single report (a HTML file). It's a general use tool, perfect for summarising the output from numerous bioinformatics tools. MultiQC will use FastQC output file and many other RNA-seq software output file as an input.
MultiQC need python and pip. Install pip before installing MultiQC
sudo apt update sudo apt install python3-pip pip3 --version #check version pip install multiqc multiqc #run multiqc
Bash
QC Analysis
To run multiple QC analysis with fastqc, just type multiple filename as argument.
fastqc SRR17035669.fastq SRR17035670.fastq SRR17035671.fastq SRR17035672.fastq
Bash
You can directly open the html output.
Summary
We can see that there is 3 bad results and one questionable results. Letā€™s dive in... we will compare the bad one with the reference ā€œgood resultsā€ from FastQC website.
Per base sequence content
A little bit chaos in the beginning, but not much different with the reference.
Per sequence GC contentsa
It should be perfectly follow the normal distribution.
Sequence duplication level
MultiQC
Use MultiQC to summarize all outpout from FastQC as beautiful as possible. Find fastqc output data (html & zip file) in current directory with MultiQC.
multiqc . #dot represent current directory
Bash
Screenshot
Quantifying Tanscripts Abundance with Kallisto
Kallisto is a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for complete alignment.
Installation
Download release for Linux (I use the latest v0.46.1), then move into $PATH.
#Go to your download directory tar -zxvf kallisto_linux-v0.46.1.tar.gz cd kallisto chmod +x kallisto sudo mv /usr/local/bin #you can specify another $PATH directory
Bash
Preparation and Index Building
To do quantification of transcripts abundance, kallisto need index file and RNA-seq experiment file. To build index file, we need transcriptome reference file specific to our organism. In this case, itā€™s Homo sapiens. We will use trimmed RNA-seq .fastq file.
Download transcriptome reference file from Ensembl:
curl -O ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
Bash
After that, we can build kallistoā€™s index file. Note that index file is the output of this operation. You can name it whatever you like, just make sure that it has .idx file format.
# Template for kallisto index kallisto index -i index_file.idx reference_file.fa.gz # Actual command (directory and filename may vary) kallisto index -i Homo_sapiens.GRCh38.cdna.all.fa_index.idx ./Homo_sapiens.GRCh38.cdna.all.fa.gz
Bash
Quantification
For quantification of transcript abundance, we need an index file that weā€™ve built before and an .fastq file that weā€™ve trimmed. Here is the argument used:
Argument -i specify the index file
Argument -o specify the output directory. We create a new directory of output for each fastq file, since the output of kallisto will have same file name in each run (thus, we will overwrite previous output file if we put each run in the same directory).
Since we only have single paired reads, we will use --single argument. If you use two-paired reads, such as each RNA-seq experiment have two files (usually end in xxx_1.fastq and xxx_2.fastq), we donā€™t specify anything.
Argument -l specify the mean length of our sequence. This argument is necessary when we use --single. You can find mean sequence length in fastp results. We use 296.
Argument -s specify the standard deviation of our sequence length. This argument is necessary when we use --single.
Last, we specify the path of our fastq file.
mkdir kallisto_quant/SRR17035669 #optional, for tidiness. kallisto quant -i Homo_sapiens.GRCh38.cdna.all.fa_index.idx -o ./kallisto_quant/SRR17035669 --single -l 296 -s 10 ./SRR17035669_p.fastq
Bash
After run, three output file will be created.
abundance.h5 abundance.tsv run_info.json
Plain Text
abundance.h5 is ..., abundance.tsv is ..., run_info.json is ...,
Exploratory Data Analysis with Sleuth
Sleuth is a program for analysis of RNA-Seq experiments for which transcript abundances have been quantified with kallisto. sleuth provides tools for exploratory data analysis utilizing Shiny by RStudio, and implements statistical algorithms for differential analysis that leverage the boostrap estimates of kallisto.
Installation
To use sleuth and many software explained afterward, you should install R (at least v.4.0). In conjunction with R, it is advised to use RStudio as your IDE. You can download RStudio from its official website.
Since R is a fast-moving project, the latest stable version isnā€™t always available from Ubuntuā€™s repositories, so weā€™ll install from CRAN repository. This guide assume you have Ubuntu 20.04 LTS.
Add GPG-key, then add repository into ppa:
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
Bash
Install R:
sudo apt update sudo apt install r-base
Bash
Open R by typing R in the terminal. We will install BioconductoR, a package manager that distributes software for bioinformatic and biology-related data analysis. In this guide, BioconductoR was used to install several R packages such as rhdf5, DESeq2, EdgeR, ggplot2, and many other!
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.14")
R
sleuth is dependence to package devtools and rhdf5. However, to install devtools, we need to install some dependencies (especially true in Ubuntu, all of its version. See this StackOverFlow discussion). In the terminal,
sudo apt update sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev
Bash
Back to the R console, we will install rhdf5 and sleuth.
BiocManager::install("rhdf5") #install rhdf5 from BioconductoR install.packages("devtools") #install devtools #install sleuth with devtools and script from mschilli87 (see note) devtools::install_github("mschilli87/sleuth@drop-h5write-default-import") library("sleuth") #load sleuth
R
Note: the ā€˜properā€™ (as the dev intended) way of installing sleuth should be from BiocManager or devtools:
#two ways of installing sleuth according to the dev: BiocManager::install("pachterlab/sleuth") #install from biocmanager devtools::install_github("pachterlab/sleuth") #install from devtools
R
However, as 26/12/2021, there is currently an issue in rhdf5. In R v.4.1.2, after installing with ā€˜properā€™ way, R console give us error warning:
Error: object ā€˜h5write.defaultā€™ is not exported by 'namespace:rhdf5' Execution halted
R
This issue was addressed in StackOverFlow and GitHub. Thanks to mschilli87 for easy sleuth installation!
Test
123
101
345
910
567
789