在RNA-seq中针对circRNA的鉴定与定量的综合性分析方法——CIRIquant

image.png

CIRIquant 与其他同类算法相比，其在鉴定circRNA的过程中降低了假阳性率。在circRNA的定量和差异表达分析方面，CIRIquant也考虑到RNase R处理不均等因素对最终结果的影响，通过qPCR验证其结果相较于其他方法拥有更高的准确度。同时，CIRIquant还提供对circRNA与线性转录本比例的分析，为circRNA生物发生和调控的相关研究提供基础。

鉴定原理

image.png

首先将reads使用hisat2与参考基因组进行比对，使用CIRI2或者其他软件对circRNA进行鉴定获得潜在的circRNA；
为了对circRNA的表达水平进行精准的定量并筛选BSJs的假阳性，我们将BSJ区域的两个全长序列连接构建一个伪circRNA参考序列；
然后将潜在的circRNA重新比对到伪序列上，确定BSJ reads 是否可以与BSJ区域完全线性对齐。
结合与基因组和伪circRNA序列比对结果，可以通过计算BSJ上的环装剪切junction reads 的比列来确定每个circRNA的junction 率；然后使用RNA-seq分析流程获得转录本水平表达量信息。

1. 安装

1.1 CIRIquant依赖的软件与python包：

Softwares:

bwa
hisat2
stringtie
samtools >= 1.9 (older version of samtools may use deprecated parameters in sort and index commands)

Python packages:

PyYAML
argparse
pysam
numpy
scipy
scikit-learn

1.2 使用源代码安装

从git-hub或者SourceForge处获得最新版本

# create and activate virtual env
pip install virtualenv #OR conda install virtualenv 
virtualenv -p /path/to/your/python2/executable venv
source ./venv/bin/activate

# Install CIRIquant and its requirement automatically
tar zxvf CIRIquant.tar.gz
cd CIRIquant
python setup.py install

# Manual installation of required pacakges is also supported
pip install -r requirements.txt #OR conda install --yes --file requirements.txt

1.3

直接使用pip安装
pip install CIRIquant

2 使用

2.1 circRNA的定量

参数使用：

Usage:
  CIRIquant [options] --config <config> -1 <m1> -2 <m2>

  <config>            Config file
  <m1>                Input mate1 reads (for paired-end data)
  <m2>                Input mate2 reads (for paired-end data)

Options (defaults in parentheses):

  -v                  Run in verbose mode
  -o, --out           Output directory (default: current directory)
  -e, --log           Specific log file (default: sample_prefix.log)
  -p, --prefix        Output sample prefix (default: input sample name)
  -t, --threads       Number of CPU threads to use (defualt: 4)
  -a, --anchor        Minimum anchor length for junction alignment (default: 5)
  -l, --library-type  Library type, 0: unstranded, 1: read1 match the sense strand, 2: read1 match the antisense strand (default: 0)

  --bed               User provided Back-Spliced Junction Site in BED format
  --circ              circRNA prediction results from other tools
  --tool              Tool name, required when --circ is specified ([CIRI2/CIRCexplorer2/DCC/KNIFE/MapSplice/UROBORUS/circRNA_finder/find_circ])

  --RNaseR            CIRIquant output file of RNase R data (required for RNase R correction)
  --bam               Specific hisat2 alignment bam file against reference genome
  --no-gene           Skip StringTie estimation of gene abundance

注意:

目前，–circ 和–tool 选项支持来自CIRI2/ CIRCexplorer2/ DCC/KNIFE/ MapSplice/ UROBORUS/ circRNA_finder/ find_cir 的结果
对于DCC和circRNA_finder等工具，需手动删除连接位置相同但链相反的重复环状rna。
基因表达值需要进行归一化，如果之后需要运行DE分析，不要使用---no-gene。

CIRIquant需要输入一个配置文件

// Example of config file
name: hg19
tools:
  bwa: path/bwa
  hisat2: path/hisat2
  stringtie: path/stringtie
  samtools: path/samtools

reference:
  fasta: path/hg19.fa
  gtf: path/gencode.v19.annotation.gtf
  bwa_index: path//hg19
  hisat_index: path/hg19

Key	Description
name	the name of config file
bwa	the path of bwa
hisat2	the path of hisat2
stringtie	the path of stringite
samtools	the path of samtools, samtools version below 1.3.1 is not supported
fasta	reference genome fasta, a fai index by samtools faidx is also needed under the same directory
gtf	annotation file of reference genome in GTF/GFF3 format
bwa_index	prefix of BWA index for reference genome
hisat_index	prefix of HISAT2 index for reference genome

对于线下提供的circRNA定量表需要有junction位点的bed文件，第4列的格式必须为“chrom:start|end”。例如：

chr1    10000   10099   chr1:10000|10099    .   +
chr1    31000   31200   chr1:31000|31200    .   -

Usage
推荐：使用CIRI2进行circRMA的预测

CIRIquant -t 4 \
          -1 ./test_1.fq.gz \
          -2 ./test_2.fq.gz \
          --config ./chr1.yml \
          -o ./test \
          -p test

使用提供的bed文件时：

CIRIquant -t 4 \
          -1 ./test_1.fq.gz \
          -2 ./test_2.fq.gz \
          --config ./chr1.yml \
          -o ./test \
          -p test \
          --bed your_circRNAs.bed

使用其它鉴定工具时：
例如使用find_circ鉴定的circRNA结果

CIRIquant -t 4 \
          -1 ./test_1.fq.gz \
          -2 ./test_2.fq.gz \
          --config ./chr1.yml \
          -o ./test \
          -p test \
          --circ find_circ_results.txt \
          --tool find_circ

输出结果

The main output of CIRIquant is a GTF file, that contains detailed information of BSJ and FSJ reads of circRNAs and annotation of circRNA back-spliced regions in the attribute columns
CIRIquant 输出的是一个GTF文件，其中包括circRNA的BSJ和FSJ reads 的详细信息，以及属性列中对circRNA反向拼接区域的注释。

每一列的信息

column	name	description
1	chrom	chromosome / contig name
2	source	CIRIquant
3	type	circRNA
4	start	5' back-spliced junction site
5	end	3' back-spliced junction site
6	score	CPM of circRNAs (#BSJ / #Mapped reads)
7	strand	strand information
8	.	.
9	attributes	attributes seperated by semicolon

attributes包含的key与value：

key	description
circ_id	name of circRNA
circ_type	circRNA types: exon / intron / intergenic
bsj	number of bsj reads
fsj	number of fsj reads
junc_ratio	circular to linear ratio: 2 * bsj / ( 2 * bsj + fsj)
rnaser_bsj	number of bsj reads in RNase R data (only when --RNaseR is specificed)
rnaser_fsj	number of fsj reads in RNase R data (only when --RNaseR is specificed)
gene_id	ensemble id of host gene
gene_name	HGNC symbol of host gene
gene_type	type of host gene in gtf file

2.2 RNase R效应矫正

当拥有RNase R处理和未经处理的样本时，CIRIquant 可以评估RNase R数据中处理前的circrna的表达水平。
为了去除RNase R处理效果，需要两个步骤:

对RNase R处理过的样品运行CIRIquant。
对未经处理的总RNA样本运行CIRIquant，--RNaseR使用之前Step1中的输出gtf文件。

然后，CIRIquant将输出RNaseR数据中鉴定到的circrna的表达水平，标题行将包含额外的RNaseR处理效率信息。

Usage:

# Step1. Run CIRIquant with RNase R treated data
CIRIquant --config ./hg19.yml \
          -1 ./RNaseR_treated_1.fq.gz \
          -2 ./RNaseR_treated_2.fq.gz \
          --no-gene \
          -o ./RNaseR_treated \
          -p RNaseR_treated \
          -t 6

# Step2. Run CIRIquant with untreated total RNA
CIRIquant --config ./hg19.yml \
          -1 ./TotalRNA_1.fq.gz \
          -2 ./TotalRNA_2.fq.gz \
          -o ./TotalRNA \
          -p TotalRNA \
          -t 6 \
          --RNaseR ./RNaseR_treated/RNaseR_treated.gtf

2.3 差异表达分析

针对无生物学重复

对于没有重复的样品，差异表达和差异剪接分析使用 CIRI_DE

Usage:
  CIRI_DE [options] -n <control> -c <case> -o <out>

  <control>         CIRIquant result of control sample
  <case>            CIRIquant result of treatment cases
  <out>             Output file

Options (defaults in parentheses):

  -p                p value threshold for DE and DS score calculation (default: 0.05)
  -t                numer of threads (default: 4)

Example usage:
  CIRI_DE -n control.gtf -c case.gtf -o CIRI_DE.tsv

输出结果

column	name	description
1	circRNA_ID	circRNA identifier
2	Case_BSJ	number of BSJ reads in case
3	Case_FSJ	number of FSJ reads in case
4	Case_Ratio	junction ratio in case
5	Ctrl_BSJ	number of BSJ reads in control
6	Ctrl_FSJ	number of FSJ reads in control
7	Ctrl_Ratio	junction ratio in control
8	DE_score	differential expression score
9	DS_score	differential splicing score

有生物学重复

对于生物学重复，推荐使用edgeR分析，使用prep_CIRIquant去生成circRNA表达水平/junction比率的矩阵，使用CIRI_DE_replicate用于DE分析。

Step1: 准备CIRIquant 的输出文件
需要提供一个文本文件，包含样本信息和CIRIquant输出的GTF文件的路径信息

CONTROL1 ./c1/c1.gtf C 1
CONTROL2 ./c2/c2.gtf C 2
CONTROL3 ./c3/c3.gtf C 3
CASE1 ./t1/t1.gtf T 1
CASE2 ./t2/t2.gtf T 2
CASE3 ./t3/t3.gtf T 3

默认情况下，前三列是必需的。对于paired 样本，还可以添加一个主题名称列。

column	description
1	sample name
2	path to CIRIquant output gtf
3	group ("C" for control, "T" for treatment)
4	subject (optional, only for paired samples)

注意:如果使用CIRI_DE进行差异表达分析，第3列group列必须是“C”或者“T”

然后，运行prep_CIRIquant汇总所以样本的circRNA的表达量

Usage:
  prep_CIRIquant [options]

  -i                the file of sample list
  --lib             where to output library information
  --circ            where to output circRNA annotation information
  --bsj             where to output the circRNA expression matrix
  --ratio           where to output the circRNA junction ratio matrix

Example:
  prep_CIRIquant -i sample.lst \
                 --lib library_info.csv \
                 --circ circRNA_info.csv \
                 --bsj circRNA_bsj.csv \
                 --ratio circRNA_ratio.csv

输出的结果可以使用DEseq2与edgeR进行分析。

Step2: 准备 StringTie 的输出
StringTie的输出应该位于output_dir/gene/prefix_out.gtf下。需要使用stringTie中的prepDE.py来生成用于标准化的基因计数矩阵。

例如，可以提供一个sample_gene.lst 包含样本 ID 和 StringTie输出路径的文本文件：

CONTROL1 ./c1/gene/c1_out.gtf
CONTROL2 ./c2/gene/c2_out.gtf
CONTROL3 ./c3/gene/c3_out.gtf
CASE1 ./t1/gene/t1_out.gtf
CASE2 ./t2/gene/t2_out.gtf
CASE3 ./t3/gene/t3_out.gtf

然后使用生成的gene_count_matrix.csv去运行prepDE.py -i sample_gene.lst。

Step3: 差异表达分析
使用CIRI_DE_replicate进行差异表达分析，需要安装edgeR包

Usage:
  CIRI_DE_replicate [options]

  --lib             library information by CIRIquant
  --bsj             circRNA expression matrix
  --gene            gene expression matrix
  --out             output differential expression result

Example:
  CIRI_DE_replicate --lib  library_info.csv \
            --bsj  circRNA_bsj.csv \
            --gene gene_count_matrix.csv \
            --out  circRNA_de.tsv

输出结果是未过滤的，可以设定阈值对结果进行过滤。

参考资料：

circRNA鉴定工具之CIRIquant