featureCounts 不用多说,这里主要介绍 StringTie 自带的计算脚本 prepDE.py
,介绍如下:
Usage: prepDE.py [options]
Generates two CSV files containing the count matrices for genes and
transcripts, using the coverage values found in the output of `stringtie -e`
-i INPUT, --input=INPUT, --in=INPUT
a folder containing all sample sub-directories, or a
text file with sample ID and path to its GTF file on
each line [default: ./]
-g G where to output the gene count matrix [default:
gene_count_matrix.csv
-t T where to output the transcript count matrix [default:
transcript_count_matrix.csv]
-l LENGTH, --length=LENGTH
the average read length [default: 75]
-p PATTERN, --pattern=PATTERN
a regular expression that selects the sample
subdirectories
-c, --cluster whether to cluster genes that overlap with different
gene IDs, ignoring ones with geneID pattern (see
below)
-s STRING, --string=STRING
if a different prefix is used for geneIDs assigned by
StringTie [default: MSTRG]
-k KEY, --key=KEY if clustering, what prefix to use for geneIDs assigned
by this script [default: prepG]
-v enable verbose processing
--legend=LEGEND if clustering, where to output the legend file mapping
transcripts to assigned geneIDs [default: legend.csv]
源代码中 prepDE.py 计算 read counts 通过 GTF 里的 coverage values
RE_COVERAGE=re.compile('cov "([\-\+\d\.]+)"')
运行命令:
$ python2 prepDE.py \
-i sample_list.txt \
-g gene_count_matrix.csv \
-t transcript_count_matrix.csv
输入文件为 sample_list.txt
, 该文件为 \t
分隔的两列,第一列为样本名称,第二列为定量的 GTF 文件的路径,示例如下:
sampleA A.stringtie.gtf
sampleB B.stringtie.gtf
或者直接指定为:
./sample1/sample1.gtf
./sample2/sample2.gtf
./sample3/sample3.gtf
同时输出 gene 和 transcript 水平的 raw count 值。
采用 StringTie 进行定量,运行速度快是一个优势,同时提供 raw count, FPKM, TPM 3种定量方式的结果,也是其最便利的地方。
统计 reads average length 作为 -l
的输入,结果如下:
awk '{if(NR%4==2) {count++; bases += length} } END{print bases/count}' <fastq_file>
# zcat WT-1h-1_1.fq.gz | awk '{if(NR%4==2) {count++; bases += length} } END{print bases/count}'
100
参考
[1]. Stringtie:转录本组装和定量工具
[2]. prepED.py - Using StringTie with DESeq2 and edgeR
--- End ---