技巧 | StringTie 计算 Raw Counts

featureCounts 不用多说，这里主要介绍 StringTie 自带的计算脚本 prepDE.py，介绍如下：

Usage: prepDE.py [options]

Generates two CSV files containing the count matrices for genes and
transcripts, using the coverage values found in the output of `stringtie -e`

  -i INPUT, --input=INPUT, --in=INPUT
                        a folder containing all sample sub-directories, or a
                        text file with sample ID and path to its GTF file on
                        each line [default: ./]
  -g G                  where to output the gene count matrix [default:
                        gene_count_matrix.csv
  -t T                  where to output the transcript count matrix [default:
                        transcript_count_matrix.csv]
  -l LENGTH, --length=LENGTH
                        the average read length [default: 75]
  -p PATTERN, --pattern=PATTERN
                        a regular expression that selects the sample
                        subdirectories
  -c, --cluster         whether to cluster genes that overlap with different
                        gene IDs, ignoring ones with geneID pattern (see
                        below)
  -s STRING, --string=STRING
                        if a different prefix is used for geneIDs assigned by
                        StringTie [default: MSTRG]
  -k KEY, --key=KEY     if clustering, what prefix to use for geneIDs assigned
                        by this script [default: prepG]
  -v                    enable verbose processing
  --legend=LEGEND       if clustering, where to output the legend file mapping
                        transcripts to assigned geneIDs [default: legend.csv]

源代码中 prepDE.py 计算 read counts 通过 GTF 里的 coverage values

RE_COVERAGE=re.compile('cov "([\-\+\d\.]+)"')

运行命令：

$ python2 prepDE.py \
-i sample_list.txt  \
-g gene_count_matrix.csv  \
-t transcript_count_matrix.csv

输入文件为 sample_list.txt，该文件为 \t 分隔的两列，第一列为样本名称，第二列为定量的 GTF 文件的路径，示例如下：

sampleA A.stringtie.gtf
sampleB B.stringtie.gtf

或者直接指定为：

    ./sample1/sample1.gtf
    ./sample2/sample2.gtf
    ./sample3/sample3.gtf

同时输出 gene 和 transcript 水平的 raw count 值。

采用 StringTie 进行定量，运行速度快是一个优势，同时提供 raw count, FPKM, TPM 3种定量方式的结果，也是其最便利的地方。

统计 reads average length 作为 -l 的输入，结果如下：

awk '{if(NR%4==2) {count++; bases += length} } END{print bases/count}' <fastq_file>

# zcat WT-1h-1_1.fq.gz | awk '{if(NR%4==2) {count++; bases += length} } END{print bases/count}'
100

参考

[1]. Stringtie：转录本组装和定量工具
[2]. prepED.py - Using StringTie with DESeq2 and edgeR

--- End ---

最后编辑于：2021.06.20 17:21:22