cellranger参考数据库gtf过滤(过滤gene_biotype)

首先看下gff里面不同的gene biotype都是啥

https://m.ensembl.org/info/genome/genebuild/biotypes.html

# Biotypes

*   **Biotype:** A gene or transcript classification.
    *   **IG gene:** Immunoglobulin gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
        *   **IG C gene:** Constant chain immunoglobulin gene that undergoes somatic recombination before transcription
        *   **IG D gene:** Diversity chain immunoglobulin gene that undergoes somatic recombination before transcription
        *   **IG J gene:** Joining chain immunoglobulin gene that undergoes somatic recombination before transcription
        *   **IG V gene:** Variable chain immunoglobulin gene that undergoes somatic recombination before transcription
    *   **Nonsense Mediated Decay:** A transcript with a premature stop codon considered likely to be subjected to targeted degradation. Nonsense-Mediated Decay is predicted to be triggered where the in-frame termination codon is found more than 50bp upstream of the final splice junction.
    *   **Processed transcript:** Gene/transcript that doesn't contain an open reading frame (ORF).
        *   **Long non-coding RNA (lncRNA):** A non-coding gene/transcript >200bp in length
            *   **3' overlapping ncRNA:** Transcripts where ditag and/or published experimental data strongly supports the existence of long (>200bp) non-coding transcripts that overlap the 3'UTR of a protein-coding locus on the same strand.
            *   **Antisense:** Transcripts that overlap the genomic span (i.e. exon or introns) of a protein-coding locus on the opposite strand.
            *   **Macro lncRNA:** Unspliced lncRNAs that are several kb in size.
            *   **Non coding:** Transcripts which are known from the literature to not be protein coding.
            *   **Retained intron:** An alternatively spliced transcript believed to contain intronic sequence relative to other, coding, transcripts of the same gene.
            *   **Sense intronic:** A long non-coding transcript in introns of a coding gene that does not overlap any exons.
            *   **Sense overlapping:** A long non-coding transcript that contains a coding gene in its intron on the same strand.
            *   **lincRNA (long intergenic ncRNA):** Transcripts that are long intergenic non-coding RNA locus with a length >200bp. Requires lack of coding potential and may not be conserved between species.
        *   **ncRNA:** A non-coding gene.
            *   **miRNA:** A small RNA (~22bp) that silences the expression of target mRNA.
            *   **miscRNA:** Miscellaneous RNA. A non-coding RNA that cannot be classified.
            *   **piRNA:** An RNA that interacts with piwi proteins involved in genetic silencing.
            *   **rRNA:** The RNA component of a ribosome.
            *   **siRNA:** A small RNA (20-25bp) that silences the expression of target mRNA through the RNAi pathway.
            *   **snRNA:** Small RNA molecules that are found in the cell nucleus and are involved in the processing of pre messenger RNAs
            *   **snoRNA:** Small RNA molecules that are found in the cell nucleolus and are involved in the post-transcriptional modification of other RNAs.
            *   **tRNA:** A transfer RNA, which acts as an adaptor molecule for translation of mRNA.
            *   **vaultRNA:** Short non coding RNA genes that form part of the vault ribonucleoprotein complex.
    *   **Protein coding:** Gene/transcipt that contains an open reading frame (ORF).
    *   **Pseudogene:** A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.
        *   **IG pseudogene:** Inactivated immunoglobulin gene.
        *   **Polymorphic pseudogene:** Pseudogene owing to a SNP/indel but in other individuals/haplotypes/strains the gene is translated.
        *   **Processed pseudogene:** Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome.
        *   **Transcribed pseudogene:** Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'.
        *   **Translated pseudogene:** Pseudogenes that have mass spec data suggesting that they are also translated. These can be classified into 'Processed', 'Unprocessed'
        *   **Unitary pseudogene:** A species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species.
        *   **Unprocessed pseudogene:** Pseudogene that can contain introns since produced by gene duplication.
    *   **Readthrough:** A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).
    *   **Stop codon readthrough:** The coding sequence contains a stop codon that is translated (as supported by experimental evidence), and termination occurs instead at a canonical stop codon further downstream. It is currently unknown which codon is used to replace the translated stop codon, hence it is represented by 'X' in the protein sequence
    *   **TEC (To be Experimentally Confirmed):** Regions with EST clusters that have polyA features that could indicate the presence of protein coding genes. These require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies.
    *   **TR gene:** T cell receptor gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
        *   **TR C gene:** Constant chain T cell receptor gene that undergoes somatic recombination before transcription
        *   **TR D gene:** Diversity chain T cell receptor gene that undergoes somatic recombination before transcription
        *   **TR J gene:** Joining chain T cell receptor gene that undergoes somatic recombination before transcription
        *   **TR V gene:** Variable chain T cell receptor gene that undergoes somatic recombination before transcription

以大鼠的为例

首先我们先匹配出来gff里面所有的biotype

grep gene_biotype Add_MT_Flag.gtf | sed 's/^.*gene_biotype "\([^"]*\)".*$/\1/g' | sort | uniq

如下是目前所有的,需要去除lncRNA相关的信息,或者其他的信息均可,看实际的实验情况。

antisense
lincRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
processed_pseudogene
processed_transcript
protein_coding
pseudogene
ribozyme
rRNA
scaRNA
sense_intronic
snoRNA
snRNA
sRNA
TEC
transcribed_processed_pseudogene
transcribed_unprocessed_pseudogene
unprocessed_pseudogene

删掉属于lncrna和其他的一些部分,大家根据需求来选择哈,参考上面的解释说明。

antisense
lincRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
ribozyme
rRNA
scaRNA
sense_intronic
snoRNA
snRNA
sRNA
TEC

写入脚本filter_gff.sh:

#!/bin/bash
#!/usr/bin/env bash
# echo the help if not input all the options
help()
{
cat <<HELP
---------------------------------------------------------------
     Author: Myshu
     Mail: myshu0601@qq.com
     Version: 1.0
     Date: 2022-3-10
     Description: filter for gtf file
---------------------------------------------------------------
USAGE: $0 gtf filter_gtf outdir
    or $0 -h # show this message
EXAMPLE:
    $0 change_filter.gtf change_filter_nolncRNA.gtf output/
HELP
exit 0
}
[ -z "$1" ] && help
[ "$1" = "-h" ] && help

# 在这里修改!!!!!
BIOTYPE_PATTERN=\
"(antisense|lincRNA|\
miRNA|misc_RNA|Mt_rRNA|Mt_tRNA|\
ribozyme|rRNA|scaRNA|\
sense_intronic|snoRNA|snRNA|\
sRNA|TEC)"

GENE_PATTERN="gene_biotype \"${BIOTYPE_PATTERN}\""

gtf_modified=$1
gtf_filtered=$2
outdir=$3
cat "$gtf_modified" \
    | awk '$3 == "gene"' \
    | grep -Ev "$GENE_PATTERN" \
    | sed -E 's/.*(gene_id "[^"]+").*/\1/' \
    | sort \
    | uniq \
    > "$outdir/gene_allowlist"

## Copy header lines beginning with "#"
grep -E "^#" "$gtf_modified" > "$gtf_filtered"
## Filter to the gene allowlist
grep -Ff "$outdir/gene_allowlist" "$gtf_modified" \
    >> "$gtf_filtered"

上述代码是参考了cellranger官网的数据库构建中的代码(一个参考:https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build
最后就可以得到过滤后的gtf,类似的gff biotype的过滤都可以这么来做。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 205,236评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,867评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,715评论 0 340
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,899评论 1 278
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,895评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,733评论 1 283
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,085评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,722评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,025评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,696评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,816评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,447评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,057评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,009评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,254评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,204评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,561评论 2 343

推荐阅读更多精彩内容