cellranger参考数据库gtf过滤（过滤gene_biotype）

首先看下gff里面不同的gene biotype都是啥

https://m.ensembl.org/info/genome/genebuild/biotypes.html

# Biotypes

*   **Biotype:** A gene or transcript classification.
    *   **IG gene:** Immunoglobulin gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
        *   **IG C gene:** Constant chain immunoglobulin gene that undergoes somatic recombination before transcription
        *   **IG D gene:** Diversity chain immunoglobulin gene that undergoes somatic recombination before transcription
        *   **IG J gene:** Joining chain immunoglobulin gene that undergoes somatic recombination before transcription
        *   **IG V gene:** Variable chain immunoglobulin gene that undergoes somatic recombination before transcription
    *   **Nonsense Mediated Decay:** A transcript with a premature stop codon considered likely to be subjected to targeted degradation. Nonsense-Mediated Decay is predicted to be triggered where the in-frame termination codon is found more than 50bp upstream of the final splice junction.
    *   **Processed transcript:** Gene/transcript that doesn't contain an open reading frame (ORF).
        *   **Long non-coding RNA (lncRNA):** A non-coding gene/transcript >200bp in length
            *   **3' overlapping ncRNA:** Transcripts where ditag and/or published experimental data strongly supports the existence of long (>200bp) non-coding transcripts that overlap the 3'UTR of a protein-coding locus on the same strand.
            *   **Antisense:** Transcripts that overlap the genomic span (i.e. exon or introns) of a protein-coding locus on the opposite strand.
            *   **Macro lncRNA:** Unspliced lncRNAs that are several kb in size.
            *   **Non coding:** Transcripts which are known from the literature to not be protein coding.
            *   **Retained intron:** An alternatively spliced transcript believed to contain intronic sequence relative to other, coding, transcripts of the same gene.
            *   **Sense intronic:** A long non-coding transcript in introns of a coding gene that does not overlap any exons.
            *   **Sense overlapping:** A long non-coding transcript that contains a coding gene in its intron on the same strand.
            *   **lincRNA (long intergenic ncRNA):** Transcripts that are long intergenic non-coding RNA locus with a length >200bp. Requires lack of coding potential and may not be conserved between species.
        *   **ncRNA:** A non-coding gene.
            *   **miRNA:** A small RNA (~22bp) that silences the expression of target mRNA.
            *   **miscRNA:** Miscellaneous RNA. A non-coding RNA that cannot be classified.
            *   **piRNA:** An RNA that interacts with piwi proteins involved in genetic silencing.
            *   **rRNA:** The RNA component of a ribosome.
            *   **siRNA:** A small RNA (20-25bp) that silences the expression of target mRNA through the RNAi pathway.
            *   **snRNA:** Small RNA molecules that are found in the cell nucleus and are involved in the processing of pre messenger RNAs
            *   **snoRNA:** Small RNA molecules that are found in the cell nucleolus and are involved in the post-transcriptional modification of other RNAs.
            *   **tRNA:** A transfer RNA, which acts as an adaptor molecule for translation of mRNA.
            *   **vaultRNA:** Short non coding RNA genes that form part of the vault ribonucleoprotein complex.
    *   **Protein coding:** Gene/transcipt that contains an open reading frame (ORF).
    *   **Pseudogene:** A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.
        *   **IG pseudogene:** Inactivated immunoglobulin gene.
        *   **Polymorphic pseudogene:** Pseudogene owing to a SNP/indel but in other individuals/haplotypes/strains the gene is translated.
        *   **Processed pseudogene:** Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome.
        *   **Transcribed pseudogene:** Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'.
        *   **Translated pseudogene:** Pseudogenes that have mass spec data suggesting that they are also translated. These can be classified into 'Processed', 'Unprocessed'
        *   **Unitary pseudogene:** A species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species.
        *   **Unprocessed pseudogene:** Pseudogene that can contain introns since produced by gene duplication.
    *   **Readthrough:** A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).
    *   **Stop codon readthrough:** The coding sequence contains a stop codon that is translated (as supported by experimental evidence), and termination occurs instead at a canonical stop codon further downstream. It is currently unknown which codon is used to replace the translated stop codon, hence it is represented by 'X' in the protein sequence
    *   **TEC (To be Experimentally Confirmed):** Regions with EST clusters that have polyA features that could indicate the presence of protein coding genes. These require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies.
    *   **TR gene:** T cell receptor gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
        *   **TR C gene:** Constant chain T cell receptor gene that undergoes somatic recombination before transcription
        *   **TR D gene:** Diversity chain T cell receptor gene that undergoes somatic recombination before transcription
        *   **TR J gene:** Joining chain T cell receptor gene that undergoes somatic recombination before transcription
        *   **TR V gene:** Variable chain T cell receptor gene that undergoes somatic recombination before transcription

以大鼠的为例

首先我们先匹配出来gff里面所有的biotype

grep gene_biotype Add_MT_Flag.gtf | sed 's/^.*gene_biotype "\([^"]*\)".*$/\1/g' | sort | uniq

如下是目前所有的，需要去除lncRNA相关的信息，或者其他的信息均可，看实际的实验情况。

antisense
lincRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
processed_pseudogene
processed_transcript
protein_coding
pseudogene
ribozyme
rRNA
scaRNA
sense_intronic
snoRNA
snRNA
sRNA
TEC
transcribed_processed_pseudogene
transcribed_unprocessed_pseudogene
unprocessed_pseudogene

删掉属于lncrna和其他的一些部分，大家根据需求来选择哈，参考上面的解释说明。

antisense
lincRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
ribozyme
rRNA
scaRNA
sense_intronic
snoRNA
snRNA
sRNA
TEC

写入脚本filter_gff.sh：

#!/bin/bash
#!/usr/bin/env bash
# echo the help if not input all the options
help()
{
cat <<HELP
---------------------------------------------------------------
     Author: Myshu
     Mail: myshu0601@qq.com
     Version: 1.0
     Date: 2022-3-10
     Description: filter for gtf file
---------------------------------------------------------------
USAGE: $0 gtf filter_gtf outdir
    or $0 -h # show this message
EXAMPLE:
    $0 change_filter.gtf change_filter_nolncRNA.gtf output/
HELP
exit 0
}
[ -z "$1" ] && help
[ "$1" = "-h" ] && help

# 在这里修改！！！！！
BIOTYPE_PATTERN=\
"(antisense|lincRNA|\
miRNA|misc_RNA|Mt_rRNA|Mt_tRNA|\
ribozyme|rRNA|scaRNA|\
sense_intronic|snoRNA|snRNA|\
sRNA|TEC)"

GENE_PATTERN="gene_biotype \"${BIOTYPE_PATTERN}\""

gtf_modified=$1
gtf_filtered=$2
outdir=$3
cat "$gtf_modified" \
    | awk '$3 == "gene"' \
    | grep -Ev "$GENE_PATTERN" \
    | sed -E 's/.*(gene_id "[^"]+").*/\1/' \
    | sort \
    | uniq \
    > "$outdir/gene_allowlist"

## Copy header lines beginning with "#"
grep -E "^#" "$gtf_modified" > "$gtf_filtered"
## Filter to the gene allowlist
grep -Ff "$outdir/gene_allowlist" "$gtf_modified" \
    >> "$gtf_filtered"

上述代码是参考了cellranger官网的数据库构建中的代码（一个参考：https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build）
最后就可以得到过滤后的gtf，类似的gff biotype的过滤都可以这么来做。

cellranger参考数据库gtf过滤（过滤gene_biotype）

cellranger参考数据库gtf过滤（过滤gene_biotype）

相关阅读更多精彩内容

友情链接更多精彩内容