首先看下gff里面不同的gene biotype都是啥
https://m.ensembl.org/info/genome/genebuild/biotypes.html
# Biotypes
* **Biotype:** A gene or transcript classification.
* **IG gene:** Immunoglobulin gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
* **IG C gene:** Constant chain immunoglobulin gene that undergoes somatic recombination before transcription
* **IG D gene:** Diversity chain immunoglobulin gene that undergoes somatic recombination before transcription
* **IG J gene:** Joining chain immunoglobulin gene that undergoes somatic recombination before transcription
* **IG V gene:** Variable chain immunoglobulin gene that undergoes somatic recombination before transcription
* **Nonsense Mediated Decay:** A transcript with a premature stop codon considered likely to be subjected to targeted degradation. Nonsense-Mediated Decay is predicted to be triggered where the in-frame termination codon is found more than 50bp upstream of the final splice junction.
* **Processed transcript:** Gene/transcript that doesn't contain an open reading frame (ORF).
* **Long non-coding RNA (lncRNA):** A non-coding gene/transcript >200bp in length
* **3' overlapping ncRNA:** Transcripts where ditag and/or published experimental data strongly supports the existence of long (>200bp) non-coding transcripts that overlap the 3'UTR of a protein-coding locus on the same strand.
* **Antisense:** Transcripts that overlap the genomic span (i.e. exon or introns) of a protein-coding locus on the opposite strand.
* **Macro lncRNA:** Unspliced lncRNAs that are several kb in size.
* **Non coding:** Transcripts which are known from the literature to not be protein coding.
* **Retained intron:** An alternatively spliced transcript believed to contain intronic sequence relative to other, coding, transcripts of the same gene.
* **Sense intronic:** A long non-coding transcript in introns of a coding gene that does not overlap any exons.
* **Sense overlapping:** A long non-coding transcript that contains a coding gene in its intron on the same strand.
* **lincRNA (long intergenic ncRNA):** Transcripts that are long intergenic non-coding RNA locus with a length >200bp. Requires lack of coding potential and may not be conserved between species.
* **ncRNA:** A non-coding gene.
* **miRNA:** A small RNA (~22bp) that silences the expression of target mRNA.
* **miscRNA:** Miscellaneous RNA. A non-coding RNA that cannot be classified.
* **piRNA:** An RNA that interacts with piwi proteins involved in genetic silencing.
* **rRNA:** The RNA component of a ribosome.
* **siRNA:** A small RNA (20-25bp) that silences the expression of target mRNA through the RNAi pathway.
* **snRNA:** Small RNA molecules that are found in the cell nucleus and are involved in the processing of pre messenger RNAs
* **snoRNA:** Small RNA molecules that are found in the cell nucleolus and are involved in the post-transcriptional modification of other RNAs.
* **tRNA:** A transfer RNA, which acts as an adaptor molecule for translation of mRNA.
* **vaultRNA:** Short non coding RNA genes that form part of the vault ribonucleoprotein complex.
* **Protein coding:** Gene/transcipt that contains an open reading frame (ORF).
* **Pseudogene:** A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.
* **IG pseudogene:** Inactivated immunoglobulin gene.
* **Polymorphic pseudogene:** Pseudogene owing to a SNP/indel but in other individuals/haplotypes/strains the gene is translated.
* **Processed pseudogene:** Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome.
* **Transcribed pseudogene:** Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression. These can be classified into 'Processed', 'Unprocessed' and 'Unitary'.
* **Translated pseudogene:** Pseudogenes that have mass spec data suggesting that they are also translated. These can be classified into 'Processed', 'Unprocessed'
* **Unitary pseudogene:** A species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species.
* **Unprocessed pseudogene:** Pseudogene that can contain introns since produced by gene duplication.
* **Readthrough:** A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).
* **Stop codon readthrough:** The coding sequence contains a stop codon that is translated (as supported by experimental evidence), and termination occurs instead at a canonical stop codon further downstream. It is currently unknown which codon is used to replace the translated stop codon, hence it is represented by 'X' in the protein sequence
* **TEC (To be Experimentally Confirmed):** Regions with EST clusters that have polyA features that could indicate the presence of protein coding genes. These require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies.
* **TR gene:** T cell receptor gene that undergoes somatic recombination, annotated in collaboration with IMGT http://www.imgt.org/.
* **TR C gene:** Constant chain T cell receptor gene that undergoes somatic recombination before transcription
* **TR D gene:** Diversity chain T cell receptor gene that undergoes somatic recombination before transcription
* **TR J gene:** Joining chain T cell receptor gene that undergoes somatic recombination before transcription
* **TR V gene:** Variable chain T cell receptor gene that undergoes somatic recombination before transcription
以大鼠的为例
首先我们先匹配出来gff里面所有的biotype
grep gene_biotype Add_MT_Flag.gtf | sed 's/^.*gene_biotype "\([^"]*\)".*$/\1/g' | sort | uniq
如下是目前所有的,需要去除lncRNA相关的信息,或者其他的信息均可,看实际的实验情况。
antisense
lincRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
processed_pseudogene
processed_transcript
protein_coding
pseudogene
ribozyme
rRNA
scaRNA
sense_intronic
snoRNA
snRNA
sRNA
TEC
transcribed_processed_pseudogene
transcribed_unprocessed_pseudogene
unprocessed_pseudogene
删掉属于lncrna和其他的一些部分,大家根据需求来选择哈,参考上面的解释说明。
antisense
lincRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
ribozyme
rRNA
scaRNA
sense_intronic
snoRNA
snRNA
sRNA
TEC
写入脚本filter_gff.sh:
#!/bin/bash
#!/usr/bin/env bash
# echo the help if not input all the options
help()
{
cat <<HELP
---------------------------------------------------------------
Author: Myshu
Mail: myshu0601@qq.com
Version: 1.0
Date: 2022-3-10
Description: filter for gtf file
---------------------------------------------------------------
USAGE: $0 gtf filter_gtf outdir
or $0 -h # show this message
EXAMPLE:
$0 change_filter.gtf change_filter_nolncRNA.gtf output/
HELP
exit 0
}
[ -z "$1" ] && help
[ "$1" = "-h" ] && help
# 在这里修改!!!!!
BIOTYPE_PATTERN=\
"(antisense|lincRNA|\
miRNA|misc_RNA|Mt_rRNA|Mt_tRNA|\
ribozyme|rRNA|scaRNA|\
sense_intronic|snoRNA|snRNA|\
sRNA|TEC)"
GENE_PATTERN="gene_biotype \"${BIOTYPE_PATTERN}\""
gtf_modified=$1
gtf_filtered=$2
outdir=$3
cat "$gtf_modified" \
| awk '$3 == "gene"' \
| grep -Ev "$GENE_PATTERN" \
| sed -E 's/.*(gene_id "[^"]+").*/\1/' \
| sort \
| uniq \
> "$outdir/gene_allowlist"
## Copy header lines beginning with "#"
grep -E "^#" "$gtf_modified" > "$gtf_filtered"
## Filter to the gene allowlist
grep -Ff "$outdir/gene_allowlist" "$gtf_modified" \
>> "$gtf_filtered"
上述代码是参考了cellranger官网的数据库构建中的代码(一个参考:https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build)
最后就可以得到过滤后的gtf,类似的gff biotype的过滤都可以这么来做。