gget
的功能跟之前介绍的fastq-dl
不一样,fastq-dl
主要是下载测序的原始数据的,而gget
的功能就非常丰富了:
- 从ensembl下载参考基因组,
- 获取指定基因的核苷酸或者多肽序列,
- 获取指定蛋白质PDB数据库的数据,
- 查询指定基因全部同源基因,
- 查询指定基因的表达量,
- 做核酸或者多肽序列的blast/blat搜索,
- 甚至可以给定一系列的基因做enrichment。
gget可以从UniProt,NCBI,UCSC,Enrichr,ARCHS4,ensembl数据库查询和交互数据。
还有更多跟医学、癌症、单细胞相关的功能可以参考官方的使用手册:
https://pachterlab.github.io/gget/
在我把gget列进这个专题的时候也没想到这个工具的功能居然这么强大。
而且这个工具现在还在活跃的维护和更新中,我在写这篇帖子时(2024-12-25),3天前还刚刚做了更新修改的。
不用担心这个工具是一个发了文章就跑路不管的一次性工具~
资源链接
- GitHub地址:
https://github.com/pachterlab/gget
- 文章地址:
https://academic.oup.com/bioinformatics/article/39/1/btac836/6971843
- 官方使用手册:
https://pachterlab.github.io/gget/en/introduction.html
安装
conda或者pip都可以安装。
conda:
conda install gget
pip
python -m pip install gget
目前最新版是:0.29.0
安装好后可以用命令行调用,如果你使用的是Jupyter Lab / Google Colab也支持直接引入
import gget
使用
场景1:下载参考基因组文件
可以以'homo_sapiens'
的形式来指定要下载的物种,也支持一些shortcuts例如:
'human'
'mouse'
'human_grch37' (accesses the GRCh37 genome assembly)
拿yeast来做个测试,会返回一个很大的json格式的文件,如果你只想要下载地址,那就加一个-ftp
就可以了。加-ftp
之后得到的长这样
gget ref 'saccharomyces_cerevisiae' -ftp
17:30:14 - INFO - Fetching reference information for saccharomyces_cerevisiae from Ensembl release: 113.
http://ftp.ensembl.org/pub/release-113/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.113.gtf.gz
http://ftp.ensembl.org/pub/release-113/fasta/saccharomyces_cerevisiae/cdna/Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz
http://ftp.ensembl.org/pub/release-113/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
http://ftp.ensembl.org/pub/release-113/fasta/saccharomyces_cerevisiae/cds/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz
http://ftp.ensembl.org/pub/release-113/fasta/saccharomyces_cerevisiae/ncrna/Saccharomyces_cerevisiae.R64-1-1.ncrna.fa.gz
http://ftp.ensembl.org/pub/release-113/fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz
如果要下载则要提供一个-d
来下载
和-od
指定输出的文件夹
:
gget ref 'saccharomyces_cerevisiae' -ftp -d -od test
萌哥碎碎念:
这个功能对我而言还挺有用的,特别是很多时候冷启动一个项目,什么都没有,连要做的物种的参考基因组都没有,那就可以一行命令获得下载地址了。对我而言省去了很多的去数据库查询的步骤,好用
场景2:根据基因名称得到蛋白质序列
很多时候我需要特定的基因的DNA序列或者蛋白质序列用来做进化树,那么gget就可以很方便的提供这个功能了。
例如人类的ENSG00000130234基因,获取它的蛋白质序列:
gget seq --translate ENSG00000130234
18:17:27 - INFO - Requesting amino acid sequence of the canonical transcript ENST00000252519 of gene ENSG00000130234 from UniProt.
>ENST00000252519 uniprot_id: Q9BYF1 ensembl_id: ENST00000252519 gene_name: ACE2 organism: Homo sapiens sequence_length: 805
MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNYNTNITEENVQNMNNAGDKWSAFLKEQSTLAQMYPLQEIQNLTVKLQLQALQQNGSSVLSEDKSKRLNTILNTMSTIYSTGKVCNPDNPQECLLLEPGLNEIMANSLDYNERLWAWESWRSEVGKQLRPLYEEYVVLKNEMARANHYEDYGDYWRGDYEVNGVDGYDYSRGQLIEDVEHTFEEIKPLYEHLHAYVRAKLMNAYPSYISPIGCLPAHLLGDMWGRFWTNLYSLTVPFGQKPNIDVTDAMVDQAWDAQRIFKEAEKFFVSVGLPNMTQGFWENSMLTDPGNVQKAVCHPTAWDLGKGDFRILMCTKVTMDDFLTAHHEMGHIQYDMAYAAQPFLLRNGANEGFHEAVGEIMSLSAATPKHLKSIGLLSPDFQEDNETEINFLLKQALTIVGTLPFTYMLEKWRWMVFKGEIPKDQWMKKWWEMKREIVGVVEPVPHDETYCDPASLFHVSNDYSFIRYYTRTLYQFQFQEALCQAAKHEGPLHKCDISNSTEAGQKLFNMLRLGKSEPWTLALENVVGAKNMNVRPLLNYFEPLFTWLKDQNKNSFVGWSTDWSPYADQSIKVRISLKSALGDKAYEWNDNEMYLFRSSVAYAMRQYFLKVKNQMILFGEEDVRVANLKPRISFNFFVTAPKNVSDIIPRTEVEKAIRMSRSRINDAFRLNDNSLEFLGIQPTLGPPNQPPVSIWLIVFGVVMGVIVVGIVILIFTGIRDRKKKNKARSGENPYASIDISKGENNPGFQNTDDVQTSF
换成yeast也是可以的,只要这个基因的名称是个ensembl的标准名称。这次要DNA序列不用翻译:
gget seq YLL032C
18:23:05 - INFO - Requesting nucleotide sequence of YLL032C from Ensembl.
>YLL032C chromosome:R64-1-1:XII:74270:76747:-1
ATGGATAACTTCAAAATTTACAGTACAGTTATCACAACTGCTTTTTTACAAGTTCCACACTTATACACGACAAATAGATTATGGAAGCCCATAGAAGCACCTTTTTTAGTTGAATTTCTTCAAAAAAGAATAAGCTCCAAAGAACTTAAAAACACAAAAGCAATCTGTCATATAGATCCCTCATGGGTCAATCTAAATGCTTCTTTCATTAGAGATGACATGATCTCAATAAAAGCTACCACAGATGATATGGACCTCGATGCTATTTGCAGGATTTCTTTGCCCTTACCTATGAACACAAATGATCTAACAGCAGAATTGGAAAAAATGAAACGTATACTATTGGACTTGAGTGAAAAATTCAATTTAGAACTAATCATCACCAAAGAACCAGCTTACTTCACACCAGAACAGACAGGCGAGAGCAAGGAATTATGTATTTACGTGCATGCATTAGGATTCCGATCTAATTTAATGGAATGCGAACCCCAGTTATTGGCATTTGTCGATTTAATAAAGAAGAATGGTATGACTTTGCCTCCACAACACTACATTATCGAACCTATGGAACTGAACTCATACTCTGTGCTTCCCTTATATATGGGTGTAGATATGGAAAACTTCAAACATATATCCAGAGCTTTTAAAACCAGCATTTACGCACCTTCCCTCATTACGCTGTCTAGGGATCTTAAGGCAAATCCACAAATCTTCTTCTCAGGCGCTGTCCATTCACTAAGTTTACTGGCTAGGAAAACTCTACGTGAGTCCATAAGCGTGAATTCAAAGAGTTTTTTTTATCGTCGATTGACTAATATTACCCCAGGAAAACTACTGTTTATCCGAAAATACTACCAACAAAAGGTAAACCAGTTAATTTTGAAATACCAATCGCTTATTAGAGTTACTAACGAATACATTGAATTTCAATCGATTTCCACGAACTTATTAGAAATGGTTATTAAGAACTTCACTATTCAAGTATTACATGAAATTGTAGAAGTACAGATTTCACTCAATGAAAACTGCGCAATGTCCCCGGAATTAATAATCGACAGCTTTTTCGGACACACTGGAAACCAAATTGTAGTAATAACTCCCAAAGAGGATTCCTTCAACCAATTAATAGTAGTTGGTAACCAATCTTCCACAGATGAGGCGTCCGATACTTCCATATTGCATTACTTATCTGATTTTATTATGGGCTCAAATCAAGTGATTAATCCAAACTTAAGACAGATAAAAGCCATTTTCGAGATACATCCTGATTTTGAAGACTTCATATCAGGCAAGAAAAATGGTAAATTAACACGTATAATGGAACTATCTGCTTGCTTAATTCAACTGGAAATGGAAGAGGAAGATGATAACCTTTATTTAAATTTGGTTAGTGACTCTTTCCCTGATTTCAAGGAATCATTTAAGAATGTTATAAATGAATTTCCTGCCGAAGAATCTTTCTTCATTCCTGAAGTTTGTCACAGACCGATTATAGGTACAGGTGGTTCCTTAATTCAAGCCACAATGAGAAAGCATAACGTCTTCATTCAGTTTTCCAATAGTTTCAATCTTCCACAAAATAAAATTTCTATGATCAGGTATGATAATGTGATAATTAGGTGCCCCAGGAAAAACAAAGCTAATATATGCCTAGCCAAGAATGATTTAAAACAAATTGTTCAAGAATATGATAGTTTACAATCGAAGACGCTCATTAGATTTTCTAGTGGGCAATATAGACATATTTTGCATGTTAACGGCCAAAAGAATATTATAGGACAAATCGAAAAGAACGAAAATGTTTACATAATGATACCGTTAAAAGAGCCCTTGGATGGAACATCTCAATTGAGTATACAAGGAAATGATGAGAATGCATCAAGAGCCGCTAACGAATTGGTTAATAGTGCGTTTGGTTATGAATACGAGTTCAAAATAGATCAAGAGATAGATCCCAATAAAGAATACGAATTTTATAACCTAATTGTTGTTCCATTTTTGCAAATTATGAACATAATTGTAACTTTTGAGAAGGACCTTATCACTTTTACTTTTGAAAAGGATACTAATGAGAATACTCTAACAAAAGCAATCGAATTACTATCCAATTATTTGGAAACACAAAAGACGAAAATAATATTTAAAAAAATAATTAAAAAATTTGTTCTAGGGTCTGCCTCCAGCAAGAGTAACACCAGTAATAGCAATACCAATGGCAATTTCAGATCAATGAATAATGCCAAAAGTCGTACGACCATCGATAATACCAGCCAATCAGGAGCCTCACCACAACGCCACAAAATGCCTGTTATAACGACGGTAGGAGGAGCCCAAGCCATCAAAGGATATATACCAAACACTTATTACAACGGATATGGGTATGGATACGGATATACATACGAGTACGATTATAATTATGCCAACTCTAATAAAGCTCAAACCAATAATAGGCATAAATATCAAAATGGCAGAAAATGA
场景3:获取给定基因的信息
尝试了一下gget的info和search的功能,info功能能获得给定基因的信息,例如
gget info YLL032C
{
"YLL032C": {
"ensembl_id": "YLL032C",
"uniprot_id": [
"Q07834",
"Q03208"
],
"pdb_id": null,
"ncbi_gene_id": null,
"species": "saccharomyces_cerevisiae",
"assembly_name": "R64-1-1",
"primary_gene_name": [
null,
null
],
"ensembl_gene_name": null,
"synonyms": [],
"parent_gene": null,
"protein_names": [
"KH domain-containing protein YLL032C",
"Uncharacterized protein YML119W"
],
"ensembl_description": "Protein of unknown function; may interact with ribosomes, based on co-purification experiments; green fluorescent protein (GFP)-fusion protein localizes to the cytoplasm; YLL032C is not an essential gene [Source:SGD;Acc:S000003955]",
"uniprot_description": "",
"ncbi_description": null,
"subcellular_localisation": "Cytoplasm",
"object_type": "Gene",
"biotype": "protein_coding",
"canonical_transcript": "YLL032C_mRNA.",
"seq_region_name": "XII",
"strand": -1,
"start": 74270,
"end": 76747,
"all_transcripts": [
{
"transcript_id": "YLL032C_mRNA",
"transcript_biotype": "protein_coding",
"transcript_name": null,
"transcript_strand": -1,
"transcript_start": 74270,
"transcript_end": 76747
}
],
"all_exons": [],
"all_translations": []
}
}
可以得到一系列的信息,结合jq就可以从这里面提取到自己需要的信息啦。
gget search的功能似乎有一点问题,即使使用官网的例子也会出现404的问题。希望作者意识到并且修改吧。
在python里使用gget
上面提到了,如果是Jupyter Lab / Google Colab的环境里,可以直接import进来
import gget
gget.ref("homo_sapiens")
gget.search(["ace2", "angiotensin converting enzyme 2"], "homo_sapiens")
gget.info(["ENSG00000130234", "ENST00000252519"])
gget.seq("ENSG00000130234", translate=True)
gget.blat("MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS")
gget.blast("MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS")
gget.muscle(["MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS", "MSSSSWLLLSLVEVTAAQSTIEQQAKTFLDKFHEAEDLFYQSLLAS"])
gget.diamond("MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS", reference="MSSSSWLLLSLVEVTAAQSTIEQQAKTFLDKFHEAEDLFYQSLLAS")
gget.enrichr(["ACE2", "AGT", "AGTR1", "ACE", "AGTRAP", "AGTR2", "ACE3P"], database="ontology", plot=True)
gget.archs4("ACE2", which="tissue")
gget.pdb("1R42", save=True)
gget.setup("elm") # setup only needs to be run once
ortho_df, regex_df = gget.elm("MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS")
gget.setup("cellxgene") # setup only needs to be run once
gget.cellxgene(gene = ["ACE2", "SLC5A1"], tissue = "lung", cell_type = "mucus secreting cell")
gget.setup("alphafold") # setup only needs to be run once
gget.alphafold("MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK")
在R里使用gget
有的同学可能不会python,更喜欢R语言,没关系!gget是可以在R里使用的!
system("pip install gget")
install.packages("reticulate")
library(reticulate)
gget <- import("gget")
gget$ref("homo_sapiens")
gget$search(list("ace2", "angiotensin converting enzyme 2"), "homo_sapiens")
gget$info(list("ENSG00000130234", "ENST00000252519"))
gget$seq("ENSG00000130234", translate=TRUE)
gget$blat("MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS")
gget$blast("MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS")
gget$muscle(list("MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS", "MSSSSWLLLSLVEVTAAQSTIEQQAKTFLDKFHEAEDLFYQSLLAS"), out="out.afa")
gget$diamond("MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS", reference="MSSSSWLLLSLVEVTAAQSTIEQQAKTFLDKFHEAEDLFYQSLLAS")
gget$enrichr(list("ACE2", "AGT", "AGTR1", "ACE", "AGTRAP", "AGTR2", "ACE3P"), database="ontology")
gget$archs4("ACE2", which="tissue")
gget$pdb("1R42", save=TRUE)
萌哥碎碎念
总的来讲,gget的功能似乎已经完全超过我一两年前mark这个工具的时候的功能了,有点叹为观止了。
因为我不是做人类啊单细胞啊肿瘤啊之类的,所以举的例子都是抛砖引玉,希望把这个软件介绍给大家,希望能给你的生信分析减少一点障碍增加一些便利~