Introduction to msigdbr--含小鼠数据集哦

含小鼠数据集哦,

Introduction to msigdbr

Overview

Pathway analysis is a common task in genomics research and there are many available R-based software tools. Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

  • in an R-friendly tidy/long format with one gene per row
  • for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
  • as gene symbols as well as NCBI Entrez and Ensembl IDs
  • that can be installed and loaded as a package without requiring additional external files

Please be aware that the homologs were computationally predicted for distinct genes. The full pathways may not be well conserved across species.

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

Usage

Load package.

library(msigdbr)

All gene sets in the database can be retrieved without specifying a collection/category.

all_gene_sets = msigdbr(species = "Mus musculus")
head(all_gene_sets)
#> # A tibble: 6 x 18
#>   gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_ge… human_gene… human_entr…
#>   <chr>  <chr>     <chr>   <chr>             <int> <chr>       <chr>             <int>
#> 1 C3     MIR:MIR_… AAACCA… Abcc4            239273 ENSMUSG000… ABCC4             10257
#> 2 C3     MIR:MIR_… AAACCA… Abraxas2         109359 ENSMUSG000… ABRAXAS2          23172
#> 3 C3     MIR:MIR_… AAACCA… Actn4             60595 ENSMUSG000… ACTN4                81
#> 4 C3     MIR:MIR_… AAACCA… Acvr1             11477 ENSMUSG000… ACVR1                90
#> 5 C3     MIR:MIR_… AAACCA… Adam9             11502 ENSMUSG000… ADAM9              8754
#> 6 C3     MIR:MIR_… AAACCA… Adamts5           23794 ENSMUSG000… ADAMTS5           11096
#> # … with 10 more variables: human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, gs_description <chr>,
#> #   taxon_id <int>, ortholog_sources <chr>, num_ortholog_sources <dbl>

There is a helper function to show the available species. Either scientific or common names are acceptable.

msigdbr_species()
#> # A tibble: 20 x 2
#>    species_name                   species_common_name                                     
#>    <chr>                          <chr>                                                   
#>  1 Anolis carolinensis            Carolina anole, green anole                             
#>  2 Bos taurus                     bovine, cattle, cow, dairy cow, domestic cattle, domest…
#>  3 Caenorhabditis elegans         roundworm                                               
#>  4 Canis lupus familiaris         dog, dogs                                               
#>  5 Danio rerio                    leopard danio, zebra danio, zebra fish, zebrafish       
#>  6 Drosophila melanogaster        fruit fly                                               
#>  7 Equus caballus                 domestic horse, equine, horse                           
#>  8 Felis catus                    cat, cats, domestic cat                                 
#>  9 Gallus gallus                  bantam, chicken, chickens, Gallus domesticus            
#> 10 Homo sapiens                   human                                                   
#> 11 Macaca mulatta                 rhesus macaque, rhesus macaques, Rhesus monkey, rhesus …
#> 12 Monodelphis domestica          gray short-tailed opossum                               
#> 13 Mus musculus                   house mouse, mouse                                      
#> 14 Ornithorhynchus anatinus       duck-billed platypus, duckbill platypus, platypus       
#> 15 Pan troglodytes                chimpanzee                                              
#> 16 Rattus norvegicus              brown rat, Norway rat, rat, rats                        
#> 17 Saccharomyces cerevisiae       baker's yeast, brewer's yeast, S. cerevisiae            
#> 18 Schizosaccharomyces pombe 972… <NA>                                                    
#> 19 Sus scrofa                     pig, pigs, swine, wild boar                             
#> 20 Xenopus tropicalis             tropical clawed frog, western clawed frog

You can retrieve data for a specific collection, such as the hallmark gene sets.

h_gene_sets = msigdbr(species = "mouse", category = "H")
head(h_gene_sets)
#> # A tibble: 6 x 18
#>   gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_ge… human_gene… human_entr…
#>   <chr>  <chr>     <chr>   <chr>             <int> <chr>       <chr>             <int>
#> 1 H      ""        HALLMA… Abca1             11303 ENSMUSG000… ABCA1                19
#> 2 H      ""        HALLMA… Abcb8             74610 ENSMUSG000… ABCB8             11194
#> 3 H      ""        HALLMA… Acaa2             52538 ENSMUSG000… ACAA2             10449
#> 4 H      ""        HALLMA… Acadl             11363 ENSMUSG000… ACADL                33
#> 5 H      ""        HALLMA… Acadm             11364 ENSMUSG000… ACADM                34
#> 6 H      ""        HALLMA… Acads             11409 ENSMUSG000… ACADS                35
#> # … with 10 more variables: human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, gs_description <chr>,
#> #   taxon_id <int>, ortholog_sources <chr>, num_ortholog_sources <dbl>

Retrieve mouse C2 (curated) CGP (chemical and genetic perturbations) gene sets.

cgp_gene_sets = msigdbr(species = "mouse", category = "C2", subcategory = "CGP")
head(cgp_gene_sets)
#> # A tibble: 6 x 18
#>   gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_ge… human_gene… human_entr…
#>   <chr>  <chr>     <chr>   <chr>             <int> <chr>       <chr>             <int>
#> 1 C2     CGP       ABBUD_… Ahnak             66395 ENSMUSG000… AHNAK             79026
#> 2 C2     CGP       ABBUD_… Alcam             11658 ENSMUSG000… ALCAM               214
#> 3 C2     CGP       ABBUD_… Ankrd40           71452 ENSMUSG000… ANKRD40           91369
#> 4 C2     CGP       ABBUD_… Arid1a            93760 ENSMUSG000… ARID1A             8289
#> 5 C2     CGP       ABBUD_… Bckdhb            12040 ENSMUSG000… BCKDHB              594
#> 6 C2     CGP       ABBUD_… AU021092         239691 ENSMUSG000… C16orf89         146556
#> # … with 10 more variables: human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, gs_description <chr>,
#> #   taxon_id <int>, ortholog_sources <chr>, num_ortholog_sources <dbl>

There is a helper function to show the available collections.

msigdbr_collections()
#> # A tibble: 23 x 3
#>    gs_cat gs_subcat         num_genesets
#>    <chr>  <chr>                    <int>
#>  1 C1     ""                         278
#>  2 C2     "CGP"                     3368
#>  3 C2     "CP"                        29
#>  4 C2     "CP:BIOCARTA"              292
#>  5 C2     "CP:KEGG"                  186
#>  6 C2     "CP:PID"                   196
#>  7 C2     "CP:REACTOME"             1604
#>  8 C2     "CP:WIKIPATHWAYS"          615
#>  9 C3     "MIR:MIRDB"               2377
#> 10 C3     "MIR:MIR_Legacy"           221
#> 11 C3     "TFT:GTRD"                 523
#> 12 C3     "TFT:TFT_Legacy"           610
#> 13 C4     "CGN"                      427
#> 14 C4     "CM"                       431
#> 15 C5     "GO:BP"                   7481
#> 16 C5     "GO:CC"                    996
#> 17 C5     "GO:MF"                   1708
#> 18 C5     "HPO"                     4813
#> 19 C6     ""                         189
#> 20 C7     "IMMUNESIGDB"             4872
#> 21 C7     "VAX"                      347
#> 22 C8     ""                         671
#> 23 H      ""                          50

The msigdbr() function output is a data frame and can be manipulated using more standard methods.

all_gene_sets %>%
  dplyr::filter(gs_cat == "H") %>%
  head()
#> # A tibble: 6 x 18
#>   gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_ge… human_gene… human_entr…
#>   <chr>  <chr>     <chr>   <chr>             <int> <chr>       <chr>             <int>
#> 1 H      ""        HALLMA… Abca1             11303 ENSMUSG000… ABCA1                19
#> 2 H      ""        HALLMA… Abcb8             74610 ENSMUSG000… ABCB8             11194
#> 3 H      ""        HALLMA… Acaa2             52538 ENSMUSG000… ACAA2             10449
#> 4 H      ""        HALLMA… Acadl             11363 ENSMUSG000… ACADL                33
#> 5 H      ""        HALLMA… Acadm             11364 ENSMUSG000… ACADM                34
#> 6 H      ""        HALLMA… Acads             11409 ENSMUSG000… ACADS                35
#> # … with 10 more variables: human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, gs_description <chr>,
#> #   taxon_id <int>, ortholog_sources <chr>, num_ortholog_sources <dbl>

Pathway enrichment analysis

The msigdbr output can be used with various popular pathway analysis packages.

Use the gene sets data frame for clusterProfiler with genes as Entrez Gene IDs.

msigdbr_t2g = msigdbr_df %>% dplyr::distinct(gs_name, entrez_gene) %>% as.data.frame()
enricher(gene = gene_ids_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for clusterProfiler with genes as gene symbols.

msigdbr_t2g = msigdbr_df %>% dplyr::distinct(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for fgsea.

msigdbr_list = split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
fgsea(pathways = msigdbr_list, ...)

Use the gene sets data frame for GSVA.

msigdbr_list = split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
gsva(gset.idx.list = msigdbr_list, ...)

Potential questions or concerns

Which version of MSigDB was used?

This package was generated with MSigDB v7.4 (released April 2021). The MSigDB version is used as the base of the msigdsbr package version. You can check the installed version with packageVersion("msigdbr").

Can I download the gene sets directly from MSigDB instead of using this package?

Yes. You can then import the GMT files (with getGmt() from the GSEABase package, for example). The GMTs only include the human genes, even for gene sets generated from mouse experiments. If you are working with non-human data, you then have to convert the MSigDB genes to your organism or your genes to human.

Can I convert between human and mouse genes just by adjusting gene capitalization?

That will work for most genes, but not all.

Can I convert human genes to any organism myself instead of using this package?

Yes. A popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.

Aren’t there already other similar tools?

There are a few other resources that and provide some of the functionality and served as an inspiration for this package. Ge Lab Gene Set Files has GMT files for many species. WEHI provides MSigDB gene sets in R format for human and mouse. MSigDF is based on the WEHI resource, but is converted to a more tidyverse-friendly data frame. These are updated at varying frequencies and may not use the latest version of MSigDB.

What if I have other questions?

You can submit feedback and report bugs on GitHub.

Details

The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software. To cite use of the underlying MSigDB data, reference Subramanian, Tamayo, et al. (2005, PNAS) and one or more of the following as appropriate: Liberzon, et al. (2011, Bioinformatics), Liberzon, et al. (2015, Cell Systems), and also the source for the gene set.

Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.

For information on how to cite cite an R package such as msigdbr, you can execute citation("msigdbr").

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,142评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,298评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,068评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,081评论 1 291
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,099评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,071评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,990评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,832评论 0 273
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,274评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,488评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,649评论 1 347
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,378评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,979评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,625评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,796评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,643评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,545评论 2 352