annotate: ID 转换终结者+数据包&NCBI查询

Annotation for microarrays

Gentleman R (2019). annotate: Annotation for microarrays. R package version 1.62.0.

1. 关于 annotate

作者是 R. Gentleman, 不知道是不是这位真の大佬👇 (似乎是的,大佬的姓氏一看就是大佬👀


包内函数主要分为四类:

  • 可从特定元数据库 (meta-data libraries) 提取数据的接口函数 (Interface function)

  • 支持查询 NLM 和 NCBI 网页

  • 支持查询染色体坐标数据并能和 geneplotter 包协作作图

  • 可将基因列表输出为 HTML 文件,并能超链接连接到不同的网页资源

简单 并不简单地分别试跑一下前三项的示例代码,再尝试一下 Boss Jimmy 博客提到过的 “ID转换终结者” ┗|`O′|┛ 、

2. 从芯片元数据库 (meta-data libraries) 提取数据

以 Affymetrix HGu95av2 芯片为例,比如想得到 GO 相关数据。

affyGO <- eapply(hgu95av2GO, getOntology)
## 查看各探针对应的 GO ID 数目
table(sapply(affyGO, length)) 
# 
#    0    1    2    3    4    5    6    7    8    9   10   11   12 
# 1901 1577 1866 1768 1440 1136  744  575  367  322  239  154  137 
#   13   14   15   16   17   18   19   20   21   22   23   24   25 
#   89   57   68   27   30   19   17   14   11   11    7    5    2 
#   26   27   28   29   30   32   37 
#   16    3    3    8    1    6    5 

查看各探针对应的 GO evidence codes 及数目:

affyEv <- eapply(hgu95av2GO, getEvidence)
table(unlist(affyEv, use.names=FALSE))
# 
#  EXP   HDA   HEP   HMP    IC   IDA   IEA   IEP   IGI   IMP   IPI 
#  457  6758    80   114  1302 59621 65824  1073  1759 19887 14319 
#  ISA   ISM   ISS   NAS    ND   RCA   TAS 
#  979   743 23946  7448   675   146 43804 

A GO annotation is a statement about the function of a particular gene. Each annotation includes an evidence code to indicate how the annotation to a particular term is supported.

Evidence codes fall into six general categories:

  • experimental evidence
  • phylogenetic evidence
  • computational evidence
  • author statements
  • curatorial statements
  • automatically generated annotations

删除特定的 GO evidence codes:

affyEv_drop <- eapply(hgu95av2GO, dropECode, c("IEA", "ND"))
table(unlist(sapply(affyEv_drop, getEvidence),
             use.names = FALSE))
# 
#   EXP   HDA   HEP   HMP    IC   IDA   IEP   IGI   IMP   IPI 
#   457  6758    80   114  1302 59621  1073  1759 19887 14319 
#   ISA   ISM   ISS   NAS   RCA   TAS 
#   979   743 23946  7448   146 43804 

3. 在线查询信息

需要和 BiobaseXML 这两个包配合使用。

3.1 四个直截了当的函数

  • genbank

  • pubmed

  • entrezGeneByID

  • entrezGeneQuery

genbankpubmed 可通过参数选择返回 XML 格式数据或打开浏览器,entrezGeneByIDentrezGeneQuery 则可返回 URL 链接。

library(annotate)
entrezGeneQuery(c("leukemia", "Homo sapiens"))
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=leukemia%20Homo sapiens"
## emmmmm这样是会报错的
entrezGeneQuery("leukemia", "Homo sapiens") 
# Error in entrezGeneQuery("leukemia", "Homo sapiens") : 
#   unused argument ("Homo sapiens")

当直接输入 UniGene ID 时, entrezGeneByIDentrezGeneQuery 可以说是等同的:

entrezGeneByID(c("100", "1002"))
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=100" 
# [2] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=1002"
entrezGeneQuery(100)
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=100"
entrezGeneByID(100)
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=100"

3.2 获取 PubMed 信息

要用到 Biobase 包内的数据:sample.ExpressionSet .

data(sample.ExpressionSet) 
## 取其中11个基因作为示例
affys <- featureNames(sample.ExpressionSet)[490:500] 
affys
# [1] "31729_at" "31730_at" "31731_at" "31732_at" "31733_at" "31734_at"
# [7] "31735_at" "31736_at" "31737_at" "31738_at" "31739_at"

这时需要把 Affymetrix id (identifiers) 转换成 PubMed ID,以便接下来 pubmed 的愉快操作。

library(hgu95av2.db) 
ids <- getPMID(affys,"hgu95av2") 

得到了一个有11个元素的 list,接下来对它进行 unlist.

ids <- unlist(ids,use.names=FALSE) 
## 去重
ids <- unique(ids[!is.na(as.numeric(ids))]) 
length(ids) 
# [1] 731
ids[1:10]
# [1] "11438666" "12477932" "12878157" "15489334" "16710414" "17375202"
# [7] "17643375" "17884155" "18029348" "19240132"

pubmed 返回了 XMLDocument 对象,即用11个 Affymetrix id 得到了731个 PMID, 再取前10个获取其他信息。

x <- pubmed(ids[1:10]) 
class(x)
[1] "XMLDocument"         "XMLAbstractDocument"
a <- xmlRoot(x) 
numAbst <- length(xmlChildren(a)) 
numAbst 
# [1] 10

构建 PubMedAbst 对象,提取摘要文本:

arts <- vector("list", length=numAbst) 
absts <- rep(NA, numAbst) 
for (i in 1:numAbst) {
  arts[[i]] <- buildPubMedAbst(a[[i]])
  absts[i] <- abstText(arts[[i]])
}
arts[[3]] 
# An object of class 'pubMedAbst':
# Title: Redifferentiation of dedifferentiated
#      chondrocytes and chondrogenesis of human
#      bone marrow stromal cells via chondrosphere
#      formation with expression profiling by
#      large-scale cDNA analysis.
# PMID: 12878157
# Authors: H Imabayashi, T Mori, S Gojo, T Kiyono,
#      T Sugiyama, R Irie, T Isogai, J Hata, Y
#      Toyama, A Umezawa
# Journal: Exp Cell Res
# Date: Aug 2003
absts[3]
# [1] "Characterization of dedifferentiated chondrocytes (DECs) and mesenchymal stem cells capable of differentiating into chondrocytes is of biological and clinical interest. We isolated DECs and bone marrow stromal cells (BMSCs), H4-1 and H3-4, and demonstrated that the cells started to produce extracellular matrices, such as type II collagen and aggrecan, at an early stage of chondrosphere formation. Furthermore, cDNA sequencing of cDNA libraries constricted by the oligocapping method was performed to analyze difference in mRNA expression profiling between DECs and marrow stromal cells. Upon redifferentiation of DECs, cartilage-related extracellular matrix genes, such as those encoding leucine-rich small proteoglycans, cartilage oligomeric matrix protein, and chitinase 3-like 1 (cartilage glycoprotein-39), were highly expressed. Growth factors such as FGF7 and CTGF were detected at a high frequency in the growth stage of monolayer stromal cultures. By combining the expression profile and flow cytometry, we demonstrated that isolated stromal cells, defined by CD34(-), c-kit(-), and CD140alpha(- or low), have chondrogenic potential. The newly established human mesenchymal cells with expression profiling provide a powerful model for a study of chondrogenic differentiation and further understanding of cartilage regeneration in the means of redifferentiated DECs and BMSCs."

4. 染色体信息的构建及利用

利用函数 buildChromLocation() 构建 chromLocation 对象:

z <- buildChromLocation("hgu95av2") 
z
# Instance of a chromLocation class with the following fields:
#   Organism:  Homo sapiens 
#   Data source:  hgu95av2 
#   Number of chromosomes for this organism:  595 
#   Chromosomes of this organism and their lengths in base pairs:
#        1 : 248956422
#        2 : 242193529
#        3 : 198295559
#        4 : 190214555
#        5 : 181538259
#        6 : 170805979
#        7 : 159345973
#        X : 156040895
#        8 : 145138636
#        9 : 138394717
#        11 : 135086622
#        10 : 133797422
#        12 : 133275309
#        13 : 114364328
#        14 : 107043718
#        15 : 101991189
#        16 : 90338345
#        17 : 83257441
#        18 : 80373285
#        20 : 64444167
#        19 : 58617616
#        Y : 57227415
#        22 : 50818468
#        21 : 46709983
#        8_KZ208915v1_fix : 6367528
#        15_KI270905v1_alt : 5161414
#        15_KN538374v1_fix : 4998962
#        6_GL000256v2_alt : 4929269
#        6_GL000254v2_alt : 4827813
#        6_GL000251v2_alt : 4795265
#        6_GL000253v2_alt : 4677643
#        6_GL000250v2_alt : 4672374
#        6_GL000255v2_alt : 4606388
#        6_GL000252v2_alt : 4604811
#        17_KI270857v1_alt : 2877074
#        16_KI270853v1_alt : 2659700
# ...
## 后面太长不贴出来了

这个 S4 对象有6个 slot.

相关信息获取:

## 物种信息
organism(z)
# [1] "Homo sapiens"

## 数据来源
dataSource(z)
# [1] "hgu95av2"

## 查看染色体位点
head(names(chromLocs(z)), n = 10) 
#  [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18"
tail(names(chromLocs(z)), n = 10) 
#  [1] "19_KV575249v1_alt" "19_KV575251v1_alt"
#  [3] "19_KV575252v1_alt" "19_KV575253v1_alt"
#  [5] "19_KV575254v1_alt" "19_KV575255v1_alt"
#  [7] "19_KV575256v1_alt" "19_KV575257v1_alt"
#  [9] "19_KV575259v1_alt" "19_KV575260v1_alt"

## 定位探针所在染色体
get("32972_at", probesToChrom(z))
# [1] "X"

## 查看染色体名称及其长度
head(chromInfo(z))
#         1         2         3         4         5 
# 248956422 242193529 198295559 190214555 181538259 
#         6 
# 170805979 
tail(chromInfo(z))
# Un_KI270312v1 Un_KI270539v1 Un_KI270385v1 Un_KI270423v1 
#           998           993           990           981 
# Un_KI270392v1 Un_KI270394v1 
#           971           970 

## 查看探针所对应的 gene symbol
get("32972_at", geneSymbols(z))
# [1] "NOX1"

## 染色体数
nChrom(z)
# [1] 595

5. ID 转换终结者

5.1 getSymbol()

getSymbol() 为例的一系列函数:

getSYMBOL(x, data)
getEG(x, data)
getGO(x, data)
getPMID(x, data)
getGOdesc(x, which)
lookUp(x, data, what, load = FALSE)

probes <- featureNames(sample.ExpressionSet)[246:260]
getSYMBOL(probes, "hgu95av2.db")
#   31490_at 31491_s_at   31492_at 31493_s_at   31494_at 
#    "SCN5A"    "CASP8"    "EIF3K"         NA         NA 
#   31495_at 31496_g_at   31497_at 31498_f_at 31499_s_at 
#     "XCL2"         NA         NA         NA   "FCGR3B" 
getEG(probes, "hgu95av2.db")
#   31490_at 31491_s_at   31492_at 31493_s_at   31494_at 
#     "6331"      "841"    "27335"         NA         NA 
#   31495_at 31496_g_at   31497_at 31498_f_at 31499_s_at 
#     "6846"         NA         NA         NA     "2215" 
go <- getGO(probes, "hgu95av2.db")
getGOdesc(go[[1]][[1]][["GOID"]], "ANY")
# $`GO:0002027`
# GOID: GO:0002027
# Term: regulation of heart rate
# Ontology: BP
# Definition: Any process that modulates the frequency
#     or rate of heart contraction.
# Synonym: cardiac chronotropy
# Synonym: regulation of heart contraction rate
# Synonym: regulation of rate of heart contraction
getGOdesc(go[[1]][[1]][["GOID"]], "BP")
# $`GO:0002027`
# GOID: GO:0002027
# Term: regulation of heart rate
# Ontology: BP
# Definition: Any process that modulates the frequency
#     or rate of heart contraction.
# Synonym: cardiac chronotropy
# Synonym: regulation of heart contraction rate
# Synonym: regulation of rate of heart contraction

## 试错
getGOdesc(go[[1]][[1]][["GOID"]], "MF")
# NULL
lookUp(probes, "hgu95av2", "ENTREZID")
# $`31490_at`
# [1] "6331"
# 
# $`31491_s_at`
# [1] "841"
# 
# $`31492_at`
# [1] "27335"
# 
# $`31493_s_at`
# [1] NA
# 
# $`31494_at`
# [1] NA
# 
# $`31495_at`
# [1] "6846"
# 
# $`31496_g_at`
# [1] NA
# 
# $`31497_at`
# [1] NA
# 
# $`31498_f_at`
# [1] NA
# 
# $`31499_s_at`
# [1] "2215"

lookUp(go[[2]][[1]][["GOID"]], "GO", "ONTOLOGY")
# 'select()' returned 1:1 mapping between keys and columns
# $`GO:0006508`
# [1] "BP"

5.2 select()

先用 colums() 查看可以转换的项目:

columns(hgu95av2.db)
#  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"     
#  [4] "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
#  [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
# [10] "GENENAME"     "GO"           "GOALL"       
# [13] "IPI"          "MAP"          "OMIM"        
# [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
# [19] "PFAM"         "PMID"         "PROBEID"     
# [22] "PROSITE"      "REFSEQ"       "SYMBOL"      
# [25] "UCSCKG"       "UNIGENE"      "UNIPROT"     
out <- select(hgu95av2.db, probes,  c("SYMBOL","ENTREZID", "GENENAME"))

References


最后,向大家隆重推荐生信技能树的一系列干货!

  1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,293评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,604评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,958评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,729评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,719评论 5 366
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,630评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,000评论 3 397
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,665评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,909评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,646评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,726评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,400评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,986评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,959评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,197评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 44,996评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,481评论 2 342

推荐阅读更多精彩内容