annotate: ID 转换终结者+数据包&NCBI查询

Annotation for microarrays

Gentleman R (2019). annotate: Annotation for microarrays. R package version 1.62.0.

1. 关于 `annotate`

作者是 R. Gentleman, 不知道是不是这位真の大佬👇 （似乎是的，大佬的姓氏一看就是大佬👀

包内函数主要分为四类：

可从特定元数据库 (meta-data libraries) 提取数据的接口函数 (Interface function)
支持查询 NLM 和 NCBI 网页
支持查询染色体坐标数据并能和 geneplotter 包协作作图
可将基因列表输出为 HTML 文件，并能超链接连接到不同的网页资源

简单并不简单地分别试跑一下前三项的示例代码，再尝试一下 Boss Jimmy 博客提到过的 “ID转换终结者” ┗|｀O′|┛ 、

2. 从芯片元数据库 (meta-data libraries) 提取数据

以 Affymetrix HGu95av2 芯片为例，比如想得到 GO 相关数据。

affyGO <- eapply(hgu95av2GO, getOntology)
## 查看各探针对应的 GO ID 数目
table(sapply(affyGO, length)) 
# 
#    0    1    2    3    4    5    6    7    8    9   10   11   12 
# 1901 1577 1866 1768 1440 1136  744  575  367  322  239  154  137 
#   13   14   15   16   17   18   19   20   21   22   23   24   25 
#   89   57   68   27   30   19   17   14   11   11    7    5    2 
#   26   27   28   29   30   32   37 
#   16    3    3    8    1    6    5

查看各探针对应的 GO evidence codes 及数目：

affyEv <- eapply(hgu95av2GO, getEvidence)
table(unlist(affyEv, use.names=FALSE))
# 
#  EXP   HDA   HEP   HMP    IC   IDA   IEA   IEP   IGI   IMP   IPI 
#  457  6758    80   114  1302 59621 65824  1073  1759 19887 14319 
#  ISA   ISM   ISS   NAS    ND   RCA   TAS 
#  979   743 23946  7448   675   146 43804

A GO annotation is a statement about the function of a particular gene. Each annotation includes an evidence code to indicate how the annotation to a particular term is supported.

Evidence codes fall into six general categories:

experimental evidence

phylogenetic evidence

computational evidence

author statements

curatorial statements

automatically generated annotations

删除特定的 GO evidence codes:

affyEv_drop <- eapply(hgu95av2GO, dropECode, c("IEA", "ND"))
table(unlist(sapply(affyEv_drop, getEvidence),
             use.names = FALSE))
# 
#   EXP   HDA   HEP   HMP    IC   IDA   IEP   IGI   IMP   IPI 
#   457  6758    80   114  1302 59621  1073  1759 19887 14319 
#   ISA   ISM   ISS   NAS   RCA   TAS 
#   979   743 23946  7448   146 43804

3. 在线查询信息

需要和 Biobase 和 XML 这两个包配合使用。

3.1 四个直截了当的函数

genbank
pubmed
entrezGeneByID
entrezGeneQuery

genbank 和 pubmed 可通过参数选择返回 XML 格式数据或打开浏览器，entrezGeneByID 和 entrezGeneQuery 则可返回 URL 链接。

library(annotate)
entrezGeneQuery(c("leukemia", "Homo sapiens"))
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=leukemia%20Homo sapiens"

## emmmmm这样是会报错的
entrezGeneQuery("leukemia", "Homo sapiens") 
# Error in entrezGeneQuery("leukemia", "Homo sapiens") : 
#   unused argument ("Homo sapiens")

当直接输入 UniGene ID 时， entrezGeneByID 和 entrezGeneQuery 可以说是等同的：

entrezGeneByID(c("100", "1002"))
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=100" 
# [2] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=1002"
entrezGeneQuery(100)
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=100"
entrezGeneByID(100)
# [1] "https://www.ncbi.nlm.nih.gov//sites/entrez?db=gene&cmd=search&term=100"

3.2 获取 PubMed 信息

要用到 Biobase 包内的数据：sample.ExpressionSet .

data(sample.ExpressionSet) 
## 取其中11个基因作为示例
affys <- featureNames(sample.ExpressionSet)[490:500] 
affys
# [1] "31729_at" "31730_at" "31731_at" "31732_at" "31733_at" "31734_at"
# [7] "31735_at" "31736_at" "31737_at" "31738_at" "31739_at"

这时需要把 Affymetrix id (identifiers) 转换成 PubMed ID，以便接下来 pubmed 的愉快操作。

library(hgu95av2.db) 
ids <- getPMID(affys,"hgu95av2")

得到了一个有11个元素的 list，接下来对它进行 unlist.

ids <- unlist(ids,use.names=FALSE) 
## 去重
ids <- unique(ids[!is.na(as.numeric(ids))]) 
length(ids) 
# [1] 731
ids[1:10]
# [1] "11438666" "12477932" "12878157" "15489334" "16710414" "17375202"
# [7] "17643375" "17884155" "18029348" "19240132"

pubmed 返回了 XMLDocument 对象，即用11个 Affymetrix id 得到了731个 PMID, 再取前10个获取其他信息。

x <- pubmed(ids[1:10]) 
class(x)
[1] "XMLDocument"         "XMLAbstractDocument"
a <- xmlRoot(x) 
numAbst <- length(xmlChildren(a)) 
numAbst 
# [1] 10

构建 PubMedAbst 对象，提取摘要文本：

arts <- vector("list", length=numAbst) 
absts <- rep(NA, numAbst) 
for (i in 1:numAbst) {
  arts[[i]] <- buildPubMedAbst(a[[i]])
  absts[i] <- abstText(arts[[i]])
}

arts[[3]] 
# An object of class 'pubMedAbst':
# Title: Redifferentiation of dedifferentiated
#      chondrocytes and chondrogenesis of human
#      bone marrow stromal cells via chondrosphere
#      formation with expression profiling by
#      large-scale cDNA analysis.
# PMID: 12878157
# Authors: H Imabayashi, T Mori, S Gojo, T Kiyono,
#      T Sugiyama, R Irie, T Isogai, J Hata, Y
#      Toyama, A Umezawa
# Journal: Exp Cell Res
# Date: Aug 2003

absts[3]
# [1] "Characterization of dedifferentiated chondrocytes (DECs) and mesenchymal stem cells capable of differentiating into chondrocytes is of biological and clinical interest. We isolated DECs and bone marrow stromal cells (BMSCs), H4-1 and H3-4, and demonstrated that the cells started to produce extracellular matrices, such as type II collagen and aggrecan, at an early stage of chondrosphere formation. Furthermore, cDNA sequencing of cDNA libraries constricted by the oligocapping method was performed to analyze difference in mRNA expression profiling between DECs and marrow stromal cells. Upon redifferentiation of DECs, cartilage-related extracellular matrix genes, such as those encoding leucine-rich small proteoglycans, cartilage oligomeric matrix protein, and chitinase 3-like 1 (cartilage glycoprotein-39), were highly expressed. Growth factors such as FGF7 and CTGF were detected at a high frequency in the growth stage of monolayer stromal cultures. By combining the expression profile and flow cytometry, we demonstrated that isolated stromal cells, defined by CD34(-), c-kit(-), and CD140alpha(- or low), have chondrogenic potential. The newly established human mesenchymal cells with expression profiling provide a powerful model for a study of chondrogenic differentiation and further understanding of cartilage regeneration in the means of redifferentiated DECs and BMSCs."

4. 染色体信息的构建及利用

利用函数 buildChromLocation() 构建 chromLocation 对象：

z <- buildChromLocation("hgu95av2") 
z
# Instance of a chromLocation class with the following fields:
#   Organism:  Homo sapiens 
#   Data source:  hgu95av2 
#   Number of chromosomes for this organism:  595 
#   Chromosomes of this organism and their lengths in base pairs:
#        1 : 248956422
#        2 : 242193529
#        3 : 198295559
#        4 : 190214555
#        5 : 181538259
#        6 : 170805979
#        7 : 159345973
#        X : 156040895
#        8 : 145138636
#        9 : 138394717
#        11 : 135086622
#        10 : 133797422
#        12 : 133275309
#        13 : 114364328
#        14 : 107043718
#        15 : 101991189
#        16 : 90338345
#        17 : 83257441
#        18 : 80373285
#        20 : 64444167
#        19 : 58617616
#        Y : 57227415
#        22 : 50818468
#        21 : 46709983
#        8_KZ208915v1_fix : 6367528
#        15_KI270905v1_alt : 5161414
#        15_KN538374v1_fix : 4998962
#        6_GL000256v2_alt : 4929269
#        6_GL000254v2_alt : 4827813
#        6_GL000251v2_alt : 4795265
#        6_GL000253v2_alt : 4677643
#        6_GL000250v2_alt : 4672374
#        6_GL000255v2_alt : 4606388
#        6_GL000252v2_alt : 4604811
#        17_KI270857v1_alt : 2877074
#        16_KI270853v1_alt : 2659700
# ...
## 后面太长不贴出来了

这个 S4 对象有6个 slot.

5. ID 转换终结者

5.1 `getSymbol()`

以 getSymbol() 为例的一系列函数：

getSYMBOL(x, data)
getEG(x, data)
getGO(x, data)
getPMID(x, data)
getGOdesc(x, which)
lookUp(x, data, what, load = FALSE)

probes <- featureNames(sample.ExpressionSet)[246:260]

getSYMBOL(probes, "hgu95av2.db")
#   31490_at 31491_s_at   31492_at 31493_s_at   31494_at 
#    "SCN5A"    "CASP8"    "EIF3K"         NA         NA 
#   31495_at 31496_g_at   31497_at 31498_f_at 31499_s_at 
#     "XCL2"         NA         NA         NA   "FCGR3B"

getEG(probes, "hgu95av2.db")
#   31490_at 31491_s_at   31492_at 31493_s_at   31494_at 
#     "6331"      "841"    "27335"         NA         NA 
#   31495_at 31496_g_at   31497_at 31498_f_at 31499_s_at 
#     "6846"         NA         NA         NA     "2215"

go <- getGO(probes, "hgu95av2.db")

getGOdesc(go[[1]][[1]][["GOID"]], "ANY")
# $`GO:0002027`
# GOID: GO:0002027
# Term: regulation of heart rate
# Ontology: BP
# Definition: Any process that modulates the frequency
#     or rate of heart contraction.
# Synonym: cardiac chronotropy
# Synonym: regulation of heart contraction rate
# Synonym: regulation of rate of heart contraction
getGOdesc(go[[1]][[1]][["GOID"]], "BP")
# $`GO:0002027`
# GOID: GO:0002027
# Term: regulation of heart rate
# Ontology: BP
# Definition: Any process that modulates the frequency
#     or rate of heart contraction.
# Synonym: cardiac chronotropy
# Synonym: regulation of heart contraction rate
# Synonym: regulation of rate of heart contraction

## 试错
getGOdesc(go[[1]][[1]][["GOID"]], "MF")
# NULL

lookUp(probes, "hgu95av2", "ENTREZID")
# $`31490_at`
# [1] "6331"
# 
# $`31491_s_at`
# [1] "841"
# 
# $`31492_at`
# [1] "27335"
# 
# $`31493_s_at`
# [1] NA
# 
# $`31494_at`
# [1] NA
# 
# $`31495_at`
# [1] "6846"
# 
# $`31496_g_at`
# [1] NA
# 
# $`31497_at`
# [1] NA
# 
# $`31498_f_at`
# [1] NA
# 
# $`31499_s_at`
# [1] "2215"

lookUp(go[[2]][[1]][["GOID"]], "GO", "ONTOLOGY")
# 'select()' returned 1:1 mapping between keys and columns
# $`GO:0006508`
# [1] "BP"

5.2 `select()`

先用 colums() 查看可以转换的项目：

columns(hgu95av2.db)
#  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"     
#  [4] "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
#  [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
# [10] "GENENAME"     "GO"           "GOALL"       
# [13] "IPI"          "MAP"          "OMIM"        
# [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
# [19] "PFAM"         "PMID"         "PROBEID"     
# [22] "PROSITE"      "REFSEQ"       "SYMBOL"      
# [25] "UCSCKG"       "UNIGENE"      "UNIPROT"

out <- select(hgu95av2.db, probes,  c("SYMBOL","ENTREZID", "GENENAME"))

References

Annotation Overview https://bioconductor.org/packages/release/bioc/vignettes/annotate/inst/doc/annotate.pdf
Basic GO Usage https://bioconductor.org/packages/release/bioc/vignettes/annotate/inst/doc/GOusage.pdf
Guide to GO evidence codes http://geneontology.org/docs/guide-go-evidence-codes/
HOWTO: Use the online query tools https://bioconductor.org/packages/release/bioc/vignettes/annotate/inst/doc/query.pdf
HowTo: use chromosomal information https://bioconductor.org/packages/release/bioc/vignettes/annotate/inst/doc/chromLoc.pdf
Question: Converting Affymetrix Probes To Gene Ids, Diwan https://www.biostars.org/p/76097/

最后，向大家隆重推荐生信技能树的一系列干货！

生信技能树全球公益巡讲：https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
B站公益74小时生信工程师教学视频合辑：https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
招学徒：https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw

最后编辑于：2019.06.22 22:13:53

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,490评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,581评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 165,830评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,957评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,974评论 6赞 393
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,754评论 1赞 307
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,464评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,357评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,847评论 1赞 317
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,995评论 3赞 338
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,137评论 1赞 351
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,819评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,482评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,023评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,149评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,409评论 3赞 373
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,086评论 2赞 355