HIBAG是用于使用SNP数据估算HLA类型的软件包,该软件包配套根据现有数据库估算的参数,因此只需要输入自己的芯片数据,无需访问大型数据库,即可得到HLA imputation 信息。
安装HIBAG
source("http://bioconductor.org/biocLite.R")
biocLite("HIBAG")
下载参数
https://hibag.s3.amazonaws.com/hlares_index.html
利用现有参数对HLA进行估算
Pre-fit HIBAG Models for HLA Imputation
以欧洲人群为例,估算HLA-A
library(HIBAG)
# load the published parameter estimates from European ancestry载入已经下载好的欧洲人群数据
mlst <- get(load("HumanOmniExpressExome-European-HLA4-hg19.RData"))
# Import your PLINK BED file假设名字为KG
geno <- hlaBED2Geno("KG.bed", "KG.fam", "KG.bim")
#Run the HIBAG prediction algorithm for HLA–A估算A的基因型
# load HLA-A pre-fit classifier to memory
model <- hlaModelFromObj(mlst$A)
# run prediction
hla_a <- hlaPredict(model, geno)
# run with 8 cores
hla_a <- hlaPredict(model, geno, cl=8)
summary(hla_a)
Gene: A
Range: [29910247bp, 29913661bp] on hg19
of samples: 930
of unique HLA alleles: 38
of unique HLA genotypes: 218
Posterior probability:
[0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1]
1 (0.1%) 36 (3.9%) 114 (12.3%) 779 (83.8%)
head(hla_a$value)
sample.id allele1 allele2 prob matching
1 HG00096 01:01 29:02 0.9999180 0.0012829705
2 HG00097 03:01 24:02 0.9491243 0.0004386126
3 HG00099 01:01 68:01 0.9983373 0.0008045432
"prob"被用于confidence scores,阈值最好0.5;
“matching”是描述SNP谱如何与在训练集中观察到的SNP单倍型匹配的量度或比例,即,在由训练单倍型组成的随机交配群体中SNP谱的可能性。 匹配比例与置信度分数没有直接关系,但是“ matching”的值非常低表明它在训练集中的代表性不足。
# write the imputation results to a text file
write.table(hla_a$value, file="HLA_A.result.txt", sep="\t",
quote=FALSE, row.names=FALSE)
实际操作中出现两个问题
1.prob偏小最少情况下只有70%大于0.5;
2.matching偏低
相关链接
https://github.com/zhengxwen/HIBAG
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3772955/
https://link.springer.com/protocol/10.1007%2F978-1-4939-8546-3_11