分析流程-多基因风险分数 PRS( Polygenic risk score)

sudo apt-get install zlib1g zlib1g.dev libblas3 libgfortran5 liblapack3 libquadmath0 plink1.9 unzip

sudo apt install dirmngr gnupg apt-transport-https ca-certificates software-properties-common

sudo apt install r-base

1.获取或者生成基础数据(base data)

Polygenic Risk Score (PRS) 分析第一步就是获得基础数据(即GWAS统计分析结果),应该包含了与性状相关的所有等位基因信息及对应效应贡献.
CHR: The chromosome in which the SNP resides---位于第几条染色体
BP: Chromosomal co-ordinate of the SNP---位于染色体物理位置,可能出现的形式:POSITION
SNP: SNP ID, usually in the form of rs-ID---SNP编号,通常是rsID
A1: The effect allele of the SNP---效应等位基因,可能出现的形式:REF
A2: The non-effect allele of the SNP---非效应等位基因,可能出现的形式:ALT
N: Number of samples used to obtain the effect size estimate---用于评估效量值的群体数量
SE: The standard error (SE) of the effect size esimate---所评估效应量值的标准误差
P: The P-value of association between the SNP genotypes and the base phenotype---所评估表型和基因型的相关性p值,可能出现的形式:p_value
OR: The effect size estimate of the SNP, if the outcome is binary/case-control. If the outcome is continuous or treated as continuous then this will usually be BETA---所评基因型的效应量,Odds Ratio或 Effect Size
INFO: The imputation information score---插补得分,可能出现的形式:INFO.plink
MAF: The minor allele frequency (MAF) of the SNP ---所评基因型的效应量,可能出现的形式:ALT_FREQ、FRQ

我是从网上下载了别人的GWAS结果,是个TXT,目的是要将数据从以下排序换成第二行所示:

Chromosome  Position    RSID    REF ALT ALT_FREQ    ALT_FREQ_1KGASN RSQ INFO    HWE_P   Pvalue  Qvalue  N   NullLogLike AltLogLike  SNPWeight   SNPWeightSE OddsRatio   WaldStat    NullLogDelta    NullGeneticVar  NullResidualVar NullBias
CHR BP  RSID    A1  A2  N   SE  P   OR  info    MAF 
1.1 数据排序

首先把TXT转换成CSV,再用PYTHON提取排序后再把CSV转换成TXT

TXT转CSV

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 11:25:52 2022

@author: Bohan
"""

import pandas as pd
df = pd.read_csv("MDD.10640samples.dosages.hwe6info9maf5.logreg.2017.txt",delimiter="\t", low_memory=False)
df.to_csv("MDD.10640samples.dosages.hwe6info9maf5.logreg.2017.csv", encoding='utf-8', index=False)

CSV提取排序

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 10:49:59 2022

@author: Bohan
"""

import csv

with open("MD_GWAS_SNPresults-original.csv") as f, open("originalarranged.csv","w",newline='') as tmp:
    r = csv.reader(f)
    wr = csv.writer(tmp)
    wr.writerows([a,b,c,d,e,m,q,k,r,i,f] for [a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w] in r)
#[a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w] 是原来文件的列顺序
#[a,b,c,d,e,m,q,k,r,i,f]是按照原顺序提取重排

CSV转TXT

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 12:07:58 2022

@author: Bohan
"""

import csv
 
 
a=open('MD_GWAS_SNPresults-original.arranged.csv','r')
reader = csv.reader(a)
 
with open('MD_GWAS_SNPresults-original.arranged','w') as f:
    
    for i in reader:
        for x in i:
            f.write(x)
            f.write('\t')
        f.write('\n')
a.close()

搞完base数据看起来这样

CHR BP  RSID    A1  A2  N   SE  P   OR  info    MAF 
10  90127   rs185642176 C   T   10640   0.004802742809305182    0.4704864679671288  1.0123502866538827  0.905519504189  0.046   
10  90164   rs141504207 C   G   10640   0.004802742809305182    0.4704864679671288  1.0123502866538827  0.905515996565  0.046   
10  94026   rs10904032  G   A   10640   0.004807929547222263    0.9907936601472477  1.000113527128114   0.946752744368  0.145   

1.2 base数据质检

解压读取MD_GWAS_SNPresults-original.2015.arranged.txt.gz
输出文件头 (NR==1)
输出MAF大于0.01的数据行 (第11列是MAF)
输出INFO大于0.8 的数据行(第10列是INFO)
压缩结果到MD2015.gz

gunzip -c MD_GWAS_SNPresults-original.2015.arranged.txt.gz |\
awk 'NR==1 || ($11 > 0.01) && ($10 > 0.8) {print}' |\
gzip  > MD2015.gz

根据第3列的数据去重获得MD2015.nodup.gz(No-duplicate)

gunzip -c MD2015.gz |\
awk '{seen[$3]++; if(seen[$3]==1){ print}}' |\
gzip - > MD2015.nodup.gz

根据第四第五行数值去除Ambiguous SNPs

gunzip -c MD2015.nodup.gz |\
awk '!( ($4=="A" && $5=="T") || \
        ($4=="T" && $5=="A") || \
        ($4=="G" && $5=="C") || \
        ($4=="C" && $5=="G")) {print}' |\
    gzip > MD2015.QC.gz

2.目标数据(target data)的质检QC

本案例用的是illumina ASA v1 SNP芯片检出的原始文件(*.idat文件)
直接去下载illumina的genomestudio软件
https://files.softwaredownloads.illumina.com/5831d9df-95cb-4427-a7c0-499fe871e1d5/genomestudio-software-v2-0-5-0-installer.zip

2.1 创建分析项目

Workflow

1st Step

要用的还有Infinium Asian Screening Array v1.0 Manifest File (BPM Format - GRCh37)

Infinium Asian Screening Array v1.0 Cluster File

2.2 评估参照数据

要用的文件
创建完project和统计后,开始control评估

评估参考数据
内建有参考marker
参照数据可以用来判断实验体系是否正常运行

2.3 评估样品和SNP

image.png

具体理论可以参考这个GSA/ASA芯片质控 - 简书 (jianshu.com)
样品评估
Call Rate(检出率) 样本检出率: 是指对于某种样本而言,通过测序并成功判刑的snp与所有检出的snp的比值,通常标准在90%或以上。
LogR Dev用于评估是否有样品污染
SNP评估
Call Frequency(检出频率)用于SNP检出的样品覆盖程度
GenTrain Scoreillumina自己的算法
image.png

image.png

image.png

image.png

image.png

image.png

image.png

样品评估
Call Rate太低的不要
image.png

SNP评估
image.png

image.png

image.png

image.png

image.png

image.png
image.png

基本流程就是创建样品-添加manifest信息,参考基因组按照GWAS对应版本,利用自带插件plink-input-report-plugin-v2-1-4到处.ped和.map文件

plink1.9 --file new --out new --make-bed
plink1.9 --bfile new --maf 0.01 --hwe 1e-6 --geno 0.02 --mind 0.02 --write-snplist --make-just-fam --out new.QC
plink1.9 --bfile new --keep new.QC.fam --extract new.QC.snplist --indep-pairwise 200 50 0.25 --out new.QC
plink1.9 --bfile new --extract new.QC.prune.in --keep new.QC.fam --het --out new.QC
sudo R
install.packages("data.table")
library(data.table)
dat <- fread("new.QC.het")
valid <- dat[F<=mean(F)+3*sd(F) & F>=mean(F)-3*sd(F)] 
fwrite(valid[,c("FID","IID")], "new.valid.sample", sep="\t") 
install.packages("magrittr")
install.packages("R.utils")
library(magrittr)
bim <- fread("new.bim") %>%
     setnames(., colnames(.), c("CHR", "SNP", "CM", "BP", "B.A1", "B.A2")) %>%
    .[,c("B.A1","B.A2"):=list(toupper(B.A1), toupper(B.A2))]
MD <- fread("MD2015.QC.gz") %>%
.[,c("A1","A2"):=list(toupper(A1), toupper(A2))]
qc <- fread("new.QC.snplist", header=F)
info <- merge(bim, MD, by=c("SNP", "CHR", "BP")) %>%
    .[SNP %in% qc[,V1]]

complement <- function(x){
    switch (x,
        "A" = "T",
        "C" = "G",
        "T" = "A",
        "G" = "C",
        return(NA)
    )

info.match <- info[A1 == B.A1 & A2 == B.A2, SNP]
com.snps <- info[sapply(B.A1, complement) == A1 &
                    sapply(B.A2, complement) == A2, SNP]

bim[SNP %in% com.snps, c("B.A1", "B.A2") :=
        list(sapply(B.A1, complement),
            sapply(B.A2, complement))]

recode.snps <- info[B.A1==A2 & B.A2==A1, SNP]

bim[SNP %in% recode.snps, c("B.A1", "B.A2") :=
        list(B.A2, B.A1)]

com.recode <- info[sapply(B.A1, complement) == A2 &
                    sapply(B.A2, complement) == A1, SNP]

bim[SNP %in% com.recode, c("B.A1", "B.A2") :=
        list(sapply(B.A2, complement),
            sapply(B.A1, complement))]
fwrite(bim[,c("SNP", "B.A1")], "EUR.a1", col.names=F, sep="\t")

mismatch <- bim[!(SNP %in% info.match |
                    SNP %in% com.snps |
                    SNP %in% recode.snps |
                    SNP %in% com.recode), SNP]
write.table(mismatch, "EUR.mismatch", quote=F, row.names=F, col.names=F)
q()


plink1.9 --bfile new --extract new.QC.prune.in --keep new.valid.sample --check-sex --out new.QC

valid <- fread("new.valid.sample")
dat <- fread("new.QC.sexcheck")[FID%in%valid$FID]
fwrite(dat[STATUS=="OK",c("FID","IID")], "new.QC.valid", sep="\t") 
q() # exit R

plink1.9 --bfile new  --extract new.QC.prune.in --keep new.QC.valid --rel-cutoff 0.125 --out new.QC

plink1.9 --bfile new --make-bed --keep new.QC.rel.id --out new.QC --extract new.QC.snplist


-----------------------------------------------------------------------------

library(data.table)
dat <- fread("MD2015.QC.gz")
fwrite(dat[,BETA:=log(OR)], "new.QC.Transformed", sep="\t")
q() # exit R


plink1.9  --bfile new.QC  --clump-p1 1 --clump-r2 0.1 --clump-kb 250 --clump new.QC.Transformed --clump-snp-field SNP --clump-field P --out new

awk 'NR!=1{print $3}' new.clumped >  new.valid.snp

awk '{print $3,$8}' new.QC.Transformed > new.pvalue

echo "0.001 0 0.001" > range_list 
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.2 0 0.2" >> range_list
echo "0.3 0 0.3" >> range_list
echo "0.4 0 0.4" >> range_list
echo "0.5 0 0.5" >> range_list


plink1.9 --bfile new.QC --score new.QC.Transformed 3 4 12 header  --q-score-range range_list new.pvalue --extract new.valid.snp --out new


# First, we need to perform prunning
plink1.9 --bfile new.QC --indep-pairwise 200 50 0.25 --out new
# Then we calculate the first 6 PCs
plink1.9 --bfile new.QC --extract new.prune.in --pca 6 --out new

library(data.table)
library(magrittr)
p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)
phenotype <- fread("new.phenotype")
pcs <- fread("new.eigenvec", header=F) %>%
    setnames(., colnames(.), c("FID", "IID", paste0("PC",1:6)) )
covariate <- fread("new.cov")
pheno <- merge(phenotype, covariate) %>%
        merge(., pcs)
~~~~~
null.r2 <- summary(lm(Trait1~., data=pheno[,-c("FID", "IID")]))$r.squared
prs.result <- NULL
for(i in p.threshold){
    pheno.prs <- paste0("new.", i, ".profile") %>%
        fread(.) %>%
        .[,c("FID", "IID", "SCORE")] %>%
        merge(., pheno, by=c("FID", "IID"))

    model <- lm(Trait1~., data=pheno.prs[,-c("FID","IID")]) %>%
            summary
    model.r2 <- model$r.squared
    prs.r2 <- model.r2-null.r2
    prs.coef <- model$coeff["SCORE",]
    prs.result %<>% rbind(.,
        data.frame(Threshold=i, R2=prs.r2, 
                    P=as.numeric(prs.coef[4]), 
                    BETA=as.numeric(prs.coef[1]),
                    SE=as.numeric(prs.coef[2])))
}
print(prs.result[which.max(prs.result$R2),])
q() # exit R

p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)

Read in the phenotype file

phenotype <- read.table("new.phenotype", header=T)

Read in the PCs

pcs <- read.table("new.eigenvec", header=F)

The default output from plink does not include a header

To make things simple, we will add the appropriate headers

(1:6 because there are 6 PCs)

colnames(pcs) <- c("FID", "IID", paste0("PC",1:6))

Read in the covariates (here, it is sex)

covariate <- read.table("new.cov", header=T)

Now merge the files

pheno <- merge(merge(phenotype, covariate, by=c("FID", "IID")), pcs, by=c("FID","IID"))

We can then calculate the null model (model with PRS) using a linear regression

(as height is quantitative)

null.model <- glm(Trait1~., data=pheno[,!colnames(pheno)%in%c("FID","IID")])

And the R2 of the null model is

null.r2 <- summary(null.model)r.squared prs.result <- NULL for(i in p.threshold){ # Go through each p-value threshold prs <- read.table(paste0("new.",i,".profile"), header=T) # Merge the prs with the phenotype matrix # We only want the FID, IID and PRS from the PRS file, therefore we only select the # relevant columns pheno.prs <- merge(pheno, prs[,c("FID","IID", "SCORE")], by=c("FID", "IID")) # Now perform a linear regression on Height with PRS and the covariates # ignoring the FID and IID from our model model <- glm(Trait1~., data=pheno.prs[,!colnames(pheno.prs)%in%c("FID","IID")]) # model R2 is obtained as model.r2 <- summary(model)r.squared
# R2 of PRS is simply calculated as the model R2 minus the null R2
prs.r2 <- model.r2-null.r2
# We can also obtain the coeffcient and p-value of association of PRS as follow
prs.coef <- summary(model)$coeff["SCORE",]
prs.beta <- as.numeric(prs.coef[1])
prs.se <- as.numeric(prs.coef[2])
prs.p <- as.numeric(prs.coef[4])
# We can then store the results
prs.result <- rbind(prs.result, data.frame(Threshold=i, R2=prs.r2, P=prs.p, BETA=prs.beta,SE=prs.se))
}

Best result is:

prs.result[which.max(prs.result$R2),]
q() # exit R

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,744评论 6 502
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,505评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 163,105评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,242评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,269评论 6 389
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,215评论 1 299
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,096评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,939评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,354评论 1 311
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,573评论 2 333
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,745评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,448评论 5 344
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,048评论 3 327
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,683评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,838评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,776评论 2 369
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,652评论 2 354

推荐阅读更多精彩内容