一、孟德尔随机化(Mendelian Randomization
, MR
)的原理
- 随机对照实验(RCT)一般被认为是明确因果关系的金标准。其要求研究对象随机分组,为减少偏倚,常采用单盲、双盲甚至三盲设计,然而,这却需要大量的人力物力,且因为伦理问题,对某些因素的研究几乎不可能。
- 单盲:实验对象不知道自己属于哪一组。
- 双盲:实验对象和研究者均不知道组别分配。
三盲:实验对象、研究者及数据分析人员都不知道分组情况。
-
孟德尔第二定律认为,遗传物质独立分离,自由组合。即父母的基因在传递给子代时是随机分配的。因此,特定基因变异的分布是随机的,不会受到环境因素或行为习惯的影响。这种随机分布类似于随机对照试验中的随机分组,使其成为一种“自然实验”。
虽然在个体层面,一个人的遗传变异来自父母即他们的分配不完全随机,因为,如果其父母没有携带某种特定的突变,则该个体也不会携带这种突变。而如果扩大到群体水平,遗传变异在人群中的分布可以被视为随机分布。
二、工具变量(Instrument Variant
, IV
)的选择
- 独立性假设,即IV不通过其他混杂因素与结局(
outcome
)相关 - 关联性假设,即IV要与暴露因素(
exposure
)强相关 - 排他性假设,即IV不能直接与结局(
outcome
)存在相关性。
因果推断:通过分析工具变量与结果变量之间的关联来推断暴露因素对结果变量的因果关系。常用的方法包括双阶段最小二乘法(2SLS)或加权中值法等。
三、什么情况下可以选择做孟德尔随机化 ?
- 不确定traits之间的因果关系,比如,到底是抑郁导致肺癌还是肺癌导致抑郁?
- 研究的样本量足够大;
- 暴露和结局存在假设的因果关系。
四、那么,什么情况不能做孟德尔随机化呢?
!!! 仅代表个人意见
- 潜在的基因-环境交互效应显著:trait的遗传力过小,即
GWAS
显著结果(p < 5e-8)可能代表不了这个trait时。
五、孟德尔随机化的应用
这里介绍两种孟德尔随机化应用的工具(杨剑老师lab),GSMR
和SMR
1、GSMR(Generalised Summary-data-based Mendelian Randomisation)
性状水平的MR
,确定两个大性状之间的因果关系,如确定吸烟和肺癌之间的因果关系。
与其他的MR
方法利用一个IV
进行分析不同,GSMR
利用与exposure
显著相关的多个IV
进行分析,大大提升了MR
分析的power
和准确性。
If there are multiple independent (or nearly independent) SNPs associated with x and the effect of x on y is causal, then all the x-associated SNPs will have an effect on y through x. (Zhu et al. Nat Commun)
GSMR
已整合到GCTA
软件中,操作十分便利
gcta --bfile ${REFGeno} \
--gsmr-file exposure.txt outcome.txt \
--gsmr-direction 2 \
--gwas-thresh 5e-08 \
--clump-r2 0.05 \
--gsmr2-beta \
--gsmr-snp-min 10 \
--heidi-thresh 0.01 \
--effect-plot \
--out meta_gsmr \
--thread-num 10
# --bfile plink格式的基因型文件,可用千人基因组参考文件
# --gsmr-file 需要两个list.txt文件,前面为exposure,后面为outcome,格式为BMI BMI.gwas.ma
# --gsmr-direction 2 选项有 0 1 2,分别为正向,反向及双向GSMR
# --gsmr-snp-min 10 只有工具变量达到规定数量才进行GSMR
# --heidi-thresh 0.01 进行herdi test,去除LD SNP
BMI.gwas.ma格式如下,header名称无所谓
SNP A1 A2 freq b se p n
rs1001 A G 0.8493 0.0024 0.0055 0.6653 129850
rs1002 C G 0.03606 0.0034 0.0115 0.7659 129799
rs1003 A C 0.5128 0.045 0.038 0.2319 129830
......
最后,筛选有显著因果关系的trait
#! /public/home/shilulu/anaconda3/envs/R4.2.0/bin/Rscript
library(data.table)
library(dplyr)
meta_gsmr <- fread("meta_gsmr.gsmr")
meta_gsmr[meta_gsmr == "nan"] <- NA
# p < 0.05 / row number / 2 for 2 dirction gsmr
# remove row contained "nan" (all)
final <- meta_gsmr %>%
filter(p < 0.05 / (2*nrow(meta_gsmr))) %>%
filter(complete.cases(.))
fwrite(final, file="meta_gsmr_noNA_anno.gsmr", sep="\t", na="nan", quote=FALSE)
2、SMR(Summary-data-based Mendelian Randomisation)
分子水平的MR
,确定与性状有因果关系的分子标记,如(eQTL
, mQTL
, pQTL
, apaQTL
, vQTL
等)
SMR
通过将分子QTL作为exposure
,复杂性状作为outcome
,来研究分子QTL与复杂性状之间的关联,使用HEIDI
进行结果检验(即从causal variant中去除Linkage
的,但无法辨别Pleiotropy
的情况)。
一、run SMR between QTL and traits
- running smr, QTL with gwas trait
smr --bfile ${bfile} \
--beqtl-summary ${eQTL} \
--gwas-summary ${gwas} \
--diff-freq-prop 0.5 \
--maf 0.01 \
--cis-wind 2000 \
--out whole_blood \
--thread-num 4 > eqtlsmr.log 2>&1
- filter result,i.e. exclude the probe that
p_smr > 0.05 / n test
和p_heidi < 0.01
library(dplyr)
args <- commandArgs(TRUE)
infile <- args[1]
outname <- args[2]
smr = read.table(infile, sep="\t", header=TRUE)
# retain p_SMR <= 0.05 / ntest and p_HEIDI >= 0.01
filter_out = filter(smr, p_SMR <= 0.05/nrow(smr) & p_HEIDI >= 0.01)
write.table(filter_out, file=paste0(outname, "_", sprintf("%.2e", 0.05/nrow(smr)), "_pass_smr_0.01_pass_heidi.txt"),
sep="\t", row.names=FALSE, quote=FALSE)
- generate plot file and plot the probe you interest
smr --bfile ${bfile} \
--gwas-summary ${gwas} \
--beqtl-summary ${eQTL} \
--out plot \
--plot \
--probe ENSG00000117601 \
--probe-wind 2000 Kb \
--gene-list ${Genelist} \
--thread-num 10
# --probe ENSG00000117601 通过p_smr 和 p_heidi的QTL probe
# --gene-list smr官网有提供
plot_SMR_image.r
SMR | Yang Lab
source("~/plot_smr/plot_SMR_image.r")
# Read the data file in R:
SMRData = ReadSMRData("ENSG00000117601.txt")
# Plot the SMR results in a genomic region centred around a probe:
pdf('ENSG00000117601.pdf',width = 10,height = 8)
SMRLocusPlot(data=SMRData, smr_thresh=7.98e-6, heidi_thresh=0.05, plotWindow=200, max_anno_probe=16)
dev.off()
# smr_thresh: genome-wide significance level for the SMR test.
# heidi_thresh: threshold for the HEIDI test. The default value is 0.05.
# cis_wind: size of a window centred around the probe to select cis-eQTLs for plot. The default value is 2000Kb.
# max_anno_probe: maximum number of probe names to be displayed on the figure. The default value is 16.
自用SMR
脚本,包括过滤,分python
和R
版本
run_SMR_andfilter.py
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
'''
@File : run_SMR_andfilter.py
@Time : 2024/10/23 20:21:31
@Author : Lulu Shi
@Mails : crazzy_rabbit@163.com
@link : https://github.com/Crazzy-Rabbit
'''
import os
import argparse
import pandas as pd
def run_smr_qtl(gwas, qtl, outname):
SMR = "/public/home/lijunpeng/smr-1.3.1-linux-x86_64/smr-1.3.1"
bfile = "/public/home/lijunpeng/smr-1.3.1-linux-x86_64/g1000_eur/g1000_eur"
os.system(f'{SMR} --bfile {bfile} \
--beqtl-summary {qtl} \
--gwas-summary {gwas} \
--diff-freq-prop 0.5 \
--maf 0.01 \
--cis-wind 2000 \
--out {outname} \
--thread-num 10 > {outname}.log')
def run_filter(smr, outname):
smr_file = pa.read_csv(smr, sep="\t")
smr_tsh = "{:.2e}".format(0.05 / len(smr_file))
flt_file = smr_file[(smr_file['p_SMR'] <= 0.05 / len(smr_file)) & (smr_file['p_HEIDI'] >= 0.01)]
flt_file.to_csv(f"{outname}_{smr_tsh}smr_0.01heidi.txt", sep="\t", index=False, quoting=False)
def main():
parser = argparse.ArgumentParser(description="Run SMR and filter results.")
parser.add_argument("--func", type=str, default="both", help="Function to run: 'run_smr_qtl', 'run_filter', or 'both'")
parser.add_argument("--gwas", type=str, help="GWAS summary data to run smr")
parser.add_argument("--qtl", type=str, help="QTL file to run smr")
parser.add_argument("--smr", type=str, help="SMR file for filtering")
parser.add_argument("--out", type=str, help="Output prefix")
args = parser.parse_args()
gwas = args.gwas
qtl = args.qtl
out = args.out
func = args.func
smr = args.smr
if func == "run_smr_qtl":
if not gwas or not smr:
raise ValueError("When func is 'run_smr_qtl', --gwas and --qtl must be provided.")
run_smr_qtl(gwas, qtl, out)
elif func == "run_filter":
if not smr:
raise ValueError("When func is 'run_filter', --smr must be provided.")
run_filter(smr, out)
else:
if not gwas or not smr:
raise ValueError("When func is 'both' or not provided, --gwas and --qtl must be provided.")
run_smr_qtl(gwas, qtl, out)
run_filter(f"{out}.smr", out)
if __name__ == '__main__':
main()
run_SMRinR_filter.r
用法:run SMR and filter
Rscript run_SMRinR_filter.r --func run_smr_qtl --gwas ${gwas} --qtl ${mQTL} --out ${outdir}/${outprx}
run filter
Rscript run_SMRinR_filter.r --func run_filter --smr ${smr} --out ${outprx}
#***********************************************#
# File : run_SMRinR_filter.r #
# Time : 2024/10/23 20:50:51 #
# Author : Lulu Shi #
# Mails : crazzy_rabbit@163.com #
# link : https://github.com/Crazzy-Rabbit #
#***********************************************#
library(dplyr)
library(data.table)
# r function run smr qtl to trait
run_smr_qtl <- function(gwas, qtl, outname){
SMR="/public/home/lijunpeng/smr-1.3.1-linux-x86_64/smr-1.3.1"
bfile="/public/home/lijunpeng/smr-1.3.1-linux-x86_64/g1000_eur/g1000_eur"
system(paste(SMR, "--bfile", bfile,
"--beqtl-summary", qtl,
"--gwas-summary", gwas,
"--diff-freq-prop 0.5",
"--maf 0.01",
"--cis-wind 2000",
"--out", outname,
"--thread-num 10"), intern=TRUE) -> mylog
writeLines(mylog, paste0(outname, ".log"))
}
# p_SMR & p_HEIDI filter
run_filter <- function(smr, outname){
smr_file = fread(smr)
smr_tsh = sprintf("%.2e", 0.05/nrow(smr_file))
flt_file = smr_file %>% filter(p_SMR <= 0.05/n() & p_HEIDI >= 0.01)
write.table(flt_file, file=paste0(outname, "_", smr_tsh, "smr_0.01heidi.txt"), sep="\t", row.names=FALSE, quote=FALSE)
}
# args <- commandArgs(TRUE)
# gwas <- args[which(args == "--gwas") + 1]
# qtl <- args[which(args == "--qtl") + 1]
# out <- args[which(args == "--out") + 1]
# func <- ifelse("--func" %in% args, args[which(args == "--func") + 1], "all")
library(optparse)
option_list <- list(
make_option(c("--func", type="character"), default="both", help="Function to run: 'run_smr_qtl', 'run_filter', or 'both'"),
make_option(c("--gwas"), type="character", help="GWAS summary data to run smr"),
make_option(c("--qtl"), type="character", help="qtl file to run smr"),
make_option(c("--out"), type="character", help="out prefix"),
make_option(c("--smr"), type="character", help="SMR file for filtering")
)
opt_parser <- OptionParser(option_list=option_list)
opts <- parse_args(opt_parser)
gwas <- opts$gwas
qtl <- opts$qtl
out <- opts$out
func <- opts$func
smr <- opts$smr
if (func == "run_smr_qtl"){
if (is.null(gwas) || is.null(qtl)) {
stop("When func is 'run_smr_qtl', --gwas and --eqtl must be provided.")
}
run_smr_qtl(gwas, qtl, out)
} else if (func == "run_filter"){
if (is.null(smr)) {
stop("When func is 'run_filter', --smr must be provided")
}
run_filter(smr, out)
} else{
if (is.null(gwas) || is.null(qtl)) {
stop("When func is not provided, --gwas and --qtl must be provided.")
}
run_smr_qtl(gwas, qtl, out)
run_filter(paste0(out, ".smr"), out)
}
二、run multi-omics SMR
即三步SMR分析,前两步和上述1中的脚本一样,只是换了QTL文件 \
1 SMR mQTL and trait
2 SMR eQTL and trait
3 SMR between mQTL and eQTL
First, we map the methylome to the transcriptome in cis-regions by testing the associations of DNAm with their neighbouring genes (within 2 Mb of each DNAm probe) using the top associated mQTL as the instrumental variable (Methods).
Next, we prioritize the trait-associated genes by testing the associations of transcripts with the phenotype using the top associated eQTL.
Last, we prioritize the trait-associated DNAm sites by testing associations of DNAm sites with the phenotype using the top associated mQTL.
If the association signals are significant in all three steps, then we predict with strong confidence that the identified DNAm sites and target genes are functionally relevant to the trait through the genetic regulation of gene expression at the DNAm sites. The SMR & HEIDI method assumes consistent LD between two samples. This assumption is generally satisfied by using data from samples of the same ancestry. (Wu et al. Nat Commun)
SMR between mQTL and eQTL 稍微有点不同
smr --bfile ${bfile} \
--beqtl-summary ${mQTL} \
--beqtl-summary ${bqtl} \
--extract-exposure-probe mqtl_probeID.txt \
--extract-outcome-probe eqtl_probeID.txt \
--diff-freq-prop 0.5 \
--thread-num 10 \
--out ${outdir}/m2eqtl > ${outdir}/m2eqtl.log 2>&1
当然,经过后来的优化,这种三步式SMR
已被集成为了软件OPERA (Wu et al. Cell Genom)
Generate multi omics plot file using OPERA
$opera --bfile $bfile \
--gwas-summary $gwas \
--beqtl-summary $eQTL \
--besd-flist mqtl_list \
--plot \
--probe ENSG00000072778 \
--probe-wind 2000 \
--gene-list $Genelist \
--out my_plot
plot the probe you interest
source("~/plot_smr/plot_OmicsSMR_xQTL.r")
SMRData = ReadomicSMRData("my_plot.ENSG00000072778.txt")
pdf("test_opera.pdf", height=12, width=10)
omicSMRLocusPlot(data=SMRData, esmr_thresh=3.19e-06, msmr_thresh=5.37e-07, m2esmr_thresh=5.75e-04,
window=300, anno_methyl=TRUE, highlight=NULL, annoSig_only=TRUE,
eprobeNEARBY="ENSG00000072778", mprobeNEARBY=c("cg03613822", "cg12805420", "cg19466160"),
epi_plot=TRUE, funcAnnoFile="~/plot_smr/funcAnno.RData")
dev.off()
# epi_plot=TRUE if you want plot epigenome
# anno_methyl=TRUE if you want the mQTL can annote in plot
where funcAnno.RData and plot_OmicsSMR_xQTL.r has been provided