AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model

AlphaGenome：利用统一的DNA序列模型进行调控变体效应预测

title.png

作者简介： Žiga Avsec, Ph.D
他从物理学转向计算基因组学，标志着人工智能与基因研究的融合迈出了重要的一步。他从斯洛文尼亚来到慕尼黑，在朱利安·加格尼尔 (Julien Gagneur) 的指导下探索 DNA 的奥秘，并为 Kipoi 和 BPNet 等工具做出了贡献，这些工具增进了我们对基因组学的理解。
在 Google DeepMind，Žiga 在 Enformer 和 AlphaMissense 上的工作正在为识别基因变异和推进我们对抗遗传疾病的斗争开辟新天地。通过他的故事，我们可以一窥医疗保健的未来：人工智能驱动的基因组学发现将彻底改变个性化医疗和疾病治疗。
更多详细的介绍可以访问如下链接：https://blog.superbio.ai/superbio-scientist-spotlight-%C5%BEiga-avsec-ph-d-2225dacc2b9b

1.前情提要

随着大语言模型的出现，transformer的推出为我们破译基因组密码提供了更优质的工具，先前基于k-mer，短序列的方法逐渐被取代，长序列深度学习模型的出现，可以实现从更长的DNA序列中学习到更多基因组信息——建模增强子跨越长距离与启动子相互作用；判断单个碱基突变是否会破坏关键调控位点；观察一个变异对所有相关层级的影响，重建完整的致病因果链......

2.摘要

目标
开发深度学习模型，从 DNA 序列预测功能基因组学测量值（例如基因表达、染色质可及性等），以解读基因调控密码。
现有问题
当前模型面临一个关键取舍——要么处理较长的输入序列但预测分辨率低，要么预测分辨率高但只能处理很短的序列片段。这限制了它们能够预测的功能模态（数据类型）数量和预测性能。
提出的解决方案 —— AlphaGenome ：
AlphaGenome 解决了上述“长度-分辨率”的取舍问题。能够处理长达1 兆碱基对 (1 Mb)的 DNA 序列输入。这相当于人类基因组的大约 1/3000，包含了更广泛的调控上下文（如远距离增强子、拓扑关联域边界等）。在如此长的输入序列基础上，能够以单碱基对分辨率预测数千种功能基因组学数据轨道。
预测覆盖极其多样化的功能模态，包括：
1.基因表达水平 2.转录起始位点
3.染色质可及性 (如 ATAC-seq) 4.组蛋白修饰 (如 H3K27ac, H3K4me3)
5.转录因子结合位点 6.染色质空间构象 (染色质接触图谱，如 Hi-C)
7.剪接位点使用情况 8.剪接连接点坐标及其连接强度
模型训练与性能评估：
训练数据：使用人类和小鼠的基因组数据进行训练。
评估指标：主要评估模型在预测遗传变异效应（如 SNP）方面的能力。这是验证模型是否真正理解序列-功能关系的关键任务。
结果：在 26 项独立的、与现有最强外部模型（如 Enformer, Basenji2）的对比评估中，AlphaGenome 在 24 项上匹配或超越了这些模型的性能。这证明了其强大的预测能力。
关键应用与价值：
多模态变异效应评分： AlphaGenome 的核心优势在于能同时预测一个变异（如致病 SNP）对所有上述数千种功能模态的影响。
揭示致病机制：以 TAL1 癌基因附近的临床相关变异为例，AlphaGenome 能够准确重现该变异影响多个功能层面（如破坏某个转录因子结合位点、改变染色质可及性、进而影响基因表达）的完整致病机制。这为理解复杂疾病的遗传基础提供了前所未有的整合视角。
可用性：
工具发布：为了促进更广泛的应用，研究者提供了工具，方便用户利用 AlphaGenome 进行基因组轨道预测和变异效应评分。

3.模型构建

3.1 数据准备

Gneome data

Input sequences were extracted from the hg38 (human) and mm10 (mouse) reference genomes. For sequence intervals that extended beyond chromosomal boundaries, padding with ‘N’ characters was used to ensure consistent input length.

Tracks details

		Human	Mouse
Tracks		5930	1128
Gene expression	RNA-seq (ENCODE and GTEx) CAGE (FANTOM5) PRO-cap (ENCODE)	667 546 12	173 188 0
Detailed splicing patterns	splice sites (ENCODE and GTEx realigned using STAR) splice site usage (公式计算) splice junctions (splicemap package)	4 734 734	4 180 180
Chromatin state	DNase (ENCODE) ATAC-seq (ENCODE) histone modifications (ENCODE) TF binding (ENCODE)	305 167 1116 1617	67 18 183 127
Chromatin contact maps	Hi-C / micro-C (4D Nucleome)	28	8

3.2 模型构建

model1.jpg

3.2.1 模型架构 (图a)

核心设计：U-Net式分层处理

①. 输入处理：

序列输入：1 Mb DNA序列（1,000,000 bp）
物种标识：区分人类/小鼠基因组
并行计算策略：将1 Mb序列分割为 131 kb的独立片段，分布式处理于多个计算设备（GPU/TPU）

②. 三阶段处理流程：

阶段	功能	关键技术
Encoder	序列降维压缩：提取局部特征（如转录因子结合位点）	卷积层（捕捉基序特征） + 池化（降维）
Transformer	建模长程依赖：解析增强子-启动子远程互作、染色质域结构	跨设备通信的注意力机制（覆盖1 Mb全局上下文）
Decoder	序列升维还原：重建高分辨率输出	转置卷积（上采样） + 跳跃连接（保留细节）

③. 任务特定输出头：

多任务适配：连接至解码器末端，生成11类实验数据类型的预测结果
分辨率定制化：不同数据类型的输出分辨率独立设定（如单碱基/128bp bin）
预测规模：同时输出5,930条人类基因组轨道或1,128条小鼠轨道

技术意义：U-Net结构解决了长序列与高分辨率的矛盾——编码器提取抽象特征，Transformer建模全局交互，解码器恢复空间细节。

3.2.2 训练策略 (图b-c)

阶段①：教师模型训练 (图1b)

数据准备：
采样区域：从人类/小鼠基因组的交叉验证划分区域选取1 Mb区间
数据增强：随机平移（模拟调控元件位置变化）反向互补（增强序列方向不变性）
模型训练目标：
直接预测实验测得的基因组功能信号（如ChIP-seq峰、RNA表达量）
产出两种教师模型：
Fold-specific：单折数据训练的专家模型
All-folds：全数据训练的通用模型

阶段②：学生模型蒸馏 (图1c)

知识蒸馏流程：
教师冻结：固定All-folds教师模型的参数
学生输入：在原始序列基础上引入突变扰动（模拟自然变异）
学习目标：让学生模型复现教师对扰动序列的预测结果
关键优势：
变异预测专精化：学生模型专注学习序列变异与功能变化的映射
模型轻量化：产出单一高效推理模型（避免集成多教师模型的计算开销）

生物学意义：教师-学生框架将"功能预测"能力蒸馏为"变异效应预测"能力，提升临床应用的准确性。

3.2.3 性能评估 (图d-e)

①. 基因组轨道预测性能 (图1d)

评估指标：
相对性能提升% $= \frac {AlphaGenome得分−最佳基线得分} {随机分类器得分}$ （分类任务需标准化）
关键结果：

模态类型	代表性任务	性能提升	技术意义
转录调控	RNA表达量预测	显著提升	捕捉长程增强子交互
染色质构象	Hi-C接触图谱预测	最大提升	建模1 Mb尺度三维结构
表观遗传	H3K27ac组蛋白修饰预测	中等提升	识别开放染色质区域
RNA加工	多聚腺苷酸化位点(PA)识别	显著提升	精确定位转录后调控位点

注：128bp分辨率任务提升幅度普遍低于单碱基任务，因基线模型在此分辨率已有较好表现。

② .变异效应预测性能 (图1e)

评估场景：
功能变异：预测非编码区SNP对分子表型的影响
因果推断：评估数量性状位点(ds/caQTL)的因果方向
核心突破：
24/26任务超越基线：在涵盖染色质可及性(ATAC)、转录因子结合(ChIP)、基因表达(eQTL)等任务中全面领先
因果方向识别：对"变异是否导致分子表型改变"的判断准确率提升15-25%

案例佐证：TAL1癌基因附近的临床变异机制解析（多模态协同预测揭示：SNP→破坏TF结合→降低染色质开放性→抑制基因表达）

3.2.4 技术突破总结

维度	创新点	解决的核心问题
架构设计	U-Net + 跨设备Transformer	1 Mb长序列与单碱基分辨率的兼容
训练策略	两阶段教师-学生蒸馏	变异效应预测的专一性优化
多模态输出	11类数据类型/数千轨道并行预测	系统性解析变异致病机制
工程实现	131 kb分块并行计算	突破GPU显存限制实现兆碱基处理
评估验证	26项严格测试（含临床变异机制再现）	证明模型在基础研究和临床应用的普适性

3.3 AlphaGenome model architecture

model2.jpg

Extended Data Figure 1 | AlphaGenome model architecture. (a) Overview schematic illustrating the flow of activations through the model. The architecture follows a U-Net-like structure with an Encoder, a central Transformer Tower, and a Decoder processing a 1Mb DNA input sequence. The Encoder uses convolutional blocks and max pooling to progressively downsample the sequence resolution (from 1 bp to 128 bp) while increasing feature channels. The Transformer Tower operates at 128 bp resolution, iteratively refining sequence representations and generating pairwise (2D) representations. The Decoder uses convolutional blocks and upsampling, incorporating skip connections (dashed lines) from corresponding Encoder stages, to restore sequence resolution up to 1 bp. An Output Embedder performs final processing before feeding representations to task-specific output heads. (b) Internal structure of key component blocks used repeatedly within the architecture overview shown in (a). Diagrams detail the layers within the convolutional blocks (Conv block, Upres block), the Transformer blocks, and the blocks responsible for generating and updating pairwise representations (Pair update block, Sequence to pair block). Tensor shapes are shown excluding the batch dimension. Abbreviations: r = log-resolution, c = channels.

4.结果展示

这里详细介绍我感兴趣的两部分Result

4.1 AlphaGenome enables state-of-the-art enhancer-gene linking

AlphaGenome无需针对PE linking任务专门训练（即“零样本”）。其Transformer模块通过自注意力机制
自动识别序列中远距离的调控依赖关系。例如：

增强子特有的转录因子结合基序（如MYB、CTCF）被局部卷积层捕获；
Transformer将这些局部信号与远端启动子关联，形成功能连接假设
零样本表现媲美监督模型
在增强子距离TSS >10 kb时，AlphaGenome显著优于Borzoi（相对auPRC提升17–25%）；
与专门训练E-P链接的ENCODE-rE2G-extended模型相比，性能差距<1% auPRC

restlt1.jpg

Figure 4 | AlphaGenome predicts the effect of variants on gene expression. (j) Enhancer-gene linking performance (ENCODE-rE2G CRISPRi dataset17). Zero-shot evaluation: Performance (auPRC) comparison stratified by enhancer-TSS distance for AlphaGenome (distilled) vs Borzoi vs TSS distance baseline. Supervised evaluation: AlphaGenome input gradient score integrated into ENCODE-rE2G-extended vs ENCODE-rE2G models.
Extended Data Figure 7 | AlphaGenome improves enhancer-gene linking using input gradients and shows enhanced sensitivity to distal enhancers. (b) Impact of incorporating AlphaGenome’s input gradient score as a feature in the ENCODE-rE2G extended logistic regression model, evaluated on the ENCODE-rE2G benchmark. ENCODE-rE2G is a logistic regression model trained to predict enhancer-gene interactions from features2. Precision-recall curves are shown, colored by the feature sets used for training the regression model (auPRC values indicated in the legend). Feature sets are:
• rE2G extended with AlphaGenome features: All ENCODE-rE2G extended model features plus a single AlphaGenome’s input x gradient score.
• AlphaGenome features only : The AlphaGenome input x gradient score alone.
• TSS distance with AlphaGenome features: AlphaGenome input x gradient score plus the distance to TSS feature. • rE2G extended: All features from the ENCODE-rE2G extended model2. • TSS distance: Distance to TSS feature from2.
• ABC features only : Subset of ’rE2g extended’, with only features related to the Activity-By-Contact (ABC) model2.(c) Precision-recall curves for the ENCODE-rE2G benchmark, similar to panel (b), evaluating the ENCODE-rE2G extended regression model with different feature sets. Area under the precision-recall curve (auPRC) values for the different feature sets are indicated in the legend. In this configuration, ‘AlphaGenome features’ consist of a more comprehensive set of K562 cell line-specific variant effect scores. These include Allele-Specific Activity Scores (AAS) and variant effect scores calculated as the difference between alternate (ALT) and reference (REF) allele predictions (ALT-REF Diff scores). These scores were derived from AlphaGenome for the following genomic assays:
• RNA-seq of the target gene
• ChIP-TF EP300
• ChIP-Histone H3K27ac
• CAGE
• PRO-cap
• H1-ESC contact maps

4.2 AlphaGenome improves on predicting variant effects on chromatin accessibility and transcription factor binding

解决两大关键问题：

QTL效应预测：
判断非编码变异（如SNP）是否影响染色质可及性（caQTL）、DNase敏感性（dsQTL）或转录因子结合（bQTL）
量化变异对上述分子表型的效应强度
MPRA活性预测：
预测短DNA序列的调控活性（报告基因表达水平）
解析局部序列变异如何通过染色质状态调控基因表达

result2.png

Figure 5 | AlphaGenome accurately predicts variant effects on chromatin accessibility and SPI1 transcription factor binding. (a) Schematic of the center-mask variant scoring strategy. This approach, detailed in Methods, is used for accessibility (DNase-seq, ATAC-seq) and ChIP-seq predictions. (b) Performance comparison on QTL causality prediction. Average Precision (AP) for AlphaGenome, Borzoi, and ChromBPNet across QTL types (caQTL, dsQTL, bQTL) and ancestries. (c) Performance comparison on QTL effect size prediction. Pearson r is shown for AlphaGenome, Borzoi, and ChromBPNet across QTL types (caQTL, dsQTL, bQTL) and ancestries. (d) AlphaGenome’s predicted versus observed effect sizes for causal caQTLs (African ancestry). Scatterplot displays predictions using the DNase track for the GM12878 cell line. Signed Pearson r = 0.74; unsigned Pearson r = 0.45. Signed Pearson r correlation uses raw values; unsigned Pearson r uses absolute values. Red and blue circles highlight variants detailed in (e, f). (e) Example AlphaGenome predictions for selected caQTLs. Shown are ALT-REF differences in predicted DNase track (GM12878) around the variants highlighted in (d). (f) ISM-derived sequence logos for REF and ALT alleles of example caQTLs from (e). The examples suggest variant disruption or modulation of TF binding motifs. Putative binding factors and JASPAR39 matrix IDs (MA0105.1, MA0105.3) are indicated on the right. (g) AlphaGenome’s predicted versus observed effect sizes for causal SPI1 bQTLs. Scatterplot displays predictions using the SPI1 ChIP-seq track for the GM12878 cell line. Signed Pearson r = 0.55; unsigned Pearson r = 0.12. Red and blue circles highlight variants detailed in (h, i). (h) Example AlphaGenome predictions for selected SPI1 bQTLs. Shown are ALT-REF differences in predicted SPI1 ChIP-TF track (GM12878) around the variants highlighted in (g). (i) ISM-derived sequence logos for REF and ALT alleles of example SPI1 bQTLs from (h). Examples indicate potential motif impacts such as creation or disruption of SPI1 or related motifs. Putative binding factors and JASPAR matrix IDs (MA0081.2, MA0080.5) are indicated on the right. (j) CAGI5 MPRA challenge performance (average across loci). Top: Average zero-shot Pearson r performance, using cell type-matched raw DNase model outputs. Middle: Average Pearson r from LASSO regression using cell type-matched or cell type-agnostic DNase outputs. Bottom: LASSO regression Pearson r performance using features from multiple modalities and the full set of cell types (DNase + RNA + ChIP-Histone output types for AlphaGenome and Borzoi; DNase + CAGE output types for Enformer).

result2supp.png

Supplementary Figure 9 | Additional accessibility variant analysis. Extended evaluation of variant effect prediction on chromatin accessibility across diverse contexts. AP = average precision (auPRC). Signed Pearson R correlation uses raw values; unsigned Pearson R uses absolute values first. (a) Precision-Recall curves comparing AlphaGenome, Borzoi, and ChromBPNet performance on caQTL causality prediction in European ancestry. (b) Scatterplot comparing AlphaGenome’s predicted versus observed effect sizes (Coefficient) for causal caQTL variants in European ancestry. (c) Precision-Recall curves comparing AlphaGenome, Borzoi, and ChromBPNet performance on dsQTL causality prediction in Yoruba ancestry. (d) Scatterplot comparing AlphaGenome’s predicted versus observed effect sizes (Coefficient) for causal dsQTL variants in Yoruba ancestry. (e) Precision-Recall curves comparing model performance for caQTL causality prediction (African ancestry). (f) Effect size prediction for microglia causal caQTL variants. Scatterplot compares observed effects versus AlphaGenome’s predicted DNase effects in a closely-related available cell type (suppressor macrophage). (g) Effect size prediction for cardiac smooth muscle cell (SMC) causal caQTL variants. Scatterplot compares observed effects versus AlphaGenome’s predicted ATAC effects in a closely-related available cell type (left cardiac atrium ATAC). (h) Precision-Recall curves comparing model performance for SPI1 bQTL causality prediction.

访问Google DeepMind可以获得关于AlphaGenome更多详细信息：网址如下https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome/
AlphaGenome github软件地址：
https://github.com/google-deepmind/alphagenome

文献分享——AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model