基于结构的机器导向映射淀粉样蛋白序列空间揭示了未知的高溶解度序列簇
The amyloid conformation can be adopted by a variety of sequences, but the precise boundaries of amyloid sequence space are still unclear. The currently charted amyloid sequence space is strongly biased towards hydrophobic, beta-sheet prone sequences that form the core of globular proteins and by Q/N/Y rich yeast prions. Here, we took advantage of the increasing amount of high-resolution structural information on amyloid cores currently available in the protein databank to implement a machine learning approach, named Cordax (https://cordax.switchlab.org), that explores amyloid sequence beyond its current boundaries. Clustering by t-Distributed Stochastic Neighbour Embedding (t-SNE) shows how our approach resulted in an expansion away from hydrophobic amyloid sequences towards clusters of lower aliphatic content and higher charge, or regions of helical and disordered propensities. These clusters uncouple amyloid propensity from solubility representing sequence flavours compatible with surface-exposed patches in globular proteins, functional amyloids or sequences associated to liquid-liquid phase transitions.
淀粉样蛋白构象可被多种序列所采用,但淀粉样蛋白序列空间的精确边界仍不清楚。目前绘制的淀粉样蛋白序列空间强烈偏向于疏水的,易于形成球蛋白核心的β -薄片序列和由Q/N/Y丰富的酵母朊病毒。在这里,我们利用蛋白质数据库中目前可用的淀粉样蛋白核的高分辨率结构信息,实现了一种名为Cordax (https://cordax.switchlab.org)的机器学习方法,该方法探索了目前边界之外的淀粉样蛋白序列。t-分布式随机邻域嵌入(t-SNE)聚类显示了我们的方法如何导致淀粉样蛋白序列向低脂肪含量和高电荷簇,或螺旋和无序倾向区域的扩展。这些聚类将淀粉样蛋白的溶解性与淀粉样蛋白的溶解性分离开来,表现出与球状蛋白、功能性淀粉样蛋白的表面暴露斑块相容的序列,或与液-液相变相关的序列。
引言
The amyloid cross-β state is a polypeptide conformation that is adopted by 36 proteins or peptides associated to human protein deposition pathologies1. It also constitutes the structural core of a growing number of functional amyloids in both bacteria and eukaryotes2,3. Beyond these bona fide functional and pathological amyloids it has been demonstrated that many if not most proteins can adopt an amyloid-like conformation upon unfolding/misfolding4. This has led to the notion that just like the α-helix or β-sheet, the amyloid state is a generic polypeptide backbone conformation but also that amino acids have different propensities to adopt the amyloid conformation5.
Initially, it was observed that amyloid-like aggregation correlates with hydrophobicity, β-strand propensity, and (lack of) net charge6. This triggered the development of aggregation prediction algorithms that essentially evaluate the above biophysical propensities7,8. Others extended to scaling residue propensities between protein folding and aggregation9,10. These algorithms confirmed the ubiquity of amyloid-like propensity in natural protein sequences and particularly in globular proteins as it was estimated that 15–20% of residues in a typical globular domain are within aggregation-prone regions (APRs)11,12. These APRs are sequence segments of six to seven amino acids in length on average and are mostly buried within the protein structure where they constitute the hydrophobic core stabilising tertiary protein structure13–15. On the other hand, the increasing identification of both yeast prions and functional amyloids clearly indicated that amyloid sequence space is not monolithic and that more polar/ less aliphatic sequences represent important alternative populations of amyloid sequence space3. The limited sensitivity of the above cited algorithms to specifically identify these other subpopulations confirmed the underestimated sequence versatility of the amyloid conformation. Indeed, more recently the role of amyloid-like sequences in proteins mediating liquid–liquid phase transitions again demonstrates the ubiquity of the amyloid in biological function and further withers the image of the amyloid state as a predominantly disease and/or toxicity-associated protein conformation16–18. Rather, this suggests that like globular protein folding, amyloid assembly is a matter of kinetic and thermodynamic control that can be evolutionary tuned by sequence variation and selection.
淀粉样蛋白交叉-β态是一种多肽构象,被36个与人类蛋白质沉积病理相关的蛋白或多肽采用1。它也是细菌和真核生物中越来越多的功能性淀粉样蛋白的结构核心。除了这些真正的功能性和病理性淀粉样蛋白外,研究表明,许多(如果不是大多数的话)蛋白质在展开/错折叠时可以形成淀粉样构象4。这导致了一种观点,即就像α-螺旋或β-薄片一样,淀粉样结构是一种多肽主干结构,但氨基酸也有不同的倾向采用淀粉样结构5。
最初,我们观察到淀粉样聚集与疏水性、β链倾向和(缺乏)净电荷有关。这引发了聚合预测算法的发展,基本上可以评估上述生物物理倾向7,8。其他的扩展到蛋白质折叠和聚集之间的结垢残留倾向。这些算法证实了淀粉样蛋白在天然蛋白序列中普遍存在,尤其是在球状蛋白中,因为据估计,一个典型球状结构域中15-20%的残基位于聚集倾向区(APRs)11,12。这些APRs是平均长度为6 - 7个氨基酸的序列片段,大部分埋藏在蛋白质结构中,构成疏水核心稳定三级蛋白结构13 - 15。另一方面,越来越多的酵母朊病毒和功能性淀粉样蛋白的鉴定清楚地表明淀粉样蛋白序列空间不是单一的,更多的极性/较少的脂肪序列代表了淀粉样蛋白序列空间的重要替代群体3。上述引用的算法在特异性识别这些其他亚群体方面的有限敏感性证实了淀粉样构象的序列多功能性被低估。事实上,最近淀粉样蛋白序列在蛋白质介导液-液相变中的作用再次证明了淀粉样蛋白在生物学功能中的普遍存在,并进一步削弱了淀粉样蛋白状态主要是疾病和/或毒性相关蛋白构象的形象16 - 18。相反,这表明,像球状蛋白折叠一样,淀粉样蛋白组装是一个动力学和热力学控制的问题,可以通过序列变异和选择来调节进化。
Efforts to develop aggregation predictors that can identify a broader spectrum of amyloid sequences have increased over the years19. Such approaches focused on identifying position-specific patterns by reference to accumulated experimental data of APRs20–22, or by using energy functions of cross-beta pairings23.
Recently developed meta-predictors produce consensus outputs by combining previous methods, in an attempt to boost performance24,25. Indirect structure-based methods were initially developed by considering secondary structure propensities26,27.
Complementary studies extended this notion by suggesting that disease-related amyloids form β-strand-loop-β-strand motifs28.
However, the principle of using structural information to accurately predict aggregation prone segments in protein sequences stems from the detailed work of Eisenberg and co-workers. The 3D-profiling method utilised the crystal structure of the fibrilforming segment NNQQNY (PDB ID: 1YJO) derived from the Sup35 prion protein, to thread and evaluate sequence fitting using the Rosetta energy function29. In this work, we build on this principle to develop Cordax, an exhaustively trained regression model that leverages a substantial library of curated template structures combined with machine learning. Cordax not only detects APRs in proteins, but also predicts the structural topology, orientation and overall architecture of the resulting putative fibril core. To validate the accuracy of our predictions, we designed a screen of 96 newly predicted APRs and experimentally determined their aggregation properties. Using this approach, we identified less hydrophobic polar and charged aggregation prone sequences that increasingly uncouple solubility and amyloid propensity, closely resembling characteristics of phase-separation inducers. Clustering by t-distributed stochastic neighbour embedding reveals the heterogeneous substructure of amyloid sequence space consisting in varying clusters corresponding to sequences compatible with globular structure, functional scaffolding amyloids, N/Q/Y-rich prions, helical peptides and intrinsically disordered sequences. Together, the structural exploration performed here demonstrates that the field now gathered sufficient structural and sequence information to start classifying amyloids according to different structural and functional niches. Just like for globular proteins in the 1980s, this will allow to fine-tune both general and context-dependent structural rule learning allowing to manipulate and design amyloid structure and function.
近年来,开发聚合预测因子以识别更广泛的淀粉样蛋白序列的努力有所增加19。这些方法的重点是通过参考APRs20-22积累的实验数据,或使用交叉beta配对的能量函数23来识别位置特异性模式。
最近开发的元预测器通过结合以前的方法产生一致的结果,试图提高性能24,25。基于间接结构的方法最初是在考虑次级结构倾向的基础上发展起来的26,27。
补充性研究通过提出疾病相关淀粉样蛋白形成β-链-环-β-链基序,扩展了这一概念28。
然而,利用结构信息来准确预测蛋白质序列中容易聚集的片段的原理来自艾森伯格和他的同事们的详细工作。3d分析方法利用来自Sup35朊病毒蛋白的纤丝形成片段NNQQNY (PDB ID: 1YJO)的晶体结构,使用Rosetta能量函数29进行梳理和评估序列拟合。在这项工作中,我们基于这一原则开发了Cordax,这是一个经过全面训练的回归模型,它利用了结合机器学习的大量模板结构库。Cordax不仅能检测蛋白质中的apr,还能预测推断出的纤维核的结构拓扑、方向和整体结构。为了验证我们预测的准确性,我们设计了96个新预测的apr,并通过实验确定了它们的聚集特性。使用这种方法,我们发现疏水极性和电荷聚集倾向序列,越来越不耦合溶解度和淀粉样蛋白倾向,密切类似相分离诱导剂的特征。通过t分布的随机邻域嵌入聚类揭示了淀粉样蛋白序列空间的异质亚结构,由不同的聚类组成,对应于与球状结构兼容的序列、功能支架淀粉样蛋白、N/Q/ y丰富的朊病毒、螺旋多肽和本质上无序的序列。总之,这里进行的结构探索表明,该领域现在收集了足够的结构和序列信息,开始根据不同的结构和功能龛对淀粉样蛋白进行分类。就像20世纪80年代的球状蛋白一样,这将允许微调一般和上下文依赖的结构规则学习,从而操纵和设计淀粉样蛋白的结构和功能。
Results
Overall approach of Cordax. We wanted to design a novel structure-based amyloid core sequence prediction method that (a) leverages all the available structure information that is currently available, and (b) employs a machine-learning element for optimal prediction performance. To this end, we first built a curated template library of amyloid core structures as described in the paragraph below. In the vein of previous prediction methods29, w e fixed on the hexapeptide as a unit of prediction. In order to determine the amyloid propensity of a query hexapeptide we start by modelling its side chains on all the available amyloid template structures using the FoldX force field30, which yields a model and an associated free energy estimate (ΔG, kcal/mol) for each template. These free energies are then fed into a logistic regression model, which is a simple statistical method relating a binary outcome to continuous variables. The prediction output of Cordax is multiple: First, there is the prediction from the logistic regression whether or not the segment is an amyloid core sequence. Second, for the sequences deemed amyloid core, the most likely amyloid core model is provided. For longer query sequences, a sliding window approach is adopted. The technical details of the pipeline can be found in the “Methods” section.
Cordax的总体方法。我们希望设计一种新的基于结构的淀粉样蛋白核心序列预测方法,该方法(a)利用当前可用的所有结构信息,(b)采用机器学习元素以获得最佳预测性能。为此,我们首先建立了一个淀粉样蛋白核心结构的策划模板库,如下文所述。和以往的预测方法一样,我们把六肽作为预测的单位。为了确定查询六肽的淀粉样蛋白倾向,我们首先使用FoldX force field30在所有可用的淀粉样蛋白模板结构上对其侧链建模,这将产生每个模板的模型和相关的自由能估计(ΔG, kcal/mol)。然后,这些自由能被输入一个逻辑回归模型,这是一个简单的统计方法,将二元结果与连续变量联系起来。Cordax的预测输出是多重的:首先是逻辑回归的预测片段是否为淀粉样蛋白核心序列。其次,对于被认为是淀粉样核的序列,提供了最可能的淀粉样核模型。对于较长的查询序列,采用滑动窗口方法。管道的技术细节可以在“方法”一节中找到。
Refinement of fibril structures for machine learning. We isolated 78 short segment fibril core high-resolution structures from the Protein Data Bank (Supplementary Data 1). Templates were grouped into seven distinct topological classes out of eight theoretically possible based on their overall structural properties, as previously proposed by Sawaya et al.31. Briefly, topologies are defined by whether β-sheets have parallel versus antiparallel orientation, by the orientation of the strand faces that form the steric zipper (face-to-face versus face-to-back), and finally the orientation of both sheets towards each other and whether that results in identical or different fibril edges. This complexity was addressed by generating an ensemble of amyloid cores per structure using crystal contact information derived from the solved structures. Every template comprises two facing β-sheets, each composed of five successive β-strands. Since parallel architectures can share more than one homotypic packing interface, those structures were split into separate individual entries (Fig. 1).
To ensure uniformity, we expanded the number of structural variants by breaking down longer segments into hexapeptide constituents, thus yielding a library of 179 peptide fragment structures (Fig. 1 and Supplementary Data 1).
The amyloid interaction interfaces were analysed in detail following energy refinement by the FoldX force field30. During this step we identified and rejected 33 imperfect β-packing interfaces formed by β-strands that contribute less than three interacting residues, thus reducing the ensemble to 146 structures (Supplementary Data 1). Detailed analysis of the contributions of various energy components showed that these excluded βpacking interfaces have inefficient shape complementarity and low overall stability, stemming from a combination of weak electrostatic contributions, diminished van der Waals interactions and exposure of hydrophobic residues to the solvent (Fig. 2a).
为机器学习改进纤维结构。我们从蛋白质数据库(补充数据1)中分离出78个短片段纤原核高分辨率结构。模板根据其整体结构特性被分为7个不同的拓扑类别,这是先前由Sawaya等人提出的。简单地说,拓扑结构的定义是:β-薄片的取向是平行还是反平行,形成空间拉链的链面取向(面对面还是面对面),最后是两个薄片的取向,以及这是否导致相同或不同的纤维边缘。这种复杂性的解决是通过生成一个集合淀粉样核每个结构使用晶体接触信息导出的解决结构。每个模板由两个面朝β-薄片组成,每个β-薄片由五个连续的β-链组成。由于并行架构可以共享一个以上的同型封装接口,这些结构被分割成单独的条目(图1)。
为了保证一致性,我们将较长的片段分解为六肽成分,从而扩大了结构变异体的数量,得到了179个肽片段结构的文库(图1和补充数据1)。
在FoldX force field30能量细化后,对淀粉样蛋白相互作用界面进行了详细分析。在这一步骤中,我们识别并剔除了33个由β-链形成的不完全β-堆积界面,这些β-链贡献了少于3个相互作用的残基,从而将整体结构减少到146个(补充数据1)。对各种能量成分贡献的详细分析表明,这些被排除的β-堆积界面具有低效的形状互补性和较低的整体稳定性,这是由弱静电贡献的组合造成的。减少范德华相互作用和疏水残留物暴露于溶剂(图2a)。
Previous work has highlighted that distinct topological layouts can potentially introduce a stronger tolerance for the integration of protein sequence segments and as a result can generate several potential type-I errors (false positives)29. To address this issue, we implemented a two-step cross-threading exploration of putative structural promiscuous traps. In more detail, we extracted a nonredundant set of hexapeptide sequences from the structural library (73 sequences), which was subsequently cross-modelled in an all-against-all reiteration process. Using an empirical cut-off threshold (=5), a sum of three structural fragments was initially identified and removed. Eliminating these structures led to the identification and subsequent elimination of three additional promiscuous templates, resulting in the final Cordax library, composed of 140 zipper structures (Fig. 2b and c).
以前的工作已经强调,不同的拓扑布局可能会为蛋白质序列片段的整合引入更强的容错能力,从而产生几个潜在的i型错误(假阳性)29。为了解决这个问题,我们实现了一个两步交叉线程探索假定的结构混杂陷阱。更详细地说,我们从结构库中提取了一组非冗余的六肽序列(73个序列),随后在一个全反全重迭代过程中交叉建模。使用经验临界值(=5),初步识别并去除三个结构碎片。消除这些结构导致了对另外三个混杂模板的识别和随后的消除,从而产生了由140个拉链结构组成的最终Cordax库(图2b和c)。
Benchmarking aggregation propensity detection with Cordax.
As an initial test of the prediction accuracy of the regression model, we performed leave-one-out cross-validation on the training dataset32 and performance metrics were determined on a peptide basis. Due to the extensive size of the dataset, comparison to other software was performed only with methods supporting multiple sequence input and a non-binary scoring function, since performances were compared using receiver operating characteristic (ROC) analysis33. The ROC curves generated highlight that Cordax performance exceeds over seven state-ofthe-art methods, which we applied using optimised options defined by the developers7,9,21–24,34. In detail, Cordax performs well over random as depicted by the highest total area under the curve (AUC) value of 0.87 (Fig. 3a). Distribution analysis of the scoring values indicates that the method achieves optimal separation, resulting in minimal scoring overlay between positive and negative amyloid forming sequences (Fig. 3b). As previously reported, TANGO showed high specificity due to the overrepresentation of unscored values, which is also evident for WALTZ as well as MetAmyl, which incorporates the latter method in its meta-prediction. The cost of high specificity is also reflected by the calculated F1 values, as PASTA and TANGO report low recall values. On the other hand, AGGRESCAN and GAP produce significant overpredictions as depicted by their reported false-positive rates (FPR values of 0.54 and 0.76, respectively) (Fig. 3c). The optimal score thresholding of our method was determined from the ROC curve analysis as the score where predictions show the highest sensitivity-to-specificity ratio.
According to this, Cordax achieves a well-balanced prediction by reporting with high specificity (86%) more than 7 out of 10 aggregation prone segments (72%), which is reflected by the highest calculated MCC, AUC and F1 values compared to other available software (Fig. 3c).
使用Cordax对聚合倾向检测进行基准测试。
作为回归模型预测精度的初始测试,我们对训练数据et32进行了留一交叉验证,性能指标以肽为基础确定。由于数据集的广泛规模,与其他软件的比较仅使用支持多序列输入的方法和非二元评分函数,因为性能的比较使用受试者工作特征(ROC)分析33。ROC曲线显示Cordax的性能超过了7种最先进的方法,我们使用了开发者定义的优化选项7,9,21 - 24,34。详细地说,Cordax的表现比随机要好,如图所示,曲线下的最高总面积(AUC)值为0.87(图3a)。评分值的分布分析表明,该方法实现了最优分离,导致正淀粉样蛋白形成序列与负淀粉样蛋白形成序列之间的评分叠加最小(图3b)。正如之前报道的那样,TANGO由于未得分值的过多呈现而表现出很高的特异性,这一点在华尔兹和MetAmyl中也很明显,它将后者纳入其元预测中。高特异性的代价还体现在计算的F1值上,如PASTA和TANGO报告的召回值较低。另一方面,正如报告的假阳性率(FPR值分别为0.54和0.76)所示,aggression和GAP产生了显著的超预测(图3c)。我们的方法的最佳评分阈值是根据ROC曲线分析确定的,即预测显示灵敏度-特异性比最高的评分。
由此可见,Cordax实现了良好的平衡预测,其报告的10个聚集倾向区段中有7个(72%)以上具有高特异性(86%),这反映在与其他可用软件相比,计算出的MCC、AUC和F1值最高(图3c)。
To further benchmark the method, we tested it against fulllength protein sequences. For this we used a standardised set of 34 annotated amyloidogenic proteins that was previously implemented for validation of several previous aggregation predictors25, following a filtering step for potential overlaps to the training data set. Despite its wide use, this collection suffers from insufficient experimental characterisation of certain large entries (i.e. gelsolin, kerato-epithelin, lactoferrin, amphoterin and others), which has been shown to introduce type-I errors (false positives). This error propensity derives from non-amyloid annotations which primarily correspond to regions of undetermined aggregation propensity, a notion that is highlighted by recent studies, such as in the case of calcitonin35, cystatin-C36 and transthyretin37. In contrast, other proteins have been linked to the formation of β-helical structures and as an after effect contain elongated fragments characterised, yet unverified in their entirety, as amyloidogenic, which can introduce type-II errors (false negatives) when applying predictors of local aggregation propensity38–41. The aforementioned shortcomings are reflected by the low MCC values that are reported for all aggregation predictors (Supplementary Table 1) and the fact that predicted segments were originally considered neutral, but later shown to be aggregation hotspots (Supplementary Fig. 1)35–41.
为了进一步对该方法进行基准测试,我们将其与全长蛋白序列进行了测试。为此,我们使用了一组标准化的34个注释淀粉样蛋白,这些蛋白先前被用于验证之前的几个聚合预测因子25,然后对训练数据集的潜在重叠部分进行过滤。尽管它被广泛使用,但这一集合受到某些大条目(如明胶蛋白、角膜上皮蛋白、乳铁蛋白、两性蛋白和其他)实验描述不足的影响,这已被证明会引入i型错误(假阳性)。这种错误倾向源于非淀粉样注释,这些注释主要对应于未确定聚集倾向的区域,最近的研究强调了这一概念,例如降钙素35、胱抑制素- c36和转thyretin37的研究。与此相反,其他蛋白质与β螺旋结构的形成相关,作为一个后续效应,它包含了被描述为淀粉样结构的细长片段,但未被完整验证,在应用局部聚集倾向的预测因子时,可能会引入ii型错误(假阴性)38 - 41。上述缺点反映在报告中所有聚合预测的MCC值较低(补充表1),以及预测的片段最初被认为是中性的,但后来被显示为聚合热点(补充图1)35-41。
Designed APR nucleators validate the accuracy of Cordax predictions. In the interest of improving the current description of the familiar amyloidogenic protein dataset, we selected and synthesised a subset of 96 peptides corresponding to strong aggregation prone regions identified in these proteins by Cordax.
Apart of prediction strength, the peptide screen was also selectively constructed to ensure broad sequence variability and a wide distribution on the proteins of the dataset, with a preference for longer entries defined by inadequate previous characterisation.
Peptide sequences were cross-checked and filtered to exclude overlapping sequences with previously identified amyloid regions and WALTZ-DB (Supplementary Data 2). The remaining selection of 96 peptides were synthesised using standard solid phase synthesis and their amyloid-forming properties were initially examined using Thioflavin-T (Th-T) or pFTAA binding, following rotating incubation for 5 days at room temperature. The binding assays are complementary, as Th-T and pFTAA are opposingly charged molecules, which increases the amyloid identification rate by overcoming cases of dye-specific failure to bind to amyloid surfaces based on charge repulsion. Under these conditions, 66 peptides successfully bind to the specific dyes (Fig. 4a and b) by forming fibrils with typical amyloid morphologies and properties that were verified using transmission electron microscopy (Fig. 4c) and Congo red staining for selected cases (Fig. 4d). As these dyes are known to yield false negatives, in particular for short peptides, all dye-negative peptides were further investigated using electron microscopy. During this scan, we recovered 19 additional sequences that were capable of forming sparse amyloid-like fibrils with shorter lengths (Supplementary Fig. 2). Taking the latter into account, Cordax was able to fish out a total number of 85 novel nucleation segments with unparalleled accuracy (89%), thus providing a rigorously improved description of the protein set to be used for the efficient testing and development of future predictors (Supplementary Fig. 1).
设计的APR核子验证了Cordax预测的准确性。为了改善目前对淀粉样蛋白数据集的描述,我们选择并合成了96个多肽子集,这些多肽对应于Cordax在这些蛋白中发现的强聚集倾向区域。
除了预测强度之外,肽筛选也被选择性地构建,以确保广泛的序列可变性和数据集蛋白质的广泛分布,偏好由不充分的先前特征定义的较长条目。
多肽序列进行交叉检查和筛选,以排除先前确定的淀粉样区域和WALTZ-DB的重叠序列(补充数据2)。其余96个多肽使用标准固相合成,并在室温下旋转孵育5天后,使用硫黄素- t (Th-T)或pFTAA结合检测其淀粉样形成特性。结合试验是互补的,因为Th-T和pFTAA是相反带电的分子,通过克服由于电荷排斥而导致染料特异性结合淀粉样蛋白表面失败的情况,从而提高淀粉样蛋白的识别率。在这些条件下,66条多肽通过形成具有典型淀粉样形态和特性的纤维成功地结合到特定的染料(图4a和b),这些特性通过透射电子显微镜(图4c)和选定病例的刚果红染色(图4d)验证。众所周知,这些染料会产生假阴性,特别是短肽,所有的染料阴性肽都用电子显微镜进一步研究。在这次扫描中,我们恢复了19个额外的序列,这些序列能够形成较短长度的稀疏淀粉样原纤维(补充图2)。考虑到后者,Cordax能够以前所未有的准确性(89%)提取出85个新的成核片段,从而提供了一个严格改进的蛋白质集描述,用于高效测试和开发未来的预测因子(补充图1)。
Cordax detects highly soluble surface-exposed conformational switches. The expanded amyloidogenic annotation of the protein dataset was supplemented with structural analysis of the newly identified aggregation prone regions. Out of 96 peptides designed and experimentally tested, 85 peptides were found to display evident amyloid-forming features, with more than half (55.3%) being predicted specifically by Cordax, contrary to shared predictions with sequence-based tools of high specificity (44.7%) (Supplementary Data 2). Pinpointing the location of the identified nucleators in parental protein folds (Fig. 5a) revealed that APRs picked up both by Cordax and traditional sequence-based methods are usually found buried within the core of soluble proteins. Contrary to what has been previously reported14,15, however, our regression model also discovered additional nucleating sequences that primarily appear to reside on the surface of protein molecules (Fig. 5b–h) and as a result, are characterised by high solvent exposure (Fig. 5i and j). Partition coefficients clearly indicate that these exposed peptide segments identified by Cordax are primarily water-soluble sequences, whereas APRs that are predicted by the majority of sequence-based predictors are largely insoluble (Fig. 5k). Sequence distribution analysis signifies that this increased exposure and solubility is complemented by an expected decrease in sequence hydrophobicity (Fig. 5l). More specifically, APRs identified solely by Cordax are relatively enriched in charged or polar side chains (Fig. 5l) and are frequently parts of α-helical or unstructured segments (Fig. 5m). This implies that these regions are in fact conformational switches that may, under fitting misfolding conditions, transiently move towards the formation of β-aggregates. The fact that these sequences are not dictated by typical sequence propensities, such as hydrophobicity or β-structure tendency, explains why sequence-based predictors overlook them.
Cordax检测高可溶性表面暴露的构象开关。对蛋白质数据集的扩展淀粉样蛋白注释进行了补充,并对新发现的聚集易发区域进行了结构分析。在设计和实验测试的96个多肽中,发现85个多肽显示明显的淀粉样形成特征,超过一半(55.3%)被Cordax特异性预测,与基于序列的工具的高特异性(44.7%)的共同预测相反(补充数据2)。通过对亲本蛋白折叠中确定的核子进行定位(图5a),发现Cordax和传统的基于序列的方法提取的APRs通常位于可溶性蛋白的核心内。然而,与之前的报道相反,我们的回归模型还发现了更多的主要位于蛋白质分子表面的成核序列(图5b-h),因此,它们的特征是高溶剂暴露(图5i和j)。分配系数清楚地表明,Cordax识别的这些暴露肽段主要是水溶性序列,而大多数基于序列的预测因子预测的APRs大多是不可溶的(图5k)。序列分布分析表明,暴露度和溶解度的增加与序列疏水性的预期下降相辅相成(图5l)。更具体地说,仅由Cordax鉴定的APRs相对富集于带电侧链或极性侧链(图5l),通常是α-螺旋或非结构化段的一部分(图5m)。这意味着这些区域实际上是构象开关,在拟合的错折叠条件下,可能会短暂地向β-聚集体的形成移动。这些序列不受典型序列倾向(如疏水性或β结构倾向)的支配,这一事实解释了为什么基于序列的预测因子会忽略它们。
Cordax infiltrates uncharted areas of amyloid sequence space.
To further explore the capabilities of our method, we composed a map of the known amyloid-forming sequence space using tdistributed stochastic neighbour embedding (t-SNE) for dimensionality reduction (Fig. 6a). As input, we used a 20-dimensional parameterisation vector describing all newly identified amyloidogenic peptides merged to the known amyloid-forming hexapeptide sequences in WALTZ-DB, in terms of their basic physicochemical properties and amino acid composition, as well as prediction outputs derived from Cordax and other high specificity predictors. t-SNE mapping pinpointed clear areas of sequence space where Cordax correctly identifies amyloid propensity (purple colour in Fig. 6a), which primarily extend towards regions that remain unpredicted (shown in black) and seclude from a large base of sequences identified by multiple methods, including Cordax (cyan colour). Clustering analysis (Fig. 6b) performed using physicochemical properties (Figs. 6c–e), secondary structure propensities (Fig. 6f) and side chain size distributions (Fig. 6g, h) identifies that this common base of by-now easy to predict APRs are characterised by high hydrophobicity, strong β-sheet propensity and a high relative content of aliphatic side chains (cluster 1 in Fig. 6b), still echoing the initial discovery of APRs by these features6. Cordax explores regions adjacent to this with a higher content of shorter side chains (clusters 2 and 5).
Notably, amyloid nucleators of this composition are an invaluable resource for amyloid nanomaterial designs with elastin-like properties, are enriched in functional amyloids and have also been linked to ancestral amyloid scaffolds in early life42–45. A similar trend in amino acid composition has also been reported for proteins that form condensates through phase transition, such as TDP-43 and FUS16,18. Low complexity regions (LCRs) that are enriched in short side chains, such as Gly or Ala, have been shown to drive phase separation, often as an intermediate event towards fibrillation, particularly in polar LCRs with lower aliphatic content and strong disorder or α-helical propensities, such as the sequences discovered in cluster 517,46. Further to this, Cordax provides significant advancement by traversing in areas with a higher content of negatively or positively charged regions (clusters 3, 4, 6 and 7, respectively). Charged residues often act as gatekeepers that directly disrupt aggregation or modulate it by flanking APRs within protein sequences47. Based on this premise, most sequence-based predictors negatively correlate net charge to protein aggregation and have increased failure rates when identifying such amyloid-forming stretches. On the other hand, sequences with a high content of aromatic side chains are relatively easy to identify (clusters 9a and 9b), following several lines of evidence supporting their role in amyloid fibril formation48. Cordax also pushes forward into less well-charted areas of amyloid sequence space, e.g. exploring clusters with high α-helical content (cluster 10) and overall a low content of aliphatic amino acids (clusters 5, 6, 7, 8 and 9b). These regions also reveal the scope to improve the method, as in particular, the region with high disorder propensity (cluster 11) still contains many false negatives, in spite of the ability of Cordax to partially pick up a minority of sequences. Interestingly, a closer look at the partition coefficients of the known amyloid sequence space reveals that although Cordax takes a significant step forward towards the right direction, these APRs remain very hard to identify as they are characterised by even higher solubility values (Fig. 6i). Similar charting of the amyloid sequence space is achieved by using uniform manifold approximation and projection (UMAP) for dimensionality reduction (Supplementary Fig. 3a and b), while PCA analysis highlights that CORDAX slowly infiltrates the sequence space of higher solubilities (Supplementary Fig. 3c and d).
Overall, dimensionality reduction transformation highlights that structural compatibility can overcome typical sequence propensities as a pivotal driver of aggregation nucleating sequences and suggests that under the proper conditions, the boundaries currently considered compatible to protein amyloid-like assembly are potentially far wider than previously expected.
Cordax浸润淀粉样蛋白序列空间的未知区域。
为了进一步探索我们方法的能力,我们使用t分布随机邻域嵌入(t-SNE)来降低维度,构建了已知淀粉样蛋白形成序列空间的地图(图6a)。我们使用一个20维参数化向量作为输入,描述所有新识别的淀粉样肽与WALTZ-DB中已知的淀粉样形成六肽序列合并,包括它们的基本物理化学性质和氨基酸组成,以及Cordax和其他高特异性预测因子导出的预测输出。t-SNE映射在序列空间中明确指出了Cordax正确识别淀粉样蛋白倾向的区域(图6a中的紫色部分),该区域主要延伸到不可预测的区域(如图黑色部分所示),并与通过多种方法(包括Cordax(青色))识别的大量序列基隔离。利用物理化学性质(图6c-e)、二级结构倾向(图6f)和侧链大小分布(图6g, h)进行聚类分析(图6b),发现这种目前很容易预测的APRs的共同碱基具有高疏水性、强β-板倾向和高脂肪侧链相对含量(图6b中的聚类1),仍然通过这些特征与APRs的最初发现相一致6。Cordax探索与此相邻的较短侧链含量较高的区域(簇2和簇5)。
值得注意的是,该成分的淀粉样核是具有弹性蛋白特性的淀粉样纳米材料设计的宝贵资源,富含功能性淀粉样蛋白,也与早期生命的祖先淀粉样蛋白支架有关42 - 45。在氨基酸组成方面也有类似的趋势,通过相变形成凝结物的蛋白质,如TDP-43和FUS16,18。在短侧链中富集的低复杂性区域(LCRs),如Gly或Ala,已被证明可以驱动相分离,通常是纤颤发生的中间事件,特别是在脂肪含量较低和强无序或α-螺旋倾向的极性LCRs,如在517,46簇中发现的序列。此外,Cordax还通过穿越负电荷或正电荷含量较高的区域(分别为第3、4、6和7簇)提供了显著的进步。带电残基通常扮演着守门人的角色,直接破坏聚集或通过在蛋白质序列内的apr侧翼调节聚集47。基于这一前提,大多数基于序列的预测因子与蛋白质聚集的净电荷负相关,并且在识别此类淀粉样蛋白形成延伸时失败率增加。另一方面,芳香侧链含量高的序列相对容易识别(聚类9a和9b),以下几行证据支持它们在淀粉样纤维形成中的作用48。Cordax还将研究方向推进到淀粉样蛋白序列空间中较不清晰的区域,例如探索高α-螺旋含量的聚类(聚类10)和整体低脂肪氨基酸含量的聚类(聚类5、6、7、8和9b)。这些区域也揭示了改进方法的范围,特别是,尽管Cordax能够部分提取少数序列,但具有高无序倾向的区域(簇11)仍然包含许多假阴性。有趣的是,仔细观察已知淀粉样蛋白序列空间的分配系数可以发现,尽管Cordax朝着正确的方向迈出了重要的一步,但这些apr仍然很难识别,因为它们具有更高的溶解度值(图6i)。通过统一流形近似和投影(UMAP)进行降维得到淀粉样蛋白序列空间的类似图表(Supplementary Fig. 3a和b),而PCA分析强调CORDAX缓慢地渗透到较高溶解度的序列空间(Supplementary Fig. 3c和d)。
总之,降维转化强调结构相容性可以克服典型的序列倾向,作为聚集成核序列的关键驱动因素,并表明在适当的条件下,目前被认为与蛋白质淀粉样组装相容的边界可能比之前预期的要宽得多。
![图5 Cordax鉴定了表面暴露的聚集核横跨残基,这些残基通常被认为是淀粉样原纤维形成的非常规。根据同源原生淀粉样蛋白Ure2p晶体结构绘制的APRs cordax预测拓扑模型示意图。b - h b Ure2p, c RepA, d acyl磷酸酶-2,e Sup35, f Prolactin, g乳铁蛋白和h角膜上皮的折叠结构的表面表征表明,与主要埋藏在天然褶皱疏水核心内的关节预测片段(蓝色部分)相比,由Cordax唯一识别的聚集核子(红色部分)主要暴露在蛋白质表面。Cordaxspecific预测的APRs产生了较低的体积埋藏值,使用FoldX计算,i侧链和j主链基团,表明它们比共同识别的成核剂暴露得多。k分配系数表明,cordax特异性APRs明显比典型预测的疏水序列更容易溶解,因此不溶于水。溶解区(vi极难溶,i不溶,n中性,s可溶,vs极易溶)以彩色背景表示72。采用单因素方差分析和多重比较计算显著差异。
Cordax predicts the structural layout and topology of fibril cores. Due to restricted availability of experimentally determined structures not included in the Cordax library, we first analysed the information derived from cross-threading analysis in order to test the performance of the tool in predicting the structural architecture of aggregation prone stretches. Among 73 unique sequences corresponding to the structural library, Cordax was able to accurately assign the correct architecture to 63%, whereas 81% was identified with proper β-strand orientation (parallel/ antiparallel) (Fig. 7a, Supplementary Data 3 and 4). In comparison, FibPredictor49 correct topology allocation was limited to 9.5% of the sequences and assigned β-strand directionality amounted to 32.9%, while introducing an evident preference towards antiparallel architectures (Fig. 7a). Similarly, the 3Dprofile method is restricted to linking all potential queries with a class 1 topology, hence was incapable of predicting alternative architectures (Fig. 7a). Structural alignment indicated that even in cases of mismatching selected templates, modelled architectures strongly superimpose to the solved structures (Fig. 7b), suggesting that Cordax identifies the correct topology with high accuracy. A closer look reveals that sequence specificity may be a modulating, yet not determining factor for this selection process. Steric perturbations can be introduced due to restrictions deriving from closely interdigitating side chains within the packed interfaces, therefore, key residue positions can be bound to the overall stability of certain structural topologies and decrease the acceptable sequence space that can accommodate energetically favourable interactions. This is highlighted by the sequence similarity observed between topological matches (Fig. 7c, Supplementary Data 4). On the other hand, topologically different model selections could also be a consequential outcome of amyloid polymorphism. The observed sequence redundancy of the Cordax library illustrates that APRs can form amyloid fibrils with distinct morphological layouts50–52, a notion that is also supported by the common morphological variability of aggregates formed at the level of full-length amyloid-forming proteins53,54. The modulating role of sequence dependency was also evident for the 96peptide screen. A ranked analysis of the output models indicated that templates with higher alignment scores were not crucial for the topology selection process, although could often correspond to the favourable architectures (Fig. 7d), thus highlighting that the structural predictions of Cordax are relatively unbiased in terms of the sequence space composing the structural templates.
The accuracy of the tool was also cross-referenced against experimentally determined structures of fibril cores not included in the structural library. We utilised the recently solved structures of parallel fibril-forming segments derived from the major curli protein CsgA55, as well as an anti-parallel polymorphic APR variant segment derived from the amyloid-β peptide56. Compared to other structural predictors, only Cordax could invariantly predict the correct architecture for every steric zipper as the closest representation of the experimentally determined reference structures (Fig. 7e and f). This performance can only improve as the fragment library expands, so we aim to update it at regular intervals, providing there is a noticeable increase in solved structures in the future.
Cordax预测纤维核的结构布局和拓扑结构。由于没有包括在Cordax库中的实验确定的结构的可用性有限,我们首先分析了来自交叉线程分析的信息,以测试工具在预测聚合倾向拉伸的结构体系结构方面的性能。在结构库对应的73个独特序列中,Cordax能够准确分配正确的结构,63%,而81%的序列具有正确的β-链方向(平行/反平行)(图7a,补充数据3和4)。相比之下,FibPredictor49正确的拓扑分配限制在9.5%的序列,分配的β-链方向达32.9%,同时引入了明显的反平行结构(图7a)。类似地,3Dprofile方法仅限于将所有潜在查询与一类拓扑联系起来,因此无法预测替代架构(图7a)。结构对齐表明,即使在选择模板不匹配的情况下,建模的架构也会强烈叠加到解决的结构上(图7b),这表明Cordax能够高精度地识别正确的拓扑。进一步观察发现,序列特异性可能是一个调节,但不是决定因素的选择过程。由于在填充界面内紧密交叉的侧链的限制,可以引入空间扰动,因此,关键剩余位置可以绑定到某些结构拓扑的整体稳定性,并减少可以容纳积极有利的相互作用的可接受序列空间。在拓扑匹配之间观察到的序列相似性强调了这一点(图7c,补充数据4)。另一方面,拓扑上不同的模型选择也可能是淀粉样蛋白多态性的结果。Cordax文库观察到的序列冗余表明,APRs可以形成具有不同形态布局的淀粉样原纤维,这一观点也得到了在全长淀粉样形成蛋白水平上形成的聚集物的常见形态变化的支持54,54。序列依赖的调节作用在96肽筛选中也很明显。对输出模型的排名分析表明,比对得分较高的模板对拓扑选择过程不是至关重要的,尽管通常可以对应于有利的架构(图7d),从而突出表明Cordax的结构预测在组成结构模板的序列空间方面相对无偏倚。
该工具的准确性还与结构库中不包括的纤维岩心的实验确定结构进行了交叉参照。我们利用了来自主要卷曲蛋白CsgA55的平行纤维形成片段,以及来自淀粉样蛋白-β肽56的反平行多态APR变异片段的最近解决的结构。与其他结构预测器相比,只有Cordax能够不变地预测每个空间拉链的正确结构,作为实验确定的参考结构的最接近的表示(图7e和f)。这种性能只能随着碎片库的扩大而提高,所以我们的目标是定期更新它,前提是未来解决的结构有明显的增加。
Discussion The number of amyloid structures in the protein databank has been steadily increasing over the last two decades. It has now achieved a number (>80) that was reached for globular proteins at the beginning of the 1980s and that then triggered the first developments of template-based modelling methods including homology-based and threading (or fold recognition) in an attempt to estimate the versatility of individual folds and discover novel folds in a more directed manner. Similarly, we here developed Cordax, an exhaustively trained regression model that leverages a substantial library of curated amyloid template structures combined with machine learning. Cordax uses a logistic regression approach to translate structural compatibility and interaction energies into sequence aggregation propensity and is therefore unconstrained by defined sequence tendencies, such as hydrophobicity or secondary structure preference that direct most sequence-based predictors. As a result, we discovered unconventional amyloid-like sequences, including sequences with low aliphatic content, high net charge or sequences with low intrinsic structural propensities. Clustering amyloid sequences by t-SNE two-dimensional reduction revealed the substructure of amyloid sequence space. Apart from a large cluster corresponding to sequences found in the hydrophobic core of globular proteins, we also found clusters corresponding to surface-exposed amyloid sequences in globular proteins, small aliphatic functional amyloids, N/Q/Y prions, strongly helical and intrinsically disordered sequences which could be compatible with liquid–liquid phase responsive sequences. Our analysis highlights the discovery of highly soluble, yet amyloid-forming, sequences and suggests that the largest portion of the remaining uncharted amyloid sequence space is hidden in this corner (Fig. 6a and i). Indeed, most archetypal hydrophobic APR sequences have low intrinsic solubility. As a result, low solubility and aggregation propensity are properties that are often wrongly used interchangeably. It is important to differentiate between the initial solubility and aggregation propensity of a peptide, as soluble monomeric sequences can often self-assemble, at later time points, into insoluble amyloid fibrils. The APRs that are newly discovered by Cordax are often highly soluble in their monomeric form, even more than the already known polar APRs from the yeast prions, as they contain many charged and polar residues, yet surprisingly can still assemble into amyloids. Overall, our approach demonstrates that the increasing structural information on amyloids now allows for more fine-graded structural rule learning of the amyloid state.
Recent developments in microcrystal electron diffraction have enabled structural determination from nanocrystals that are not typically suited for traditional X-ray diffraction and have provided significant insights on the polymorphic architectures of amyloid fibrils57. In this line, the emergence of cryo-EM has been pivotal in determining features of amyloid fibril polymorphs58, complementing earlier efforts developed using solid-state NMR spectroscopy53,59. Notably, these structures represent snapshots of the kinetic cores of aggregation or end-state morphologies of amyloid fibrils and therefore provide limited information on the underlying aggregation pathways and toxicity-related effects of amyloids. On the other hand, the growing number of highresolution cryo-EM structures has highlighted the in vivo structural diversity of amyloid fibrils60, whereas steric zippers have been recently used for the development of targeted therapeutics61–63. However, determining the structural layout of amyloid fibrils still remains challenging. Cordax attempts to provide a cost-effective complementary powerful computational alternative that can be operated without any required scientific expertise necessary to apply the intricate technical approaches.
Apart from its function as an aggregation predictor, the tool is uniquely poised to provide detailed complementary structural information on the putative amyloid fibril architecture of identified APRs. Users can utilise the method to structurally characterise identified APRs by classifying their overall specific topological preferences, including β-strand directionality and key residue positions that are integral parts of the amyloid core. The latter information is imperative for efforts focused on understanding the underlying mechanisms that dictate amyloid-related diseases or the formation of functional amyloids, but can also have an immense impact on the design of applied nanobiomaterials64, targeted amyloid inducers65 or counteragents, following the increased interest in the development of structurebased inhibitors of aggregation61–63.
在过去的二十年里,蛋白质数据库中淀粉样结构的数量一直在稳步增加。它现在已经达到了一个数字(>80),该数字是在20世纪80年代初达到的,然后引发了基于模板的建模方法的第一个发展,包括基于同源性和线程(或折叠识别),试图估计单个折叠的多功能性,并以更直接的方式发现新的折叠。同样,我们在这里开发了Cordax,这是一个经过全面训练的回归模型,它利用了大量的淀粉样蛋白模板结构库,并结合了机器学习。Cordax使用逻辑回归方法将结构兼容性和相互作用能转化为序列聚集倾向,因此不受定义的序列倾向的约束,如疏水性或二级结构偏好,这些偏好直接影响大多数基于序列的预测因子。因此,我们发现了非常规的淀粉样蛋白序列,包括低脂肪含量的序列,高净电荷或低固有结构倾向的序列。通过t-SNE二维还原聚类淀粉样蛋白序列揭示淀粉样蛋白序列空间的亚结构。除了在球状蛋白的疏水核心中发现的大簇序列外,我们还在球状蛋白中发现了与表面暴露的淀粉样蛋白序列相对应的簇,小的脂肪功能性淀粉样蛋白,N/Q/Y朊病毒,强螺旋和本质上无序的序列,这些序列可以与液-液相响应序列相兼容。我们的分析突出了高可溶性淀粉样蛋白形成序列的发现,并表明剩余未知淀粉样蛋白序列空间的最大部分隐藏在这一角落(图6a和i)。事实上,大多数典型的疏水APR序列具有较低的固有溶解度。因此,低溶解度和聚合倾向是经常被错误地互换使用的特性。区分肽的初始溶解性和聚集性是很重要的,因为可溶性单体序列往往可以在稍后的时间点自组装成不溶性淀粉样原纤维。Cordax新发现的apr通常以单体形式高可溶性,甚至比酵母朊病毒中已知的极性apr还多,因为它们包含许多带电和极性残基,但令人惊讶的是,它们仍然可以组装成淀粉样蛋白。总之,我们的方法表明,淀粉样蛋白结构信息的增加现在允许对淀粉样蛋白状态进行更精细的结构规则学习。
微晶体电子衍射的最新发展使纳米晶体的结构测定成为可能,这通常不适合传统的x射线衍射,并为淀粉样原纤维的多态结构提供了重要的见解57。在这一研究中,低温- em技术的出现在确定淀粉样纤维多态性特征方面起到了关键作用58,补充了早期使用固态核磁共振波谱技术开发的成果513,59。值得注意的是,这些结构代表了淀粉样原纤维聚集的动力学核心或终态形态的快照,因此对淀粉样蛋白的潜在聚集途径和毒性相关影响提供了有限的信息。另一方面,越来越多的高分辨率冷冻- em结构突出了淀粉样蛋白原纤维的体内结构多样性60,而立体拉链最近已被用于靶向治疗的开发61 - 63。然而,确定淀粉样原纤维的结构布局仍然具有挑战性。Cordax试图提供一种具有成本效益的补充性强大的计算替代方案,它可以在不需要任何必要的科学专业知识的情况下操作,以应用复杂的技术方法。
除了作为聚集预测器的功能外,该工具还独特地准备提供确定的apr的淀粉样原纤维结构的详细补充结构信息。用户可以利用该方法对已识别的APRs进行结构表征,方法是对其整体特定拓扑偏好进行分类,包括β链方向和淀粉样蛋白核心组成部分的关键残留位置。后者的信息对于专注于了解淀粉样蛋白相关疾病或功能性淀粉样蛋白形成的潜在机制是必不可少的,但也可能对应用纳米材料、靶向淀粉样蛋白诱导剂或对抗剂的设计产生巨大影响,随着人们对发展基于结构的聚集抑制剂的兴趣的增加61 - 63。
Methods
Regression model training.
In previous work we synthesised and explored the aggregation potential of 940 peptide sequences derived from both functional and pathological amyloid-forming proteins, which were supplemented with additional data on 462 hexapeptides derived from other published sources to develop WALTZ-DB 2.032, the largest public comprehensive repository of experimentally defined amyloidogenic peptides. In total, 1402 hexapeptide sequences from WALTZ-DB were modelled on the 140 backbone structures of the Cordax library, leading to the generation of 196,280 models. The thermodynamic stability of each model (ΔG, kcal mol−1) was calculated using FoldX and fed into a logistic regression model (Fig. 2c). This model was used to distil the aggregation propensity from the free energy values. Towards this end, from the calculated ΔGs, we isolated 50 representative energies using a recursive feature elimination algorithm (using the RFE module of the SciKit-learn python package33 and selecting for the set of templates that maximised the AUC). As a result, each sequence is described with a 50-dimensional vector. Next, the data were transformed in order to be constrained in a scoring range between 0 and 1, using a Min/Max scaling algorithm. The regression model was trained with L2 penality and regularisation strength (C) equal to 1. Both scaling of the estimated ΔG and the machine-learning model were developed using the SciKit-learn python package66.
Model pipeline.
Cordax receives a protein sequence in FASTA format as input, which is fragmented into hexapeptides using a sliding window process. Sequences are then threaded against the fragment library utilising FoldX and the derived free energies are translated into scoring values for every peptide window. An energetically fitted model is selected as the closest representative of the overall topology of the amyloid fibril core for each predicted window and is provided as output in standard PDB format to the users (Fig. 2c). An amyloidogenic profile is generated by scoring every single residue of the input sequence with the maximum calculated score of the corresponding windows, followed by a binary prediction for every segment. Finally, calculated energies are stored automatically in a growing local database and can be retrieved, thus creating a ‘lazy’ interface that bypasses unnecessary computation for recurring sequence segments or future runs.
方法
回归模型训练
在之前的工作中,我们合成并探索了来自功能性和病病性淀粉样形成蛋白的940个肽序列的聚集潜力,并补充了来自其他发表来源的462个六肽的额外数据,以开发WALTZ-DB 2.032,这是最大的实验性定义淀粉样形成肽的公共综合库。共对WALTZ-DB中的1402条六肽序列进行了Cordax文库140个主干结构的建模,得到196280个模型。使用FoldX计算每个模型(ΔG, kcal mol−1)的热力学稳定性,并将其输入逻辑回归模型(图2c)。该模型用于从自由能值中提取聚合倾向。为此,我们使用递归特征消除算法(使用SciKit-learn python包33的RFE模块,并选择最大AUC的模板集)从计算的ΔGs中分离出50个代表性能量。因此,每个序列都用一个50维向量来描述。接下来,利用最小/最大缩放算法对数据进行转换,以将其限制在0到1之间的评分范围内。在L2惩罚和正则化强度(C)等于1的情况下训练回归模型。估计ΔG的缩放和机器学习模型都是使用SciKit-learn python package66开发的。
管道模型
Cordax接收FASTA格式的蛋白质序列作为输入,该序列通过滑动窗口过程被分割成六肽。然后,序列利用FoldX对片段库进行线程处理,并将导出的自由能转换为每个肽窗口的得分值。对于每个预测窗口,选择一个能量拟合模型作为淀粉样蛋白原核整体拓扑结构最接近的代表,并以标准PDB格式向用户提供输出(图2c)。通过对输入序列的每一个残差进行评分,并对相应窗口的最大计算分数进行评分,生成淀粉样变剖面,然后对每个片段进行二进制预测。最后,计算出的能量被自动存储在一个不断增长的本地数据库中,并可以被检索,从而创建了一个“惰性”接口,避免了重复序列段或未来运行的不必要的计算。
Datasets.
Performance assessment of Cordax was carried out utilising two individual data sets for peptide and protein aggregation propensity detection. Further validation of the method was performed against an independent subset screen of 96 hexapeptides sequences.
For peptide aggregation propensity, we used a dataset of 1402 non-redundant hexapeptides contained in the WALTZ-DB 2.0 repository32. This database is the largest currently available resource of experimentally characterised amyloidogenic peptides. It contains annotated peptide entries that are distributed in shorter subsets and extracted from literature22,23,67–69, in addition to peptides with experimentally determined amyloid-forming properties. As a result, it has been widely used as a validation set for several aggregation predicting tools21,23,67,70,71.
Collected in 2013, reg33 is a standard dataset for estimating the performance of aggregation propensity prediction in protein sequences25. It contains regional annotation of aggregating segments identified for 34 well-known amyloidogenic proteins. The annotation is assigned on a residue basis, thus containing 1260 residues in defined APRs and 6472 residues located in non-aggregating segments.
Last, we compiled a set consisting of 96 hexapeptide segments derived from potentially mis-annotated non-amyloidogenic regions of the reg33 dataset that were predicted as aggregation-prone segments after applying Cordax. Peptide segments were filtered for potential overlaps to the WALTZ-DB 2.0 set (Supplementary Data 2).
数据集
Cordax的性能评估利用两个单独的数据集进行肽和蛋白质聚集倾向检测。通过对96个六肽序列的独立亚群筛选,进一步验证了该方法。
对于肽聚合倾向,我们使用了WALTZ-DB 2.0库中包含的1402个非冗余六肽数据集32。该数据库是目前可用的最大的实验性淀粉样肽资源。它包含了标注的肽条目,分布在较短的子集中,并从文献22,23,67 - 69中提取,此外还有实验确定的淀粉样形成特性的肽。因此,它被广泛用于几种聚合预测工具的验证集21,23,67,70,71。
reg33收集于2013年,是一个用于估计蛋白质序列聚集倾向预测性能的标准数据集25。它包含34个已知淀粉样蛋白聚集片段的区域注释。该注释以残差为基础进行赋值,因此在定义的apr中包含1260个残差,而在非聚集段中包含6472个残差。
最后,我们编译了一套包含96个六肽片段的数据集,这些片段来自reg33数据集中可能错误注释的非淀粉样蛋白区域,应用Cordax后被预测为易于聚集的片段。筛选与WALTZ-DB 2.0集可能重叠的肽段(补充数据2)。
Comparative analysis.
Binary classification was utilised to determine performances of calculated aggregation propensities per hexapeptide fragment or per residue. As a result, predictions can be classified by comparison to experimental validation into true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN), respectively. Performance is evaluated using the following metrics:
比较分析。二元分类用于确定每个六肽片段或每个残基的聚合倾向的计算性能。因此,通过与实验验证的比较,预测结果可以分为真阳性(TP)、真阴性(TN)、假阳性(FP)和假阴性(FN)。性能评估使用以下指标:
Peptide synthesis. Peptides derived from the Cordax validation set were synthesised using an Intavis Multipep RSi solid phase peptide synthesis robot. Peptide purity (>90%) was evaluated using RP-HPLC purification protocols and peptides were stored as ether precipitates (−20 °C). Peptide stocks were initially treated with 1,1,1,3,3,3-hexafluoro-isopropanol (HFIP) (Merck), then dissolved in traces of dimethyl sulfoxide (DMSO) (Merck) (<5 %), filtered through 0.2 μm filters and finally in milli-Q water to reach a final concentration of 200 μM or up to 1 mM for dye-negative peptides. Dithiothreitol (DTT) (1 mM) was included in solutions of peptides spanning cysteine or methionine residues. All peptides were incubated at room temperature for a period of 5 days on a rotating wheel.
Thioflavin-T and pFTAA-binding assays. Amyloid aggregation was monitored using fluorescent spectroscopy-binding assays. Th-T (Sigma) or pFTAA (Ebba Biotech AB) was added in half-area black 96-well microplates (Corning, USA) at a final concentration of 25 and 0.5 μM, respectively. Fluorescence intensity was measured in replicates (n = 6) using a PolarStar Optima and a FluoStar Omega plate reader (BMG Labtech, Germany), equipped with an excitation filter at 440 nm and emission filters at 490 and 510 nm, respectively.
Transmission electron microscopy. Peptide solutions were incubated for 5 days at room temperature in order to form mature amyloid-like fibrils. Suspensions (5 μL) of each peptide solution were added on 400-mesh carbon-coated copper grids (Agar Scientific Ltd., England), following a glow-discharging step of 30 s to improve sample adsorption. Grids were washed with milli-Q water and negatively stained using uranyl acetate (2% w/v in milli-Q water). Grids were examined with a JEM-1400 120 kV transmission electron microscope (JEOL, Japan), operated at 80 keV.
Congo red staining. Droplets (10 μL) of peptide solutions containing mature amyloid fibrils were cast on glass slides and permitted to dry slowly in ambient conditions in order to form thin films. The films were stained with a Congo red (Sigma) solution (0.1% w/v) prepared in milli-Q water for 20 min. De-staining was performed with gradient ethanol solutions (70–90%).
Determination of peptide propensities. Surface exposure and secondary structure analysis was performed using the FoldX energy force field on the available crystal structures for acylphosphatase-2 (PDB ID:1APS), amphoterin (PDB ID:1CKT and 1HME), apolipoprotein-C2 (PDB ID:1I5J), α-synuclein (PDB ID:1XQ8), β2-microglobulin (PDB ID:1A1M), casein (PDB ID:6FS5), gelsolin (PDB ID:3FFN), Het-S (PDB ID:2WVN), kerato-epithelin (PDB ID:5NV6), lactoferrin (PDB ID:1CB6), prolactin (PDB ID:1RW5), major prion protein (PDB ID:1E1G), repA (PDB ID:1HKQ), serum amyloid alpha (PDB ID:4IP8), Sup35 (PDB ID:4CRN) and Ure2p (PDB ID:1HQO). Partition coefficients were calculated using PlogP, which specialises in peptides with blocked termini72. Structural alignment and visualisation were performed with the aid of YASARA73. Sequence similarities were calculated using the BLOSUM62 matrix currently available under the Biostrings R library. Correlation plots were generated using the ggpairs() function available under the GGally R library and ROC curves were calculated using ROCR.
Dimensionality reduction analysis. A d efined amyloid-forming sequence space was constructed by merging the experimentally determined amyloid sequences of the 96-peptide screen, identified by Cordax, to the amyloid sequence content extracted from WALTZ-DB. Prior to t-SNE analysis, scoring outputs using Cordax, PASTA23, TANGO7 and WALTZ21 were calculated for each peptide entry. Peptide description was complemented with a 20-dimensional vector using the available R package Peptides. All data points were reduced and embedded in 2D-space using the Rtsne package, with perplexity (p = 45), iteration steps (n = 5000) and learning rate (default) defined based on the initial guidelines proposed by van der Maaten and Hinton74. UMAP reduction was performed using the R umap package and three-dimensional PCA analysis was conducted using pca3d R package and visualised with scatter3D, respectively.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.
肽合成。使用Intavis Multipep RSi固相多肽合成机器人合成Cordax验证集中的多肽。使用RP-HPLC纯化方案评估肽纯度(>90%),肽以醚沉淀物的形式存储(−20℃)。肽原液首先用1,1,1,3,3,3-六氟-异丙醇(HFIP)(默克公司)处理,然后在痕量二甲亚砜(DMSO)(默克公司)(< 5%)中溶解,通过0.2 μm过滤器过滤,最后在毫微q水中达到最终浓度为200 μm或高达1 mM的染料负肽。二硫苏糖醇(DTT) (1 mM)包含在半胱氨酸或蛋氨酸残基的多肽溶液中。所有多肽在室温旋转轮上孵育5天。
硫黄素- t和pftaa结合试验。淀粉样蛋白聚集使用荧光光谱结合分析监测。Th-T (Sigma)或pFTAA (Ebba Biotech AB)分别加入半面积黑色96孔微孔板(康宁,美国)中,最终浓度分别为25和0.5 μM。使用北极星Optima和荧光星Omega平板阅读器(BMG Labtech,德国)测量重复(n = 6)的荧光强度,分别配备440 nm的激励滤波器和490和510 nm的发射滤波器。
透射电子显微镜。肽溶液在室温下培养5天,以形成成熟的淀粉样原纤维。将每种肽溶液的悬浮液(5 μL)添加到400目碳包覆铜网格(琼脂科学有限公司,英国)上,随后进行30 s的发光放电步骤,以提高样品的吸附。网格用毫q水冲洗,然后用醋酸铀酰(2% w/v在毫q水中)进行负染色。栅格检查使用JEM-1400 120千伏透射电子显微镜(JEOL,日本),操作在80 keV。
刚果红染色。将含成熟淀粉样原纤维的肽液滴(10 μL)置于载玻片上,允许在环境条件下缓慢干燥,以形成薄膜。薄膜用在毫微q水中制备的刚果红(Sigma)溶液(0.1% w/v)染色20分钟。用梯度乙醇溶液(70-90%)去染。
肽倾向的测定。使用FoldX能量力场对酰基磷酸酶-2 (PDB ID:1APS)、两性蛋白(PDB ID:1CKT和1HME)、载脂蛋白- c2 (PDB ID:1I5J)、α-synuclein (PDB ID:1XQ8)、β2-微球蛋白(PDB ID:1A1M)、酪蛋白(PDB ID:6FS5)、gelsolin (PDB ID:3FFN)、Het-S (PDB ID:2WVN)、角膜上皮(PDB ID:5NV6)、乳铁蛋白(PDB ID:1CB6)、泌乳素(PDB ID:1RW5)、主要朊病毒蛋白(PDB ID:1E1G)、repA (PDB ID:1HKQ),血清淀粉样蛋白(PDB ID:4IP8), Sup35 (PDB ID:4CRN)和Ure2p (PDB ID:1HQO)。分配系数用PlogP计算,PlogP专门化于末端闭塞的多肽72。结构对齐和可视化在YASARA73的帮助下进行。利用Biostrings R库中现有的BLOSUM62矩阵计算序列相似性。使用GGally R库下的ggpairs()函数生成相关图,使用ROCR计算ROC曲线。
降维分析。将Cordax鉴定的96肽筛选实验确定的淀粉样蛋白序列与WALTZ-DB中提取的淀粉样蛋白序列含量合并,构建一个明确的淀粉样蛋白形成序列空间。在t-SNE分析之前,使用Cordax、PASTA23、TANGO7和WALTZ21对每个肽条目的输出进行评分。肽段描述用可用的R包多肽与20维载体进行补充。使用Rtsne包对所有数据点进行缩减并嵌入到2d空间中,根据van der Maaten和Hinton74提出的初始准则定义perplexity (p = 45)、迭代步长(n = 5000)和学习率(default)。使用R UMAP包进行UMAP缩减,分别使用pca3d R包进行三维PCA分析,并使用scatter3D进行可视化。
报告总结。关于研究设计的进一步信息可以在与本文链接的《自然研究报告摘要》中找到。