MCMC-PurBayes

Related Knowledge

异质性

  • 肿瘤的异质性是恶性肿瘤的特征之一,是指肿瘤在生长过程中,经过多次分裂增殖,其子细胞呈现出分子生物学或基因方面的改变,从而使肿瘤的生长速度、侵袭能力、对药物的敏感性、预后等各方面产生差异。简单点说就是同一肿瘤中可以存在有很多不同的基因型或者亚型的细胞。因此同一种肿瘤在不同的个体身上可表现出不一样的治疗效果及预后,甚至同一个体身上的肿瘤细胞也存在不同的特性和差异。
  • 肿瘤的异质性是指肿瘤组织内部不同的肿瘤细胞或者亚群中体细胞突变不完全相同

肿瘤纯度

  • 肿瘤样本中癌细胞总是混合一定未知比例的正常细胞,我们称肿瘤样本中癌细胞所占的比例为肿瘤纯度(Tumor purity)。

SNV

  • SNV是基因组上单个碱基发生改变的位点,在基因组上广泛分布。

Abstract

  • PurBayes,to estimate tumor purity and detect intratumor heterogeneity based on next-generation sequencing data of paired tumor-normal tissue samples, which uses finite mixture modeling methods.

  • PurBayes,基于使用有限混合物建模方法的成对肿瘤 - 正常组织样本下一代测序数据(NGS)估计肿瘤纯度检测肿瘤内异质性

introduction

  • With advances in high-throughput next-generation sequencing (NGS) technologies, sequencing of tumor-normal tissue pairs is becoming commonplace in cancer studies. Often, the sampled tumor tissue is contaminated with stromal cells, resulting in a mixture of tumor and normal sequence data in the tumor sample. There has been a recent interest in accurate estimation of tumor purity levels in tumor data analysis, including methods specific to NGS data such as PurityEst.

  • 随着高通量新一代测序(NGS)技术的进步,肿瘤-正常组织对的测序在癌症研究中变得普遍。 通常,取样的肿瘤组织被基质细胞污染,导致肿瘤样品中肿瘤和正常序列数据的混合。 最近人们对肿瘤数据分析中肿瘤纯度水平的准确估计感兴趣,包括NGS数据特有的方法,如PurityEst。

  • However, a subset of the observed somatic mutations may be subclonal because of intratumor heterogeneity . Unlike clonal mutations, which are observed tumor-wide, subclonal mutations will be observed at cellularities less than the tumor purity level and subsequently bias purity estimates under an assumption of tumor tissue homogeneity. By modeling this heterogeneity, it may also be possible to make inferences about tumor evolution and founder events. To date there are no methods that aim to both quantify tumor purity and detect intratumor heterogeneity using NGS data.

  • 然而,由于肿瘤内异质性,观察到的体细胞突变的子集可能是亚克隆的。 与在肿瘤范围内观察到的克隆突变不同,将在低于肿瘤纯度水平的细胞系中观察到亚克隆突变,并且随后在肿瘤组织同质性的假设下偏向纯度估计。 通过对这种异质性进行建模,也可以对肿瘤进化和创始事件做出推论。 迄今为止,没有任何方法旨在使用NGS数据来量化肿瘤纯度和检测肿瘤内异质性

  • In this article, we present a Bayesian mixture modeling approach, PurBayes, toward estimating tumor purity and subclonality using NGS data, resulting in posterior distributions of tumor cellularities from which credible intervals (CI) can be derived. To illustrate its implementation, we conduct a simulation study under a variety of conditions and discuss the performance of PurBayes on synthetic data.

  • 在本文中,我们提出了贝叶斯混合物建模方法,PurBayes,使用NGS数据估计肿瘤纯度和亚克隆性,得出肿瘤细胞的后验分布,从中可以得出可信区间(CI)。 为了说明其实施,我们在各种条件下进行了模拟研究,并讨论了PurBayes在合成数据上的性能。

Methods

  • For a set of S observed heterozygous loci because of somatically acquired single-nucleotide variants (SNVs) for a given tumor sequencing sample, each SNV can be represented by respective normal and mutant allele read counts Xi and Yi. The total number of sample reads Ni = Xi + Yi can in turn be decomposed into respective tumor and normal tissue read counts Nti and Nwi , such that Ni = Nwi + Nti . As it cannot be directly determined which cell type each individual read was derived, Nti and Nwi are latent variables. If we assume Nti to be binomially distributed, such that Nti~Bin(Ni, λ) and λ indicates tumor sample purity, and Yi|Nti~Bin(Nti , 0.50), then Yi follows a binomial–binomial hierarchical mixture model with marginal distribution Yi~Bin(Ni, λ/2) .
  • 对于一组观察到的杂合位点集合S,由于给定了肿瘤测序样品的体细胞获得的单核苷酸变体(SNV),每个SNV可以由相应的正常和突变等位基因读数Xi和Yi表示。 样本总数写作Ni = Xi + Yi又可以分解成各自的肿瘤和正常组织读数Nti和Nwi,使得Ni = Nwi + Nti。 由于不能直接确定每个单独读取的细胞类型,Nti和Nwi是潜在变量。 如果我们假设Nti是二项分布的,那么Nti~Bin(Ni,λ)其中λ表示肿瘤样本纯度,并且Yi|Nti~Bin(Nti,0.50),则Yi遵循二项式 - 二项式层次混合模型与边缘分布Yi~Bin(Ni,λ/ 2)
  • Consider a tumor that exhibits intratumor heterogeneity. If we assume subclonal mutations cluster into an a priori finite number of J-1 subclonal populations, Y can be modeled under a Bayesian finite mixture model. Let Kj denote to the probability a mutation corresponds to variant population j with respective cellularity λj, for j = 1, ... , J, such that E Kj = 1, λ1<...<λj, and λj ~=λ , with uniform priors on λj. To obtain a data-driven value for J, PurBayes generates model fits iteratively by initially assuming tumor homogeneity and then increasing the subclonal population count by one until an optimal model fit is achieved under a penalized expected deviance (PED) criterion .
  • 考虑一种表现出肿瘤内异质性的肿瘤。 如果我们假设亚克隆突变聚集到先验有限数量的J-1亚克隆群体中,Y可以在贝叶斯有限混合模型下建模。 令Kj表示突变对应于具有各自细胞性λj的变体群J的概率,对于j = 1,...,J,使得 epsilon Kj = 1,λ1<... <λj,并且λj ~= λ, 其中λj是均匀先验的。 为了获得J的数据驱动值,PurBayes通过初始假设肿瘤同质性然后将亚克隆种群数增加1来迭代地生成模型拟合,直到在惩罚预期偏差(PED)标准下实现最佳模型拟合。
  • Mapping bias can result in non-reference alleles in heterozygous loci being mapped at rates<0.50, which would impact tumor purity estimation. PurBayes can accommodate this bias by estimating it from additional reference and alternate allele counts in heterozygous normal tissue variant calls.
  • 定位偏差可导致杂合基因座中的非参考等位基因以<0.50的速率定位,这将影响肿瘤纯度估计。 PurBayes可以通过从杂合正常组织变异调用中的额外参考和替代等位基因计数来估计它来适应这种偏差
  • PurBayes is implemented in the statistical programming language R and uses the MCMC software JAGS. The only inputs required for PurBayes are the tumor tissue read counts (N and Y) for a set of high-confidence SNVs, which can easily be derived from most variant calling software output file formats on NGS data.
  • PurBayes以统计编程语言R实现,并使用MCMC软件JAGS。 PurBayes所需的唯一输入是一组高可信度SNV的肿瘤组织读数(N和Y),可以很容易地从NGS数据上的大多数变体调用软件输出文件格式中获得。

Simulation

  • To illustrate the performance of PurBayes under a variety of conditions, we conducted simulation studies based on real sequencing data from the 1000 Genomes Project (details in Supplementary Materials). We first simulated read count data for homogenous tumors ranging in purity from 20–80%, with S = 100 and average sequencing depth at 50x and 100x. We ran 100 replications of each unique set of conditions and examined the PurBayes posterior median estimates. We ran similar simulations for heterogeneous tumor data with J = 2 at 100x for various values of Kj and λj to determine how well PurBayes can detect intratumor heterogeneity and estimate tumor purity. For each application, we also simulated read count data from 100 additional germ line variant calls to account for mapping bias. For purposes of comparison, we also applied the PurityEst algorithm to each simulation replicate.

  • 为了说明PurBayes在各种条件下的性能,我们基于来自1000个基因组项目(详见补充材料)的真实测序数据进行了仿真研究。我们首先模拟了20-80%纯度的同质肿瘤的计数数据,S = 100,平均测序深度分别为50x和100x。我们对每种独特的条件进行了100次复制,并检查了PurBayes后中位数估计值。我们对各种Kj和λj值的异质性肿瘤数据进行了类似的仿真,其中 J = 2, 100x,以确定PurBayes检测肿瘤内异质性和估计肿瘤纯度的精度。 对于每钟应用,我们还仿真了来自另外100个胚芽系变体调用的读计数数据,以考虑映射偏差。为了便于比较,我们还对每个仿真迭代应用了PurityEst 算法。

  • For each application of PurBayes, the first 50000 iterations of the optimal MCMC model fit were discarded as a burn-in before posterior sampling of 10000 iterations. Mean per-sample execution time was 2 min on a workstation equipped with an Intel CoreTM i5 3.10 Ghz processor and 4GB of random access memory.

  • 对于PurBayes的每个应用,最佳MCMC模型拟合的前50000次迭代在10000次迭代的后验取之前被丢弃作为老化。 在配备Intel CoreTM i5 3.10 Ghz处理器和4GB随机存取存储器的工作站上,每个样本的平均执行时间为2分钟。

Results and Discussion

  • For the homogenous tumor simulations, PurBayes correctly identified tumor homogeneity in all replications. Distributions of the posterior median estimates of tumor purity for each value of λ and method are displayed in Figure 1. Estimates from PurBayes and PurityEst were nearly identical, with a Pearson correlation of 0.9997. Both methods were accurate, tending toward overestimation at lower values of λ. When we applied PurBayes to heterogeneous data, the ability to detect heterogeneity was highly dependent on the disparity between cellularities. The proportion of clonal variants also affected detection, with larger values of K1 leading to higher mean absolute error (MAE) of the posterior median purity estimates. Although PurityEst performed comparably under certain conditions, the ability for PurBayes to detect heterogeneity generally resulted in greater estimate accuracy.
  • 对于同质肿瘤仿真,PurBayes在所有重复实验中正确识别肿瘤同质性。 图1显示了对每个λ值的肿瘤纯度的后验中位数估计值的分布以及方法。PurBayes和PurityEst的估计值几乎相同,Pearson相关系数为0.9997。 两种方法都是准确的,倾向于在较低的λ值下过高估计。 当我们将PurBayes应用于异质性数据时,检测异质性的能力高度依赖于细胞之间的差异克隆变体的比例也影响检测,较大的K1值导致后验中位数纯度估计时较高的平均绝对误差(MAE)。虽然PurityEst在某些条件下表现相当,但PurBayes检测异质性的能力通常会带来更高的估计准确性。
  • Our simulation results highlight the potential bias of tumor purity estimates in the presence of unaccounted intratumor heterogeneity. By simultaneously estimating tumor purity and subclonality, PurBayes may also provide additional advantages, such as facilitating inference regarding the tumor composition and evolution as well as isolation of potential founder events. As a Bayesian approach, measures of uncertainty are directly derived from the posterior distribution of J in the form of CIs.
  • 我们的仿真结果强调了在未计入肿瘤内异质性的情况下肿瘤纯度估计的潜在偏差。 通过同时估计肿瘤纯度和亚克隆性,PurBayes还可以提供额外的优势,例如促进关于肿瘤组成和进化的推断,以及潜在的创始事件的分离。 作为贝叶斯方法,不确定性的度量直接来自于CI的形式的J的后验分布
  • One possible issue in the application of PurBayes is if it estimates J to be larger than the true value because of outlier observations, which leads to a positively biased tumor purity estimate. This can be especially problematic with the existence of copy number variation (CNV) and structural rearrangements. Given that regions of CNV will result in multiplicative impact on the number of mapped reads and SNVs contained within such regions will not truly reflect heterozygosity at a proportion of 0.50, such SNVs would highly influence estimation of J. As such, we anticipate PurityEst to perform better in instances in which CNVs are present and unaccounted for in purity estimation because of its robust estimation procedures. It is thus highly recommended that regions indicated to be CNVs by parallel analyses be filtered from the estimation procedure.
  • 应用PurBayes的一个可能问题是,如果由于离群值的观察使J的估计大于真值,则导致肿瘤纯度估计值偏向正偏差。 对于拷贝数变异(CNV)和结构重排的存在,这可能尤其成问题。 鉴于CNV区域将对映射读数的数量产生倍增影响,并且这些区域中包含的SNV不能真实地反映0.50的比例的杂合性,这样的SNV将高度影响J的估计。因此,我们预期PurityEst执行在CNV存在的情况下更好,并且由于其强大的估计程序而在纯度估计中不明确。 因此,强烈建议从估计程序中过滤通过平行分析指示为CNV的区域。
  • We foresee a variety of extensions to the concepts in PurBayes. For example, the mixture model could be alternatively formulated to characterize tumor cellularity as a continuous distribution using semi-parametric approaches. Integration of CNV and ploidy information will also make PurBayes a more effective estimator.
  • 我们预见到对PurBayes概念的各种扩展。例如,混合型模型可以通过半参数化方法来描述肿瘤细胞的连续分布。CNV和倍性信息的集成也将使PurBayes成为一种更有效的估计器。
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,417评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,921评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,850评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,945评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,069评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,188评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,239评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,994评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,409评论 1 304
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,735评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,898评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,578评论 4 336
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,205评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,916评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,156评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,722评论 2 363
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,781评论 2 351

推荐阅读更多精彩内容