Related Knowledge
异质性
- 肿瘤的异质性是恶性肿瘤的特征之一,是指肿瘤在生长过程中,经过多次分裂增殖,其子细胞呈现出分子生物学或基因方面的改变,从而使肿瘤的生长速度、侵袭能力、对药物的敏感性、预后等各方面产生差异。简单点说就是同一肿瘤中可以存在有很多不同的基因型或者亚型的细胞。因此同一种肿瘤在不同的个体身上可表现出不一样的治疗效果及预后,甚至同一个体身上的肿瘤细胞也存在不同的特性和差异。
- 肿瘤的异质性是指肿瘤组织内部不同的肿瘤细胞或者亚群中体细胞突变不完全相同。
肿瘤纯度
- 肿瘤样本中癌细胞总是混合一定未知比例的正常细胞,我们称肿瘤样本中癌细胞所占的比例为肿瘤纯度(Tumor purity)。
SNV
- SNV是基因组上单个碱基发生改变的位点,在基因组上广泛分布。
Abstract
PurBayes,to estimate tumor purity and detect intratumor heterogeneity based on next-generation sequencing data of paired tumor-normal tissue samples, which uses finite mixture modeling methods.
PurBayes,基于使用有限混合物建模方法的成对肿瘤 - 正常组织样本的下一代测序数据(NGS)来估计肿瘤纯度和检测肿瘤内异质性。
introduction
With advances in high-throughput next-generation sequencing (NGS) technologies, sequencing of tumor-normal tissue pairs is becoming commonplace in cancer studies. Often, the sampled tumor tissue is contaminated with stromal cells, resulting in a mixture of tumor and normal sequence data in the tumor sample. There has been a recent interest in accurate estimation of tumor purity levels in tumor data analysis, including methods specific to NGS data such as PurityEst.
随着高通量新一代测序(NGS)技术的进步,肿瘤-正常组织对的测序在癌症研究中变得普遍。 通常,取样的肿瘤组织被基质细胞污染,导致肿瘤样品中肿瘤和正常序列数据的混合。 最近人们对肿瘤数据分析中肿瘤纯度水平的准确估计感兴趣,包括NGS数据特有的方法,如PurityEst。
However, a subset of the observed somatic mutations may be subclonal because of intratumor heterogeneity . Unlike clonal mutations, which are observed tumor-wide, subclonal mutations will be observed at cellularities less than the tumor purity level and subsequently bias purity estimates under an assumption of tumor tissue homogeneity. By modeling this heterogeneity, it may also be possible to make inferences about tumor evolution and founder events. To date there are no methods that aim to both quantify tumor purity and detect intratumor heterogeneity using NGS data.
然而,由于肿瘤内异质性,观察到的体细胞突变的子集可能是亚克隆的。 与在肿瘤范围内观察到的克隆突变不同,将在低于肿瘤纯度水平的细胞系中观察到亚克隆突变,并且随后在肿瘤组织同质性的假设下偏向纯度估计。 通过对这种异质性进行建模,也可以对肿瘤进化和创始事件做出推论。 迄今为止,没有任何方法旨在使用NGS数据来量化肿瘤纯度和检测肿瘤内异质性。
In this article, we present a Bayesian mixture modeling approach, PurBayes, toward estimating tumor purity and subclonality using NGS data, resulting in posterior distributions of tumor cellularities from which credible intervals (CI) can be derived. To illustrate its implementation, we conduct a simulation study under a variety of conditions and discuss the performance of PurBayes on synthetic data.
在本文中,我们提出了贝叶斯混合物建模方法,PurBayes,使用NGS数据估计肿瘤纯度和亚克隆性,得出肿瘤细胞的后验分布,从中可以得出可信区间(CI)。 为了说明其实施,我们在各种条件下进行了模拟研究,并讨论了PurBayes在合成数据上的性能。
Methods
- For a set of S observed heterozygous loci because of somatically acquired single-nucleotide variants (SNVs) for a given tumor sequencing sample, each SNV can be represented by respective normal and mutant allele read counts Xi and Yi. The total number of sample reads Ni = Xi + Yi can in turn be decomposed into respective tumor and normal tissue read counts Nti and Nwi , such that Ni = Nwi + Nti . As it cannot be directly determined which cell type each individual read was derived, Nti and Nwi are latent variables. If we assume Nti to be binomially distributed, such that Nti~Bin(Ni, λ) and λ indicates tumor sample purity, and Yi|Nti~Bin(Nti , 0.50), then Yi follows a binomial–binomial hierarchical mixture model with marginal distribution Yi~Bin(Ni, λ/2) .
- 对于一组观察到的杂合位点集合S,由于给定了肿瘤测序样品的体细胞获得的单核苷酸变体(SNV),每个SNV可以由相应的正常和突变等位基因读数Xi和Yi表示。 样本总数写作Ni = Xi + Yi又可以分解成各自的肿瘤和正常组织读数Nti和Nwi,使得Ni = Nwi + Nti。 由于不能直接确定每个单独读取的细胞类型,Nti和Nwi是潜在变量。 如果我们假设Nti是二项分布的,那么Nti~Bin(Ni,λ)其中λ表示肿瘤样本纯度,并且Yi|Nti~Bin(Nti,0.50),则Yi遵循二项式 - 二项式层次混合模型与边缘分布Yi~Bin(Ni,λ/ 2)。
- Consider a tumor that exhibits intratumor heterogeneity. If we assume subclonal mutations cluster into an a priori finite number of J-1 subclonal populations, Y can be modeled under a Bayesian finite mixture model. Let Kj denote to the probability a mutation corresponds to variant population j with respective cellularity λj, for j = 1, ... , J, such that E Kj = 1, λ1<...<λj, and λj ~=λ , with uniform priors on λj. To obtain a data-driven value for J, PurBayes generates model fits iteratively by initially assuming tumor homogeneity and then increasing the subclonal population count by one until an optimal model fit is achieved under a penalized expected deviance (PED) criterion .
- 考虑一种表现出肿瘤内异质性的肿瘤。 如果我们假设亚克隆突变聚集到先验有限数量的J-1亚克隆群体中,Y可以在贝叶斯有限混合模型下建模。 令Kj表示突变对应于具有各自细胞性λj的变体群J的概率,对于j = 1,...,J,使得 epsilon Kj = 1,λ1<... <λj,并且λj ~= λ, 其中λj是均匀先验的。 为了获得J的数据驱动值,PurBayes通过初始假设肿瘤同质性然后将亚克隆种群数增加1来迭代地生成模型拟合,直到在惩罚预期偏差(PED)标准下实现最佳模型拟合。
- Mapping bias can result in non-reference alleles in heterozygous loci being mapped at rates<0.50, which would impact tumor purity estimation. PurBayes can accommodate this bias by estimating it from additional reference and alternate allele counts in heterozygous normal tissue variant calls.
- 定位偏差可导致杂合基因座中的非参考等位基因以<0.50的速率定位,这将影响肿瘤纯度估计。 PurBayes可以通过从杂合正常组织变异调用中的额外参考和替代等位基因计数来估计它来适应这种偏差。
- PurBayes is implemented in the statistical programming language R and uses the MCMC software JAGS. The only inputs required for PurBayes are the tumor tissue read counts (N and Y) for a set of high-confidence SNVs, which can easily be derived from most variant calling software output file formats on NGS data.
- PurBayes以统计编程语言R实现,并使用MCMC软件JAGS。 PurBayes所需的唯一输入是一组高可信度SNV的肿瘤组织读数(N和Y),可以很容易地从NGS数据上的大多数变体调用软件输出文件格式中获得。
Simulation
To illustrate the performance of PurBayes under a variety of conditions, we conducted simulation studies based on real sequencing data from the 1000 Genomes Project (details in Supplementary Materials). We first simulated read count data for homogenous tumors ranging in purity from 20–80%, with S = 100 and average sequencing depth at 50x and 100x. We ran 100 replications of each unique set of conditions and examined the PurBayes posterior median estimates. We ran similar simulations for heterogeneous tumor data with J = 2 at 100x for various values of Kj and λj to determine how well PurBayes can detect intratumor heterogeneity and estimate tumor purity. For each application, we also simulated read count data from 100 additional germ line variant calls to account for mapping bias. For purposes of comparison, we also applied the PurityEst algorithm to each simulation replicate.
为了说明PurBayes在各种条件下的性能,我们基于来自1000个基因组项目(详见补充材料)的真实测序数据进行了仿真研究。我们首先模拟了20-80%纯度的同质肿瘤的计数数据,S = 100,平均测序深度分别为50x和100x。我们对每种独特的条件进行了100次复制,并检查了PurBayes后中位数估计值。我们对各种Kj和λj值的异质性肿瘤数据进行了类似的仿真,其中 J = 2, 100x,以确定PurBayes检测肿瘤内异质性和估计肿瘤纯度的精度。 对于每钟应用,我们还仿真了来自另外100个胚芽系变体调用的读计数数据,以考虑映射偏差。为了便于比较,我们还对每个仿真迭代应用了PurityEst 算法。
For each application of PurBayes, the first 50000 iterations of the optimal MCMC model fit were discarded as a burn-in before posterior sampling of 10000 iterations. Mean per-sample execution time was 2 min on a workstation equipped with an Intel CoreTM i5 3.10 Ghz processor and 4GB of random access memory.
对于PurBayes的每个应用,最佳MCMC模型拟合的前50000次迭代在10000次迭代的后验取之前被丢弃作为老化。 在配备Intel CoreTM i5 3.10 Ghz处理器和4GB随机存取存储器的工作站上,每个样本的平均执行时间为2分钟。
Results and Discussion
- For the homogenous tumor simulations, PurBayes correctly identified tumor homogeneity in all replications. Distributions of the posterior median estimates of tumor purity for each value of λ and method are displayed in Figure 1. Estimates from PurBayes and PurityEst were nearly identical, with a Pearson correlation of 0.9997. Both methods were accurate, tending toward overestimation at lower values of λ. When we applied PurBayes to heterogeneous data, the ability to detect heterogeneity was highly dependent on the disparity between cellularities. The proportion of clonal variants also affected detection, with larger values of K1 leading to higher mean absolute error (MAE) of the posterior median purity estimates. Although PurityEst performed comparably under certain conditions, the ability for PurBayes to detect heterogeneity generally resulted in greater estimate accuracy.
- 对于同质肿瘤仿真,PurBayes在所有重复实验中正确识别肿瘤同质性。 图1显示了对每个λ值的肿瘤纯度的后验中位数估计值的分布以及方法。PurBayes和PurityEst的估计值几乎相同,Pearson相关系数为0.9997。 两种方法都是准确的,倾向于在较低的λ值下过高估计。 当我们将PurBayes应用于异质性数据时,检测异质性的能力高度依赖于细胞之间的差异。 克隆变体的比例也影响检测,较大的K1值导致后验中位数纯度估计时较高的平均绝对误差(MAE)。虽然PurityEst在某些条件下表现相当,但PurBayes检测异质性的能力通常会带来更高的估计准确性。
- Our simulation results highlight the potential bias of tumor purity estimates in the presence of unaccounted intratumor heterogeneity. By simultaneously estimating tumor purity and subclonality, PurBayes may also provide additional advantages, such as facilitating inference regarding the tumor composition and evolution as well as isolation of potential founder events. As a Bayesian approach, measures of uncertainty are directly derived from the posterior distribution of J in the form of CIs.
- 我们的仿真结果强调了在未计入肿瘤内异质性的情况下肿瘤纯度估计的潜在偏差。 通过同时估计肿瘤纯度和亚克隆性,PurBayes还可以提供额外的优势,例如促进关于肿瘤组成和进化的推断,以及潜在的创始事件的分离。 作为贝叶斯方法,不确定性的度量直接来自于CI的形式的J的后验分布。
- One possible issue in the application of PurBayes is if it estimates J to be larger than the true value because of outlier observations, which leads to a positively biased tumor purity estimate. This can be especially problematic with the existence of copy number variation (CNV) and structural rearrangements. Given that regions of CNV will result in multiplicative impact on the number of mapped reads and SNVs contained within such regions will not truly reflect heterozygosity at a proportion of 0.50, such SNVs would highly influence estimation of J. As such, we anticipate PurityEst to perform better in instances in which CNVs are present and unaccounted for in purity estimation because of its robust estimation procedures. It is thus highly recommended that regions indicated to be CNVs by parallel analyses be filtered from the estimation procedure.
- 应用PurBayes的一个可能问题是,如果由于离群值的观察使J的估计大于真值,则导致肿瘤纯度估计值偏向正偏差。 对于拷贝数变异(CNV)和结构重排的存在,这可能尤其成问题。 鉴于CNV区域将对映射读数的数量产生倍增影响,并且这些区域中包含的SNV不能真实地反映0.50的比例的杂合性,这样的SNV将高度影响J的估计。因此,我们预期PurityEst执行在CNV存在的情况下更好,并且由于其强大的估计程序而在纯度估计中不明确。 因此,强烈建议从估计程序中过滤通过平行分析指示为CNV的区域。
- We foresee a variety of extensions to the concepts in PurBayes. For example, the mixture model could be alternatively formulated to characterize tumor cellularity as a continuous distribution using semi-parametric approaches. Integration of CNV and ploidy information will also make PurBayes a more effective estimator.
- 我们预见到对PurBayes概念的各种扩展。例如,混合型模型可以通过半参数化方法来描述肿瘤细胞的连续分布。CNV和倍性信息的集成也将使PurBayes成为一种更有效的估计器。