Expanding the computational toolbox for mining cancer genomes

Corresponding author: Li Ding
Director of Computational Biology, Oncology
Washington University School of Medicine, St. Louis, MO

Sample procurement, sequencing and analysis roadmap.

1.Sequencing strategies

WES转向WGS：WGS data are therefore considered to be the unbiased 'gold standard'

1.1 Traditional sequencing analyses
In practice, detection of all germline and somatic aberrations is a formidable challenge owing to limitations in current analysis algorithms, as well as to the quantity and quality of sequence data.
实际上，由于当前分析算的局限性，以及测序数据的数量和质量的限制，检测所有种系和体细胞突变是一项艰巨的挑战。
1.2 Subclonal analyses
cancer progression has long been known to be a fundamentally clonal process, and sequence coverage is now becoming sufficiently large to permit detection of the low-prevalence events that are routinely associated with tumour subclones. Multisite and/or multistage sequencing and tumour sectioning experiments have begun to identify founding clones and subclones that contribute to cancer progression
1.3 Single-cell sequencing
Pioneering work on assessing CNAs in multiple tumour subpopulations was followed by single-cell sequencing using whole-genome amplification (WGA) of DNA extracted from nuclei that were sorted by flow cytometry.
目前仍然存在一些挑战，如简并寡核苷酸引物WGA的放大偏差和多重置换扩增技术（degenerate oligonucleotide-primed WGA是指引物的3' 含6bp的随机序列，可以随机的和基因组DNA结合，从而实现对全基因组的扩增；multiple displacement amplification techniques利用随机引物和等温扩增可以获得高保真的DNA大片段，但该方法的主要缺陷在于非平衡的基因组覆盖率、扩增偏倚、嵌合序列及非特异扩增等），这些技术的偏倚导致了不均匀的覆盖，并因此难以确定体细胞的变化，包括SNVs、CNAs和结构畸变。由于两个等位基因中的一个的优先扩增，检测灵敏度受等位基因缺失的影响最大，有报道称等位基因缺失率为8 - 40%。大的CNAs仍然可以在基因组覆盖率较低的情况下进行检测(例如，5-6%)，而不平等的覆盖率使得分析较小的CNAs和结构变异极其困难。

2.Dissecting genomic changes in cancer

以下表格是注释和解读肿瘤基因组突变的计算工具

Program	Function	Synopsis	Refs
*SNV and indel detection*
Bassovac	SNV and indel detection	Bayesian approach with tumour or normal impurity and clonality	–
GATK	SNV and indel detection	Analysis framework using MapReduce	23
JointSNVMix	SNV detection	Binomial/multinomial probability with pre-filtering	31
MuTect	SNV and indel detection	Bayesian probability with pre- and post-filtering	28
Pindel	Indel detection	Pattern growth learning method	38
SNVMix	SNV detection	Binomial mixture model	30
SomaticSniper	SNV and indel detection	Bayesian probability with posterior filtering	27
Strelka	SNV and indel detection	Bayesian probability with posterior filtering	29
VarScan	SNV and indel detection	Fisher exact test, filtering and FDR correction	24,25
*Copy-number aberration, structural variant and gene fusion detection*
BreakDancer	Structural variant and indel detection	Kolmogorov–Smirnov test on discordant reads	54
BreakFusion	Gene fusion detection	Alignment-based pipeline for transcriptomic data	68
BreakTrans	Gene fusion mapping	Integration of fusion discovery and breakpoint tools	73
ChimeraScan	Chimeric transcription detection	Discordant read pairs with posterior filtering	67
CREST	Structural variant detection	Heuristics and binomial test on soft-clipped reads	55
deFuse	Gene fusion detection	Dynamic programming split and discordant reads	65
DELLY	Structural variant detection	Integrated method of discordant and split reads	40
GASV-Pro	Structural variant detection	Plane sweep for segment intersection	57
Genome STRiP	Structural variant detection	Depth and split or discordant reads on populations	59
Hydra	Structural variant detection	Discordant reads with assembly validation	139
LUMPY	Structural variant detection	Integrated method of discordant and split reads	167
TIGRA	Structural variant detection	Debruijn graph-based assembly	42
*Level I annotation and interpretation*
ABSOLUTE	Purity, ploidy and clonality prediction	Optimization of logarithmic scores	148
ANNOVAR	Functional prediction	Annotation-based prediction	74
ASCAT	Purity, ploidy and clonality prediction	Goodness-of-fit ranking of candidate solutions	168
TUSON Explorer	Gene classification	Oncogene or tumour suppressor discovery using mutational signatures	100
CHASM	Functional prediction	Random forest classifier	84,85
MutationAssessor	Functional prediction	Conservation-based prediction (entropy score)	83
PolyPhen2	Functional prediction	Probability model based on structure and alignment	81,169
SciClone	Tumour clonality prediction	Bayesian mixture model	–
SIFT	Functional prediction	Conservation-based prediction	82
SNPeff	Functional prediction	Annotation and coding effect prediction	75
THetA	Purity, ploidy and clonality prediction	Maximum likelihood of mixture composition	151
VEP	Functional prediction	Annotation-based prediction	170
*Level II annotation and interpretation*
Dendrix	Mutation analysis	De novo discovery of mutually exclusive mutations	128
HotNet	Network analysis	Diffusion model for significant networks	119
MEMo	Network analysis	Network modules with mutual exclusivity	122
MuSiC	Mutation analysis	Framework for significance analysis of mutations	92
Multi-Dendrix	Mutation analysis	De novo discovery of multiple sets of exclusive mutations	129
MutSigCV	Mutation analysis	Gene significance with variable background mutation rate	93
NBS	Network analysis	Clustering using non-negative matrix factorization	121
Oncodrive-CIS and OncodriveCLUST	Mutation analysis	Z-statistics for copy numbers of driver genes	171,172
PARADIGM	Gene expression analysis	Network analysis of gene expression	126
PathScan	Pathway analysis	Probability model for mutation-enriched pathways	109
TieDIE	Network analysis	Network diffusion model linking mutations to gene expression	125

根据经验，由多个独立算法call出来的候选事件不太可能是假阳性，而由任何单个算法call出来的候选事件则反之。因此，使用multicaller strategies现在变得更加普遍，当然这样做也会影响结果的灵敏度。但是各类工具的组合数量太庞大了，较难实现。

2.1 SNV detection
SNV检测算法：GATK、VarScan、SAMtools、SomaticSniper、MuTect、Strelka、JointSNVMix和SNVMix。前三种方法能够同时处理germline and somatic variants，其他几种方法用来call somatic mutations using tumour and matched normal genomic sequences.
尽管在生殖系样本中杂合子VAFs(variant allele fraction)预计为50%，但这一数字不适用于肿瘤中的体细胞突变，主要原因是正常组织污染和/或肿瘤异质性。目前，算法开发的重点是在广泛的VAFs上处理体细胞突变。例如Bassovac算法，它在call变异时考虑了双向杂质和肿瘤亚克隆结构(即异质性)的影响。
2.2 Indel detection
Indel detection is still challenging, mainly owing both to their lower frequencies than those of SNVs and to mapping difficulties.
大多数工具默认允许two mismatches and no gaps in 'seeded' regions (that is, in the first 28 bp in a read), 从而导致了包含indel的序列无法正常比对。Paired-end mapping对于发现末端再翼侧的大片段indel很有帮助，Gapped alignment, split read and de novo assembly 是目前常见的检测indel的方法。VarScan25 and GATK Unified Genotyper are based on heuristics for indel calling using raw statistics such as coverage, number of indel-supporting reads, read mapping qualities and mismatch counts.
现有的许多工具对短indels (< 5-8 bp)检测效果较好，但缺乏高的阳性率。此外，他们通常无法检测中等大小的indel，包括一些已知的'druggable' and/or prognostic events。最后，低复杂度区域(如均聚物)的检测尤其具有挑战性。SAMtools、Dindel可以call出短indel，Pindel、DELLY8采用了一种借鉴蛋白质数据分析的模式生长方法来检测indel断点，Pindel具有较高的精度，Burrows Wheeler aligner (BWA)-MEM41允许更好地发现长indels和SV， local de novo assembly or multiple alignments可以减少假阳性indel的数量。
2.3 CNA and structural variant detection
Accurate inference of copy number from sequence data requires normalization procedures that consider certain biases inherent to short-read sequencing methods (such as GC content and library biases). Approaches have been implemented for both GC-based coverage normalization and mapping bias.
寻找复发的CNA：Genomic identification of significant targets in cancer (GISTIC) and correlation matrix diagonal segmentation (CMDS) have been developed for the identification of recurrent CNAs.
检测多种结构变化（缺失、串联或反向复制、倒置、插入和易位）：BreakDancer, CREST (clipping reveals structure), VariationHunter, geometric analysis of structural variants (GASV)-Pro，and Genome STRucture In Populations (Genome STRiP)
2.4 Gene fusion detection
RNA-Seq发现基因融合：TopHat-fusion、 deFuse、MapSplice、ChimeraScan、 BreakFusion
基因融合既可以发生在只涉及两个远端loci的简单易位，也可以由多个远端loci组成复杂重排：Comrad and nFuse，这两种方法都将原始WGS和RNA-seq序列进行比对，同时验证融合和基因组断点。
Comrad和nFuse可以解释不明确的读取对齐，因此可以最小化由不对齐引起的错误。
我们最近开发了BreakTrans，它联合分析WGS和RNA-seq数据，以测试其他工具(如TopHat-fusion、MapSplice、BreakDancer和CREST)产生的假设，以进一步描述基因融合的机制成分。

3. Driver mutations and pathways

3.1 Annotations and functional predictions
RefSeq基因和转录本：Ensembl和GENCODE
调控元件：ENCODE、TransFac和RegulomeDB
非编码RNA：NONCODE、BodyMap和miRBase
蛋白质注释：Pfam和Interpro
综合注释：ANNOVAR和SNPeff提供转录变异的注释，SKIPPY预测隐性剪接效应因子，VEP、FunSeq和SNPnexus均扩展支持，包括非编码元素和调控特性的注释，VAAST(变异注释、分析和搜索工具)和GEMINI(基因组挖掘)允许对编码变异、非编码变异、调控元件和表型进行全面分析和整合
有害性：PolyPhen、SIFT、MutationAssessor和Condel
蛋白质翻译后修饰：ActiveDriver
3.2 Significantly mutated genes
检测Driver mutation的一个方法是区分掉背景突变率BMR。BMR的测量比较困难，许多因素可以影响BMR（包括基因长度、表达水平和复制时间的差异）, variation among samples and errors in upstream analyses. BMR不仅在同一癌症类型的患者之间存在差异，而且可能与环境因素和病毒特征有关的不同癌症类型也有关。最后，对突变的不正确或有偏倚的注释可能会导致假阳性。基因序列覆盖不足加剧了这些问题。MuSiC和MutSig可以解决这些问题。
另一种用于区分司机突变和乘客突变的方法是检查突变是否聚集在蛋白质序列的特定残基上。The '20/20 rule' 建议，如果一个基因至少20%的错义突变(or identical in-frame indels)位于一个特定的残基上，那么该基因应该被归类为致癌基因。相反，如果至少20%的突变处于失活状态(即无意义的移码、剪接位点或终止密码子读取突变)，则基因可以被归类为肿瘤抑制因子。现在，这一方法被一些算法所补充，这些算法利用更严格的统计分数来评估突变信号的模式，以及蛋白质序列或三维蛋白质结构突变的聚类。
3.3 Pathway and network analyses
通路和网络分析: 1.分析已知通路, which are represented as gene sets, 2.分析交互作用网络to implicitly build pathways de novo.
方法1：评估突变基因组合的一种直接方法是检查突变基因列表与已知生物功能的预定义基因集之间的重叠：KEGG、GO和MSigDB。例如，假设我们有一个突变基因列表(M)，我们的目标是看看这个列表中是否包含调控细胞周期的基因，利用KEGG数据库，我们发现了20多个细胞周期基因(L)的列表，有两个统计检验可以用来检验M和L是否有显著重叠。首先，如果对M进行排序(例如，使用上面描述的突变显著性评分之一)，那么可以使用基因集富集分析(GSEA)来确定L中的基因是否接近排序列表的顶部(M)；其次，如果M未排序，则可以使用超几何检验评估M和L之间的重叠。
方法2：以上分析方法的缺陷：1. Human gene annotations and pathway databases remain incomplete, and there is extensive crosstalk between pathways, which implies that decisions regarding the genes that form the boundary of a pathway are arbitrary to some extent. 2. The crosstalk is represented in gene-set and pathway databases by the presence of multiple overlapping gene sets, thus complicating the interpretation of reported enrichments. 3. Finally, signalling and regulatory pathways have a rich topology of activating and inhibitory interactions, and this information is not represented in the list of genes or proteins that are members of the pathway，激活和抑制作用无法通过富集分析体现。为了克服这些限制，分析突变组合的第二种方法是使用生物相互作用网络：相互作用网络已被用来取代基因集，以确定应进一步评估的突变组合。然而，大多数生物网络具有不均匀的拓扑结构，其特征是中心或节点的存在。HotNet是一种查找大型交互网络的子网络的方法，该子网络在随机样本中发生的变异比预期的要多，HotNet已被用于确定几种癌症类型的子网络，这些子网络在TCGA的背景下进行了分析，例如，涉及卵巢癌中Notch信号通路的突变。还有一些其他工具，如network-based stratification (NBS)、MEMo、Tied Diffusion Through Interacting Events (TieDIE)等。
方法3：第三种用于分析突变组合的方法是识别相互排斥的突变集。人们可以通过识别相互排斥的突变集来找到驱动突变的组合。MEMo使用这个概念来检测已知相互作用的基因，或者，可以尝试在不预先限制基因集的情况下重新发现相互排斥的基因集（Dendrix、Multi-Dendrix、RME）。

4. Genome integrity and clonal architectures

4.1 Kataegis, chromothripsis and chromoplexy
TCGA中最引人注目的发现之一是具有极端数量和突变类型的基因组。
Kataegis is the occurrence of an unusually large number of SNPs clustered in a single locus, and was first reported in breast tumours and other cancer types.
chromothripsis, in which one or more loci undergo a catastrophic event of simultaneous breakage and aberrant repair at multiple breakpoints in a single cell division，chromothripsis was originally reported in ~2–3% of all cancers but was shown to be particularly common in bone cancers (~25%)，后来发现可能与TP53突变有关。chromoplexy是在前列腺癌中发现的类似事件。
4.2 Defining clonal architecture in heterogeneous tumours
以上讨论的所有基因组改变都在克隆进化中发挥作用。
ABSOLUTE增加了一个最佳拟合CNA模型和一个核型似然模型
PyClone使用分层贝叶斯聚类来识别克隆
SciClone使用贝叶斯混合模型来检查来自患者的多个样本(使用初始和复发的肿瘤样本)或空间(使用多个活检样本)
肿瘤异质性分析(THetA)算法解释了CNAs的存在，这使得VAFs的分析变得混乱

5. Conclusion: basic and clinical applications

在癌症基因组学进入生物医学领域的短短时间内，它做出了许多基础性的贡献：
首先，癌症相关基因和途径已被确定;
其次，已经建立了胚系的易感性;
三是技术和算法不断完善;
第四，组织和记录了大量的数据集;
最后，知识被分类到新的数据库中。
未来的挑战：
'data spectrum' and associated analysis tools are not yet complete，如蛋白质组数据；
The second factor is the reality of cost；
癌症研究的下一个篇章无疑将进一步推动临床应用，并使大型制药公司更多地参与开发新的治疗药物。

Expanding the computational toolbox for mining cancer genomes