Beginner's Handbook of Next Generation Sequencing
本文前半部分摘要来自:https://genohub.com/next-generation-sequencing-handbook/
一些基本原理可以了解一下:
测序的接头和作用
测序的PCR duplicates - I
测序的PCR duplicates - II
I. Designing a Sequencing Run
Designing Your Next Generation Sequencing Run
对于read length, 50bp足以mapping到reference
In many cases, biological replicates offer more value than a large number of reads for a single sample.更多的情况下,更多的生物学重复比单个样本测很多很多reads来得更好
Estimate of Coverage Requirements by Application Type
Application Type | Coverage |
---|---|
DNA-Seq (Re-Sequencing) | 30 - 80X |
DNA-Seq (De novo assembly) | 100X |
SNP Analysis / Rearrangement Detection | 10 - 30X |
Exome | 100 - 200X |
ChIP-Seq | 10 - 40X |
For more examples see the Sequencing Coverage Guide.
这里可以看到外显子测序需要的测序深度明显比其他NGS要高,后面介绍WES和WGS区别时会提到
一般来说测序覆盖度越高,每个碱基测序的置信度也会高,但实际上测序覆盖度和测序对reads数目的要求还受到下列因素影响:
- Read length
- Genome size
- Application
- Established guidelines in the literature
- Gene expression level
- Genome complexity, repetitive regions
- Error rate of sequencing instrument or methodology
- Assembly algorithm
Replication, Randomization and Multiplexing
The two main sources of variation that contribute to confounding factors(混淆因素,指那些会干扰真实生物学结果的干扰因素,需要考虑到) are
- library effects that occur due to reverse transcription and amplification and
- unit effects (sequencing lanes [Illumina and SOLiD], chips [Ion], plates [Roche 454]) such as poor base calling, bad sequencing cycles. We recommend randomizing your samples by making sure each sequencing unit contains samples from both control and experimental groups. This can be done by barcoding or indexing your samples to allow for multiplexing.也就是说尽可能在一个cycle,一个仪器上run多个样本,multiplexing相对就是很好的方法(不同样本上在一个lane上也能区分开)
Poor Quality Sequencing Run
一般地,需要过滤的reads如下:
- Un-mappable reads
- PCR duplicates
- Low quality reads
- Adapter dimer or sequencing adapter reads
- Non-unique mapped reads or poor sequencing diversity
- Reads mapping to uninformative sequence (e.g. rRNA)
Optimizing flow cell loading and cluster densities
上样量的大小需要合适
If you load too little DNA, you’re likely to ‘under-cluster’ the flow cell. Under-clustering usually maintains data quality, but results in lower data output. If you load too much DNA, clusters will be too close together (over-clustering), resulting in poor image resolution and analysis problems
Choosing between WGS and exome-sequencing
About WES:
WES的一大优势就是用的事先设计好的oligo,针对特定位点的snp calling效果会好于WGS,解决WGS在特定位点的覆盖度不够的情况
不过也正因为WES的探针是事先设计好的,就会有探针本身质量、设计科学性的问题需要考虑,在应用上也不够WGS灵活,不能保证任何物种都有相应的WES探针
Advantages of Whole Genome Sequencing
- 对于位于编码和非编码区的SNVs, indels, SV and CNVs都可以检测,而WES会忽略很多重要的调控位点如启动子和增强子
- WGS 的序列覆盖度更可靠. 因为WES设计的探针在杂交效率上的bias会使得部分区域可能捕获效率低、覆盖度很低
3.WGS的覆盖度更均匀,低复杂度基因组区域难以设计好的WES捕获探针 - PCR扩增偏差的问题在WES中更明显:PCR amplification isn’t required during library preparation reducing the potential of GC bias. WES frequently requires PCR amplification as the bulk input amount needed to capture is generally ~1 ug of DNA.
- WGS的测序读长根据不同应用可以修改,而WES的探针长度是一定的:Sequencing read length isn’t a limitation with WGS. Most target probes for exome-seq are designed to be less than 120 nt long, making it meaningless to sequence using a greater read length.
- A lower average read depth is required to achieve the same breath of coverage as WES.
- 回避了捕获效率的偏差:WGS doesn’t suffer from reference bias. WES capture probes tend to preferentially enrich reference alleles at heterozygous sites producing false negative SNV calls.
- WGS在不同物种中的应用更为广泛
Advantages of Whole Exome Sequencing
- 最大的优点就是减少测序成本、数据存储
- 相对少的成本使WES往往测更多样本,能应用到更大的群体研究中:Reduced costs make it feasible to increase the number of samples to be sequenced, enabling large population based comparisons.
许多疾病关联变异需要的测序深度是 100-120x 因此选择WES更现实
denovo mutation calling测序深度到底要多高?需要考虑到reads的覆盖度是否均匀以及研究目的是什么:
It’s also important to remember that depth isn’t everything. The better your uniformity of reads and breath of coverage, the higher the likelihood you’ll actually find de novo mutations and call them. And that’s the main goal, if you can’t call SNPs or INDELs with high sensitivity and accuracy, then the most high depth sequencing runs are worthless
To conclude, whole genome sequencing typically offers better uniformity and balanced allele ratio calls. While greater exome-seq depth can match this, sufficient mapped depth or variant detection in specific regions may never reach the quality of WGS due to probe design failures or protocol shortcomings. These are important considerations when examining tissues like primary tumors where copy number changes and heterogeneity are confounding factors.
Targeted gene panels vs. whole exome sequencing
同样基于杂交的技术,实际上就是涉及研究的需求到底是什么
Advantages of targeting all exons – whole exome sequencing (WES)
如果只是为了广泛撒网,没有具体确定要测变异的基因,那就用WES吧
- Better for discovery based applications where you’re not sure what genes you should be targeting.
- Exome panels are commercially available, they don’t need to be customized or designed.
- Exome sequencing services are fairly standard, costs range between $550-800 for 100-150x mean on target coverage.
Advantages of targeted gene panels (amplicon-seq or targeted hybridization methods)
Targeted gene panels are ideal for analyzing specific mutations or genes that have suspected associations with disease.如果是gene panel,那么一般涉及更为个性化的研究
- Focusing on individual genes or gene regions allows you to sequence at a much higher depth than exome-seq, e.g. 2,000-10,000x as opposed to 200x which is typical with exome-seq.
- High depth sequencing enables the identification of rare variants
- Can be customized for different samples types, e.g. FFPE, cf/ctDNA, degraded samples.
- Lower input amounts can be used with targeted gene panels (1 ng vs. 100 ng with whole exome sequencing).
- Gene panels can be customized to only include genomic regions of interest. Why sequence everything when you don’t need that extra information?
- Panels can be easily designed for non-human species. Designing a non-human exome is much more laborious.
- Gene panel workflows are a lot simpler and time to results is often as little as 1-2 days.
- You can process thousands of samples on a single sequencing run. Targeted gene panels can be run at a higher throughput and are often more cost-effective than whole exome sequencing.
II. Library Preparation
具体的protocol可以针对感兴趣的应用参考上述原文
Typically 100 to 1000 nanograms of DNA are required for whole genome or whole exome sequencing. Targeted panels or amplicon based sequencing can use as little as 1 to 10 ng of input material. Other applications will have specific input requirements. See our guide for recommendations on shipping DNA samples.
TruSeq DNA建库方法对DNA的质量要求较低,成本低,基因组覆盖度高,自动化程度高,适合普通基因组建库;Nextera建库方法操作简单,耗时短,建库起始量低,适合样品量有限的样品建库。
关于这两种建库方法的比较,在论坛上也有讨论:Illumina library prep kits. Nextera vs. TruSeq
III.Sequencing Instrument
参考:The Biostar Handbook.2nd教材。不过这种数据会随着测序仪的优化发生改变,如MinION
Illumina MiniSeq, MiSeq, NextSeq, HiSeq
Illumina is the current undisputed leader in the high-throughput sequencing market. Illumina
currently offers sequencers that cover the full range of data output. More details are
available on the Illumina sequencers page.
• Up to 300 million reads (HiSeq 2500)
• Up to 1500 GB per run (GB = 1 billion bases)
IonTorrent PGM, Proton
The IonTorrent platform targets more specialized clinical applications.
• Up to 400bp long reads
• Up to 12 GB per run
PacBio Sequel
This company is the leader in long read sequencing. More details on the PacBio sequencers
page.
• Up to 12,000 bp long paired-end reads
• Up to 4 GB per run
MinION
A portable, miniaturized device that is not yet quite as robust and reliable as the other
options. More details on the MinION sequencers page.
Phased sequencing
Genome phasing identifies alleles on both maternal and paternal chromosomes offering haplotype information. Phased sequencing is important in genetic disorders where there are disruptions to alleles in cis and trans positions on a chromosome. It’s ideal in studies where variant linkage and allele expression is important.
关于phasing:https://www.jianshu.com/p/5a8ebac310e4
LD区块的存在就意味着我们可以通过构建相关的数学模型,来把这样的连锁关系求解出来。在开展大规模的基因组研究计划时(如Hapmap、国际千人基因组、Haplotype reference consortium以及各国家的国家基因组计划),通过构建基于隐马尔可夫模型(HMM)等的Phasing算法就可以依据测序数据或者芯片数据,反推出每个个体最有可能的单倍体,完成Phasing。
一条read、一对reads或者一个clone上的每一个碱基都必定来自同一个染色体(也就是同一个单倍体)。对于每一个这样的测序片段而言,它本身就是某一个单倍体的一个“局部”,因此现在的问题就变成了要如何把这些一个一个的小”局部“连成一个整体,接出完整的单倍体,从而实现定相,这就是Physical Phasing
总的来说,要把局部的小片段连成一个大片段,从而实现Phasing,这个过程要做的好就需要充分借助小片段上的杂合SNPs作为区分的标记。通过每个杂合位点上各个小片段中所含碱基的异同和彼此之间的重叠关系,我们可以把绝大部分的小片段分成两类,然后通过一系列的连接、二分图构建、二分图求解和重新组装等方法,最后就可以把小片段逐步连成大片段,从而构建出单倍体了
物理定相的方法,往往要求每个片段中都能包含较多的杂合SNPs位点,但由于人类基因组中杂合SNPs位点之间的距离普遍在1.5Kbp左右——还是比较长的,因此测序片段本身就要足够长,这就需要使用包括三代测序技术在内的一些测序方法,因此它的成本会比较高。
ref
Haplotype phasing: existing methods and new developments:https://idp.xmu.edu.cn/idp/profile/SAML2/POST/SSO?execution=e1s1
Genetic linkage analysis in the age of whole-genome sequencing: https://www.nature.com/nrg/journal/v16/n5/full/nrg3908.html
ChIP-Seq的「黑名单」
基因组上会存在一些特定的重复序列,例如在丝粒、端粒以及卫星重复序列,特点就是重复序列区域的碱基完全相同。而二代测序的数据进行比对时(比如现在有重复区域A和B,而A和B的碱基完全相同),仅仅依靠比对的算法,是不能判断reads比对到A还是B的。这时不同的软件会进行判断:有的软件会随机选取一个,有的软件会两个区域都进行计算。这种计算上的不确定性导致了这些区域的测序深度普遍偏高。
回归ChIP-Seq数据,我们一般是通过比较IP组和input组之间基因组上测序深度的差异,通过这个差异再加上一些计算方法,macs2等软件就会帮我们得到一系列peaks,也就得到了结合位点附近区域。如果我们是通过测序深度来确定peaks,那么重复序列区域的虚高情况势必会造成影响。因此,这部分区域被列入了“黑名单”。
为啥考虑blacklist?
It's still considered best-practice to remove these regions.
For genomes like GRCh38, the blacklisted regions are largely comprised of things like major satellite repeats, which are primarily located in hard-masked telomeric and pericentromeric regions.
参考:https://deeptools.readthedocs.io/en/develop/content/feature/blacklist.html
引入假阳性peaks,对样本文库归一化有影响(Including these regions can lead not only to false-positive peaks, but can also throw off between-sample normalization.)
因此,去除blacklist region主要就是为了降低peaks的假阳性
怎么去除blacklist
Devon Ryan在https://bioinformatics.stackexchange.com/questions/458/when-to-account-for-the-blacklisted-genomic-regions-in-chip-seq-data-analyses/459#459?newreg=dca76bad61c443d7b4f0b1abd1487878中提到:
如果仅仅要去掉这些区域是很简单的;一般在peak calling之前去除这些区域;这个去除对peak calling结果的改进不大;deeptools可以设置blacklist region;基因组版本越新(像GRCh38 and GRCm38), blacklisted regions范围就越小,所以现在很多时候也不用考虑这些blacklist区域另外 Devon在https://www.biostars.org/p/238222/中提到,基因组是不断完善的:
Only 10% of the problematic regions in hg18 were still problematic in hg19. Somewhere Heng Li has a presentation showing further improvements in GRCh38. This is the same for the mm8 -> mm9 -> mm10 progression of releases.
另外,不用纠结一开始就去除相关的reads,那样也比较费时;我们使用的QC软件都是考虑到这些区域的;只需要在bed文件中去掉这些就好了,也方便操作
#最简单的方法:
bedtools intersect -v -a your_regions.bed -b blacklist.bed
\# -v:Only report those entries in A that have no overlap in B
另外,deeptools可以设置blacklist region
https://deeptools.readthedocs.io/en/develop/content/feature/blacklist.html
deepTools全程支持指定–blackListFileName
参数