hello,everyone。welcome to the OBS seminar today。we have Dr Zilin Li joining us。he is an assistant professor in the department of biostatistics and health data science at Indiana university school of medicine。prior to this,he was before being an assistant professor,he was a research scientist research associate and about post-doctoral research fellow and professor xihong lin's lab in the department of biostatistics at Harvard the chan school of public health and he also received his phd from Tsinghua university 2016。my name is raj。and i will be moderating the talk today。so if you have any questions,you can paste them down in the chat and with that,Dr.Li the floor is yours to give your the presentation。yeah。that's a introduction by raj。good morning,everyone。thanks for attending my presentations。so。my name is lizilin and i am a assistant professor in the department of biostatistics and health institute in the Indiana university school of medicine。so。today i am going to present our project。STAARpipeline。a framework for detecting noncoding rare variant associations of large scale whole genome sequencing studies。and this work was published in the nature。
欢迎李子林进行汇报。我叫李子林,我是印第安纳大学医学院生物统计和健康所的助理教授,今天我要展示我们的项目——STAARpipeline,这是一个在大规模全基因组测序研究中检测非编码罕见变异的框架,这项工作已经发表在nature系列文章中。
so。first i will start with the outline of my presentations。so。my presentations includes four parts。the first part。i will introduce the background of the large-scale sequencing studies。especially for the rare variants association analysis。and then。i will introduce our proposed STAARpipeline。and all we environment rare variant analysis tools for the large scale sequencing studies。and then。in the third part。i will introduce and examples of for the real data applications。we will apply the STAARpipeline to analyze the topmed freezed latest data and finally。it is a concluded of my presentations。
首先,我从汇报的概要开始,我的汇报包含四部分。首先,我将介绍大规模测序研究的背景,尤其是关于罕见变异关联研究。其次,我将介绍我们的STAARpipeline流程以及我们关于大规模测序研究的罕见变异分析的整个环境软件。第三部分,我将介绍真实数据应用的例子,我们将应用STAARpipeline去分析TOPmed freeze的最新数据。最后,是我报告的总结部分。
so。let me start from the background。so。for everyone of us,the each human genomes contains three billions base pairs。basically,it is a letter ATGC。an important goal of human genetic research is to detect the genetic basis of human diseases or traits。so many of you,might heard about GWAS。GWAS is genome wide association studies。and GWAS has been widely used to detect the genetic basis of human diseases or traits in the past 15 years。
让我们从背景开始吧。对于我们每个人,每个人类基因组都包含30亿个碱基对,它是字母ATGC。人类遗传学研究的一个重要目标就是去检测人类疾病或性状的遗传基础。你们许多人可能听说过GWAS。GWAS是一种全基因组关联研究。GWAS在过去的15年里已经被广泛地用于检测人类疾病或性状的遗传学基础。
GWAS is an array based technologies and what limitations of the GWAS is that。it only focuses on the common variants。so。here is a common variants as defined as a genetic variants of whose manner will another frequency is rather than 5% or 1% in the study populations。so。let's look at the examples。so assume that we have seven individuals in our studies。now we look at the variant in the gray circles。we can see that among the seven individuals。there are three T and four C。so hence this variant is a common variant and the GWAS data can match coverage。so although the GWAS has been successful in detecting thousands of common variants associated with human diseases or traits。however。only common variant can explain a small fraction of the probabilityies。 and as we know。this phenomena is called the missing heritabilities in the genetics。and a recent studies indicate that。there is missing capabilities must be account for the rare variant and the researchers also found that so the majority of the variants in the human genomes are the rare variants。
GWAS是一种基于芯片的技术,它的局限性在于只能关注常见变异。常见变异被定义为:在群体研究中某个遗传变异的频率高于5%或1%。让我们看一个例子。假设在我们的研究中有七个个体,我们看灰色圈中的变异,我们可以看到,在这七个个体中,有3个T和4个C,所以这个变异是一个常见变异,并且GWAS能处理它。尽管GWAS在检测人类疾病或性状关联的成千上万个常见变异已经取得了成功。但是,只有常见变异能解释一小部分的遗传力。正如我们所知,这种现象叫做遗传学中缺失的遗传力。最近的研究暗示着,存在这些罕见变异可以解释缺失的遗传力,并且发现了人类基因组上大部分的变异都是罕见变异。
so the current back level data has a huge sample sizes。for examples。UK Biobank was data has about 500 thousands of individuals。however。the common variants covered by this large。GWAS is still less than 10% in the human genomes and remain 90% of the variants in the human genome are the rare variants。a recent study also indicate that so the variants have larger effect size and are more likely to cause diseases。researchers also found that the coded proteins of rare variants are more likely to be drug targets。
当前的背景是有大量的样本数据。例如,UK Biobank有50万个个体,然而常见变异很少,只占到10%,剩下90%在人类基因组中都是罕见变异。最近研究也暗示,这些罕见变异有更大的影响效应并且更有可能引起疾病。研究人员也发现罕见变异编码的蛋白质更有可能是药物靶点。
so。in order to study the effects of rare variants。a rapidly increasing number of whole genome sequencing studies are being conducted recently。the examples include 。the trans-omics of the precision medicine program which has founded by the national long heart and blood institute and also the whole genome sequencing project which is found by the national human genome research institute and also the UK biobank whole genome sequencing data。so by doing the whole genome sequencing,we would be able to sequence every positions of the human genomes which means that。so each position of these three billion base pairs。so which makes the data structures of the whole genome sequencing data is very different from the GWAS studies。so。we also illustrate the difference i use examples。so now we look back to the toy examples。so the whole genome sequencing data will not only include the information of the common variants which is in the gray circles,but additionally have the informations of many newly detected rare variants in the toy examples is the orange circles。and we could see that。to compare to the common variants for each of the rare variants in the orange circles we could find that there is only one individual has the different alleles compared to others。so。which indicated that the data structures is very different of the whole genome sequencing data compared to the GWAS data。
因此,为了去研究罕见变异的影响。现在有一些大规模全基因组研究在开展,这些例子包括,精准医学跨组学TOPMed全基因组测序项目旨在研究心脏和肺部和血液以及睡眠障碍,UK Biobank全基因组测序数据。通过测这些全基因组,我们能知道人类基因组三十亿个碱基对的每个位置的碱基情况。每个位置的获取造成了全基因组测序数据的数据结构非常不同于GWAS研究。我使用一个例子来阐明这种不同,现在我们来回看这个模型例子,假设研究有七个个体,全基因组测序数据不止包含灰色圈中的常见变异,还有许多新检测到的橙色圈中的罕见变异。我们能看到,和常见变异相比,对于每个橙色圈中的罕见变异,相比于其它个体,只有一个个体有这种不同的等位基因。因此,这表明全基因组测序的数据结构是非常不同于GWAS数据的。
then。what the whole genome sequencing data looks like?i will use the topmed,a consortium freeze 8 data as an illustrations。so。in total,there are around 140 thousands individuals in the term at freeze age data。and among the list individuals,we found that,in total there are around 700 million variants。so we could see that。the whole genome sequencing data has a huge number of variants and a very large sample sizes。and we could also find that a mount of at lease 700 million of the variants only 1.8% of it variants are the common variants with the metal frequency rather than 1% and the remain another 98.2% of the variant are the rare variant。we further found that,the amount is variants about 61% of these variant are the singletons and doubletons。for each of these singleton and doubletons only two of the individuals have the different alleles compared to others。and which is the list observations junk indicated that whole genome sequencing data as very very sparse。so now we can see that。the whole genome sequencing data is just a large and a huge P analysis problems and related to the sparse matrix problems。
接下来,那么全基因组测序数据长什么样呢?我将使用TOPmed数据来做说明。在这个数据中,总共有14万个体,在这些个体中,我们发现一共有大约7亿个变异。全基因组测序数据有大量变异和大量样本。我们也能发现,在这7亿个变异中,只有1.8%的变异属于频率大于1%的常见变异,剩下98.2%的变异都属于罕见变异。我们进一步发现,这些变异中有61%都是singleton和doubleton。对于每个singleton或doubleton,和其它样本相比,只有一个个体或两个个体有不同的等位基因。对于其它观察到的相同的变异,可以看做垃圾,这就表明全基因组测序数据是非常非常稀疏的。因此,我们可以看到,全基因组测序数据是大量的,并且关系到稀疏矩阵中巨大的计算问题。
so now our question as,how could we analyze this huge dataset。so the common analysis strategies is for GWAS in to perform the individual analysis of each variant and then to the multiple testing adjustment to detect the significant ones。however。this common strategies could not directly apply to the real variant。because of lack of the power。so,in order to solve these problems。the researchers proposed the variant set test instead of testing each variant individually。the variant test evaluates the cumulative effects of multiple variants in your various sstatus。and there are multiple variants in a various set associated with the trait we are interested in。then,the variant set tests could increase the powers。however,there are still many questions remained to others。while we apply the variant set analysis for the rare variant association analysis in the whole genome sequencing studies。so,the first question is just the definition of the variant set。which is related to the selection of variants in the variant set。how could we define the analysis unit to make the analysis a powerful for the whole genome sequencing data。and the second question is related to the choice of the variant test?so,in literature there are many different various attacks that has been proposed and we need to choose the wires which are suitable for the whole genome sequencing data。and the third question is about the leverage of the biometric of the biological informations。so recently,there is many about informatics database has provided the informations of the function nalities of the variant。because we further leverage thses informations to further increase the power of the analysis in the whole genome sequencing studies。and last but not the least one,is the competition skill abilities。so,given the huge number of the analysis。so,we need the competition method as scalable for this huge dataset。so,my research is mainly focusing on developing statistical methods to perform powerful and scalable association analysis serves by addressing these issues。
因此,我们的问题是,我们怎样分析这个巨大的数据集?对于GWAS的常见变异分析是展示每个变异的个体分析,然后进行多个测试调整去检测显著的变异。然而,常见策略由于缺乏效能不能直接用于罕见变异。因此,为了解决这些问题,研究者提出用变异集合测试代替单独检测每个变异。变异集合测试评估多个变异的累计效应。在数据集中有多个和我们感兴趣的性状相关联的变异。然后,变异集合测试能够提高效能。然而,还是有很多问题存在。当我们在全基因组测序研究中对于罕见变异关联分析应用the variant set分析。第一个问题是变异集的定义,和变异集中变异的选择相关,如何定义分析单元,使得对全基因组测序数据的分析更加有力?第二个问题与变异测试的选择有关,在文献中已经提出了许多不同的方式,我们需要选择适合全基因组测序数据的路线。第三个问题是关于生物信息的生物识别的数据库应用,最近有很多数据库提供了变异的功能信息,我们可以进一步利用这些信息来进一步提高全基因组测序研究中的分析能力。最后但同样重要的是,计算技能。考虑到庞大的数据分析,我们需要针对这个庞大的数据集采用可扩展的计算方法。因此,我的研究主要集中在开发统计方法来执行强大且可扩展的关联分析,以解决这些问题。
so,now we first look at our proposed the whole genome sequencing association analysis pipeline——STAARpipeline。so,our STAARpipeline is an all-in-one a rare variant analysis tools and the users only need to input the phenotype and the genotype informationss。and then our STAARpipeline will automatically handle all the remain things including the association analysis and also the results summaries and visualizations。so,specifically,in the first steps,our STAARpipeline will first annotate the variants in the dataset。and our STAARpipeline used a database FAVOR annotators to functionally annotate any genetics dataset and then generated the agds file。so here is the agds files are a new data format for storing the genotypes and annotation informationss。and the agds files are the they are all files that include the information is for both the genotype and annotations。so,our STAARpipeline also provides some tools to generate the GRMs which is used to accounted for the published structures and also the readiness in the following association analysis。then,in the next steps,our STAARpipeline will help to the researchers to do the associated analysis for the common variants just as similar as GWAS,our STAARpipeline provides the single variant analysis,and for the rare variant,our STAARpipeline provide the variant set analysis。so here we considered two different strategies to group the variants in our STAARpipeline,so the first one is the gene centric analysis of where we focus on analyzing the variant a year or neural genes and groups of variants based on the variant,based on the biological functionalities of the variants。and the second one is a 19 centric analysis where we mainly focusing on analyzing the associations in the non-coding genomesespecially in the intergenic region in 19th centric analysis。we just grouped the reference based onpositions。then,after we defined each of the variants set,our STAARpipeline further uses STAR method to incorporate the multiple function annotations to increase the powers。then in the final steps,our STAARpipeline provides the tools for some analytical follow-ups。for example,including the summarizations and visualizations of the analysis result。and also STAARpipeline a coup to perform the conditional analysis to detect the novel associations。the users to provide as a non-environment list that they want to adjust for。so next,I will introduce how our STAARpipeline address the four challenges i mentioned before。and to perform the powerful and scalable rare variant association analysis for the sequencing studies。
现在,我们先来看一下我们提出的全基因组测序关联分析流程——STAARpipeline。我们的STAARpipeline是一个罕见变异分析工具,用户只需要输入表型和基因型信息。然后我们的STAARpipeline将自动处理所有剩余的事情,包括关联分析和结果总结以及可视化。在第一步,STAARpipeline将首先注释数据集中的变异,STAARpipeline使用数据库FAVOR注释器对任何遗传学数据集进行功能注释并产生agds文件。agds文件是存储基因型和注释信息的文件。
so,first,now we look back to the challenges in the rare variant association analysis。and now we looking at the first challenges which is just the selections of the variants in the variant set。
so our STAARpipeline provides two different strategies to define the analysis unit of the real variants。so the first one is a centric analysis and the second one is a 19 centric analysis。so,for the central analysis,we focus on analyzing the variants in our neural genes and groups of variants based on as a functional categories。so,here in the STAARpipeline,we provide a five different closing categories and eight different non-coing categories。for the coding categories,we just group the red variants as a loss of function variants。so loss of function variants and this transmission varied togethers recess variant destructive research variants only and also synonymous variants。and for the non-coding masks,we just provide the analysis unit of the draft variants in the enhance or in the promoters。but overlaid with the cage of the DHS set and we also provide the mask as the reference in the UTR regions,upstream regions,downstream regions,and we also additionally provide auto analysis mask of the real variants in the ncRNA genes。and then,in the
ps:https://www.xihaoli.org/
我是北卡罗来纳大学教堂山分校(UNC)生物统计学系和遗传学系的助理教授。我在哈佛大学获得了生物统计学博士学位,在那里我有幸得到了林希宏博士的指导,并有机会与Iuliana Ionita-Laza博士一起工作。我留在哈佛大学接受林希宏博士的博士后培训,在那里我还与Pradeep Natarajan博士、Gina Peloso博士、Laura Raffield博士、Zilin Li博士和Bing Yu博士密切合作,作为NHLBI TOPMed项目的成员。
我的研究兴趣在于开发新的统计方法和计算工具,这些方法和工具(1)能够对大规模全基因组/全外显子组测序(WGS/WES)数据和多组学数据进行可扩展和综合分析(作为NHLBI TOPMed计划的一部分),(2)能够对祖先多样化联盟和生物库的大规模测序数据进行荟萃分析,以及(3)使用功能注释数据对假定的因果遗传变异进行优先级排序,以更好地理解基因组变异、基因组功能和表型之间的关系(作为 NHGRI IGVF 联盟的一部分)。我还参与了方法学项目,为罕见病临床试验和真实世界证据研究开发统计方法。
此前,我在哈佛大学获得生物统计学硕士学位,在数理科学学院获得数学学士学位,在北京大学国家发展学院获得经济学双学位。 ORCiD的 ORCiD: 0000-0001-8151-0106 电子邮件:xihaoli@unc.edu
ps:https://doi.org/10.1038/s41588-020-0676-4
IF: 30.8 Q1 B1
动态掺入多个计算机功能注释,可对大规模全基因组测序研究进行罕见变异关联分析
摘要:大规模的全基因组测序研究已经能够分析和复杂表型相关的罕见变异。常用的罕见变异关联测试利用变异的范围有限。我们提出了STAAR(使用注释信息进行关联的变异集测试),这是一种可扩展且功能强大的罕见变异关联测试方法,它使用动态加权方案有效地合并了变异类别和多个互补注释。对于后者,我们引入了“注释主成分”,即计算机变异注释的多维总结。STAAR考虑了种群结构和相关性,并可扩展用于分析连续性和二分类性状的超大型队列和生物库全基因组测序研究。我们应用STAAR在12316个发现样本和17822个精准医学跨组学验证样本中鉴定和四种脂质性状相关的罕见变异。我们发现并验证了新的罕见变异关联,包括NPC1L1的破坏性错义罕见变异和与低密度脂蛋白胆固醇相关的APOC1P1附近的基因间区域。
介绍:越来越多的WGS/WES研究正在开展,以调查人类疾病和性状的遗传基础,包括国家心肺血液研究所的精准医学跨组学计划TOPMed和国家人类基因组研究所的基因组测序计划。这些研究能够评估复杂性状和整个基因组中编码和非编码罕见变异(MAF<1%)之间的关联。然而,单变异分析在识别和罕见变异的关联方面通常功效较低。为了提高功效,已经提出了“变异集测试”,以联合测试给定的多个罕见变异集的效果。这些方法包括负担检验burden test,序列核关联测试SKAT以及它们的各种组合。同时,一些功能注释(如保护分数和预测的增强子状态)已经成功用于在精细映射fine-mapping研究中优先考虑看似合理的因果共同变异,划分GWAS中的遗传力并预测遗传风险。所以,有效地整合变异的功能注释可以提高WGS关联研究的罕见变异分析能力。
变异功能注释有两种形式:(1)把定性功能分组为基因组元素,如变异效应预测因子类别;(2)可用于整个基因组变异的定量功能评分,包括蛋白质功能评分,进化保守性评分,表观遗传测量和综合功能评分。不同的注释分数捕获了变体功能的不同方面。鉴于可用注释的多样性,研究者在汇总这些关于基因组功能的证据。在变异集测试中同时使用多个不同的功能注释分数可以提高罕见变异关联研究能力,例如通过优化选择和加权看似合理的因果关系罕见变异。
为了提高WGS中罕见变异关联研究中变异集测试的能力,我们提出了使用annotation information(STAAR)的变异集关联测试,这是一个通用框架,使用统一的综合多维加权方案动态地整合了定性功能类别和定量互补注释分数。对于后者,为了有效地捕捉变异的多方面生物学影响,我们引入了“注释主成分”,即可以在STAAR框架中利用的注释分数的多维摘要。
最近的方法在遗传关联研究中纳入了功能注释。然而,这些方法无法扩展以分析大规模的WGS研究,同时考虑相关性和种群结构。大规模的WGS和WES研究,如TOPMed和GSP(The National Human Genome Research Institute (NHGRI) Genome Sequencing Program (GSP)),包括相当一部分相关和祖先多样化的样本。STAAR使用广义线性混合模型(GLMM)框架考虑了数量和二分类性状的相关性和种群结构,以及纵向随访设计。使用稀疏遗传相关性矩阵(GRM),STAAR在计算上可扩展,适用于非常大的WGS研究和数十万个样本的生物库。
在本研究中,我们进行了广泛的模拟研究,以证明和传统的变异集测试相比,STAAR可以实现更高的功效,同时保持定量和二分类表型的准确I型错误率。然后,我们应用STAAR对12316个发现样本和17822个验证样本进行WGS基因中心和基于滑动窗口的遗传区域分析,这些样本有4个定量脂质性状:低密度脂蛋白胆固醇(LDL-C),高密度脂蛋白胆固醇(HDL-C),甘油三酯(TG)和总胆固醇(TC)。我们表明,STAAR优于现有方法,并鉴定了新的和重复的关联,包括在NPC1L1的破坏性错义罕见变异和APOC1P1附近的基因间区域和LDL-C的关联。
结果:方法概述。STAAR是一个通用框架,用于通过使用定性功能类别和变异集中的多重计算机变异注释分数来大规模分析WGS的罕见变异关联研究,同时通过使用快速和可扩展的算法为定量和二分类性状拟合线性和逻辑混合模型来考虑种群结构和相关性。对于每个变异集,STAAR框架有两个主要组成部分:(1)使用注释主成分来捕获多维变异生物学功能并确定其优先级;(2)通过使用综合加权方案将这些注释主成分以及其它综合功能评分和MAF纳入STAAR测试统计量中,测试每个变异集和表型之间的关联。