2019年11月bioRxiv生信好文速览

11月6号，biorxiv上post出了一篇独特的预印本（preprint）：来自biorxiv的创建团队的Richard Server等人，以bioRxiv: the preprint server for biology为题发布了一篇preprint，其中涵盖了对biorxiv五年来的总结【1】。文中提到，bioRxiv已经发布了超过64,000篇预印本文章，每个月有超过2000篇新的preprint投放，每周超过400万点击量，且更重要的是，这些数字还在不断上升！

Fig.4. The growth of bioRxiv. A. Monthly submissions to bioRxiv. New articles are in blue; revised articles are in red.

蓬勃发展的预印本也开始从草根阶段登堂入室。11月19号，Nature杂志就以Every butterfly in the United States and Canada now has a genome sequence为题，在新闻栏目中报道了一项刚刚投放到biorxiv上的预印本手稿。这篇预印本里，来自德州西南医学中心Grishin实验室的研究人员报道了对美国和加拿大所有845种蝴蝶物种的基因组的测序结果，以对蝴蝶的基因组进化，特别是蝴蝶种水平的系统发育同物种分化速率进行细致研究。

最近Nature上还有另一篇热门的进化生物学文章，那就澳洲学者通过群体基因组学的研究将现代人类“走出非洲”的具体位置追溯到非洲南部的博茨瓦纳国【2】。然而，该文一经发表就听到了一些不同的论调。现在，其中的一些声音终于落在了纸上：来自瑞典乌普萨拉大学（Uppsala University）的Carina Schlebusch等人上月于preprints.org以预印本形式表达了对原文的强烈反击，直指其结论完全站不住脚。不知道这篇preprint会否经过同行评议转化为一篇短文不久后在Nature见刊呢？

预知更多关于这几篇文章的更多细节？那就请浏览我们为您带来的11月bioRxiv生信好文速览吧。

1. 北美全部845种蝴蝶基因组测序展示动物进化的整体规律

Genomics of a complete butterfly continent（CC BY-NC-ND 4.0）

Never before have we had the luxury of choosing a continent, picking a large phylogenetic group of animals, and obtaining genomic data for its every species. Here, we sequence all 845 species of butterflies recorded from North America north of Mexico. Our comprehensive approach reveals the pattern of diversification and adaptation occurring in this phylogenetic lineage as it has spread over the continent, which cannot be seen on a sample of selected species. We observe bursts of diversification that generated taxonomic ranks: subfamily, tribe, subtribe, genus, and species. The older burst around 70 Mya resulted in the butterfly subfamilies, with the major evolutionary inventions being unique phenotypic traits shaped by high positive selection and gene duplications. The recent burst around 5 Mya is caused by explosive radiation in diverse butterfly groups associated with diversification in transcription and mRNA regulation, morphogenesis, and mate selection. Rapid radiation correlates with more frequent introgression of speciation-promoting and beneficial genes among radiating species. Radiation and extinction patterns over the last 100 million years suggest the following general model of animal evolution. A population spreads over the land, adapts to various conditions through mutations, and diversifies into several species. Occasional hybridization between these species results in accumulation of beneficial alleles in one, which eventually survives, while others become extinct. Not only butterflies, but also the hominids may have followed this path.

2. 染色体外DNA（ecDNA）上的癌基因在侵略性肿瘤中的角色

Frequent extrachromosomal oncogene amplification drives aggressive tumors（CC-BY-ND 4.0）

Extrachromosomal DNA (ecDNA) amplification promotes high oncogene copy number, intratumoral genetic heterogeneity, and accelerated tumor evolution1–3, but its frequency and clinical impact are not well understood. Here we show, using computational analysis of whole-genome sequencing data from 1,979 cancer patients, that ecDNA amplification occurs in at least 26% of human cancers, of a wide variety of histological types, but not in whole blood or normal tissue. We demonstrate a highly significant enrichment for oncogenes on amplified ecDNA and that the most common recurrent oncogene amplifications arise on ecDNA. EcDNA amplifications resulted in higher levels of oncogene transcription compared to copy number matched linear DNA, coupled with enhanced chromatin accessibility. Patients whose tumors have ecDNA-based oncogene amplification showed increase of cell proliferation signature activity, greater likelihood of lymph node spread at initial diagnosis, and significantly shorter survival, even when controlled for tissue type, than do patients whose cancers are not driven by ecDNA-based oncogene amplification. The results presented here demonstrate that ecDNA-based oncogene amplification plays a central role in driving the poor outcome for patients with some of the most aggressive forms of cancers.

3. 想了解CRISPR knock-in后同源重组修复结果吗？这篇来自陈-扎克伯格生物中心Manuel Leonetti课题组的文章不容错过

Deep profiling reveals substantial heterogeneity of integration outcomes in CRISPR knock-in experiments（CC-BY-NC-ND 4.0）

CRISPR/Cas technologies have transformed our ability to add functionality to the genome by knock-in of payload via homology-directed repair (HDR). However, a systematic and quantitative profiling of the knock-in integration landscape is still lacking. Here, we present a framework based on long-read sequencing and an integrated computational pipeline (knock-knock) to analyze knock-in repair outcomes across a wide range of experimental parameters. Our data uncover complex repair profiles, with perfect HDR often accounting for a minority of payload integration events, and reveal markedly distinct mis-integration patterns between cell-types or forms of HDR templates used. Our analysis demonstrates that the two sides of a given double-strand break can be repaired by separate pathways and identifies a major role for sequence micro-homology in driving donor mis-integration. Altogether, our comprehensive framework paves the way for investigating repair mechanisms, monitoring accuracy, and optimizing the precision of genome engineering.

4. BlobToolKit：检测基因组组装质量的可视化工具

BlobToolKit – Interactive quality assessment of genome assemblies（CC-BY 4.0）

We present BlobToolKit, a software suite to aid researchers in identifying and isolating non-target data in draft and publicly available genome assemblies. BlobToolKit can be used to process assembly, read and analysis files for fully reproducible interactive exploration in the browser-based Viewer. BlobToolKit can be used during assembly to filter non-target DNA, helping researchers produce assemblies with high biological credibility. We have been running an automated BlobToolKit pipeline on eukaryotic assemblies publicly available in the International Nucleotide Sequence Data Collaboration and are making the results available through a public instance of the Viewer at https://blobtoolkit.genomehubs.org/view. We aim to complete analysis of all publicly available genomes and then maintain currency with the flow of new genomes. We have worked to embed these views into the presentation of genome assemblies at the European Nucleotide Archive, providing an indication of assembly quality alongside the public record with links out to allow full exploration in the Viewer.

5. 印度学者：一个有参转录组分析的极简pipeline

A Simplest Bioinformatics Pipeline for Whole Transcriptome Sequencing: Overview of the Processing and Steps from Raw Data to Downstream Analysis（CC-BY-NC 4.0）

Recent advances in next generation sequencing (NGS) technologies have heralded the genomic research. From the good-old inferring differentially expressed genes (DEG) using microarray to the current adage NGS-based whole transcriptome or RNA-Seq pipelines, there have been advances and improvements. With several bioinformatics pipelines for analysing RNA-Seq on rise, inferring the candidate DEGs prove to be a cumbersome approach as one may have to reach consensus among all the pipelines. To Check this, we have benchmarked the well known cufflinks-cuffdiff pipeline on a set of datasets and outline it in the form of a protocol where researchers interested in performing whole transcriptome shotgun sequencing and it’s downstream analysis can better disseminate the analysis using their datasets.

6. 转座子和重序列注释工具RepeatModeler升级啦

RepeatModeler2: automated genomic discovery of transposable element families（CC-BY 4.0）

The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a new pipeline that greatly facilitates this process. This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately three times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. The program had an extremely low false positive rate when applied to simulated genomes devoid of TEs. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, https://github.com/Dfam-consortium/TETools).

注：本文于生信菜鸟团一周文献推荐37中亦有呈递

7. 15年开始研发、好评不断的泛基因组研究工具Coinfinder终于刊文

Coinfinder: Detecting Significant Associations and Dissociations in Pangenomes（CC-BY-NC-ND 4.0）

Coinfinder identifies genes that co-occur (associate) or avoid (dissociate) with each other across the accessory genomes of a pangenome of interest. Genes that associate or dissociate more often than expected by chance, suggests that those genes have a connection (attraction or repulsion) that is interesting to explore. Identification of these groups of genes will further the field’s understanding of the importance of accessory genes. Coinfinder is a freely available, open-source software which can identify gene patterns locally on a personal computer in a matter of hours.

8. 一款纯粹的Smith-Waterman local alignment工具SLAST：BLAST的挑战者，还是匆匆过客？

SLAST: Simple Local Alignment Search Tool（CC-BY-NC-ND 4.0）

We present a local alignment search tool not based on the usual strategy of seed and grow often employed for these tools. Instead, we just find regions in the database sequences having a high density of seed matches and then we perform a Smith-Waterman local alignment of the query sequence into these regions. This approach has some advantages for some use cases.

9. 法国索邦大学（Sorbonne Universités）：基于基因排布顺序的系统发育树构建工具PhyChro

Phylogenetic reconstruction based on synteny block and gene adjacencies（CC BY-NC-ND 4.0）

Gene order can be used as an informative character to reconstruct phylogenetic relationships-between species independently from the local information present in gene/protein sequences. PhyChro is a reconstruction method based on chromosomal rearrangements, applicable to a wide range of eukaryotic genomes with different gene contents and levels of synteny conservation. For each synteny breakpoint issued from pairwise genome comparisons, the algorithm defines two disjoint sets of genomes, named partial splits, respectively supporting the two block adjacencies defining the breakpoint. Considering all partial splits issued from all pairwise comparisons, a distance between two genomes is computed from the number of partial splits separating them. Tree reconstruction is achieved through a bottom-up approach by iteratively grouping sister genomes minimizing genome distances. PhyChro estimates branch lengths based on the number of synteny breakpoints and provides confidence scores for the branches. PhyChro performance isevaluatedon two datasets of 13 vertebrates and 21 yeast genomes by using up to 130 000 and 179 000 breakpoints respectively, a scale of genomic markers that has been out of reach until now. PhyChro reconstructs very accurate tree topologies even at known problematic branching positions. Its robustness has been benchmarked for different synteny block reconstruction methods. On simulated data PhyChro reconstructs phylogenies perfectly in almost all cases, and shows the highest accuracy compared to other existing tools. PhyChro is very fast, reconstructing the vertebrate and yeast phylogenies in less than 15 min. Availability PhyChro will be freely available under the BSD license after publication

10. 马普物理化学所Söding实验室开发真核生物宏转录组基因注释新工具

MetaEuk – sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics（CC-BY 4.0）

Results MetaEuk is a toolkit for high-throughput, reference-based discovery and annotation of protein-coding genes in eukaryotic metagenomic contigs. It performs fast searches with 6-frame-translated fragments covering all possible exons and optimally combines matches into multi-exon proteins. We used a benchmark of seven diverse, annotated genomes to show that MetaEuk is highly sensitive even under conditions of low sequence similarity to the reference database. To demonstrate MetaEuk’s power to discover novel eukaryotic proteins in large-scale metagenomic data, we assembled contigs from 912 samples of the Tara Oceans project. MetaEuk predicted >12,000,000 protein-coding genes in eight days on ten 16-core servers. Most of the discovered proteins are highly diverged from known proteins and originate from very sparsely sampled eukaryotic supergroups.

11. 【preprints.org】人类起源于博茨瓦纳？这个玩笑有点大

Human Origins in Southern African Palaeo-wetlands? Strong Claims from Weak Evidence

Chan and colleagues in their paper titled “Human origins in a southern African palaeo-wetland and first migrations” (https://www.nature.com/articles/s41586-019-1714-1) report 198 novel whole mitochondrial DNA (mtDNA) sequences and infer that ‘anatomically modern humans’ originated in the Makgadikgadi–Okavango palaeo-wetland of southern Africa around 200 thousand years ago. This claim relies on weakly informative data. In addition to flawed logic and questionable assumptions, the authors surprisingly disregard recent evidence and debate on human origins in Africa. As a result, the emphatic and high profile conclusions of the paper are unjustified.

博茨瓦纳在非洲的位置

引文

1. Server, R. et al., bioRxiv: the preprint server for biology. bioRxiv, 2019.

2. Eva K. F. Chan, et al. Humanorigins in a southern African palaeo-wetland and first migrations. Nature, 2019.

2019年11月bioRxiv生信好文速览

2019年11月bioRxiv生信好文速览

相关阅读更多精彩内容

友情链接更多精彩内容