2019年12月bioRxiv生信好文速览

Original montreal 生信人

对于bioRxiv，过去的一年可以说是丰收的一年。用bioRxiv主创人员之一的John Inglis的下面这则推特来说明再合适不过了：不仅有数量上的飞跃，也包括更多、更方便的manuscript transfer和杂志接纳，以及最新出炉的透明审稿。

临近年底，不少学术杂志也推出了各种盘点。其中，在《科学》杂志评选的19年十大科技进展中【1】，有一项被称为“真核生物起源的争议问题迈出了重要的一步”的研究：来自日本海洋研究开发机构（JAMSTEC）的Imachi等人，成功培养并测序了一种被认为和真核生物有着密切联系的古菌，而该文居然是来自未经同行评议的bioRxiv！实际上，说预印本完全未经审稿是不严密的。就阅读者本身而言，其实也可以看做是审稿人之一。此外，很多热门话题和前沿领域的预印本文章，一经发布就会在网络上受到关注，作者们得到的反馈不仅可以来自bioRxiv的留言板，也包括同行们的邮件和其他方式的交流。而且，不同预印本文章之间也可以彼此提供支持。本段开头提到的这篇preprint，在四个月内已经斩获了6次谷歌学术引用，影响力可见一斑。本期的好文速览也为大家选取了其中一篇引文，瑞典乌普萨拉大学（Uppasala University）著名学者Ettema展示了实验室最新的古菌宏基因组测序结果，与Imachi等人的结果相呼应。

在过往期的“好文速览”中，在保证简洁的情况下，不少文章有标明实验团队和所属单位。细心的读者可能发现，似乎一些学校或机构从未听过。比如，上期栏目中出现的法国索邦大学（Sorbonne Université）。实际上，该校是马克龙总统上任以来卓越大学计划的产物【2】。法国高校历史悠久，水平也很高，但规模较小，且校名大多以阿拉伯数字按顺序排列。近几年，有关部门对法国大学系统进行了大力资源整合：著名的巴黎第六大学（玛丽居里大学）和第四大学合并为索邦大学，马赛第一第二第三大学合并为艾克斯马赛大学（Aix-Marseille Université），而尼斯大学等校合并为蔚蓝海岸大学（Université Côte d'Azur），在本期的好文速览中，我们也将看到来自后面两所大学的最新成果。小编以为，虽然过去按照阿拉伯数字命名的大学难以记忆，新改的名字也很奇怪。也许，在各大大学排名青睐大而全式学校的趋势之下，法国人也动心了吧。

1. 法国蔚蓝海岸大学（）Barbry实验室：人呼吸道的单细胞测序

A single-cell atlas of the human healthy airways（CC-BY 4.0）

Results The resulting atlas is composed of a high percentage of epithelial cells (89.1%), but also immune (6.2%) and stromal (4.7%) cells with peculiar cellular proportions in different sites of the airways. It reveals differential gene expression between identical cell types (suprabasal, secretory, and multiciliated cells) from the nose (MUC4, PI3, SIX3) and tracheobronchial (SCGB1A1, TFF3) airways. By contrast, cell-type specific gene expression was stable across all tracheobronchial samples. Our atlas improves the description of ionocytes, pulmonary neuro-endocrine (PNEC) and brush cells, which are likely derived from a common population of precursor cells. We also report a population of KRT13 positive cells with a high percentage of dividing cells which are reminiscent of “hillock” cells previously described in mouse.

2. 法国马赛大学Legendre团队：巨型病毒的DNA甲基化图谱

The DNA Methylation Landscape of Giant Viruses

DNA methylation is an important epigenetic mark that contributes to various regulations in all domains of life. Prokaryotes use it through Restriction-Modification (R-M) systems as a host-defense mechanism against viruses. The recently discovered giant viruses are widespread dsDNA viruses infecting eukaryotes with gene contents overlapping the cellular world. While they are predicted to encode DNA methyltransferases (MTases), virtually nothing is known about the DNA methylation status of their genomes. Using single-molecule real-time sequencing we studied the complete methylome of a large spectrum of families: the Marseilleviridae, the Pandoraviruses, the Molliviruses, the Mimiviridae along with their associated virophages and transpoviron, the Pithoviruses and the Cedratviruses (of which we report a new strain). Here we show that DNA methylation is widespread in giant viruses although unevenly distributed. We then identified the corresponding viral MTases, all of which are of bacterial origins and subject to intricate gene transfers between bacteria, viruses and their eukaryotic host. If some viral MTases undergo pseudogenization, most are conserved, functional and under purifying selection, suggesting that they increase the viruses’ fitness. While the Marseilleviridae, Pithoviruses and Cedratviruses DNA MTases catalyze N6-methyl-adenine modifications, some MTases of Molliviruses and Pandoraviruses unexpectedly catalyze the formation of N4-methyl-cytosine modifications. In Marseilleviridae, encoded MTases are paired with cognate restriction endonucleases (REases) forming complete R-M systems. Our data suggest that giant viruses MTases could be involved in different kind of virus-virus interactions during coinfections.

3. 基因组重排研究利器Smash++，能否带给你碾压一切的感受？

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements（CC-BY-ND 4.0）

Background The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial due to their role in chromosomal evolution, genetic disorders and cancer; Results We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between two DNA sequences. This computational solution extracts information contents of the two sequences, exploiting a data compression technique, in order for finding rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image; Conclusions Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves and mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions complied with previous studies which took alignment-based approaches or performed FISH (Fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ~1 GB, which makes Smash++ feasible to run on present-day standard computers.

4. Stdpopsim：一个强大的群体遗传学开源项目

A community-maintained standard library of population genetic models（CC-BY 4.0）

The explosion in population genomic data demands ever more complex modes of analysis, and increasingly these analyses depend on sophisticated simulations. Recent advances in population genetic simulation have made it possible to simulate large and complex models, but specifying such models for a particular simulation engine remains a difficult and error-prone task. Computational genetics researchers currently re-implement simulation models independently, leading to duplication of effort and the possibility for error. Population genetics, as a field, also lacks standard benchmarks by which new tools for inference might be measured. Here we describe a new resource, stdpopsim, that attempts to rectify this situation. Stdpopsim is a community-driven open source project, which provides easy access to a standard catalog of published simulation models from a wide range of organisms and supports multiple simulation engine backends. We share some examples demonstrating how stdpopsim can be used to systematically compare demographic inference methods, and we encourage an even broader community of developers to contribute to this growing resource.

5. 瑞典乌普萨拉大学Suh：不同技术在（天堂鸟）基因组组装中的系统比较

Identifying the causes and consequences of assembly gaps using a multiplatform genome assembly of a bird-of-paradise（CC-BY-NC 4.0）

Genome assemblies are currently being produced at an impressive rate by consortia and individual laboratories. The low costs and increasing efficiency of sequencing technologies have opened up a whole new world of genomic biodiversity. Although these technologies generate high-quality genome assemblies, there are still genomic regions difficult to assemble, like repetitive elements and GC-rich regions (genomic “dark matter”). In this study, we compare the efficiency of currently used sequencing technologies (short/linked/long reads and proximity ligation maps) and combinations thereof in assembling genomic dark matter starting from the same sample. By adopting different de-novo assembly strategies, we were able to compare each individual draft assembly to a curated multiplatform one and identify the nature of the previously missing dark matter with a particular focus on transposable elements, multi-copy MHC genes, and GC-rich regions. Thanks to this multiplatform approach, we demonstrate the feasibility of producing a high-quality chromosome-level assembly for a non-model organism (paradise crow) for which only suboptimal samples are available. Our approach was able to reconstruct complex chromosomes like the repeat-rich W sex chromosome and several GC-rich microchromosomes. Telomere-to-telomere assemblies are not a reality yet for most organisms, but by leveraging technology choice it is possible to minimize genome assembly gaps for downstream analysis. We provide a roadmap to tailor sequencing projects around the completeness of both the coding and non-coding parts of the genomes.

6. 瑞典乌普萨拉大学Ettema：最新古菌宏基因组测序使我们离真核生物起源的揭秘更近一步

Near-complete Lokiarchaeota genomes from complex environmental samples using long and short read metagenomic analyses（CC-BY-NC-ND 4.0）

Asgard archaea is a recently proposed superphylum currently comprised of five recognised phyla: Lokiarchaeota, Thorarchaeota, Odinarchaeota, Heimdallarchaeota and Helarchaeota. Members of this group have been identified based on culture-independent approaches with several metagenome-assembled genomes (MAGs) reconstructed to date. However, most of these genomes consist of several relatively small contigs, and, until recently, no complete Asgard archaea genome is yet available. Large scale phylogenetic analyses suggest that Asgard archaea represent the closest archaeal relatives of eukaryotes. In addition, members of this superphylum encode proteins that were originally thought to be specific to eukaryotes, including components of the trafficking machinery, cytoskeleton and endosomal sorting complexes required for transport (ESCRT). Yet, these findings have been questioned on the basis that the genome sequences that underpin them were assembled from metagenomic data, and could have been subjected to contamination and other assembly artefacts. Even though several lines of evidence indicate that the previously reported findings were not affected by these issues, having access to high-quality and preferentially fully closed Asgard archaea genomes is needed to definitively close this debate. Current long-read sequencing technologies such as Oxford Nanopore allow the generation of long reads in a high-throughput manner making them suitable for their use in metagenomics. Although the use of long reads is still limited in this field, recent analyses have shown that it is feasible to obtain complete or near-complete genomes of abundant members of mock communities and metagenomes of various level of complexity. Here, we show that long read metagenomics can be successfully applied to obtain near-complete genomes of low-abundant members of complex communities from sediment samples. We were able to reconstruct six MAGs from different Lokiarchaeota lineages that show high completeness and low fragmentation, with one of them being a near-complete genome only consisting of three contigs. Our analyses confirm that the eukaryote-like features previously associated with Lokiarchaeota are not the result of contamination or assembly artefacts, and can indeed be found in the newly reconstructed genomes.

7. 沸沸扬扬：胎盘里到底有没有微生物组？

No consistent evidence for microbiota in murine placental and fetal tissues（CC-BY-NC-ND 4.0）

The existence of a placental microbiota and in utero colonization of the fetus has been the subject of recent debate. The objective of this study was to determine whether the placental and fetal tissues of mice harbor bacterial communities. Bacterial profiles of the placenta and fetal brain, lung, liver, and intestine were characterized through culture, qPCR, and 16S rRNA gene sequencing. These profiles were compared to those of the maternal mouth, lung, liver, uterus, cervix, vagina, and intestine, as well as to background technical controls. Positive bacterial cultures from placental and fetal tissues were rare; of the 165 total bacterial cultures of placental tissues from the 11 mice included in this study, only nine yielded at least a single colony, and five of those nine positive cultures came from a single mouse. Cultures of fetal intestinal tissues yielded just a single bacterial isolate: Staphylococcus hominis, a common skin bacterium. Bacterial loads of placental and fetal brain, lung, liver, and intestinal tissues were not higher than those of DNA contamination controls and did not yield substantive 16S rRNA gene sequencing libraries. From all placental or fetal tissues (N = 49), there was only a single bacterial isolate that came from a fetal brain sample having a bacterial load higher than that of contamination controls and that was identified in sequence-based surveys of at least one of its corresponding maternal samples. Therefore, using multiple modes of microbiologic inquiry, there was not consistent evidence of bacterial communities in the placental and fetal tissues of mice.

拓展阅读：新研究挑战人胎盘微生物组的存在，但也备受质疑（生物谷）

8. 哥伦比亚大学Przeworski：群体基因组研究推动珊瑚的精准医疗

Population genetics of the coral Acropora millepora: Towards a genomic predictor of bleaching

Although reef-building corals are rapidly declining worldwide, responses to bleaching vary both within and among species. Because these inter-individual differences are partly heritable, they should in principle be predictable from genomic data. Towards that goal, we generated a chromosome-scale genome assembly for the coral Acropora millepora. We then obtained whole genome sequences for 237 phenotyped samples collected at 12 reefs distributed along the Great Barrier Reef, among which we inferred very little population structure. Scanning the genome for evidence of local adaptation, we detected signatures of long-term balancing selection in the heat-shock co-chaperone sacsin. We further used 213 of the samples to conduct a genome-wide association study of visual bleaching score, incorporating the polygenic score derived from it into a predictive model for bleaching in the wild. These results set the stage for the use of genomics-based approaches in conservation strategies.

9. Welcome Sanger Institute：1142只蚊子基因组测序，能否为人类找到破解疟疾的良策？

Genome variation and population structure among 1,142 mosquitoes of the African malaria vector species Anopheles gambiae and Anopheles coluzzii（CC-BY-NC 4.0）

Mosquito control remains a central pillar of efforts to reduce malaria burden in sub-Saharan Africa. However, insecticide resistance is entrenched in malaria vector populations, and countries with high malaria burden face a daunting challenge to sustain malaria control with a limited set of surveillance and intervention tools. Here we report on the second phase of a project to build an open resource of high quality data on genome variation among natural populations of the major African malaria vector species Anopheles gambiae and Anopheles coluzzii. We analysed whole genomes of 1,142 individual mosquitoes sampled from the wild in 13 African countries, and a further 234 individuals comprising parents and progeny of 11 lab crosses. The data resource includes high confidence single nucleotide polymorphism (SNP) calls at 57 million variable sites, genome-wide copy number variation calls, and haplotypes phased at biallelic SNPs. We used the SNP data to analyse genetic population structure, compute allele frequencies, and characterise genetic diversity within and between populations. We illustrate the utility of these data by investigating species differences in isolation by distance, genetic variation within proposed gene drive target sequences, and patterns of resistance to pyrethroid insecticides. This data resource provides a foundation for developing new operational systems for molecular surveillance, and for accelerating research and development of new vector control tools.

10. 哥伦比亚大学Tavazoie：原核生物单细胞测序新方法

Prokaryotic Single-Cell RNA Sequencing by In Situ Combinatorial Indexing（CC-BY-NC-ND 4.0）

Despite longstanding appreciation of gene expression heterogeneity in isogenic bacterial populations, affordable and scalable technologies for studying single bacterial cells have been limited. While single-cell RNA sequencing (scRNA-seq) has revolutionized studies of transcriptional heterogeneity in diverse eukaryotic systems, application of scRNA-seq to prokaryotic cells has been hindered by their low levels of mRNA, lack of mRNA polyadenylation, and thick cell walls. Here, we present Prokaryotic Expression-profiling by Tagging RNA In Situ and sequencing (PETRI-seq), a high-throughput prokaryotic scRNA-seq pipeline that overcomes these obstacles. PETRI-seq uses in situ combinatorial indexing to barcode transcripts from tens of thousands of cells in a single experiment. We have demonstrated that PETRI-seq effectively captures single cell transcriptomes of Gram-negative and Gram-positive bacteria with high purity and little bias. Although bacteria express only thousands of mRNAs per cell, captured mRNA levels were sufficient to distinguish between the transcriptional states of single cells within isogenic populations. In E. coli, we were able to identify single cells in either stationary or exponential phase and define consensus transcriptomes for these sub-populations. In wild type S. aureus, we detected a rare population of cells undergoing prophage induction. We anticipate that PETRI-seq will be widely useful for studying transcriptional heterogeneity in microbial communities.

引文

1. 2019 BREAKTHROUGH of the YEAR https://vis.sciencemag.org/breakthrough2019/

2. 许浙景教育|法国“卓越大学”建设进程及成效http://www.sohu.com/a/281521420_764031

2019年12月bioRxiv生信好文速览

推荐阅读更多精彩内容