上个月,随着新冠病毒的继续肆虐,我们在预印本(preprint)平台上看到了越来越多相关的文章。如果用“2019-nCoV”为关键词检索,bioRxiv上在今年头两个月份分别有21和68篇文章入账。其兄弟平台、医学预印本专属网站medRxiv一月份仅有三篇,而二月却猛增至178篇。一个原因是,医学领域的研究可能需要更多时间和数据的积累,所以出来的慢一些。另一方面,也许越来越多的学者意识到medRxiv是更适合安放关于新冠病毒preprint的地方,这也与medRxiv在去年年底刚刚推出缺乏宣传有关。
整体上看,bioRxiv所涉及的内容还是要比medRxiv宽泛,除了包含生物学各个领域之外,还包括像Scientific Communication and Education这样一般生物学期刊上未有涉及的方向。上个月,来自澳洲迪肯大学(Deakin University)的学者对学生与导师间不同期望所造成的误解和沟通问题进行了分析(选文11)。
bioRxiv上对文章分为三个类型(注意不是按照题材分类),分别是new results, confirmatory results, contradictory results(换言之,bioRxiv理论上不允许综述类文章投放)。对于confirmatory和contradictory results类的论文,在同行评议期刊上发表时往往有较大难度,前者缺少新颖性,而后者则被认为太过挑战。实际上,这些文章的结果对于科学的贡献不可忽视。BioRxiv恰好提供了一个这样的平台,也是对现有的学术期刊的很好补充。本期“好文速览”我们也为大家选择了两篇“特殊体裁”的文章,特别的,来自密歇根大学的著名结构生物信息学家张阳课题组,对前不久基于密码子分析预测蛇作为新冠病毒宿主【1】的文章提出了不同见解,一起看看吧。
1. 苏黎世大学Mark Robinson团队:pipeComp,一款单细胞测序pipeline的比较工具
pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools
The massive growth of single-cell RNA-sequencing (scRNAseq) and methods for its analysis still lacks sufficient and up-to-date benchmarks that would guide analytical choices. Moreover, current studies are often focused on isolated steps of the process. Here, we present a flexible R framework for pipeline comparison with multi-level evaluation metrics and apply it to the benchmark of scRNAseq analysis pipelines using datasets with known cell identities. We evaluate common steps of such analyses, including filtering, doublet detection (suggesting a new R package, scDblFinder), normalization, feature selection, denoising, dimensionality reduction and clustering. On the basis of these analyses, we make a number of concrete recommendations about analysis choices. The evaluation framework, pipeComp, has been implemented so as to easily integrate any other step or tool, allowing extensible benchmarks and easy application to other fields (https://github.com/plger/pipeComp).
2. Donald Danforth 植物中心Slotkin:通过TE-fredienly拟南芥对转座元件的全新注释
Long-read cDNA Sequencing Enables a ‘Gene-Like’ Transcript Annotation of Arabidopsis Transposable Elements
High-quality transcript-based annotations of genes facilitates both genome-wide analyses and detailed single locus research. In contrast, transposable element (TE) annotations are rudimentary, consisting of only information on location and type of TE. When analyzing TEs, their repetitiveness and limited annotation prevents the ability to distinguish between potentially functional expressed elements and degraded copies. To improve genome-wide TE bioinformatics, we performed long-read Oxford Nanopore sequencing of cDNAs from Arabidopsis lines deficient in multiple layers of TE repression. We used these uniquely-mapping transcripts to identify the set of TEs able to generate mRNAs, and created a new transcript-based annotation of TEs that we have layered upon the existing high-quality community standard TAIR10 annotation. The improved annotation enables us to test specific standing hypotheses in the TE field. We demonstrate that inefficient TE splicing does not trigger small RNA production, and the cell more strongly targets DNA methylation to TEs that have the potential to make mRNAs. This work provides a transcript-based TE annotation for Arabidopsis, and serves as a blueprint to reduce the genomic complexity associated with repetitive TEs in any organism.
3. 英国谢菲尔德大学(University of Sheffield)学者声称发现禾本科植物间普遍的基因水平转移
Phylogenetic relatedness, co-occurrence, and rhizomes increase lateral gene transfer among grasses
Here we scan the genomes of a diverse set of grass species that span more than 50 million years of divergence and include major crops. We identify protein coding LGT in a majority of them (13 out of 17). There is variation among species in the amount of LGT received, with rhizomatous species receiving more genes. In addition, the amount of LGT increases with phylogenetic relatedness, which might reflect genomic compatibility among close relatives facilitating successful transfers. However, we also observe genetic exchanges among distantly related species that diverged shortly after the origin of the grass family when they co-occur in the wild, pointing to a role of biogeography. The dynamics of successful LGT in grasses therefore appear to be dependent on both opportunity (co-occurrence and rhizomes) and compatibility (phylogenetic distance). Overall, we show that LGT is a widespread phenomenon in grasses, which is boosted by repeated contact among related lineages. The process has moved functional genes across the entire grass family into domesticated and wild species alike.
4. 加州大学三藩分校Hunter Shain实验室全景展示黑色素细胞的基因组图谱
The genomic landscapes of individual melanocytes from human skin
Every cell in the human body has a unique set of somatic mutations, yet it remains difficult to comprehensively genotype an individual cell. Here, we developed solutions to overcome this obstacle in the context of normal human skin, thus offering the first glimpse into the genomic landscapes of individual melanocytes from human skin. We comprehensively genotyped 133 melanocytes from 19 sites across 6 donors. As expected, sun-shielded melanocytes had fewer mutations than sun-exposed melanocytes. However, within sun-exposed sites, melanocytes on chronically sun-exposed skin (e.g. the face) displayed a lower mutation burden than melanocytes on intermittently sun-exposed skin (e.g. the back). Melanocytes located adjacent to a skin cancer had higher mutation burdens than melanocytes from donors without skin cancer, implying that the mutation burden of normal skin can be harnessed to measure cumulative sun damage and skin cancer risk. Moreover, melanocytes from healthy skin commonly harbor pathogenic mutations, likely explaining the origins of the melanomas that arise in the absence of a pre-existing nevus. Phylogenetic analyses identified groups of related melanocytes, suggesting that melanocytes spread throughout skin as fields of clonally related cells, invisible to the naked eye. Overall, our study offers an unprecedented view into the genomic landscapes of individual melanocytes, revealing key insights into the causes and origins of melanoma.
5. iGenomics:可”掌”握的基因组分析
iGenomics: Comprehensive DNA Sequence Analysis on your Smartphone
iGenomics is the first comprehensive mobile genome analysis application, with capabilities to align reads, call variants, and visualize the results entirely on an iOS device. Implemented in Objective-C using the FM-index, banded dynamic programming, and other high-performance bioinformatics techniques, iGenomics is optimized to run in a mobile environment. We benchmark iGenomics using a variety of real and simulated Nanopore sequencing datasets and show that iGenomics has performance comparable to the popular BWA-MEM/Samtools/IGV suite, without needing a laptop or server cluster. iGenomics is available open-source (https://github.com/stuckinaboot/iGenomics) and for free on Apple’s App Store (https://apps.apple.com/us/app/igenomics-mobile-dna-analysis/id1495719841).
6. 加拿大不列颠哥伦比亚大学(University of British Columbia)学者推出二、三代测序混合组装软件,声称同类软件中最快最准
HASLR: Fast Hybrid Assembly of Long Reads
Third generation sequencing technologies from platforms such as Oxford Nanopore Technologies and Pacific Biosciences have paved the way for building more contiguous assemblies and complete reconstruction of genomes. The larger effective length of the reads generated with these technologies has provided a mean to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive while faster methods are not as accurate. Therefore, there is still an unmet need for tools that are both fast and accurate for reconstructing small and large genomes. Despite the recent advances in third generation sequencing, researchers tend to generate second generation reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler which uses both second and third generation sequencing reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples.Availability HASLR is an open source tool available at https://github.com/vpc-ccg/haslr.
7. 剑桥大学团队:aneuploidies(非整数倍体)可对食道癌提前数年做出预测
Genomic copy number predicts oesophageal cancer years before transformation
Cancer arises through a process of somatic evolution and recent studies have shown that aneuploidies and driver gene mutations precede cancer diagnosis by several years to decades1–4 Here, we address the question whether such genomic signals can be used for early detection and pre-emptive cancer treatment. To this end we study Barrett’s oesophagus, a genomic copy number driven neoplastic precursor lesion to oesophageal adenocarcinoma5. We use shallow whole genome sequencing of 777 biopsies sampled from 88 patients in surveillance for Barrett’s oesophagus over a period of up to 15 years. These data show that genomic signals exist that distinguish progressive from stable disease with an AUC of 0.87 and a sensitivity of 50% even ten years prior to histopathological disease transformation. These finding are validated on two independent cohorts of 75 and 248 patients. Compared against current patient management guidelines genomic risk classification enables earlier treatment for high risk patients as well as reduction of unnecessary treatment and monitoring for patients who are unlikely to develop cancer.
8. 北京希望组: HiFi vs ONT对水稻基因组PacBio组装的比较
Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacbio Sequel II system and ultralong reads of Oxford Nanopore
The availability of reference genomes has revolutionized the study of biology. Multiple competing technologies have been developed to improve the quality and robustness of genome assemblies during the last decade. The two widely-used long read sequencing providers – Pacbio (PB) and Oxford Nanopore Technologies (ONT) – have recently updated their platforms: PB enable high throughput HiFi reads with base-level resolution with >99% and ONT generated reads as long as 2 Mb. We applied the two up-to-date platforms to one single rice individual, and then compared the two assemblies to investigate the advantages and limitations of each. The results showed that ONT ultralong reads delivered higher contiguity producing a total of 18 contigs of which 10 were assembled into a single chromosome compared to that of 394 contigs and three chromosome-level contigs for the PB assembly. The ONT ultralong reads also prevented assembly errors caused by long repetitive regions for which we observed a total 44 genes of false redundancies and 10 genes of false losses in the PB assembly leading to over/under-estimations of the gene families in those long repetitive regions. We also noted that the PB HiFi reads generated assemblies with considerably less errors at the level of single nucleotide and small InDels than that of the ONT assembly which generated an average 1.06 errors per Kb assembly and finally engendered 1,475 incorrect gene annotations via altered or truncated protein predictions.
9. Confirmatory results:生物信息学分析成功检测到大量已知癌症突变热点
Modeling and analysis of site-specific mutations in cancer identifies known plus putative novel hotspots and bias due to contextual sequences
In cancer, recurrently mutated sites in DNA and proteins, called hotspots, are thought to be raised by positive selection and therefore important due to its potential functional impact. Although recent evidence for APOBEC enzymatic activity have shown that specific types of sequences are likely to be false, the identification of putative hotspots is important to confirm either its functional role or its mechanistic bias. In this work, an algorithm and a statistical model is presented to detect hotspots. The model consists of a beta-binomial component plus fixed effects that efficiently fits the distribution of mutated sites. The algorithm employs an optimal step-wise approach to find the model parameters. Simulations show that the proposed algorithmic model is highly accurate for common hotspots. The approach has been applied to TCGA mutational data from 33 cancer types. The results show that well-known cancer hotspots are easily detected. Besides, novel hotspots are also detected. An analysis of the sequence context of detected hotspots show a preference for TCG sites that may be related to APOBEC or other unknown mechanistic biases. The detected hotspots are available online in http://bioinformatica.mty.itesm.mx/HotSpotsAnnotations.
10. Contradictory results:蛇不是新冠病毒的宿主的生物信息学依据
Protein structure and sequence re-analysis of 2019-nCoV genome does not indicate snakes as its intermediate host or the unique similarity between its spike protein insertions and HIV-1
As the infection of 2019-nCoV coronavirus is quickly developing into a global pneumonia epidemic, careful analysis of its transmission and cellular mechanisms is sorely needed. In this report, we re-analyzed the computational approaches and findings presented in two recent manuscripts by Ji et al. (https://doi.org/10.1002/jmv.25682) and by Pradhan et al. (https://doi.org/10.1101/2020.01.30.927871), which concluded that snakes are the intermediate hosts of 2019-nCoV and that the 2019-nCoV spike protein insertions shared a unique similarity to HIV-1. Results from our re-implementation of the analyses, built on larger-scale datasets using state-of-the-art bioinformatics methods and databases, do not support the conclusions proposed by these manuscripts. Based on our analyses and existing data of coronaviruses, we concluded that the intermediate hosts of 2019-nCoV are more likely to be mammals and birds than snakes, and that the “novel insertions” observed in the spike protein are naturally evolved from bat coronaviruses.
11. Bonus preprint:研究关注学生与导师不同诉求对沟通和科研的影响
Supervising the PhD: identifying common mismatches in expectations between candidate and supervisor to improve research training outcomes
The relationship between a PhD candidate and their supervisor is influential in not only successful candidate completion, but maintaining candidate satisfaction and mental health. We quantified potential mismatches between the PhD candidates and supervisors expectations as a potential mechanism that facilitates poor candidate experiences and research training outcomes. 114 PhD candidates and 52 supervisors ranked the importance of student attributes and outcomes at the beginning and end of candidature. In relation to specific attributes, supervisors indicated the level of guidance they expected to give the candidate and candidates indicated the level of guidance they expected to receive. Candidates also report on whether different aspects of candidature influenced their mental well-being. We identified differences between candidates and supervisors perceived supervisor teaching responsibility and influences on mental well-being. Our results indicate that the majority of candidates were satisfied overall with their supervision, and find alignment of many expectations between both parties. Yet, we find that candidates have much higher expectations of achieving quantitative outcomes than supervisors. Supervisors believed they give more guidance to candidates than candidates perceive they received, and supervisors often only provided guidance when the candidate explicitly asked. Personal expectations and research progress significantly and negatively influenced over 50% of candidate’s mental well-being. Our results highlight the importance of candidates and supervisors explicitly communicating the responsibilities and expectations of the roles they play in helping candidates develop research skills. We provide four suggestions to supervisors that may be particularly effective at increasing communication, avoiding potential conflict and promoting candidate success and wellbeing.
引文
1. Ji, W., Wang, W., Zhao, X., Zai, J. & Li, X. J. Med. Virol. https://doi.org/10.1002/jmv.25682 (2020).
欢迎关注生信人