2020年2月bioRxiv生信好文速览

上个月,随着新冠病毒的继续肆虐,我们在预印本(preprint)平台上看到了越来越多相关的文章。如果用“2019-nCoV”为关键词检索,bioRxiv上在今年头两个月份分别有21和68篇文章入账。其兄弟平台、医学预印本专属网站medRxiv一月份仅有三篇,而二月却猛增至178篇。一个原因是,医学领域的研究可能需要更多时间和数据的积累,所以出来的慢一些。另一方面,也许越来越多的学者意识到medRxiv是更适合安放关于新冠病毒preprint的地方,这也与medRxiv在去年年底刚刚推出缺乏宣传有关。

整体上看,bioRxiv所涉及的内容还是要比medRxiv宽泛,除了包含生物学各个领域之外,还包括像Scientific Communication and Education这样一般生物学期刊上未有涉及的方向。上个月,来自澳洲迪肯大学(Deakin University)的学者对学生与导师间不同期望所造成的误解和沟通问题进行了分析(选文11)。

bioRxiv上对文章分为三个类型(注意不是按照题材分类),分别是new results, confirmatory results, contradictory results(换言之,bioRxiv理论上不允许综述类文章投放)。对于confirmatory和contradictory results类的论文,在同行评议期刊上发表时往往有较大难度,前者缺少新颖性,而后者则被认为太过挑战。实际上,这些文章的结果对于科学的贡献不可忽视。BioRxiv恰好提供了一个这样的平台,也是对现有的学术期刊的很好补充。本期“好文速览”我们也为大家选择了两篇“特殊体裁”的文章,特别的,来自密歇根大学的著名结构生物信息学家张阳课题组,对前不久基于密码子分析预测蛇作为新冠病毒宿主【1】的文章提出了不同见解,一起看看吧。


1. 苏黎世大学Mark Robinson团队:pipeComp,一款单细胞测序pipeline的比较工具

pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools

The massive growth of single-cell RNA-sequencing (scRNAseq) and methods for its analysis still lacks sufficient and up-to-date benchmarks that would guide analytical choices. Moreover, current studies are often focused on isolated steps of the process. Here, we present a flexible R framework for pipeline comparison with multi-level evaluation metrics and apply it to the benchmark of scRNAseq analysis pipelines using datasets with known cell identities. We evaluate common steps of such analyses, including filtering, doublet detection (suggesting a new R package, scDblFinder), normalization, feature selection, denoising, dimensionality reduction and clustering. On the basis of these analyses, we make a number of concrete recommendations about analysis choices. The evaluation framework, pipeComp, has been implemented so as to easily integrate any other step or tool, allowing extensible benchmarks and easy application to other fields (https://github.com/plger/pipeComp).


2. Donald Danforth 植物中心Slotkin:通过TE-fredienly拟南芥对转座元件的全新注释

Long-read cDNA Sequencing Enables a ‘Gene-Like’ Transcript Annotation of Arabidopsis Transposable Elements

High-quality transcript-based annotations of genes facilitates both genome-wide analyses and detailed single locus research. In contrast, transposable element (TE) annotations are rudimentary, consisting of only information on location and type of TE. When analyzing TEs, their repetitiveness and limited annotation prevents the ability to distinguish between potentially functional expressed elements and degraded copies. To improve genome-wide TE bioinformatics, we performed long-read Oxford Nanopore sequencing of cDNAs from Arabidopsis lines deficient in multiple layers of TE repression. We used these uniquely-mapping transcripts to identify the set of TEs able to generate mRNAs, and created a new transcript-based annotation of TEs that we have layered upon the existing high-quality community standard TAIR10 annotation. The improved annotation enables us to test specific standing hypotheses in the TE field. We demonstrate that inefficient TE splicing does not trigger small RNA production, and the cell more strongly targets DNA methylation to TEs that have the potential to make mRNAs. This work provides a transcript-based TE annotation for Arabidopsis, and serves as a blueprint to reduce the genomic complexity associated with repetitive TEs in any organism.

3. 英国谢菲尔德大学(University of Sheffield)学者声称发现禾本科植物间普遍的基因水平转移

Phylogenetic relatedness, co-occurrence, and rhizomes increase lateral gene transfer among grasses

Here we scan the genomes of a diverse set of grass species that span more than 50 million years of divergence and include major crops. We identify protein coding LGT in a majority of them (13 out of 17). There is variation among species in the amount of LGT received, with rhizomatous species receiving more genes. In addition, the amount of LGT increases with phylogenetic relatedness, which might reflect genomic compatibility among close relatives facilitating successful transfers. However, we also observe genetic exchanges among distantly related species that diverged shortly after the origin of the grass family when they co-occur in the wild, pointing to a role of biogeography. The dynamics of successful LGT in grasses therefore appear to be dependent on both opportunity (co-occurrence and rhizomes) and compatibility (phylogenetic distance). Overall, we show that LGT is a widespread phenomenon in grasses, which is boosted by repeated contact among related lineages. The process has moved functional genes across the entire grass family into domesticated and wild species alike.


4. 加州大学三藩分校Hunter Shain实验室全景展示黑色素细胞的基因组图谱

The genomic landscapes of individual melanocytes from human skin

Every cell in the human body has a unique set of somatic mutations, yet it remains difficult to comprehensively genotype an individual cell. Here, we developed solutions to overcome this obstacle in the context of normal human skin, thus offering the first glimpse into the genomic landscapes of individual melanocytes from human skin. We comprehensively genotyped 133 melanocytes from 19 sites across 6 donors. As expected, sun-shielded melanocytes had fewer mutations than sun-exposed melanocytes. However, within sun-exposed sites, melanocytes on chronically sun-exposed skin (e.g. the face) displayed a lower mutation burden than melanocytes on intermittently sun-exposed skin (e.g. the back). Melanocytes located adjacent to a skin cancer had higher mutation burdens than melanocytes from donors without skin cancer, implying that the mutation burden of normal skin can be harnessed to measure cumulative sun damage and skin cancer risk. Moreover, melanocytes from healthy skin commonly harbor pathogenic mutations, likely explaining the origins of the melanomas that arise in the absence of a pre-existing nevus. Phylogenetic analyses identified groups of related melanocytes, suggesting that melanocytes spread throughout skin as fields of clonally related cells, invisible to the naked eye. Overall, our study offers an unprecedented view into the genomic landscapes of individual melanocytes, revealing key insights into the causes and origins of melanoma.


5. iGenomics:可握的基因组分析

iGenomics: Comprehensive DNA Sequence Analysis on your Smartphone

iGenomics is the first comprehensive mobile genome analysis application, with capabilities to align reads, call variants, and visualize the results entirely on an iOS device. Implemented in Objective-C using the FM-index, banded dynamic programming, and other high-performance bioinformatics techniques, iGenomics is optimized to run in a mobile environment. We benchmark iGenomics using a variety of real and simulated Nanopore sequencing datasets and show that iGenomics has performance comparable to the popular BWA-MEM/Samtools/IGV suite, without needing a laptop or server cluster. iGenomics is available open-source (https://github.com/stuckinaboot/iGenomics) and for free on Apple’s App Store (https://apps.apple.com/us/app/igenomics-mobile-dna-analysis/id1495719841).


6. 加拿大不列颠哥伦比亚大学(University of British Columbia)学者推出二、三代测序混合组装软件,声称同类软件中最快最准

HASLR: Fast Hybrid Assembly of Long Reads

Third generation sequencing technologies from platforms such as Oxford Nanopore Technologies and Pacific Biosciences have paved the way for building more contiguous assemblies and complete reconstruction of genomes. The larger effective length of the reads generated with these technologies has provided a mean to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive while faster methods are not as accurate. Therefore, there is still an unmet need for tools that are both fast and accurate for reconstructing small and large genomes. Despite the recent advances in third generation sequencing, researchers tend to generate second generation reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler which uses both second and third generation sequencing reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples.Availability HASLR is an open source tool available at https://github.com/vpc-ccg/haslr.


7. 剑桥大学团队:aneuploidies(非整数倍体)可对食道癌提前数年做出预测

Genomic copy number predicts oesophageal cancer years before transformation

Cancer arises through a process of somatic evolution and recent studies have shown that aneuploidies and driver gene mutations precede cancer diagnosis by several years to decades1–4 Here, we address the question whether such genomic signals can be used for early detection and pre-emptive cancer treatment. To this end we study Barrett’s oesophagus, a genomic copy number driven neoplastic precursor lesion to oesophageal adenocarcinoma5. We use shallow whole genome sequencing of 777 biopsies sampled from 88 patients in surveillance for Barrett’s oesophagus over a period of up to 15 years. These data show that genomic signals exist that distinguish progressive from stable disease with an AUC of 0.87 and a sensitivity of 50% even ten years prior to histopathological disease transformation. These finding are validated on two independent cohorts of 75 and 248 patients. Compared against current patient management guidelines genomic risk classification enables earlier treatment for high risk patients as well as reduction of unnecessary treatment and monitoring for patients who are unlikely to develop cancer.


8. 北京希望组: HiFi vs ONT对水稻基因组PacBio组装的比较

Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacbio Sequel II system and ultralong reads of Oxford Nanopore

The availability of reference genomes has revolutionized the study of biology. Multiple competing technologies have been developed to improve the quality and robustness of genome assemblies during the last decade. The two widely-used long read sequencing providers – Pacbio (PB) and Oxford Nanopore Technologies (ONT) – have recently updated their platforms: PB enable high throughput HiFi reads with base-level resolution with >99% and ONT generated reads as long as 2 Mb. We applied the two up-to-date platforms to one single rice individual, and then compared the two assemblies to investigate the advantages and limitations of each. The results showed that ONT ultralong reads delivered higher contiguity producing a total of 18 contigs of which 10 were assembled into a single chromosome compared to that of 394 contigs and three chromosome-level contigs for the PB assembly. The ONT ultralong reads also prevented assembly errors caused by long repetitive regions for which we observed a total 44 genes of false redundancies and 10 genes of false losses in the PB assembly leading to over/under-estimations of the gene families in those long repetitive regions. We also noted that the PB HiFi reads generated assemblies with considerably less errors at the level of single nucleotide and small InDels than that of the ONT assembly which generated an average 1.06 errors per Kb assembly and finally engendered 1,475 incorrect gene annotations via altered or truncated protein predictions.


9. Confirmatory results:生物信息学分析成功检测到大量已知癌症突变热点

Modeling and analysis of site-specific mutations in cancer identifies known plus putative novel hotspots and bias due to contextual sequences

In cancer, recurrently mutated sites in DNA and proteins, called hotspots, are thought to be raised by positive selection and therefore important due to its potential functional impact. Although recent evidence for APOBEC enzymatic activity have shown that specific types of sequences are likely to be false, the identification of putative hotspots is important to confirm either its functional role or its mechanistic bias. In this work, an algorithm and a statistical model is presented to detect hotspots. The model consists of a beta-binomial component plus fixed effects that efficiently fits the distribution of mutated sites. The algorithm employs an optimal step-wise approach to find the model parameters. Simulations show that the proposed algorithmic model is highly accurate for common hotspots. The approach has been applied to TCGA mutational data from 33 cancer types. The results show that well-known cancer hotspots are easily detected. Besides, novel hotspots are also detected. An analysis of the sequence context of detected hotspots show a preference for TCG sites that may be related to APOBEC or other unknown mechanistic biases. The detected hotspots are available online in http://bioinformatica.mty.itesm.mx/HotSpotsAnnotations.


10. Contradictory results:蛇不是新冠病毒的宿主的生物信息学依据

Protein structure and sequence re-analysis of 2019-nCoV genome does not indicate snakes as its intermediate host or the unique similarity between its spike protein insertions and HIV-1

As the infection of 2019-nCoV coronavirus is quickly developing into a global pneumonia epidemic, careful analysis of its transmission and cellular mechanisms is sorely needed. In this report, we re-analyzed the computational approaches and findings presented in two recent manuscripts by Ji et al. (https://doi.org/10.1002/jmv.25682) and by Pradhan et al. (https://doi.org/10.1101/2020.01.30.927871), which concluded that snakes are the intermediate hosts of 2019-nCoV and that the 2019-nCoV spike protein insertions shared a unique similarity to HIV-1. Results from our re-implementation of the analyses, built on larger-scale datasets using state-of-the-art bioinformatics methods and databases, do not support the conclusions proposed by these manuscripts. Based on our analyses and existing data of coronaviruses, we concluded that the intermediate hosts of 2019-nCoV are more likely to be mammals and birds than snakes, and that the “novel insertions” observed in the spike protein are naturally evolved from bat coronaviruses.


11. Bonus preprint:研究关注学生与导师不同诉求对沟通和科研的影响

Supervising the PhD: identifying common mismatches in expectations between candidate and supervisor to improve research training outcomes

The relationship between a PhD candidate and their supervisor is influential in not only successful candidate completion, but maintaining candidate satisfaction and mental health. We quantified potential mismatches between the PhD candidates and supervisors expectations as a potential mechanism that facilitates poor candidate experiences and research training outcomes. 114 PhD candidates and 52 supervisors ranked the importance of student attributes and outcomes at the beginning and end of candidature. In relation to specific attributes, supervisors indicated the level of guidance they expected to give the candidate and candidates indicated the level of guidance they expected to receive. Candidates also report on whether different aspects of candidature influenced their mental well-being. We identified differences between candidates and supervisors perceived supervisor teaching responsibility and influences on mental well-being. Our results indicate that the majority of candidates were satisfied overall with their supervision, and find alignment of many expectations between both parties. Yet, we find that candidates have much higher expectations of achieving quantitative outcomes than supervisors. Supervisors believed they give more guidance to candidates than candidates perceive they received, and supervisors often only provided guidance when the candidate explicitly asked. Personal expectations and research progress significantly and negatively influenced over 50% of candidate’s mental well-being. Our results highlight the importance of candidates and supervisors explicitly communicating the responsibilities and expectations of the roles they play in helping candidates develop research skills. We provide four suggestions to supervisors that may be particularly effective at increasing communication, avoiding potential conflict and promoting candidate success and wellbeing.


引文

1. Ji, W., Wang, W., Zhao, X., Zai, J. & Li, X. J. Med. Virol. https://doi.org/10.1002/jmv.25682 (2020).

欢迎关注生信人

转录组|甲基化|重测序|单细胞|m6A|多组学

cytoscape| limma| WGCNA|水熊虫传奇|linux

电泳|PCR|测序简史|核型|NIPT|基础实验

基因| 2019-nCoV| 富集分析|联合分析|微环境

瘟疫追凶| 思路汇总| 学者|科研|撤稿|读博|基因

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,732评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,496评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,264评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,807评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,806评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,675评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,029评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,683评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 41,704评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,666评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,773评论 1 332
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,413评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,016评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,978评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,204评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,083评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,503评论 2 343

推荐阅读更多精彩内容

  • 我而威尔
    Gao_Jilun阅读 274评论 0 0
  • 出去外面,去一个不熟悉的地方,或者做几件特别的小事,都可以让你忘记一两个小烦恼,让大脑被美好占据。 想到的就是应该...
    爱元若哥哥阅读 131评论 0 1
  • 今天学习了经纬仪的跟踪架的基本原理,初步认识了经纬仪结构以及部分器件的结构。晚上去加班了,老师不放我假还被嫌弃实验...
    王康宁12138阅读 206评论 1 0
  • 我是凤亚,来自地球某一角落的某个有名字的小卒。 我不是一个作者,我没有作者那样丰富的想象力,没有作者笔下那样丰富多...
    凤亚阅读 49评论 0 0