2019-10-04-学习Bioconductor包simpleSingleCell（1）

A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor

准备工作安装包并获得帮助文档

To install this package, start R (version "3.6") and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("simpleSingleCell")

For older versions of R, please refer to the appropriate Bioconductor release.
Documentation
To view documentation for the version of this package installed in your system, start R and enter:

browseVignettes("simpleSingleCell")

得到帮助文档页面http://127.0.0.1:27783/session/Rvig.5cd5b9ac694.html
内容很丰富

Vignettes found by `browseVignettes("simpleSingleCell")`

Vignettes in package `simpleSingleCell`

01. Introduction - HTML source R code
02. Read count data - HTML source R code
03. UMI count data - HTML source R code
04. Droplet-based data - HTML source R code
05. Correcting batch effects - HTML source R code
06. Quality control details - HTML source R code
07. Spike-in normalization - HTML source R code
08. Detecting doublets - HTML source R code
09. Advanced variance modelling - HTML source R code
10. Detecting differential expression - HTML source R code
11. Scalability for big data - HTML source R code
12. Further analysis strategies - HTML source R code
那就按照流程一步一步走吧

01. Introduction

http://127.0.0.1:27783/library/simpleSingleCell/doc/intro.html

Users favouring an R-based approach to read alignment and counting might consider using the methods in the Rsubread package (Liao, Smyth, and Shi 2013, 2014).
Many other scRNA-seq protocols contain bespoke sequence structures that require careful processing:

Unique molecular identifiers (UMIs) (Islam et al. 2014) are widely used to mitigate the effects of amplification biases. Reads with the same UMI mapping to the same gene represent a single underlying transcript molecule and only increment the count of that gene by one. Processing of this data requires extraction of the UMI sequence from each read or read pair, and a method to collapse UMI-based duplicates into a single count (Smith, Heger, and Sudbery 2017).
Protocols may also use custom cell barcodes to improve multiplexing efficiency beyond that offered by the standard sequencing barcodes (e.g., from Illumina). This includes data generated from droplet-based experiments (Zheng et al. 2017) or from very high-throughput plate-based protocols like MARS-seq (Jaitin et al. 2014). Processing of this data usually requires a separate step to extract the barcode sequence from each read and to allocate the read to the correct per-cell sequencing library.
For these data sets, the scPipe package (Tian et al. 2018) provides an R-based processing pipeline for obtaining a count matrix.

从前言中我发现需要先学习2个包，貌似一个是比对用的，一个是有内置数据集的教程。先学习一下吧

Rsubread package

Subread is a general-purpose read aligner, which can be used to map both genomic DNA-seq and RNA-seq read data.
its running time has only very modest increase with the increase of read length.(所以更适合用来比对长读长的的序列)
Note that Subread does not perform global alignments for the exon-spanning reads and it soft clips those read bases which could not be mapped.
However, the Subread mapping result is sufficient for carrying out the gene-level expression analysis using RNA-seq data, because the mapped read bases can be reliably used to assign reads, including both exonic reads and exon-spanning reads, to genes.
To get the full alignments for exon-spanning RNA-seq reads, the Subjunc aligner can be used.（Subread的缺点由Subjunc来补足）
Accuracy of junction detection generally improves when external gene annotation data is provided. The annotation data should include chromosomal coordinates of known exons of each gene.
Note that although Subread aligner does not report exon-exon junctions, providing this annotation is useful for it to map junction reads more accurately.
Subread and Subjunc can be used detect SV events including long indel, duplication, inversion and translocation, in RNA-seq and genomic DNA-seq data.
For the reads that were found to contain SV breakpoints, extra tags will be added for them in mapping output. These tags include CC(chromosome name), CP(mapping position), CG(CIGAR string) and CT(strand)
Subread and Subjunc are the first to employ a two-scan mapping strategy to achieve a superior mapping accuracy. This strategy was later seen in other aligners as well (called ‘two-pass’).

In the first scan, the aligners use seed-and-vote method to identify candidate mapping locations for each read and also discover short indels, exon-exon junctions and structural variants.
In the second scan, they carry out final alignment for each read using the variant and junction information. Variant and junction data (including chromosomal coordinates and number of supporting reads) will be output along with the read mapping results.
Note that for both RNA-seq and genomic DNA-seq data, any alignment reported for a multi-mapping read must not have more than threshold number of mis-matched bases (as specified in ‘-M’ parameter).(看完这个介绍我感觉如果我学会了这个东西应该能自己比对那堆CNV测序的结果了，不用整那个不好用的cyto5.0了)

果然，再往下看我看到了什么！！！

Chapter 4
Mapping reads generated by genomic DNA sequencing technologies

简直找到了宝藏--我要先完成课题任务，这个包要好好学习一下先。

Size of one-block full index built for the human reference genome (GRCh38) is 17.8 GB. The subread-buildindex function needs 15 GB of memory to build this index. Size of a gapped index built for GRCh38 is less than 9 GB and subread-buildindex needs 5.7 GB of memory to build it.

一口老血喷出来，这是什么鬼，实力劝退吗？
还好Options are available to generate index of any size. In Rsubread, a one-block full index is built by default.
参考基因组：
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz

image.png

Read mapping

The Subread aligner (subread-align program in SourceForge Subread package or align func- tion in Bioconductor Rsubread package) extracts a number of subreads from each read and then uses these subreads to vote for the mapping location of the read. It uses the the “seed- and-vote” paradigm for read mapping and reports the largest mappable region for each read. Table 2 describes the arguments used by Subread aligner (also Subjunc and Sublong aligners). Arguments used in Bioconductor Rsubread package are included in parenthesis.
subread-buildindex (buildindex function in Rsubread) needs 15GB of memory to build a full index for human/mouse genome. With this index, subread-align (align in Rsubread) require 17.8GB of memory for read mapping. This enables fastest mapping speed, but it is recommended that the full index should be on a unix server due to relatively large memory use. Mapping rate is ∼14 million reads per minute (10 CPU threads) when full index is used.（呵呵...）
A gapped index is recommended for use on a personal computer, which typically has 16GB of memory or less. subread-buildindex (buildindex function in Rsubread) only needs 5.7GB of memory to build a gapped index for human/mouse genome. subread-align (align in Rsubread) needs 8.2GB of memory for mapping with the gapped index.(似乎还是可以一试)
It takes subread-buildindex (buildindex function in Rsubread) about 40 minutes to build a full index for human/mouse genome, and building a gapped index takes about 15 minutes. （嗯，今天太晚了，明天试试吧）
Memory use for index building and read mapping can be further reduced by building a split index using the -B and -M options in subread-buildindex (indexSplit and memory options in buildindex function in Rsubread).
Mapping of long reads（又有新技术）
We developed a new long-read aligner, called Sublong, for the mapping of long reads that were generated by long-read sequencing technologies such as Nanopore and PacBio sequencers. Sublong is also based on the seed-and-vote mapping strategy. Parameters of Sublong program can be found in Table 2.

然后是
Chapter 5
Mapping reads generated by RNA sequencing technologies

流程是跟DNA的差不多，有些特殊比如关于mircroRNA的比对等等，用到的时候在仔细琢磨。

这个比较重要
Chapter 6
Read summarization
这一章就是教学：如何阅读比对结果

Read summarization is required by a number of downstream analyses such as gene expression analysis and histone modification analysis. The output of read summarization is a count table, in which the number of reads assigned to each feature in each library is recorded.
Care must be taken to ensure that such reads are not over-counted or under-counted. Here we describe the featureCounts program, an efficient and accurate read quantifier. featureCounts has the following features:

• It carries out precise and accurate read assignments by taking care of indels, junctions and structural variants in the reads.
• It takes only half a minute to summarize 20 million reads.
• It supports GTF and SAF format annotation.
• It supports strand-specific read counting.
• It can count reads at feature (eg. exon) or meta-feature (eg. gene) level.
• Highly flexible in counting multi-mapping and multi-overlapping reads. Such reads can be excluded, fully counted or fractionally counted.
• It gives users full control on the summarization of paired-end reads, including allowing them to check if both ends are mapped and/or if the fragment length falls within the specified range.
• Reduce ambiguity in assigning read pairs by searching features that overlap with both reads from the pair.
• It allows users to specify whether chimeric fragments should be counted.
• Automatically detect input format (SAM or BAM).
• Automatically sort paired-end reads. Users can provide either location-sorted or name- sorted bams files to featureCounts. Read sorting is implemented on the fly and it only incurs minimal time cost.

featureCounts supports strand-specific read counting if strand-specific information is pro- vided. Read mapping results usually include mapping quality scores for mapped reads. Users can optionally specify a minimum mapping quality score that the assigned reads must satisfy.
(看到这里才发现，我那堆数据原来已经走完了比对的流程了，有了QC报告之类的东东了，那应该是可以直接进入这一步了。)
The genomic features can be specified in either GTF/GFF or SAF format. A definition of the GTF format can be found at UCSC website (http://genome.ucsc.edu/FAQ/FAQformat.html#format4). The SAF format includes five required columns for each feature: feature identifier, chromosome name, start position, end position and strand.
A SAF-format annotation file should be a tab-delimited text file. It should also include a header line. An example of a SAF annotation is shown as below:

GeneID Chr Start End Strand
497097 chr1 3204563 3207049 -
497097 chr1 3411783 3411982 -
497097 chr1 3660633 3661579 -
100503874 chr1 3637390 3640590 -
100503874 chr1 3648928 3648985 -
100038431 chr1 3670236 3671869 -

GeneID column includes gene identifiers that can be numbers or character strings. Chromosomal names included in the Chr column must match the chromosomal names of reference sequences to which the reads were aligned.
In Rsubread, users can access these annotations via the getInBuiltAnnotation function. In Subread, these annotations are stored in directory ‘annotation’ under home directory of the package.
Features sharing the same feature identifier in the GTF or SAF annotation are taken to belong to the same meta-feature. featureCounts can summarize reads at either the feature or meta-feature levels.
We recommend to use unique gene identifiers, such as NCBI Entrez gene identifiers, to cluster features into meta-features. Gene names are not recommended to use for this purpose because different genes may have the same names.
Unique gene identifiers were often included in many publicly available GTF annotations which can be readily used for summarization. The Bioconductor Rsubread package also includes NCBI RefSeq annotations for human and mice. Entrez gene identifiers are used in these annotations.
featureCounts preforms precise read assignment by comparing mapping location of every base in the read with the genomic region spanned by each feature. It takes account of any gaps (insertions, deletions, exon-exon junctions or structural variants) that are found in the read. It calls a hit if any overlap is found between read and feature.
Users may use ‘–minOverlap (minOverlap in R)’and ‘–fracOverlap (fracOverlap in R)’ options to specify the minimum number of overlapping bases and minimum fraction of over- lapping bases requried for assigning a read to a feature, respectively. The ‘–fracOverlap’ option might be particularly useful for counting reads with variable lengths.
看到下面这一步发现，还是需要走完全部的流程才行，不然数据格式等等不匹配是走不下来的
Read filtering

A quick start for featureCounts in SourceForge Subread

You need to provide read mapping results (in either SAM or BAM format) and an annotation file for the read summarization. The example commands below assume your annotation file is in GTF format.

# Summarize SAM format single-end reads using 5 threads:
featureCounts -T 5 -a annotation.gtf -t exon -g gene id -o counts.txt mapping results SE.sam
# Summarize BAM format single-end read data:
featureCounts -a annotation.gtf -t exon -g gene id -o counts.txt mapping results SE.bam
# Summarize multiple libraries at the same time:
featureCounts -a annotation.gtf -t exon -g gene id
-o counts.txt mapping results1.bam mapping results2.bam
# Summarize paired-end reads and count fragments (instead of reads):
featureCounts -p -a annotation.gtf -t exon -g gene id -o counts.txt mapping results PE.bam
# Count fragments satisfying the fragment length criteria, eg. [50bp, 600bp]:
featureCounts -p -P -d 50 -D 600 -a annotation.gtf -t exon -g gene id -o counts.txt mapping results PE.bam
# Count fragments which have both ends successfully aligned without considering the fragment length constraint:
featureCounts -p -B -a annotation.gtf -t exon -g gene id -o counts.txt mapping results PE.bam
# Exclude chimeric fragments from the fragment counting:
featureCounts -p -C -a annotation.gtf -t exon -g gene id -o counts.txt mapping results PE.bam

A quick start for featureCounts in Bioconductor Rsubread

You need to provide read mapping results (in either SAM or BAM format) and an annotation file for the read summarization. The example commands below assume your annotation file is in GTF format.

#Load Rsubread library from you R session: 
library(Rsubread)
# Summarize single-end reads using built-in RefSeq annotation for mouse genome ‘mm10’ (‘mm10’ is the default inbuilt genome annotation):
featureCounts(files="mapping_results_SE.sam")
# Summarize single-end reads using a user-provided GTF annotation file:
featureCounts(files="mapping_results_SE.sam",annot.ext="annotation.gtf",
isGTFAnnotationFile=TRUE,GTF.featureType="exon",GTF.attrType="gene_id")
# Summarize single-end reads using 5 threads:
featureCounts(files="mapping_results_SE.sam",nthreads=5)
# Summarize BAM format single-end read data:
featureCounts(files="mapping_results_SE.bam")
# Summarize multiple libraries at the same time:
featureCounts(files=c("mapping_results1.bam","mapping_results2.bam"))
# Summarize paired-end reads and counting fragments (instead of reads):
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE)
# Count fragments satisfying the fragment length criteria, eg. [50bp, 600bp]:
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,checkFragLength=TRUE,
minFragLength=50,maxFragLength=600)
# Count fragments which have both ends successfully aligned without considering the fragment length constraint:
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,requireBothEndsMapped=TRUE)
# Exclude chimeric fragments from fragment counting:
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,countChimericFragments=FALSE)

Chapter 7 SNP calling
Chapter 8 Utility programs
Chapter 9 Case studies

9.1 A Bioconductor R pipeline for analyzing RNA-seq data

Note that these libraries only included reads originating from human chromosome 1 (according to Subread aligner). Reads were generated by the MAQC/SEQC Consortium. Data used in this case study can be downloaded from http://bioinf.wehi.edu.au/RNAseqCaseStudy/data.tar.gz (283MB). Both read data and reference sequence for chromosome 1 of human genome (GRCh37) were included in the data.
After downloading the data, you can uncompress it and save it to your current working directory. Launch R and load Rsubread, limma and edgeR libraries by issuing the following commands at your R prompt. Version of your R should be 3.0.2 or later. Rsubread version should be 1.12.1 or later and limma version should be 3.18.0 or later.
Note that this case study only runs on Linux/Unix and Mac OS X.

library(Rsubread)
library(limma)
library(edgeR)
# Index building. Build an index for human chromosome 1. This typically takes ∼3 minutes. Index files with basename ‘chr1’ will be generated in your current working directory.
buildindex(basename="chr1",reference="hg19_chr1.fa")
# Alignment. Perform read alignment for all four libraries and report uniquely mapped reads only. This typically takes ∼5 minutes. BAM files containing the mapping results will be generated in your current working directory.
targets <- readTargets()
align(index="chr1",readfile1=targets$InputFile,output_file=targets$OutputFile)
#Read summarization. Summarize mapped reads to NCBI RefSeq genes. This will only take a few seconds. Note that the featureCounts function contains built-in RefSeq annotations for human and mouse genes. featureCounts returns an R ‘List’ object, which includes raw read count for each gene in each library and also annotation information such as gene identifiers and gene lengths.
fc <- featureCounts(files=targets$OutputFile,annot.inbuilt="hg19")
fc$counts[1:5,]
fc$annotation[1:5,c("GeneID","Length")]
# Create a DGEList object.
x <- DGEList(counts=fc$counts, genes=fc$annotation[,c("GeneID","Length")])
# Filtering. Only keep in the analysis those genes which had >10 reads per million mapped reads in at least two libraries.
isexpr <- rowSums(cpm(x) > 10) >= 2
x <- x[isexpr,]
# Design matrix. Create a design matrix:
celltype <- factor(targets$CellType)
design <- model.matrix(~0+celltype)
colnames(design) <- levels(celltype)
# Normalization. Perform voom normalization:
y <- voom(x,design,plot=TRUE)
# Sample clustering. Multi-dimensional scaling (MDS) plot shows that sample A libraries are clearly separated from sample B libraries.
plotMDS(y,xlim=c(-2.5,2.5))
# Linear model fitting and differential expression analysis. Fit linear models to genes and assess differential expression using eBayes moderated t statistic. Here we compare sample B vs sample A.
fit <- lmFit(y,design)
contr <- makeContrasts(BvsA=B-A,levels=design)
fit.contr <- eBayes(contrasts.fit(fit,contr))
dt <- decideTests(fit.contr)
summary(dt)
# List top 10 differentially expressed genes:
options(digits=2)
topTable(fit.contr)

上下游串起来了，以后还需要好好练习，这个流程主要还是针对RNA-seq的，不过关于上游的分析应该都是大同小异的

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,744评论 6赞 502
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,505评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 163,105评论 0赞 353
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,242评论 1赞 292
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,269评论 6赞 389
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,215评论 1赞 299
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,096评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,939评论 0赞 274
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,354评论 1赞 311
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,573评论 2赞 333
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,745评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,448评论 5赞 344
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,048评论 3赞 327
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,683评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,838评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,776评论 2赞 369
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,652评论 2赞 354