从简书作者刘小泽处看到跟着Bioconductor一步一步学习scRNA(一)
A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor
准备工作 安装包并获得帮助文档
- To install this package, start R (version "3.6") and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("simpleSingleCell")
- For older versions of R, please refer to the appropriate Bioconductor release.
Documentation - To view documentation for the version of this package installed in your system, start R and enter:
browseVignettes("simpleSingleCell")
得到帮助文档页面http://127.0.0.1:27783/session/Rvig.5cd5b9ac694.html
内容很丰富
Vignettes found by browseVignettes("simpleSingleCell")
Vignettes in package simpleSingleCell
- 01. Introduction - HTML source R code
- 02. Read count data - HTML source R code
- 03. UMI count data - HTML source R code
- 04. Droplet-based data - HTML source R code
- 05. Correcting batch effects - HTML source R code
- 06. Quality control details - HTML source R code
- 07. Spike-in normalization - HTML source R code
- 08. Detecting doublets - HTML source R code
- 09. Advanced variance modelling - HTML source R code
- 10. Detecting differential expression - HTML source R code
- 11. Scalability for big data - HTML source R code
- 12. Further analysis strategies - HTML source R code
那就按照流程一步一步走吧
01. Introduction
http://127.0.0.1:27783/library/simpleSingleCell/doc/intro.html
Users favouring an R-based approach to read alignment and counting might consider using the methods in the Rsubread package (Liao, Smyth, and Shi 2013, 2014).
-
Many other scRNA-seq protocols contain bespoke sequence structures that require careful processing:
Unique molecular identifiers (UMIs) (Islam et al. 2014) are widely used to mitigate the effects of amplification biases. Reads with the same UMI mapping to the same gene represent a single underlying transcript molecule and only increment the count of that gene by one. Processing of this data requires extraction of the UMI sequence from each read or read pair, and a method to collapse UMI-based duplicates into a single count (Smith, Heger, and Sudbery 2017).
Protocols may also use custom cell barcodes to improve multiplexing efficiency beyond that offered by the standard sequencing barcodes (e.g., from Illumina). This includes data generated from droplet-based experiments (Zheng et al. 2017) or from very high-throughput plate-based protocols like MARS-seq (Jaitin et al. 2014). Processing of this data usually requires a separate step to extract the barcode sequence from each read and to allocate the read to the correct per-cell sequencing library. For these data sets, the scPipe package (Tian et al. 2018) provides an R-based processing pipeline for obtaining a count matrix.
从前言中我发现需要先学习2个包,貌似一个是比对用的,一个是有内置数据集的教程。先学习一下吧
Rsubread package
Subread
is a general-purpose read aligner, which can be used to map both genomic DNA-seq and RNA-seq read data.its running time has only very modest increase with the increase of read length.(所以更适合用来比对长读长的的序列)
Note that
Subread
does not perform global alignments for the exon-spanning reads and it soft clips those read bases which could not be mapped.However, the
Subread
mapping result is sufficient for carrying out the gene-level expression analysis using RNA-seq data, because the mapped read bases can be reliably used to assign reads, including both exonic reads and exon-spanning reads, to genes.To get the full alignments for exon-spanning RNA-seq reads, the
Subjunc
aligner can be used.(Subread
的缺点由Subjunc
来补足)Accuracy of junction detection generally improves when
external gene annotation data
is provided.The annotation data should include chromosomal coordinates of known exons of each gene.
Note that although
Subread
aligner does not report exon-exon junctions, providing this annotation is useful for it to map junction reads more accurately.Subread
andSubjunc
can be used detectSV
events includinglong indel, duplication, inversion and translocation
, in RNA-seq and genomic DNA-seq data.For the reads that were found to contain SV breakpoints, extra tags will be added for them in mapping output. These tags include
CC(chromosome name), CP(mapping position), CG(CIGAR string) and CT(strand)
-
Subread and Subjunc are the first to employ a two-scan mapping strategy to achieve a superior mapping accuracy. This strategy was later seen in other aligners as well (called ‘two-pass’).
In the first scan, the aligners use seed-and-vote method to identify candidate mapping locations for each read and also discover short indels, exon-exon junctions and structural variants.
In the second scan, they carry out final alignment for each read using the variant and junction information. Variant and junction data (including chromosomal coordinates and number of supporting reads) will be output along with the read mapping results. Note that for both RNA-seq and genomic DNA-seq data, any alignment reported for a multi-mapping read must not have more than threshold number of mis-matched bases (as specified in ‘-M’ parameter).(看完这个介绍我感觉如果我学会了这个东西应该能自己比对那堆CNV测序的结果了,不用整那个不好用的cyto5.0了)
果然,再往下看我看到了什么!!!
Chapter 4
Mapping reads generated by genomic DNA sequencing technologies
简直找到了宝藏--我要先完成课题任务,这个包要好好学习一下先。
Size of one-block full index built for the human reference genome (GRCh38) is 17.8 GB. The subread-buildindex function needs 15 GB of memory to build this index. Size of a gapped index built for GRCh38 is less than 9 GB and subread-buildindex needs 5.7 GB of memory to build it.
一口老血喷出来,这是什么鬼,实力劝退吗?
还好Options are available to generate index of any size. In Rsubread, a one-block full index is built by default.
参考基因组:
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz
Read mapping
The Subread aligner (subread-align program in SourceForge Subread package or align func- tion in Bioconductor Rsubread package) extracts a number of subreads from each read and then uses these subreads to vote for the mapping location of the read. It uses the the “seed- and-vote” paradigm for read mapping and reports the largest mappable region for each read. Table 2 describes the arguments used by Subread aligner (also Subjunc and Sublong aligners). Arguments used in Bioconductor Rsubread package are included in parenthesis.
subread-buildindex (buildindex function in Rsubread) needs 15GB of memory to build a full index for human/mouse genome. With this index, subread-align (align in Rsubread) require 17.8GB of memory for read mapping. This enables fastest mapping speed, but it is recommended that the full index should be on a unix server due to relatively large memory use. Mapping rate is ∼14 million reads per minute (10 CPU threads) when full index is used.(呵呵...)
A gapped index is recommended for use on a personal computer, which typically has 16GB of memory or less. subread-buildindex (buildindex function in Rsubread) only needs 5.7GB of memory to build a gapped index for human/mouse genome. subread-align (align in Rsubread) needs 8.2GB of memory for mapping with the gapped index.(似乎还是可以一试)
It takes subread-buildindex (buildindex function in Rsubread) about 40 minutes to build a full index for human/mouse genome, and building a gapped index takes about 15 minutes. (嗯,今天太晚了,明天试试吧)
Memory use for index building and read mapping can be further reduced by building a split index using the -B and -M options in subread-buildindex (indexSplit and memory options in buildindex function in Rsubread).
Mapping of long reads(又有新技术)
We developed a new long-read aligner, called
Sublong
, for the mapping of long reads that were generated by long-read sequencing technologies such as Nanopore and PacBio sequencers. Sublong is also based on the seed-and-vote mapping strategy. Parameters of Sublong program can be found in Table 2.
然后是
Chapter 5
Mapping reads generated by RNA sequencing technologies
流程是跟DNA的差不多,有些特殊比如关于mircroRNA的比对等等,用到的时候在仔细琢磨。
这个比较重要
Chapter 6
Read summarization
这一章就是教学:如何阅读比对结果
- Read summarization is required by a number of downstream analyses such as gene expression analysis and histone modification analysis. The output of read summarization is a count table, in which the number of reads assigned to each feature in each library is recorded.
- Care must be taken to ensure that such reads are not over-counted or under-counted. Here we describe the featureCounts program, an efficient and accurate read quantifier. featureCounts has the following features:
• It carries out precise and accurate read assignments by taking care of indels, junctions and structural variants in the reads.
• It takes only half a minute to summarize 20 million reads.
• It supports GTF and SAF format annotation.
• It supports strand-specific read counting.
• It can count reads at feature (eg. exon) or meta-feature (eg. gene) level.
• Highly flexible in counting multi-mapping and multi-overlapping reads. Such reads can be excluded, fully counted or fractionally counted.
• It gives users full control on the summarization of paired-end reads, including allowing them to check if both ends are mapped and/or if the fragment length falls within the specified range.
• Reduce ambiguity in assigning read pairs by searching features that overlap with both reads from the pair.
• It allows users to specify whether chimeric fragments should be counted.
• Automatically detect input format (SAM or BAM).
• Automatically sort paired-end reads. Users can provide either location-sorted or name- sorted bams files to featureCounts. Read sorting is implemented on the fly and it only incurs minimal time cost.
- featureCounts supports strand-specific read counting if strand-specific information is pro- vided. Read mapping results usually include mapping quality scores for mapped reads. Users can optionally specify a minimum mapping quality score that the assigned reads must satisfy.
(看到这里才发现,我那堆数据原来已经走完了比对的流程了,有了QC报告之类的东东了,那应该是可以直接进入这一步了。) - The genomic features can be specified in either GTF/GFF or SAF format. A definition of the GTF format can be found at UCSC website (http://genome.ucsc.edu/FAQ/FAQformat.html#format4). The SAF format includes five required columns for each feature: feature identifier, chromosome name, start position, end position and strand.
- A SAF-format annotation file should be a tab-delimited text file. It should also include a header line. An example of a SAF annotation is shown as below:
GeneID Chr Start End Strand
497097 chr1 3204563 3207049 -
497097 chr1 3411783 3411982 -
497097 chr1 3660633 3661579 -
100503874 chr1 3637390 3640590 -
100503874 chr1 3648928 3648985 -
100038431 chr1 3670236 3671869 -
- GeneID column includes gene identifiers that can be numbers or character strings. Chromosomal names included in the Chr column must match the chromosomal names of reference sequences to which the reads were aligned.
- In
Rsubread
, users can access these annotations via thegetInBuiltAnnotation
function. InSubread
, these annotations are stored in directory‘annotation’
under home directory of the package. - Features sharing the same feature identifier in the GTF or SAF annotation are taken to belong to the same meta-feature. featureCounts can summarize reads at either the feature or meta-feature levels.
- We recommend to use unique gene identifiers, such as NCBI Entrez gene identifiers, to cluster features into meta-features. Gene names are not recommended to use for this purpose because different genes may have the same names.
- Unique gene identifiers were often included in many publicly available GTF annotations which can be readily used for summarization. The Bioconductor Rsubread package also includes NCBI RefSeq annotations for human and mice. Entrez gene identifiers are used in these annotations.
-
featureCounts
preforms precise read assignment by comparing mapping location of every base in the read with the genomic region spanned by each feature. It takes account of any gaps(insertions, deletions, exon-exon junctions or structural variants)
that are found in the read. It calls a hit if any overlap is found between read and feature. - Users may use
‘–minOverlap (minOverlap in R)’
and‘–fracOverlap (fracOverlap in R)’
options to specify the minimum number of overlapping bases and minimum fraction of over- lapping bases requried for assigning a read to a feature, respectively. The ‘–fracOverlap’ option might be particularly useful for counting reads with variable lengths.
看到下面这一步发现,还是需要走完全部的流程才行,不然数据格式等等不匹配是走不下来的
Read filtering
A quick start for featureCounts in SourceForge Subread
- You need to provide read mapping results (in either SAM or BAM format) and an annotation file for the read summarization. The example commands below assume your annotation file is in GTF format.
# Summarize SAM format single-end reads using 5 threads:
featureCounts -T 5 -a annotation.gtf -t exon -g gene id -o counts.txt mapping results SE.sam
# Summarize BAM format single-end read data:
featureCounts -a annotation.gtf -t exon -g gene id -o counts.txt mapping results SE.bam
# Summarize multiple libraries at the same time:
featureCounts -a annotation.gtf -t exon -g gene id
-o counts.txt mapping results1.bam mapping results2.bam
# Summarize paired-end reads and count fragments (instead of reads):
featureCounts -p -a annotation.gtf -t exon -g gene id -o counts.txt mapping results PE.bam
# Count fragments satisfying the fragment length criteria, eg. [50bp, 600bp]:
featureCounts -p -P -d 50 -D 600 -a annotation.gtf -t exon -g gene id -o counts.txt mapping results PE.bam
# Count fragments which have both ends successfully aligned without considering the fragment length constraint:
featureCounts -p -B -a annotation.gtf -t exon -g gene id -o counts.txt mapping results PE.bam
# Exclude chimeric fragments from the fragment counting:
featureCounts -p -C -a annotation.gtf -t exon -g gene id -o counts.txt mapping results PE.bam
A quick start for featureCounts in Bioconductor Rsubread
- You need to provide read mapping results (in either SAM or BAM format) and an annotation file for the read summarization. The example commands below assume your annotation file is in GTF format.
#Load Rsubread library from you R session:
library(Rsubread)
# Summarize single-end reads using built-in RefSeq annotation for mouse genome ‘mm10’ (‘mm10’ is the default inbuilt genome annotation):
featureCounts(files="mapping_results_SE.sam")
# Summarize single-end reads using a user-provided GTF annotation file:
featureCounts(files="mapping_results_SE.sam",annot.ext="annotation.gtf",
isGTFAnnotationFile=TRUE,GTF.featureType="exon",GTF.attrType="gene_id")
# Summarize single-end reads using 5 threads:
featureCounts(files="mapping_results_SE.sam",nthreads=5)
# Summarize BAM format single-end read data:
featureCounts(files="mapping_results_SE.bam")
# Summarize multiple libraries at the same time:
featureCounts(files=c("mapping_results1.bam","mapping_results2.bam"))
# Summarize paired-end reads and counting fragments (instead of reads):
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE)
# Count fragments satisfying the fragment length criteria, eg. [50bp, 600bp]:
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,checkFragLength=TRUE,
minFragLength=50,maxFragLength=600)
# Count fragments which have both ends successfully aligned without considering the fragment length constraint:
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,requireBothEndsMapped=TRUE)
# Exclude chimeric fragments from fragment counting:
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,countChimericFragments=FALSE)
Chapter 7 SNP calling
Chapter 8 Utility programs
Chapter 9 Case studies
9.1 A Bioconductor R pipeline for analyzing RNA-seq data
- Note that these libraries only included reads originating from human chromosome 1 (according to Subread aligner). Reads were generated by the MAQC/SEQC Consortium. Data used in this case study can be downloaded from http://bioinf.wehi.edu.au/RNAseqCaseStudy/data.tar.gz (283MB). Both read data and reference sequence for chromosome 1 of human genome (GRCh37) were included in the data.
- After downloading the data, you can uncompress it and save it to your current working directory. Launch R and load
Rsubread, limma and edgeR
libraries by issuing the following commands at your R prompt. Version of your R should be 3.0.2 or later.Rsubread
version should be 1.12.1 or later andlimma
version should be 3.18.0 or later. - Note that this case study only runs on Linux/Unix and Mac OS X.
library(Rsubread)
library(limma)
library(edgeR)
# Index building. Build an index for human chromosome 1. This typically takes ∼3 minutes. Index files with basename ‘chr1’ will be generated in your current working directory.
buildindex(basename="chr1",reference="hg19_chr1.fa")
# Alignment. Perform read alignment for all four libraries and report uniquely mapped reads only. This typically takes ∼5 minutes. BAM files containing the mapping results will be generated in your current working directory.
targets <- readTargets()
align(index="chr1",readfile1=targets$InputFile,output_file=targets$OutputFile)
#Read summarization. Summarize mapped reads to NCBI RefSeq genes. This will only take a few seconds. Note that the featureCounts function contains built-in RefSeq annotations for human and mouse genes. featureCounts returns an R ‘List’ object, which includes raw read count for each gene in each library and also annotation information such as gene identifiers and gene lengths.
fc <- featureCounts(files=targets$OutputFile,annot.inbuilt="hg19")
fc$counts[1:5,]
fc$annotation[1:5,c("GeneID","Length")]
# Create a DGEList object.
x <- DGEList(counts=fc$counts, genes=fc$annotation[,c("GeneID","Length")])
# Filtering. Only keep in the analysis those genes which had >10 reads per million mapped reads in at least two libraries.
isexpr <- rowSums(cpm(x) > 10) >= 2
x <- x[isexpr,]
# Design matrix. Create a design matrix:
celltype <- factor(targets$CellType)
design <- model.matrix(~0+celltype)
colnames(design) <- levels(celltype)
# Normalization. Perform voom normalization:
y <- voom(x,design,plot=TRUE)
# Sample clustering. Multi-dimensional scaling (MDS) plot shows that sample A libraries are clearly separated from sample B libraries.
plotMDS(y,xlim=c(-2.5,2.5))
# Linear model fitting and differential expression analysis. Fit linear models to genes and assess differential expression using eBayes moderated t statistic. Here we compare sample B vs sample A.
fit <- lmFit(y,design)
contr <- makeContrasts(BvsA=B-A,levels=design)
fit.contr <- eBayes(contrasts.fit(fit,contr))
dt <- decideTests(fit.contr)
summary(dt)
# List top 10 differentially expressed genes:
options(digits=2)
topTable(fit.contr)
上下游串起来了,以后还需要好好练习,这个流程主要还是针对RNA-seq的,不过关于上游的分析应该都是大同小异的