RNA注释
这里以inDrop 实验数据举例,spliced/unspliced的RNA可以通过:
1.使用dropEst 输出管道生成 类似10X的 bam 文件:
~/dropEst/build/dropest -m -F -L eiEIBA -o run1 -g cellranger/refdata-cellranger-mm10-1.2.0/genes/genes.gtf -c ~/dropEst/configs/indrop_v3.xml *.bam
2.使用velocyto.py来注释spliced/unspliced的reads,写出标准loom文件:
velocyto run -u Gene -o out -e SCG71 -m mm10_rmsk_srt.gtf -v SCG_71_tophat.filtered.sorted.bam UCSC/mm10/Annotation/Genes/genes.gtf
(注意,也可以使用-V参数直接通过 DropEst 注释spliced/unspliced的reads:)
~/dropEst/dropest -V -C 6000 -m -g ucsc_mm10_exons.gtf.gz -c ~/dropEst/configs/indrop_v3.xml *.aligned.bam)
下面的例子从 velocyto.py 生成的loom文件开始,使用 pagoda2 获取细胞clusters/embedding,然后估计/可视化RNA速率。
加载R包:
library(velocyto.R)
Loading required package: Matrix
数据下载
下载预先计算的loom矩阵
wget http://pklab.med.harvard.edu/velocyto/mouseBM/SCG71.loom
读取矩阵
ldat <-read.loom.matrices(url("http://pklab.med.harvard.edu/velocyto/mouseBM/SCG71.loom"))
#Error: is.character(name) is not TRUE
使用pagoda2标准化和聚类细胞
使用spliced 表达矩阵作为pagoda2的输入,并查看分布。
emat <- ldat$spliced
hist(log10(colSums(emat)),col='wheat',xlab='cell size')
# this dataset has already been pre-filtered, but this is where one woudl do some filtering
emat <- emat[,colSums(emat)>=1e3]
pagoda2处理
pagoda2用于生成细胞嵌入、细胞聚类以及更精确的细胞距离矩阵。您也可以使用其他工具生成这些工具,如 Seurat等。
创建pagoda2对象,调整方差:
library(pagoda2)
r <- Pagoda2$new(emat,modelType='plain',trim=10,log.scale=T)
#2600 cells, 7301 genes; normalizing ... using plain model winsorizing ... log scale ... done.
r$adjustVariance(plot=T,do.par=T,gam.k=10)
#calculating variance fit ... using gam 137 overdispersed genes ... 137 persisting ... done.
运行基本分析步骤以生成细胞嵌入和聚类,并可视化:
r$calculatePcaReduction(nPcs=100,n.odgenes=3e3,maxit=300)
running PCA using 3000 OD genes ..
Loading required package: irlba
.. done
r$makeKnnGraph(k=30,type='PCA',center=T,distance='cosine');
Loading required package: igraph
Attaching package: ‘igraph’
The following objects are masked from ‘package:stats’:decompose, spectrum
The following object is masked from ‘package:base’:union
creating space of type angular done
adding data ... done
building index ... done
querying ... done
r$getKnnClusters(method=multilevel.community,type='PCA',name='multilevel')r$getEmbedding(type='PCA',embeddingType='tSNE',perplexity=50,verbose=T)
calculating distance ... pearson ...running tSNE using 16 cores:
Read the 2600 x 2600 data matrix successfully!
OpenMP is working...
Using no_dims = 2, perplexity = 50.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 9.36 seconds (sparsity = 0.084143)!
Learning embedding...
Iteration 50: error is 73.294295 (50 iterations in 3.70 seconds)
Iteration 100: error is 65.648297 (50 iterations in 3.19 seconds)
Iteration 150: error is 65.151406 (50 iterations in 3.17 seconds)
Iteration 200: error is 65.090414 (50 iterations in 3.24 seconds)
Iteration 250: error is 65.075571 (50 iterations in 3.27 seconds)
Iteration 300: error is 1.814009 (50 iterations in 3.11 seconds)
Iteration 350: error is 1.699950 (50 iterations in 3.08 seconds)
Iteration 400: error is 1.660607 (50 iterations in 3.01 seconds)
Iteration 450: error is 1.644676 (50 iterations in 2.98 seconds)
Iteration 500: error is 1.638744 (50 iterations in 2.96 seconds)
Iteration 550: error is 1.633982 (50 iterations in 3.00 seconds)
Iteration 600: error is 1.629732 (50 iterations in 2.99 seconds)
Iteration 650: error is 1.628101 (50 iterations in 3.12 seconds)
Iteration 700: error is 1.625991 (50 iterations in 3.27 seconds)
Iteration 750: error is 1.624012 (50 iterations in 3.15 seconds)
Iteration 800: error is 1.622847 (50 iterations in 3.27 seconds)
Iteration 850: error is 1.621847 (50 iterations in 3.31 seconds)
Iteration 900: error is 1.620831 (50 iterations in 3.34 seconds)
Iteration 950: error is 1.619936 (50 iterations in 3.21 seconds)
Iteration 1000: error is 1.618511 (50 iterations in 3.16 seconds)
Fitting performed in 63.52 seconds.
绘制嵌入、集群标记(左)和"Xist"表达图(将男性和女性分开展示)
par(mfrow=c(1,2))r$plotEmbedding(type='PCA',embeddingType='tSNE',show.legend=F,mark.clusters=T,min.group.size=10,shuffle.colors=F,mark.cluster.cex=1,alpha=0.3,main='cell clusters')r$plotEmbedding(type='PCA',embeddingType='tSNE',colors=r$depth,main='depth')
treating colors as a gradient with zlim: 1000.9 2939
速率估计
准备矩阵和聚类数据:
emat <- ldat$spliced; nmat <- ldat$unspliced; # disregarding spanning reads, as there are too few of thememat <- emat[,rownames(r$counts)]; nmat <- nmat[,rownames(r$counts)]; # restrict to cells that passed p2 filter# take cluster labelscluster.label <- r$clusters$PCA$multilevel # take the cluster factor that was calculated by p2cell.colors <- pagoda2:::fac2col(cluster.label)# take embedding form p2emb <- r$embeddings$PCA$tSNE
除了聚类和 t-SNE 嵌入,从 p2 处理,我们将采取细胞距离,优于默认的R 通常会使用全转录体相关距离。
cell.dist <- as.dist(1-armaCor(t(r$reductions$PCA)))
基于最低平均表达量(至少在其中一个cluster中)筛选基因,输出生成的有效基因总数:
emat <- filter.genes.by.cluster.expression(emat,cluster.label,min.max.cluster.average = 0.2)nmat <- filter.genes.by.cluster.expression(nmat,cluster.label,min.max.cluster.average = 0.05)length(intersect(rownames(emat),rownames(nmat)))
[1] 2541
估计RNA速率:
fit.quantile <- 0.02rvel.cd <- gene.relative.velocity.estimates(emat,nmat,deltaT=1,kCells=25,cell.dist=cell.dist,fit.quantile=fit.quantile)
calculating cell knn ... done
calculating convolved matrices ... done
fitting gamma coefficients ... done. succesfful fit for 743 genes
filtered out 149 out of 743 genes due to low nmat-emat correlation
filtered out 27 out of 594 genes due to low nmat-emat slope
calculating RNA velocity shift ... done
calculating extrapolated cell state ... done
可视化 t-SNE 嵌入上的速率:
show.velocity.on.embedding.cor(emb,rvel.cd,n=200,scale='sqrt',cell.colors=ac(cell.colors,alpha=0.5),cex=0.8,arrow.scale=3,show.grid.flow=TRUE,min.grid.cell.mass=0.5,grid.n=40,arrow.lwd=1,do.par=F,cell.border.alpha = 0.1)
delta projections ... sqrt knn ... transition probs ... done
calculating arrows ... done
grid estimates ... grid.sd= 0.6580196 min.arrow.size= 0.01316039 max.grid.arrow.length= 0.05887327 done
特定基因可视化:
gene <- "Camp"gene.relative.velocity.estimates(emat,nmat,deltaT=1,kCells = 25,kGenes=1,fit.quantile=fit.quantile,cell.emb=emb,cell.colors=cell.colors,cell.dist=cell.dist,show.gene=gene,old.fit=rvel.cd,do.par=T)
calculating convolved matrices ... done
[1] 1
调整视图
调整kCells,这会给我们一个更理想化的图像视图,差异肉眼可见:
gene <- "Camp"gene.relative.velocity.estimates(emat,nmat,deltaT=1,kCells = 100,kGenes=1,fit.quantile=fit.quantile,cell.emb=emb,cell.colors=cell.colors,cell.dist=cell.dist,show.gene=gene,do.par=T)
calculating cell knn ... done
calculating convolved matrices ... done
[1] 1