RNA速率分析是一个非常重要的分析内容,我们这里来深入解析一下这个分析内容(python版本,个人喜欢python),参考网址在Velocyto
安装velocyto其实很简单,直接pip安装
pip install velocyto
至于用法:
velocyto --help
Usage: velocyto [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
run Runs the velocity analysis outputting a loom file
run10x Runs the velocity analysis for a Chromium Sample
run-dropest Runs the velocity analysis on DropEst preprocessed data
run-smartseq2 Runs the velocity analysis on SmartSeq2 data (independent bam file per cell)
tools helper tools for velocyto
其实做velocyto分析使用的是转录组的可变剪切,那么对于转录组的要求当然是越长越好,最好是全长,也就是说用smartseq2的数据做这个分析最为合适,但是10X数据最多最火,也可以尝试。
velocyto run10x --help
Usage: velocyto run10x [OPTIONS] SAMPLEFOLDER GTFFILE
Runs the velocity analysis for a Chromium 10X Sample
10XSAMPLEFOLDER specifies the cellranger sample folder
GTFFILE genome annotation file
Options:
-s, --metadatatable FILE Table containing metadata of the various samples (csv fortmated rows are samples and cols are entries)
-m, --mask FILE .gtf file containing intervals to mask
-l, --logic TEXT The logic to use for the filtering (default: Default)
-M, --multimap Consider not unique mappings (not reccomended)
-@, --samtools-threads INTEGER The number of threads to use to sort the bam by cellID file using samtools
--samtools-memory INTEGER The number of MB used for every thread by samtools to sort the bam file
-t, --dtype TEXT The dtype of the loom file layers - if more than 6000 molecules/reads per gene per cell are expected set uint32 to avoid truncation (default run_10x: uint16)
-d, --dump TEXT For debugging purposes only: it will dump a molecular mapping report to hdf5. --dump N, saves a cell every N cells. If p is prepended a more complete (but huge) pickle report is printed (default: 0)
-v, --verbose Set the vebosity level: -v (only warinings) -vv (warinings and info) -vvv (warinings, info and debug)
--help Show this message and exit.
简单的用法
velocyto run10x -m repeat_msk.gtf mypath/sample01 somepath/refdata-cellranger-mm10-1.2.0/genes/genes.gtf
产生的loom文件将用于下游的分析。
我们这里详细看看这个文件及分析是怎样的。(这里还是python为例,R也可以R读loom需要hdf5r,非常难装,装成功了一定要记录,这也是我讨厌R的原因之一)
首先第一步,我们来看看生成的loom文件里面都有什么
import scvelo as scv
scv.set_figure_params()##设置可视化的一些参数
data = scv.read('TH_possorted_genome_bam_YHVE5.loom', cache=True) ###读取loom文件
当然我们这里也可以读取h5ad格式的文件(由python分析软件scanpy生成)。
data = scv.read(filename, cache=True)
我们来看看loom文件里面有什么:
data
AnnData object with n_obs × n_vars = 70668 × 27998
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'
data.var
Accession Chromosome End Start Strand
Xkr4 ENSMUSG00000051951 1 3671498 3205901 -
Gm37381 ENSMUSG00000102343 1 3986215 3905739 -
Rp1 ENSMUSG00000025900 1 4409241 3999557 -
Rp1 ENSMUSG00000109048 1 4409187 4292981 -
Sox17 ENSMUSG00000025902 1 4497354 4490931 -
... ... ... ... ... ...
Gm28672 ENSMUSG00000100492 Y 89225296 89222067 +
Gm28670 ENSMUSG00000099982 Y 89394761 89391528 +
Gm29504 ENSMUSG00000100533 Y 90277501 90275224 +
Gm20837 ENSMUSG00000096178 Y 90433263 90401248 +
Erdr1 ENSMUSG00000096768 Y 90816464 90784738 +
###var里面是一些基因的信息
ldata.layers
Layers with keys: matrix, ambiguous, spliced, unspliced
###layer下面的东西有点多,包括矩阵信息,以及基因是否是可变剪切等信息矩阵。
可以看出这个地方只有关于可变剪切的信息,没有我们单细胞聚类得到的信息,所有我们需要将聚类结果与该loom文件的内容进行整合(merge),注意这里是python,需要读取scanpy的分析结果h5ad文件.但这里我们没有h5ad文件,只好读取Seurat的单独结果进行赋值,这里我们只需要聚类信息和二维坐标信息
import pandas as pd
import numpy as np
cluster = pd.read_csv(file)
adata.obs['clusters'] = cluster['Cluster']
tsne_axis=open(file,'r')
for i in all_tsne[1:]:
tsne_axis_dict={}
all_tsne=tsne_axis.readlines()
... key=i.strip().split(',')[0]
... tsne_axis_dict[key]=[]
... tsne_axis_dict[key].append(i.strip().split(',')[1])
... tsne_axis_dict[key].append(i.strip().split(',')[2])
... tsne_axis_dict[key]=np.array(tsne_axis_dict[key],dtype='float32')
tsne_axis_list=[]
for i in Barcode:
tsne_axis_list.append(tsne_axis_dict[i])
adata.obsm['X_tsne']=np.array(tsne_axis_list)
###UMAP的赋值也是一样的,这样我们就得到了可以直接分析的无缝衔接的文件。
接下来就是可视化了,就相对简单了
scv.pl.proportions(adata)
这个图展示的是每个cluster可变切剪的比例值,这个地方我们需要注意,跟聚类结果严格相关,也就是说我们必须确保聚类结果相对完美的情况下做这个分析。
接下来我们要进行RNA速率分析,这里我们要注意,我们嫁接了Seurat 的结果,不要再进行过滤等操作
scv.tl.velocity(adata)
###The computed velocities are stored in adata.layers just like the count matrices.
scv.tl.velocity_graph(adata)
###For a variety of applications, the velocity graph can be converted to a transition matrix by applying a Gaussian kernel to transform the cosine correlations into actual transition probabilities. You can access the Markov transition matrix via scv.utils.get_transition_matrix.
接下来就是一些展示了,很简单
scv.pl.velocity_embedding_stream(adata, basis='umap')
scv.pl.velocity_embedding(adata, arrow_length=3, arrow_size=2, dpi=120)
marker gene的展示
scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2', 'Adk'], ncols=2)
Identify important genes
We need a systematic way to identify genes that may help explain the resulting vector field and inferred lineages. To do so, we can test which genes have cluster-specific differential velocity expression, being siginificantly higher/lower compared to the remaining population. The module scv.tl.rank_velocity_genes
runs a differential velocity t-test and outpus a gene ranking for each cluster. Thresholds can be set (e.g. min_corr
) to restrict the test on a selection of gene candidates.
scv.tl.rank_velocity_genes(adata, groupby='clusters', min_corr=.3)
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head()
kwargs = dict(frameon=False, size=10, linewidth=1.5,
add_outline='Ngn3 high EP, Pre-endocrine, Beta')
scv.pl.scatter(adata, df['Ngn3 high EP'][:5], ylabel='Ngn3 high EP', **kwargs)
scv.pl.scatter(adata, df['Pre-endocrine'][:5], ylabel='Pre-endocrine', **kwargs)
接下来的可视化分析都很简单,我就不多说了,大家赶紧去试试吧
请保持愤怒,让王多鱼倾家荡产