10X单细胞(10X空间转录组)数据分析之Consensus Non-negative Matrix factorization (cNMF)

最近简书服务器不知道发生了什么,这两天写的文章一直在审核状态中,既不让发表,也不封,不知道什么原因,这一篇来分享一篇代码文,看看服务器会做怎样的处理。

关于Consensus Non-negative Matrix factorization (cNMF) ,之前分享过一篇,在10X单细胞(10X空间转录组)数据分析之约束非负矩阵分解(cNMF),大家可以回顾一下,简单来说就是约束非负矩阵分解(CNMF)算法,该算法将标签信息作为附加的硬约束,使得具有相同类标签信息的数据在新的低维空间中仍然保持一致。

图片.png

我们来看看cNMF在单细胞数据分析中的运用

安装很简单
pip install cnmf
加载
%matplotlib inline

import os
import pandas as pd
import numpy as np
from scipy.io import mmread
import scipy.sparse as sp
import matplotlib.pyplot as plt
from IPython.display import Image
import scanpy as sc
from cnmf import cNMF

if not os.path.exists('example_PBMC'):
    os.mkdir('example_PBMC')
    
np.random.seed(14)
示例数据
! wget -O example_PBMC/pbmc3k_filtered_gene_bc_matrices.tar.gz http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
! tar -zxvf example_PBMC/pbmc3k_filtered_gene_bc_matrices.tar.gz    
! mv filtered_gene_bc_matrices example_PBMC
! rm  example_PBMC/pbmc3k_filtered_gene_bc_matrices.tar.gz
加载数据并质控
adata = sc.read_10x_mtx(
    'example_PBMC/filtered_gene_bc_matrices/hg19/',
    var_names='gene_symbols',
    cache=False)

adata.var_names_make_unique()

sc.pp.filter_cells(adata, min_genes=200) # filter cells with fewer than 200 genes
sc.pp.filter_cells(adata, min_counts=200)  # This is a weaker threshold than above. It is just to population the n_counts column in adata
sc.pp.filter_genes(adata, min_cells=3) # filter genes detected in fewer than 3 cells

## plot log10 # counts per cell
plt.hist(np.log10(adata.obs['n_counts']), bins=100)
_ = plt.xlabel('log10 Counts Per cell')
_ = plt.ylabel('# Cells')



count_adat_fn = 'example_PBMC/counts.h5ad'
sc.write(count_adat_fn, adata)
图片.png
也可以将包含计数矩阵的文本文件传递给 cNMF。 (但还是用scanpy的h5ad会好一点)
dense_data = pd.DataFrame(adata.X.todense(), index=adata.obs.index, columns=adata.var.index).astype(int)
dense_data.head()

## Commented out because it is slow

count_txt_fn = 'example_PBMC/counts.tsv'
#dense_data.to_csv(count_txt_fn, sep='\t')
基因为列,barcode为行的矩阵
运行cNMF
numiter=20 # Number of NMF replicates. Set this to a larger value ~200 for real data. We set this to a relatively low value here for illustration at a faster speed
numhvgenes=2000 ## Number of over-dispersed genes to use for running the actual factorizations

## Results will be saved to [output_directory]/[run_name] which in this example is example_PBMC/cNMF/pbmc_cNMF
output_directory = 'example_PBMC/cNMF'
if not os.path.exists(output_directory):
    os.mkdir(output_directory)
run_name = 'pbmc_cNMF'

## Specify the Ks to use as a space separated list in this case "5 6 7 8 9 10"
K = ' '.join([str(i) for i in range(5,11)])

## To speed this up, you can run it for only K=7-8 with the option below
#K = ' '.join([str(i) for i in range(7,9)])


seed = 14 ## Specify a seed pseudorandom number generation for reproducibility

## Path to the filtered counts dataset we output previously
countfn = 'example_PBMC/counts.h5ad'

## Initialize the cnmf object that will be used to run analyses
cnmf_obj = cNMF(output_dir=output_directory, name=run_name)

## Prepare the data, I.e. subset to 2000 high-variance genes, and variance normalize
cnmf_obj.prepare(counts_fn=countfn, components=np.arange(5,11), n_iter=20, seed=14, num_highvar_genes=2000)

## Specify that the jobs are being distributed over a single worker (total_workers=1) and then launch that worker
cnmf_obj.factorize(worker_i=0, total_workers=1)

也可以

cmd = 'cnmf prepare --output-dir example_PBMC/cNMF --name pbmc_cNMF -c example_PBMC/counts.h5ad -k 5 6 7 8 9 10 --n-iter 20 --total-workers 1 --seed 14 --numgenes 2000 --beta-loss frobenius'
print('Command line - prepare step: %s' % cmd)
!{cmd}
cmd = 'cnmf factorize --output-dir example_PBMC/cNMF --name pbmc_cNMF --worker-index 0'
print('Command line - factorize step: %s' % cmd)
!{cmd}
Compute the stability and error at each choice of K to see if a clear choice jumps out.

请注意,最大稳定性解决方案并不总是最佳选择,具体取决于应用。 然而,即使必须调查 K 的多个选择,这通常也是一个很好的起点

cnmf_obj.k_selection_plot(close_fig=False)
图片.png
print('This saves the corresponding figure to the following file: %s' % cnmf_obj.paths['k_selection_plot'])
或者
kselect_plot_cmd = 'cnmf k_selection_plot --output-dir example_PBMC/cNMF --name pbmc_cNMF'
print('K selection plot command: %s' % kselect_plot_cmd)
#!{kselect_plot_cmd}
In this range, K=7 gave the most stable solution so we will begin by looking at that.

下一步计算给定 K 选择的共识解决方案。首先在没有任何异常值过滤的情况下运行它以查看它的样子。 将密度阈值设置为 >= 2.00(两个单位向量之间的最大可能距离)可确保不会过滤任何内容。
Then we run the consensus with a filter for outliers determined based on inspecting the histogram of distances between components and their nearest neighbors

selected_K = 7
density_threshold = 2.00
cnmf_obj.consensus(k=selected_K, density_threshold=density_threshold, show_clustering=True, close_clustergram_fig=False)
图片.png

上面的共识图表明,在具有一些异常值的复制之间存在相当程度的一致性。 0.1 的异常值阈值似乎是合适的

density_threshold = 0.10
cnmf_obj.consensus(k=selected_K, density_threshold=density_threshold, show_clustering=True, close_clustergram_fig=False)
图片.png
cNMF就运行结束了,会得到一系列的分析结果

pbmc_cNMF.clustering.k_7.dt_0_1.png
pbmc_cNMF.clustering.k_7.dt_2_0.png
pbmc_cNMF.gene_spectra_score.k_7.dt_0_1.txt
pbmc_cNMF.gene_spectra_score.k_7.dt_2_0.txt
pbmc_cNMF.gene_spectra_tpm.k_7.dt_0_1.txt
pbmc_cNMF.gene_spectra_tpm.k_7.dt_2_0.txt
pbmc_cNMF.k_selection.png
pbmc_cNMF.k_selection_stats.df.npz
pbmc_cNMF.overdispersed_genes.txt
pbmc_cNMF.spectra.k_7.dt_0_1.consensus.txt
pbmc_cNMF.spectra.k_7.dt_2_0.consensus.txt
pbmc_cNMF.usages.k_7.dt_0_1.consensus.txt
pbmc_cNMF.usages.k_7.dt_2_0.consensus.txt

The steps below load the AnnData object, TPT normalize it, mean and variance normalize each gene, run PCA, and run UMAP. We use the same set of high-variance genes that were determined prior to running cNMF which we obtain as the columns of the raw output file:
adata = sc.read(countfn)
hvgs = open('./example_PBMC/cNMF/pbmc_cNMF/pbmc_cNMF.overdispersed_genes.txt').read().split('\n')
sc.pp.normalize_per_cell(adata, counts_per_cell_after=10**4) ## TPT normalization

## Set log-normalized data to the raw attribute of the AnnData object to make it easy to plot expression levels of individual genes.
## This does not log normalize the actual AnnData data matrix
adata.raw = sc.pp.log1p(adata.copy(), copy=True)
## Subset out only the high-variance genes

adata = adata[:,hvgs]

## Mean and variance normalize the genes
sc.pp.scale(adata)
## Run PCA
sc.pp.pca(adata)
## Make a scree plot to determine number of PCs to use for UMAP
sc.pl.pca_variance_ratio(adata, log=True)
## Construct the nearest neighbor graph for UMAP
sc.pp.neighbors(adata, n_neighbors=50, n_pcs=15)
## Run UMAP
sc.tl.umap(adata)
## Plot the UMAP with some cannonical marker genes to see that the apparent grouping makes sense
sc.pl.umap(adata, color=['IL7R', 'GZMB', 'CD8A','GNLY',
                         'MS4A1', 'CD14', 'FCGR3A', 'HLA-DRA'],
           use_raw=True, ncols=3)
图片.png
首先,我们加载 cNMF 输出的原始使用文件。 cNMF 不会将每个细胞的使用归一化为 1,因此我们在下面的第二个细胞中将其作为后续步骤
usage_file = cnmf_obj.paths['consensus_usages__txt'] % (selected_K, '0_1')
print(usage_file)
usage = pd.read_csv(usage_file, sep='\t', index_col=0)
usage.columns = ['Usage_%s' % i for i in usage.columns]
usage.head()
图片.png
usage_norm = usage.div(usage.sum(axis=1), axis=0)
usage_norm.head()
图片.png
adata.obs = pd.merge(left=adata.obs, right=usage_norm, how='left', left_index=True, right_index=True)

sc.pl.umap(adata, color=usage_norm.columns,
           use_raw=True, ncols=3, vmin=0, vmax=1)
图片.png

用法与上图中标记基因的表达大致重叠:

  • 2 - CD14+ monocyte identity
  • 4 - CD4 T-cell identity
  • 5 - NK/CD8 T-cell identity
  • 6 - B-cell identity
  • 7 CD16+ monocyte identity

不太清楚的是 GEP 1 和 3 的解释。让我们加载gene_scores 并识别与每个程序最相关的基因,看看我们是否可以解释它们。

selected_K = 7
gene_scores_file = cnmf_obj.paths['gene_spectra_score__txt'] % (selected_K, '0_1')
print(gene_scores_file)
## Load the Z-scored GEPs which reflect how enriched a gene is in each GEP relative to all of the others
gene_scores = pd.read_csv(gene_scores_file, sep='\t', index_col=0).T
gene_scores.head()
图片.png
## Obtain the top 40 genes for each GEP in sorted order and combine them into a single dataframe

top_genes = []
ngenes = 40
for gep in gene_scores.columns:
    top_genes.append(list(gene_scores.sort_values(by=gep, ascending=False).index[:ngenes]))
    
top_genes = pd.DataFrame(top_genes, index=gene_scores.columns).T
top_genes
图片.png

The top expressed genes for GEP 3 include common cell cycle markers such as MKI67 but also plasmablast genes such as MZB1

GEP 1 is a bit more complicated but includes high expression of genes like MMD (MMD monocyte to macrophage differentiation associated) and ESAM (Endothelial cell adhesion molecule) suggesting that is may reflect a distinct stage of monocyte differentiation

sc.pl.umap(adata, color=['CDC6', 'MKI67', 'MMD', 'ESAM'],
           use_raw=True, ncols=4)
图片.png

Let us quickly investigate a larger value of K to see if there are other less robust GEPs that may be worth exploring further, potentially with larger numbers of NMF replicates

cnmf_obj.consensus(k=8, density_threshold=0.1, show_clustering=True, close_clustergram_fig=False)
图片.png
cnmf_obj.consensus(k=8, density_threshold=0.15, show_clustering=True, close_clustergram_fig=False)
图片.png

AS expected, there is considerably more variability in the solutions of distinct NMF replicates with this run. Thus more caution must be taken in interpreting the results, and ideally cNMF should be run with a larger number of replicates
It seems that in some iterations the 3rd GEP is getting split into 2 components, while in other iterations, it is being kept as a single one
Lets look at the resulting programs

usage_file = cnmf_obj.paths['consensus_usages__txt'] % (8, '0_15')
print(usage_file)
usage = pd.read_csv(usage_file, sep='\t', index_col=0)
usage.columns = ['Usage_%s' % i for i in usage.columns]
usage_norm = usage.div(usage.sum(axis=1), axis=0)
# Remove the old usages from the AnnData object
adata.obs = adata.obs.drop([x for x in adata.obs.columns if 'Usage_' in x], axis=1)

adata.obs = pd.merge(left=adata.obs, right=usage_norm, how='left', left_index=True, right_index=True)

sc.pl.umap(adata, color=usage_norm.columns,
           use_raw=True, ncols=3, vmin=0, vmax=1)]
图片.png

示例代码在cNMF

安装包在cNMF

期待一下吧,生活很好,有你更好

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,732评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,496评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,264评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,807评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,806评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,675评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,029评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,683评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 41,704评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,666评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,773评论 1 332
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,413评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,016评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,978评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,204评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,083评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,503评论 2 343