最近简书服务器不知道发生了什么,这两天写的文章一直在审核状态中,既不让发表,也不封,不知道什么原因,这一篇来分享一篇代码文,看看服务器会做怎样的处理。
关于Consensus Non-negative Matrix factorization (cNMF) ,之前分享过一篇,在10X单细胞(10X空间转录组)数据分析之约束非负矩阵分解(cNMF),大家可以回顾一下,简单来说就是约束非负矩阵分解(CNMF)算法,该算法将标签信息作为附加的硬约束,使得具有相同类标签信息的数据在新的低维空间中仍然保持一致。
我们来看看cNMF在单细胞数据分析中的运用
安装很简单
pip install cnmf
加载
%matplotlib inline
import os
import pandas as pd
import numpy as np
from scipy.io import mmread
import scipy.sparse as sp
import matplotlib.pyplot as plt
from IPython.display import Image
import scanpy as sc
from cnmf import cNMF
if not os.path.exists('example_PBMC'):
os.mkdir('example_PBMC')
np.random.seed(14)
示例数据
! wget -O example_PBMC/pbmc3k_filtered_gene_bc_matrices.tar.gz http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
! tar -zxvf example_PBMC/pbmc3k_filtered_gene_bc_matrices.tar.gz
! mv filtered_gene_bc_matrices example_PBMC
! rm example_PBMC/pbmc3k_filtered_gene_bc_matrices.tar.gz
加载数据并质控
adata = sc.read_10x_mtx(
'example_PBMC/filtered_gene_bc_matrices/hg19/',
var_names='gene_symbols',
cache=False)
adata.var_names_make_unique()
sc.pp.filter_cells(adata, min_genes=200) # filter cells with fewer than 200 genes
sc.pp.filter_cells(adata, min_counts=200) # This is a weaker threshold than above. It is just to population the n_counts column in adata
sc.pp.filter_genes(adata, min_cells=3) # filter genes detected in fewer than 3 cells
## plot log10 # counts per cell
plt.hist(np.log10(adata.obs['n_counts']), bins=100)
_ = plt.xlabel('log10 Counts Per cell')
_ = plt.ylabel('# Cells')
count_adat_fn = 'example_PBMC/counts.h5ad'
sc.write(count_adat_fn, adata)
也可以将包含计数矩阵的文本文件传递给 cNMF。 (但还是用scanpy的h5ad会好一点)
dense_data = pd.DataFrame(adata.X.todense(), index=adata.obs.index, columns=adata.var.index).astype(int)
dense_data.head()
## Commented out because it is slow
count_txt_fn = 'example_PBMC/counts.tsv'
#dense_data.to_csv(count_txt_fn, sep='\t')
基因为列,barcode为行的矩阵
运行cNMF
numiter=20 # Number of NMF replicates. Set this to a larger value ~200 for real data. We set this to a relatively low value here for illustration at a faster speed
numhvgenes=2000 ## Number of over-dispersed genes to use for running the actual factorizations
## Results will be saved to [output_directory]/[run_name] which in this example is example_PBMC/cNMF/pbmc_cNMF
output_directory = 'example_PBMC/cNMF'
if not os.path.exists(output_directory):
os.mkdir(output_directory)
run_name = 'pbmc_cNMF'
## Specify the Ks to use as a space separated list in this case "5 6 7 8 9 10"
K = ' '.join([str(i) for i in range(5,11)])
## To speed this up, you can run it for only K=7-8 with the option below
#K = ' '.join([str(i) for i in range(7,9)])
seed = 14 ## Specify a seed pseudorandom number generation for reproducibility
## Path to the filtered counts dataset we output previously
countfn = 'example_PBMC/counts.h5ad'
## Initialize the cnmf object that will be used to run analyses
cnmf_obj = cNMF(output_dir=output_directory, name=run_name)
## Prepare the data, I.e. subset to 2000 high-variance genes, and variance normalize
cnmf_obj.prepare(counts_fn=countfn, components=np.arange(5,11), n_iter=20, seed=14, num_highvar_genes=2000)
## Specify that the jobs are being distributed over a single worker (total_workers=1) and then launch that worker
cnmf_obj.factorize(worker_i=0, total_workers=1)
也可以
cmd = 'cnmf prepare --output-dir example_PBMC/cNMF --name pbmc_cNMF -c example_PBMC/counts.h5ad -k 5 6 7 8 9 10 --n-iter 20 --total-workers 1 --seed 14 --numgenes 2000 --beta-loss frobenius'
print('Command line - prepare step: %s' % cmd)
!{cmd}
cmd = 'cnmf factorize --output-dir example_PBMC/cNMF --name pbmc_cNMF --worker-index 0'
print('Command line - factorize step: %s' % cmd)
!{cmd}
Compute the stability and error at each choice of K to see if a clear choice jumps out.
请注意,最大稳定性解决方案并不总是最佳选择,具体取决于应用。 然而,即使必须调查 K 的多个选择,这通常也是一个很好的起点
cnmf_obj.k_selection_plot(close_fig=False)
print('This saves the corresponding figure to the following file: %s' % cnmf_obj.paths['k_selection_plot'])
或者
kselect_plot_cmd = 'cnmf k_selection_plot --output-dir example_PBMC/cNMF --name pbmc_cNMF'
print('K selection plot command: %s' % kselect_plot_cmd)
#!{kselect_plot_cmd}
In this range, K=7 gave the most stable solution so we will begin by looking at that.
下一步计算给定 K 选择的共识解决方案。首先在没有任何异常值过滤的情况下运行它以查看它的样子。 将密度阈值设置为 >= 2.00(两个单位向量之间的最大可能距离)可确保不会过滤任何内容。
Then we run the consensus with a filter for outliers determined based on inspecting the histogram of distances between components and their nearest neighbors
selected_K = 7
density_threshold = 2.00
cnmf_obj.consensus(k=selected_K, density_threshold=density_threshold, show_clustering=True, close_clustergram_fig=False)
上面的共识图表明,在具有一些异常值的复制之间存在相当程度的一致性。 0.1 的异常值阈值似乎是合适的
density_threshold = 0.10
cnmf_obj.consensus(k=selected_K, density_threshold=density_threshold, show_clustering=True, close_clustergram_fig=False)
cNMF就运行结束了,会得到一系列的分析结果
pbmc_cNMF.clustering.k_7.dt_0_1.png
pbmc_cNMF.clustering.k_7.dt_2_0.png
pbmc_cNMF.gene_spectra_score.k_7.dt_0_1.txt
pbmc_cNMF.gene_spectra_score.k_7.dt_2_0.txt
pbmc_cNMF.gene_spectra_tpm.k_7.dt_0_1.txt
pbmc_cNMF.gene_spectra_tpm.k_7.dt_2_0.txt
pbmc_cNMF.k_selection.png
pbmc_cNMF.k_selection_stats.df.npz
pbmc_cNMF.overdispersed_genes.txt
pbmc_cNMF.spectra.k_7.dt_0_1.consensus.txt
pbmc_cNMF.spectra.k_7.dt_2_0.consensus.txt
pbmc_cNMF.usages.k_7.dt_0_1.consensus.txt
pbmc_cNMF.usages.k_7.dt_2_0.consensus.txt
The steps below load the AnnData object, TPT normalize it, mean and variance normalize each gene, run PCA, and run UMAP. We use the same set of high-variance genes that were determined prior to running cNMF which we obtain as the columns of the raw output file:
adata = sc.read(countfn)
hvgs = open('./example_PBMC/cNMF/pbmc_cNMF/pbmc_cNMF.overdispersed_genes.txt').read().split('\n')
sc.pp.normalize_per_cell(adata, counts_per_cell_after=10**4) ## TPT normalization
## Set log-normalized data to the raw attribute of the AnnData object to make it easy to plot expression levels of individual genes.
## This does not log normalize the actual AnnData data matrix
adata.raw = sc.pp.log1p(adata.copy(), copy=True)
## Subset out only the high-variance genes
adata = adata[:,hvgs]
## Mean and variance normalize the genes
sc.pp.scale(adata)
## Run PCA
sc.pp.pca(adata)
## Make a scree plot to determine number of PCs to use for UMAP
sc.pl.pca_variance_ratio(adata, log=True)
## Construct the nearest neighbor graph for UMAP
sc.pp.neighbors(adata, n_neighbors=50, n_pcs=15)
## Run UMAP
sc.tl.umap(adata)
## Plot the UMAP with some cannonical marker genes to see that the apparent grouping makes sense
sc.pl.umap(adata, color=['IL7R', 'GZMB', 'CD8A','GNLY',
'MS4A1', 'CD14', 'FCGR3A', 'HLA-DRA'],
use_raw=True, ncols=3)
首先,我们加载 cNMF 输出的原始使用文件。 cNMF 不会将每个细胞的使用归一化为 1,因此我们在下面的第二个细胞中将其作为后续步骤
usage_file = cnmf_obj.paths['consensus_usages__txt'] % (selected_K, '0_1')
print(usage_file)
usage = pd.read_csv(usage_file, sep='\t', index_col=0)
usage.columns = ['Usage_%s' % i for i in usage.columns]
usage.head()
usage_norm = usage.div(usage.sum(axis=1), axis=0)
usage_norm.head()
adata.obs = pd.merge(left=adata.obs, right=usage_norm, how='left', left_index=True, right_index=True)
sc.pl.umap(adata, color=usage_norm.columns,
use_raw=True, ncols=3, vmin=0, vmax=1)
用法与上图中标记基因的表达大致重叠:
- 2 - CD14+ monocyte identity
- 4 - CD4 T-cell identity
- 5 - NK/CD8 T-cell identity
- 6 - B-cell identity
- 7 CD16+ monocyte identity
不太清楚的是 GEP 1 和 3 的解释。让我们加载gene_scores 并识别与每个程序最相关的基因,看看我们是否可以解释它们。
selected_K = 7
gene_scores_file = cnmf_obj.paths['gene_spectra_score__txt'] % (selected_K, '0_1')
print(gene_scores_file)
## Load the Z-scored GEPs which reflect how enriched a gene is in each GEP relative to all of the others
gene_scores = pd.read_csv(gene_scores_file, sep='\t', index_col=0).T
gene_scores.head()
## Obtain the top 40 genes for each GEP in sorted order and combine them into a single dataframe
top_genes = []
ngenes = 40
for gep in gene_scores.columns:
top_genes.append(list(gene_scores.sort_values(by=gep, ascending=False).index[:ngenes]))
top_genes = pd.DataFrame(top_genes, index=gene_scores.columns).T
top_genes
The top expressed genes for GEP 3 include common cell cycle markers such as MKI67 but also plasmablast genes such as MZB1
GEP 1 is a bit more complicated but includes high expression of genes like MMD (MMD monocyte to macrophage differentiation associated) and ESAM (Endothelial cell adhesion molecule) suggesting that is may reflect a distinct stage of monocyte differentiation
sc.pl.umap(adata, color=['CDC6', 'MKI67', 'MMD', 'ESAM'],
use_raw=True, ncols=4)
Let us quickly investigate a larger value of K to see if there are other less robust GEPs that may be worth exploring further, potentially with larger numbers of NMF replicates
cnmf_obj.consensus(k=8, density_threshold=0.1, show_clustering=True, close_clustergram_fig=False)
cnmf_obj.consensus(k=8, density_threshold=0.15, show_clustering=True, close_clustergram_fig=False)
AS expected, there is considerably more variability in the solutions of distinct NMF replicates with this run. Thus more caution must be taken in interpreting the results, and ideally cNMF should be run with a larger number of replicates
It seems that in some iterations the 3rd GEP is getting split into 2 components, while in other iterations, it is being kept as a single one
Lets look at the resulting programs
usage_file = cnmf_obj.paths['consensus_usages__txt'] % (8, '0_15')
print(usage_file)
usage = pd.read_csv(usage_file, sep='\t', index_col=0)
usage.columns = ['Usage_%s' % i for i in usage.columns]
usage_norm = usage.div(usage.sum(axis=1), axis=0)
# Remove the old usages from the AnnData object
adata.obs = adata.obs.drop([x for x in adata.obs.columns if 'Usage_' in x], axis=1)
adata.obs = pd.merge(left=adata.obs, right=usage_norm, how='left', left_index=True, right_index=True)
sc.pl.umap(adata, color=usage_norm.columns,
use_raw=True, ncols=3, vmin=0, vmax=1)]
示例代码在cNMF
安装包在cNMF
期待一下吧,生活很好,有你更好