hello,大家好,今天给大家分享的是利用10X单细胞数据构建基因表达网络,关注cluster内部和之间的基因网络关系,话不多说,开始分享。文章在Constructing Local Cell Specific Networks from Single Cell Data,感兴趣可以读一读。
图片.png
安装
pip install locCSN
加载数据
There are 51 marker genes and 1018 cells from 7 cell types. The gene expression are stored in logChumaker.txt and corresponding cell types in chutypectname.txt. Cell types are H1, H9, DEC, EC, HFF, NPC and TF. In our paper, we focus on cell type DEC and NPC.(示例数据集)
# Import packages
import locCSN
import os
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Set path to data
os.chdir('yourpathtodata/Chutype/')
# read in Chutype dataset
data = sc.read_text('logChumarker.txt')
data.shape
data = data.transpose() # 1018 cells * 51 genes
cell_type = pd.read_csv('chutypectname.txt', sep = ' ')
data.obs = cell_type # The observation are labeled with cell types.
# Plot the Heatmap of gene expression
sc.pl.heatmap(data, data.var.index, groupby= "cell_type", dendrogram = False, swap_axes = True,
show_gene_labels= True, cmap='Wistia', figsize=(8,6))
图片.png
Calculate Pearson’s Correlation
After loading gene expression matrix and cell types, we first show the absolute Pearson’s correlation for DEC and NPC cells.
data_dec = data[data.obs.cell_type == "DEC", ]
X_dec = data_dec.X.transpose()
data_npc = data[data.obs.cell_type == 'NPC', ]
X_npc = data_npc.X.transpose()
corr_dec = np.corrcoef(X_dec)
corr_npc = np.corrcoef(X_npc)
np.fill_diagonal(corr_dec, 0)
np.fill_diagonal(corr_npc, 0)
plt.subplot(1, 2, 1)
plt.imshow(abs(corr_dec), vmin=0, vmax=0.7, cmap='RdPu')
plt.title('DEC', fontweight ="bold")
plt.subplot(1, 2, 2)
plt.imshow(abs(corr_npc), vmin=0, vmax=0.7, cmap='RdPu')
plt.title("NPC", fontweight = "bold")
plt.suptitle("Absolute Pearson`s Correlation", fontsize = 14, fontweight = "bold")
图片.png
Calculate CSN test statistics
import time
start = time.time()
csn_dec = locCSN.csn(X_dec, dev = True)
end = time.time()
print(end - start)
start = time.time()
csn_npc = locCSN.csn(X_npc, dev = True)
end = time.time()
print(end - start)
Now we show what function csn produces. For a specific cell, we compute each pair of genes and store test statistics in an upper diagnol matrix.
type(csn_dec)
# list
len(csn_dec) # 138 cells
# Let's see the test statistics for the first cell in DEC
plt.imshow(csn_dec[0].toarray(), vmin = -6, vmax = 6, cmap = 'coolwarm')
plt.title('DEC one cell', fontweight = "bold")
plt.colorbar()
#plt.savefig('dec_one_cell.png')
图片.png
As we stated in our paper: ‘’Althogh individual CSNs are estimated with considerable noise, average CSNs provide stable estimates of network structure, which provide better estimates of gene block structure.’’ For CSN test statistics matrices within a cell group, we first threshold test statistics and averaged the adjacency matrices with the cell group. The averaged CSN is the estimate of gene co-expression of the cell group. In this example, we thresholded at α=0.05.
from scipy.stats import norm
# Cutoff at norm(0.95)
csn_mat = [(item > norm.ppf(0.95)).astype(int) for item in csn_dec]
avgcsn_dec = sum(csn_mat).toarray()/len(csn_mat) + np.transpose(sum(csn_mat).toarray()/len(csn_mat))
csn_mat = [(item > norm.ppf(0.95)).astype(int) for item in csn_npc]
avgcsn_npc = sum(csn_mat).toarray()/len(csn_mat) + np.transpose(sum(csn_mat).toarray()/len(csn_mat))
plt.subplot(1, 2, 1)
plt.imshow(avgcsn_dec, cmap = "Greens", vmin = 0, vmax = 0.7)
plt.title('DEC', fontweight ="bold")
plt.subplot(1, 2, 2)
plt.imshow(avgcsn_npc, cmap = "Greens", vmin = 0, vmax = 0.7)
plt.title('NPC', fontweight = 'bold')
plt.suptitle("Averaged CSN, cut at alpha = 0.05", fontsize=14, fontweight = "bold")
图片.png
Comparing networks between two groups of cells
为了比较使用 CSN 的两组细胞,我们使用来自 ASD Brain 数据集的数据集 (Velmeshev et al. 2019)。 我们关注 942 个表达的 SFARI ASD 基因,并使用 Pearson 相关性和 CSN 比较对照组和 ASD 组的基因共表达网络。 比较方法是sLED 和DISTp。
数据
ASD Brain 数据集有 16 种细胞类型的 104,559 个细胞,并且非常稀疏。 我们使用 Metacell(Baran et al. 2019) 来减少数据集的稀疏性。 请参阅 Metacell 网站了解如何生成元单元。 下面的分析是基于metacells,它存储在这个文件夹中。
为了演示,我们从 4 种神经元层细胞类型的元细胞表达开始:L2/3、L4、L5/6 和 L5/6-CC 以及 942 个 SFARI 基因。 元单元表达式存储在此文件夹中。 请在您自己的目录中下载它们。
我们来看看神经元层的表达。 有1778个元细胞和942个基因。
# import scanpy as sc
# load data
os.chdir('yourpathtodata/Velme/')
data = sc.read_text('Velme_log_mc_cpm_L.txt')
data = data.transpose()
data.shape # 1778 metacells * 942 genes
meta_L = pd.read_csv('Velme_meta_mc_L.txt', sep = ' ')
meta_L.columns
# Index(['sampleID', 'broad.cluster', 'cluster', 'diagnosis'], dtype='object')
data.obs = meta_L
sc.pl.heatmap(data, data.var.index, groupby= ["cluster", "diagnosis"], dendrogram = False, swap_axes = True,
cmap='Wistia', figsize=(8,4))
图片.png
The metadata of metacells can be accessed in data.obs.
data.obs['cluster'].value_counts()
#L2/3 772
#L4 449
#L5/6-CC 341
#L5/6 216
data.obs['diagnosis'].value_counts()
#ASD 936
#Control 842
grouped_data = data.obs.groupby(['cluster', 'diagnosis'])
grouped_data.describe()['sampleID']['count']
#cluster diagnosis
#L2/3 ASD 414.0
# Control 358.0
#L4 ASD 238.0
# Control 211.0
#L5/6 ASD 107.0
# Control 109.0
#L5/6-CC ASD 177.0
# Control 164.0
CSN Construction of L4 cell-group
CSN test statistics are calcaulted within cell group. Now we focus on one cell-group: L4, which contains 449 metacells (238 ASD + 211 Control). Let’s first subset the neuron layers to L4 cell group.
ct_name = "L4"
data_L4 = data[data.obs.cluster == ct_name, :]
data_L4.shape # 449 metacell * 942 genes
mcknn = pd.read_csv('mcknn100_' + ct_name + '.txt', sep = ' ')
mcknn = mcknn.to_numpy()
X_L4 = data_L4.X.transpose()
The runtime of 942 genes is longer than 1 hour. Therefore we provide a toy example only use a subset of 20 genes and the runtime is approximately 20-40s.
start = time.time()
csn_L4_sub = locCSN.csn_loc(X_L4[0:20, :], mcknn)
end = time.time()
print(end_start)
# 25.824307203292847
为了存储和导出到不同的软件平台,我们将 CSN 测试统计矩阵展平。 对于每个 CSN 矩阵,我们将其向量化为一个向量,然后按单元对向量进行列绑定。 最终的 csn flatten 矩阵是基因对 * 细胞矩阵。 G 基因和 N 细胞的扁平矩阵大小为 G(G−1)/2×N。
csn_L4_sub_flat = locCSN.csntoflat(csn_L4_sub) # 20 genes
csn_L4_sub_flat.shape #190 gene pairs * 449 cells
# np.savetxt('csn_'+ct_name+'_sub_flat.txt', csn_L4_sub_flat, delimiter = '\t')
For analysis and visualization, we threshold the CSN test statistics at α=0.01 and average CSN within ASD and Control group respectively.
csn_mat = [(item > norm.ppf(0.99)).astype(int) for item in csn_L4_sub]
meta_L4 = meta_L[meta_L['cluster'] == ct_name]
c_index = (meta_L4['diagnosis'].to_numpy() == 'Control').tolist()
csn_mat_L4_control = [x for x, y in zip(csn_mat_L4, c_index) if y]
a_index = (meta_L4['diagnosis'].to_numpy() == 'ASD').tolist()
csn_mat_L4_asd = [x for x, y in zip(csn_mat_L4, a_index) if y]
avgcsn_L4_sub_control = sum(csn_mat_L4_control).toarray()/len(csn_mat_L4_control)
avgcsn_L4_sub_control = + np.transpose(avgcsn_L4_sub_control)
avgcsn_L4_sub_asd = sum(csn_mat_L4_asd).toarray()/len(csn_mat_L4_asd)
avgcsn_L4_sub_asd = + np.transpose(avgcsn_L4_sub_asd)
my_dpi = 50
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5), dpi=my_dpi)
print(fig)
print(axes)
fig.suptitle("Avg CSN for L4, 20 genes, cut at alpha = 0.01", fontsize = 14, fontweight = "bold")
axes[0].set_title('Control', fontweight = "bold")
axes[0].imshow(avgcsn_L4_sub_control, cmap = "Greens", vmin = 0, vmax = 0.5)
axes[1].set_title('ASD', fontweight = "bold")
axes[1].imshow(avgcsn_L4_sub_asd, cmap = "Greens", vmin = 0, vmax = 0.5)
#fig.savefig('Velme_Avg_csn_L4_sub.png')
图片.png
For visualization in different software platform, we store averaged CSN for ASD and Control group in text file.
# save averaged CSN file control + ASD
avgcsn_flat_L4_sub = csntoflat([avgcsn_L4_sub_control, avgcsn_L4_sub_asd])
np.savetxt('avgcsn_asd_data_'+ct_name+'_sub.txt', avgcsn_flat_L4_sub, delimiter='\t')
Comparison of ASD and Control groups
在本节中,我们将演示如何使用 DISTp 和 sLED 比较基因共表达。 ### DISTp 我们还在locCSN 包中实现了DISTp 的测试。 DISTp 需要按小区组排列的 CSN 邻接矩阵。
# Arrange adjacency matrices by cell group
csn_mat_L4_new = csn_mat_L4_control + csn_mat_L4_asd
n_control = len(csn_mat_L4_control) # number of the first group
start = time.time()
pval = locCSN.DISTp(csn_mat_L4_new, n_control)
end = time.time()
print(end-start)
# 33.536951303482056
pval
# 0.252
这个示例的 DISTp pvalue 是 0.252,这并不重要。 具有 20 个基因的玩具示例的热图也显示了对照组和 ASD 组的相似基因连接强度。
下面提供了所有 942 个基因的代码。 在您完成演示时请不要运行。 需要很长时间才能完成。
# code for 942 genes of L4\. Do not run!!! It will take too long for demo
csn_L4 = locCSN.csn_block_loc(X_L4, mcknn)
csn_L4_flat = locCSN.csntoflat(csn_L4)
# save the flatten the CSN test statistics for sLED. This is a extremely big file
np.savetxt('csn_asd_loc_flat_',ct_name, '.txt', csn_L4_flat, delimiter='\t')
csn_mat_L4 = [(item > norm.ppf(0.99)).astype(int) for item in csn_L4]
meta_L4 = meta_L[meta_L['cluster'] == ct_name]
c_index = (meta_L4['diagnosis'].to_numpy() == 'Control').tolist()
csn_mat_L4_control = [x for x, y in zip(csn_mat_L4, c_index) if y]
a_index = (meta_L4['diagnosis'].to_numpy() == 'ASD').tolist()
csn_mat_L4_asd = [x for x, y in zip(csn_mat_L4, a_index) if y]
avgcsn_L4_control = sum(csn_mat_L4_control).toarray()/len(csn_mat_L4_control)
avgcsn_L4_control = + np.transpose(avgcsn_L4_control)
avgcsn_L4_asd = sum(csn_mat_L4_asd).toarray()/len(csn_mat_L4_asd)
avgcsn_L4_asd = + np.transpose(avgcsn_L4_asd)
# save averaged CSN file control + ASD
avgcsn_flat_L4 = csntoflat([avgcsn_L4_control, avgcsn_L4_asd])
np.savetxt('avgcsn_asd_data_'+ct_name+'.txt', avgcsn_flat_L4, delimiter='\t')
csn_mat_L4_new = csn_mat_L4_control + csn_mat_L4_asd
n_control = len(csn_mat_L4_control)
pval = locCSN.DISTp(csn_mat_L4_new, n_control)
pval
# 0.039
sLED comparison
sLED is a R package for two-sample test for high-dimensional covariance matrices. Details are in this GitHub repo. Let’s get started with installation:
library(sLED)
# read in gene expression and metadata files
setwd('yourpathtodata/Velme/')
log.mc.cpm.L = read.table('Velme_log_mc_cpm_L.txt')
meta.mc.L = read.table('Velme_meta_mc_L.txt')
# Let's take L4 as an example
ct.name = 'L4'
meta.mc.diag = as.numeric(meta.mc.L$diagnosis[meta.mc.L$cluster == ct.name] == 'ASD')
log.mc.L = data.matrix(log.mc.cpm.L[, meta.mc.L$cluster == ct.name])
log.mc.L[1:5, 1:5]
# mc_L_4 mc_L_7 mc_L_10 mc_L_25 mc_L_28
#SAMD11 0.000000 0.000000 0.000000 0.000000 0.000000
#SKI 5.797950 4.036630 5.298243 0.000000 3.842033
#SLC45A1 0.000000 2.814837 0.000000 0.000000 2.269254
#RERE 6.489579 5.775307 5.702040 5.917348 5.959781
#CA6 0.000000 1.965827 0.000000 0.000000 1.894637
# rownames of expression are ASD genes
asd.genes = rownames(log.mc.L)
result.cor = sLED(X = t(log.mc.L[, meta.mc.diag == 0]), Y = t(log.mc.L[, meta.mc.diag == 1]),
sumabs.seq = 0.2, npermute = 100, seeds = c(1:100), adj.beta = 0)
# 100 permutation started:
# 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
result.cor$pVal
尽管 sLED 是为协方差矩阵设计的,但比较差异矩阵的思想可以应用于 CSN 测量的共表达。 我已经修改了 CSN 邻接矩阵的 sLED 代码。
可以通过此链接找到 flatten csn 测试统计数据。 请在执行以下代码之前下载并解压缩文件 csn_asd_loc_flat_L4.txt。
# load functions of sLED for CSNs
source('https://raw.githubusercontent.com/xuranw/locCSN/main/Rcode/sLEDmodify.R')
csn.flat.temp = read.table(paste0('csn_asd_loc_flat_',ct_name, '.txt'))
csn.flat.temp = data.matrix(csn.flat.temp)
csn.t.flat = (csn.flat.temp > qnorm(0.99)) + 0 #Threshold at alpha = 0.01
result.csn = sLED.csn(X = csn.t.flat[, meta.mc.diag == 0], Y = csn.t.flat[, meta.mc.diag == 1],
sumabs.seq = 0.2, npermute = 100, seeds = c(1:100))
result.csn$pVal
# [1] 0
sLED-CSN produces a significant p-value for comparison of ASD and Control groups for L4 cell group.
leverage genes and DN genes
我们还可以识别杠杆基因,它们是稀疏前导特征向量的非零项。 差异网络基因解释了杠杆基因之间 90% 的变异性。 要获得杠杆基因和 DN 基因,请使用以下代码。
# Leverage genes
lev.L4 = asd.genes[result.csn$leverage > 0]
# DN genes (top 90%)
num.dn = min(which(cumsum(sort(result.csn$leverage, decreasing = T)) > 0.9)) # 27
dn.L4.id = which(result.csn$leverage >= sort(result.csn$leverage, decreasing = T)[num.dn])
dn.L4 = asd.genes[dn.L4.id]
We also plot the heatmap of averaged CSN for DN genes plus 30 random selected genes. DN genes are boxed in the heatmaps.
plot.gene.id = c(dn.L4.id, sample(setdiff(1:942, dn.L4, id), 30))
# DN genes + 30 random selected genes
avgcsn.flat = read.table(paste0('avgcsn_asd_data_', ct.name, '.txt'))
avg.csn.ctl = flat.to.matrix(avgcsn_temp[, 1])
avg.csn.asd = flat.to.matrix(avgcsn_temp[, 2])
library(reshape2)
m.data.avgcsn.dn = rbind(melt(avg.csn.ctl[plot.gene.id, plot.gene.id]),
melt(avg.csn.asd[plot.gene.id, plot.gene.id]))
temp.dn = dn.L4.id
temp.non.dn = setdiff(plot.gene.id, dn.L4.id)
# A simple clustering of genes for visualization
dist.dn = dist(cbind(avg.csn.ctl[temp.dn, temp.dn], avg.csn.asd[temp.dn, temp.dn]));
dist.non.dn = dist(cbind(avg.csn.ctl[temp.non.dn, temp.non.dn], avg.csn.asd[temp.non.dn, temp.non.dn]))
hclust.dn = hclust(dist.dn); hclust.non.dn = hclust(dist.non.dn)
match.temp.dn = match(temp.dn, plot.gene.id);
match.temp.non.dn = match(temp.non.dn, plot.gene.id);
order.temp = c(match.temp.dn[hclust.dn$order], match.temp.non.dn[hclust.non.dn$order])
n.dn.withnull = length(plot.gene.id); n.dn = length(dn.L4.id)
m.data.avgcsn.dn$X1 = factor(m.data.avgcsn.dn$X1, levels = order.temp)
m.data.avgcsn.dn$X2 = factor(m.data.avgcsn.dn$X2, levels = order.temp)
colnames(m.data.avgcsn.dn) = c('gene.x', 'gene.y', 'avgcsn')
m.data.avgcsn.dn$group = factor(rep(c('Control', 'ASD'), each = n.dn.withnull^2), levels = c('Control', 'ASD'))
data.rec <- data.frame(y=c(0.5, n.dn+0.5, n.dn+0.5, 0.5), x=c(0.5, 0.5, n.dn+0.5, n.dn+0.5))
p1 = ggplot(m.data.avgcsn.dn, aes(gene.x, gene.y, fill = avgcsn)) + geom_tile(size = 0) +
geom_polygon(data = data.rec, aes(x=x, y=y), colour="black", fill=NA) +
facet_wrap(~group) + coord_fixed() +
scale_fill_distiller(palette = 'Greens', direction = 1) + theme_minimal() +
theme(axis.text.x = element_blank(), axis.text.y = element_blank(),
axis.ticks.x = element_blank(), axis.ticks.y = element_blank()) +
ggtitle(paste0('L4: ', n.dn, ' DN genes')) +
theme(axis.title = element_blank()) + guides(fill = guide_colorbar(barwidth = 0.5, barheight = 3))
diff.avg.csn.dn = avg.csn.asd[plot.gene.id, plot.gene.id] - avg.csn.ctl[plot.gene.id, plot.gene.id]
m.diff.avg.csn = melt(diff.avg.csn.dn)
m.diff.avg.csn$X1 = factor(m.diff.avg.csn$X1, levels = order.temp)
m.diff.avg.csn$X2 = factor(m.diff.avg.csn$X2, levels = order.temp)
colnames(m.diff.avg.csn) = c('gene.x', 'gene.y', 'avgcsn')
lim.max = max(0.7, max(abs(m.diff.avg.csn$avgcsn)))
p2 = ggplot(m.diff.avg.csn, aes(gene.x, gene.y, fill = avgcsn)) + geom_tile(size = 0) +
geom_polygon(data = data.rec, aes(x=x, y=y), colour="black", fill=NA) + coord_fixed() +
scale_fill_distiller(palette = 'RdBu', direction = -1, limit = c(-lim.max, lim.max)) + theme_minimal() +
theme(axis.text.x = element_blank(), axis.text.y = element_blank(),
axis.ticks.x = element_blank(), axis.ticks.y = element_blank(),
axis.title = element_blank()) +
ggtitle('Difference') + labs(fill = expression(paste(Delta, 'avgcsn'))) +
guides(fill = guide_colorbar(barwidth = 0.5, barheight = 3))
library(cowplot)
p_comb = plot_grid(p1, p2, ncol = 2, rel_widths = c(1.6, 1))
p_comb
velme_DN_heat
Trajectory analysis using Brain Cortex Atlas Dataset
对于大脑皮层 Altas 数据(Polioudakis 等人,2019 年),我将仅展示在获得 7 个伪时间点的平均 CSN 后的下游分析以及我们如何获得 D 曲线的桑基图。 平均 CSN 存储在此 文件夹 中。 PisCES (Liu et al. 2018) 是这个 gitHub repo 中的一个 Matlab 包
load('avgcsn_Dcurve_list.mat') %avgcsn_Dcurve_list
% this is a list of averaged CSN for 8 pseudotime points: P, IP, Ex
T = 8;
N = size(avgcsn_Dcurve_list{1}, 1);
net_temp = csnto3dmat(avgcsn_Dcurve_list);
Z = PisCES(net_temp, 'T', 0.05*ones(T, 2));
K = max(max(Z));
A_rec = avgcsn_Dcurve_list;
param.min_flow_size = N/K^2;
param.frac_min_flow_size = 0.15;
[newZ] = layout_timeline(A_rec, Z', param);
This section is not necessary. We manually reordered the gene community so that the dense gene community is the first gene community.
% Manually change gene community order
for i = 1:8
index = find(newZ(:, i) == 3);
newZ(newZ(:, i) == 2, i) = 3;
newZ(newZ(:, i) == 1, i) = 2;
newZ(index, i) = 1;
end
Create sankey plot.
[flow_rec, cluster_rec] = create_sankey_tables(newZ, A_rec);
param.add_class_labels = 1;
param.draw_whole_timeline = 1;
param.draw_paired_plots = 0;
param.which_paired_plots = 1:T-1;
param.draw_whole_timeline_with_heatmap = 0;
param.draw_all_plots = 0;
[paired_param] = make_timeline_and_paired_plots(newZ, A_rec, flow_rec,cluster_rec, param);
sankey_plot
生活很好,有你更好~~~~