10X单细胞 && 10XATAC联合分析之scJoint

开工第一弹，我们来看看最新的10X单细胞联合ATAC的分析方法，文章在scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning，2022年1月发表于nature biotechnology,IF54分，相当高了~~~~我们来看一下，其实这里要解决的就是多组学的联合分析问题，下面列举了一些我之前分享的方法，供大家借鉴和参考。

当然了，10X单细胞 && 10XATAC有两种数据类型，一种是多组学，同时测一个细胞内的转录组和ATAC，另外一种是把样本分成了2份，一份测转录组，一份测ATAC，大家要根据自己的情况来甄别一下。

首先我们来分分类，看看这些10X单细胞 && 10XATAC的联合方法的区别，以及他们的优缺点

manifold alignment（Manifold alignment methods have demonstrated promising results in integrating several modalities from the same tissue. However, requiring distributions to match globally is often too restrictive when different modalities are derived from different tissues and cell types）
matrix factorization (Liger, coupled-NMF)（matrix factorization and correlation-based methods designed for unpaired data require a separate feature selection step before integration for dimension reduction, and the method’s performance is sensitive to which genes are selected）
using correlations to identify nearby cells across modalities(Conos, Seurat)
neural-network approaches（Most existing neural-network methods for multiomics integration are based on autoencoders, which, with a few exceptions,
require paired data. In general, unsupervised training of several autoencoders simultaneously can be very challenging without pairing information across different modalities, with finding a common embedding manifold becomes more difficult as the complexity of the data increases）

因此，当前的方法在处理表征多组学图谱数据的复杂性和规模方面的能力有限。

scJoint的改进之处

一个新的损失函数，明确地将降维作为转移学习中特征工程过程的一部分，允许在整个训练过程中修改低维特征，并且无需选择高度可变的基因。
当细胞类型不完全重叠时，增加了模态对齐的灵活性的相似性损失
weight sharing across encoders for different modalities resulting in stable training

图片.png

方法核心

scJoint 的核心是对标记（scRNA-seq）和未标记（scATAC-seq）数据进行协同训练的半监督方法，解决了通过共同的低维空间对齐这两种不同数据模式的主要挑战。 scJoint 包括三个主要步骤。步骤 1 分别通过新的基于神经网络的降维 (NNDR) 损失和余弦相似度损失在公共嵌入空间中执行联合降维和模态对齐。 NNDR 损失在类似于 PCA 的vein中提取具有最大可变性的正交特征，而余弦相似性损失鼓励神经网络找到嵌入空间的投影，以便可以对齐两种模式的大部分。 scRNA-seq 的嵌入进一步由细胞类型分类损失引导，形成半监督部分。在步骤 2 中，将 scATAC-seq 数据中的每个细胞视为query，通过测量它们在公共嵌入空间中的距离来识别 scRNA-seq 细胞之间的 k 最近邻（KNN），并从 scRNA 中转移细胞类型标签-seq 通过“多数票”通过 scATAC-seq。在第 3 步中，通过在度量学习损失中使用转移标签来进一步改进两种模式之间的混合。使用标准工具（包括 tSNE 和 UMAP）从最终嵌入层获得数据集的联合可视化。 scJoint 需要简单的数据预处理，输入维度等于给定数据集中经过适当过滤后的基因数。 scATAC-seq 数据中的染色质可及性首先转换为基因活性评分，从而允许使用单个编码器，RNA 和 ATAC 的权重共享。

方法比较，我们简单看一下

scJoint shows accurate and robust performance on atlas data

图片.png

Refining scATAC-seq annotations in heterogeneous atlas data

图片.png

Integration of multimodal data across biological conditions

图片.png

scJoint shows versatile performance on paired data

图片.png

当然了，文章写的方法，自然是作者的方法效果最好

核心算法，这个需要大家自己来解读了

图片.png

示例代码

安装

git clone https://github.com/SydneyBioX/scJoint.git

scJoint是python版本，如果分析结果保存成了rds（R版本），数据需要转换一下

library(SingleCellExperiment)
library(DropletUtils)
library(scater)
library(ggplot2)
sce_10xPBMC_atac <- readRDS("data_10x/sce_10xPBMC_atac.rds")
sce_10xPBMC_rna <- readRDS("data_10x/sce_10xPBMC_rna.rds")
# Only keep common genes between two dataset
common_genes <- intersect(rownames(sce_10xPBMC_atac),
                          rownames(sce_10xPBMC_rna))
length(common_genes)

# Extract the logcounts data from sce object
exprs_atac <- logcounts(sce_10xPBMC_atac[common_genes, ])
exprs_rna <- logcounts(sce_10xPBMC_rna[common_genes, ])



source("data_to_h5.R")
write_h5_scJoint(exprs_list = list(rna = exprs_rna,
                                   atac = exprs_atac), 
                 h5file_list = c("data_10x/exprs_10xPBMC_rna.h5", 
                                 "data_10x/exprs_10xPBMC_atac.h5"))
write_csv_scJoint(cellType_list =  list(rna = sce_10xPBMC_rna$cellTypes),
                  csv_list = c("data_10x/cellType_10xPBMC_rna.csv"))

data_to_h5.R这个软件有提供，大家可以下载。

Analysis of PBMC data from 10x Genomics using scJoint

import process_db
import h5py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import random
random.seed(1)

rna_h5_files = ["data_10x/exprs_10xPBMC_rna.h5"] 
rna_label_files = ["data_10x/cellType_10xPBMC_rna.csv"] # csv file

atac_h5_files = ["data_10x/exprs_10xPBMC_atac.h5"]
atac_label_files = []

process_db.data_parsing(rna_h5_files, atac_h5_files)
rna_label = pd.read_csv(rna_label_files[0], index_col = 0)
rna_label
print(rna_label.value_counts(sort = False))
process_db.label_parsing(rna_label_files, atac_label_files)

Running scJoint

The scRNA-seq and scATAC-seq have 15463 genes in common. And we have 11 cell types in total in the scRNA_seq adta, so we will set the number of class as 11 in the config.py file. The paths also needed to be edited accordingly. Here are the settings for integration of scRNA-seq and scATAC-seq from 10x Genomics we used:

self.number_of_class = 11 # Number of cell types in scRNA-seq data
self.input_size = 15463 # Number of common genes and proteins between scRNA-seq data and scATAC-seq
self.rna_paths = ['data_10x/exprs_10xPBMC_rna.npz'] # RNA gene expression from scRNA-seq data
self.rna_labels = ['data_10x/cellType_10xPBMC_rna.txt'] # scRNA-seq data cell type labels (coverted to numeric)
self.atac_paths = ['data_10x/exprs_10xPBMC_atac.npz'] # ATAC gene activity matrix from scATAC-seq data
self.atac_labels = []
self.rna_protein_paths = []
self.atac_protein_paths = []

Training config

self.batch_size = 256
self.lr_stage1 = 0.01
self.lr_stage3 = 0.01
self.lr_decay_epoch = 20
self.epochs_stage1 = 20
self.epochs_stage3 = 20
self.p = 0.8
self.embedding_size = 64
self.momentum = 0.9
self.center_weight = 1
self.with_crossentorpy = True
self.seed = 1
self.checkpoint = ''
After modifying config.py, we are ready to run scJoint in terminal with
python3 main.py
This takes ~4 minutes using 1 thread for PyTorch.

Visualisation

rna_embeddings = np.loadtxt('./output/exprs_10xPBMC_rna_embeddings.txt')
atac_embeddings = np.loadtxt('./output/exprs_10xPBMC_atac_embeddings.txt')
print(rna_embeddings.shape)
print(atac_embeddings.shape)
embeddings =  np.concatenate((rna_embeddings, atac_embeddings))
print(embeddings.shape)
tsne_results = TSNE(perplexity=30, n_iter = 1000).fit_transform(embeddings)
tsne_results.shape
df = pd.DataFrame()
df['tSNE1'] = tsne_results[:,0]
df['tSNE2'] = tsne_results[:,1]
rna_labels = np.loadtxt('./data_10x/cellType_10xPBMC_rna.txt')
atac_predictions = np.loadtxt('./output/exprs_10xPBMC_atac_knn_predictions.txt')
labels =  np.concatenate((rna_labels, atac_predictions))
label_to_idx = pd.read_csv('./data_10x/label_to_idx.txt', sep = '\t', header = None)
label_to_idx.shape
label_dic = []
for i in range(label_to_idx.shape[0]):
    label_dic = np.append(label_dic, label_to_idx[0][i][:-2])

data_label = np.array(["scRNA-seq", "scATAC-seq"])
df['data'] = np.repeat(data_label, [rna_embeddings.shape[0], atac_embeddings.shape[0]], axis=0)
df['predicted'] = label_dic[labels.astype(int)]

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "data",
    palette = sns.color_palette("tab10", 2),
    data = df,
    legend = "full",
    alpha = 0.3
)

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "predicted",
    data = df,
    legend = "full",
    alpha = 0.3
)

图片.png

Mouse Primary Motor Cortex Data Integration using scJoint

import process_db
import h5py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import random
import os
import re
random.seed(1)
output_dir = "data_MOp/output/"
embeddings_file_names = [fn for fn in os.listdir(output_dir)
                         if any(fn.endswith(ext) for ext in ['_embeddings.txt'])]
embeddings_file_names.sort()
embeddings_file_names
embeddings = np.loadtxt(output_dir + embeddings_file_names[0])
batch = np.repeat(re.sub('_embeddings.txt', '', embeddings_file_names[0]), embeddings.shape[0])
for fn in embeddings_file_names[1:]:
    e = np.loadtxt(output_dir + fn)
    embeddings = np.append(embeddings, e, axis=0)
    batch = np.append(batch, np.repeat(re.sub('_embeddings.txt', '', fn), e.shape[0]))
print(embeddings.shape)
print(batch.shape)

tsne_results = TSNE(perplexity=30, n_iter = 1000).fit_transform(embeddings)

df = pd.DataFrame()
df['tSNE1'] = tsne_results[:,0]
df['tSNE2'] = tsne_results[:,1]
df['data'] = batch

prediction_file_names = [fn for fn in os.listdir(output_dir)
                         if any(fn.endswith(ext) for ext in ['_knn_predictions.txt'])]
#print(prediction_file_names)
atac_prediction = np.loadtxt(output_dir + prediction_file_names[0])
methy_prediction = np.loadtxt(output_dir + prediction_file_names[1])
#print(atac_prediction.shape)
#print(methy_prediction.shape)

# Reading the original labels
data_dir = "data_MOp/"
label_file_names = [fn for fn in os.listdir(data_dir)
                         if any(fn.endswith(ext) for ext in ['cellTypes.txt'])]
label_file_names.sort()
#print(label_file_names)

atac_original = np.loadtxt(data_dir + label_file_names[0])
methy_original = np.loadtxt(data_dir + label_file_names[8])
rna_original = []

for fn in label_file_names[1:8]:
    p = np.loadtxt(data_dir + fn)
    rna_original = np.append(rna_original, p)
    
#print(rna_original.shape)
prediction = np.append(atac_prediction, rna_original)
prediction = np.append(prediction, methy_original)

# matching the numeric labels to cell type annotation
label_to_idx = pd.read_csv(data_dir + 'label_to_idx.txt', sep = '\t', header = None)
label_to_idx.shape
label_dic = []
for i in range(label_to_idx.shape[0]):
    label_dic = np.append(label_dic, label_to_idx[0][i][:-2])
    
df['predicted'] = label_dic[prediction.astype(int)]

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "data",
    palette = sns.color_palette("tab10", 9),
    data = df.sample(frac=1),
    legend = "full",
    alpha = 0.3
)

图片.png

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "predicted",
    data = df,
    legend = "full",
    alpha = 0.3
)

图片.png

生活很好，有你更好

10X单细胞 && 10XATAC联合分析之scJoint

10X单细胞 && 10XATAC联合分析之scJoint

10X单细胞转录组整合、转录组 && ATAC整合分析之VIPCCA

10X单细胞（10X空间转录组）多组学（RNA + ATAC）推断Velovyto（MultiVelo

10X单细胞 & 10XATAC 联合分析表征细胞调控网络（MIRA)

10X单细胞 & 10XATAC联合之调控网络分析（IReNA）

10X单细胞（10X空间转录组）数据分析迁移之scGCN

10X单细胞核转录组 + 10X单细胞ATAC的联合分析（WNN）

10X单细胞转录组与10X单细胞ATAC联合分析之Seurat

当然了，10X单细胞 && 10XATAC有两种数据类型，一种是多组学，同时测一个细胞内的转录组和ATAC，另外一种是把样本分成了2份，一份测转录组，一份测ATAC，大家要根据自己的情况来甄别一下。

首先我们来分分类，看看这些10X单细胞 && 10XATAC的联合方法的区别，以及他们的优缺点

因此，当前的方法在处理表征多组学图谱数据的复杂性和规模方面的能力有限。

scJoint的改进之处

方法核心

方法比较，我们简单看一下

scJoint shows accurate and robust performance on atlas data

Refining scATAC-seq annotations in heterogeneous atlas data

Integration of multimodal data across biological conditions

scJoint shows versatile performance on paired data

当然了，文章写的方法，自然是作者的方法效果最好

核心算法，这个需要大家自己来解读了

示例代码

安装

scJoint是python版本，如果分析结果保存成了rds（R版本），数据需要转换一下

data_to_h5.R这个软件有提供，大家可以下载。

Analysis of PBMC data from 10x Genomics using scJoint

Running scJoint

Training config

Visualisation

Mouse Primary Motor Cortex Data Integration using scJoint

推荐阅读更多精彩内容