10X单细胞 && 10XATAC联合分析之scJoint

开工第一弹,我们来看看最新的10X单细胞联合ATAC的分析方法,文章在scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning,2022年1月发表于nature biotechnology,IF54分,相当高了~~~~我们来看一下,其实这里要解决的就是多组学的联合分析问题,下面列举了一些我之前分享的方法,供大家借鉴和参考。

10X单细胞转录组整合、转录组 && ATAC整合分析之VIPCCA

10X单细胞(10X空间转录组)多组学(RNA + ATAC)推断Velovyto(MultiVelo

10X单细胞 & 10XATAC 联合分析表征细胞调控网络(MIRA)

10X单细胞 & 10XATAC联合之调控网络分析(IReNA)

10X单细胞(10X空间转录组)数据分析迁移之scGCN

10X单细胞核转录组 + 10X单细胞ATAC的联合分析(WNN)

10X单细胞转录组与10X单细胞ATAC联合分析之Seurat

当然了,10X单细胞 && 10XATAC有两种数据类型,一种是多组学,同时测一个细胞内的转录组和ATAC,另外一种是把样本分成了2份,一份测转录组,一份测ATAC,大家要根据自己的情况来甄别一下。

首先我们来分分类,看看这些10X单细胞 && 10XATAC的联合方法的区别,以及他们的优缺点

  • manifold alignment(Manifold alignment methods have demonstrated promising results in integrating several modalities from the same tissue. However, requiring distributions to match globally is often too restrictive when different modalities are derived from different tissues and cell types
  • matrix factorization (Liger, coupled-NMF)(matrix factorization and correlation-based methods designed for unpaired data require a separate feature selection step before integration for dimension reduction, and the method’s performance is sensitive to which genes are selected
  • using correlations to identify nearby cells across modalities(Conos, Seurat)
  • neural-network approaches(Most existing neural-network methods for multiomics integration are based on autoencoders, which, with a few exceptions,
    require paired data. In general, unsupervised training of several autoencoders simultaneously can be very challenging without pairing information across different modalities, with finding a common embedding manifold becomes more difficult as the complexity of the data increases)

因此,当前的方法在处理表征多组学图谱数据的复杂性和规模方面的能力有限。

scJoint的改进之处

  • 一个新的损失函数,明确地将降维作为转移学习中特征工程过程的一部分,允许在整个训练过程中修改低维特征,并且无需选择高度可变的基因。
  • 当细胞类型不完全重叠时,增加了模态对齐的灵活性的相似性损失
  • weight sharing across encoders for different modalities resulting in stable training
图片.png

方法核心

scJoint 的核心是对标记(scRNA-seq)和未标记(scATAC-seq)数据进行协同训练的半监督方法解决了通过共同的低维空间对齐这两种不同数据模式的主要挑战。 scJoint 包括三个主要步骤。步骤 1 分别通过新的基于神经网络的降维 (NNDR) 损失和余弦相似度损失在公共嵌入空间中执行联合降维和模态对齐。 NNDR 损失在类似于 PCA 的vein中提取具有最大可变性的正交特征,而余弦相似性损失鼓励神经网络找到嵌入空间的投影,以便可以对齐两种模式的大部分。 scRNA-seq 的嵌入进一步由细胞类型分类损失引导,形成半监督部分。在步骤 2 中,将 scATAC-seq 数据中的每个细胞视为query,通过测量它们在公共嵌入空间中的距离来识别 scRNA-seq 细胞之间的 k 最近邻(KNN),并从 scRNA 中转移细胞类型标签-seq 通过“多数票”通过 scATAC-seq。在第 3 步中,通过在度量学习损失中使用转移标签来进一步改进两种模式之间的混合。使用标准工具(包括 tSNE 和 UMAP)从最终嵌入层获得数据集的联合可视化。 scJoint 需要简单的数据预处理,输入维度等于给定数据集中经过适当过滤后的基因数。 scATAC-seq 数据中的染色质可及性首先转换为基因活性评分,从而允许使用单个编码器,RNA 和 ATAC 的权重共享。

方法比较,我们简单看一下

scJoint shows accurate and robust performance on atlas data
图片.png
Refining scATAC-seq annotations in heterogeneous atlas data
图片.png
Integration of multimodal data across biological conditions
图片.png
scJoint shows versatile performance on paired data
图片.png
当然了,文章写的方法,自然是作者的方法效果最好

核心算法,这个需要大家自己来解读了

图片.png

示例代码

安装
git clone https://github.com/SydneyBioX/scJoint.git
scJoint是python版本,如果分析结果保存成了rds(R版本),数据需要转换一下
library(SingleCellExperiment)
library(DropletUtils)
library(scater)
library(ggplot2)
sce_10xPBMC_atac <- readRDS("data_10x/sce_10xPBMC_atac.rds")
sce_10xPBMC_rna <- readRDS("data_10x/sce_10xPBMC_rna.rds")
# Only keep common genes between two dataset
common_genes <- intersect(rownames(sce_10xPBMC_atac),
                          rownames(sce_10xPBMC_rna))
length(common_genes)

# Extract the logcounts data from sce object
exprs_atac <- logcounts(sce_10xPBMC_atac[common_genes, ])
exprs_rna <- logcounts(sce_10xPBMC_rna[common_genes, ])



source("data_to_h5.R")
write_h5_scJoint(exprs_list = list(rna = exprs_rna,
                                   atac = exprs_atac), 
                 h5file_list = c("data_10x/exprs_10xPBMC_rna.h5", 
                                 "data_10x/exprs_10xPBMC_atac.h5"))
write_csv_scJoint(cellType_list =  list(rna = sce_10xPBMC_rna$cellTypes),
                  csv_list = c("data_10x/cellType_10xPBMC_rna.csv"))

data_to_h5.R这个软件有提供,大家可以下载。

Analysis of PBMC data from 10x Genomics using scJoint

import process_db
import h5py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import random
random.seed(1)

rna_h5_files = ["data_10x/exprs_10xPBMC_rna.h5"] 
rna_label_files = ["data_10x/cellType_10xPBMC_rna.csv"] # csv file

atac_h5_files = ["data_10x/exprs_10xPBMC_atac.h5"]
atac_label_files = []

process_db.data_parsing(rna_h5_files, atac_h5_files)
rna_label = pd.read_csv(rna_label_files[0], index_col = 0)
rna_label
print(rna_label.value_counts(sort = False))
process_db.label_parsing(rna_label_files, atac_label_files)

Running scJoint

The scRNA-seq and scATAC-seq have 15463 genes in common. And we have 11 cell types in total in the scRNA_seq adta, so we will set the number of class as 11 in the config.py file. The paths also needed to be edited accordingly. Here are the settings for integration of scRNA-seq and scATAC-seq from 10x Genomics we used:

self.number_of_class = 11 # Number of cell types in scRNA-seq data
self.input_size = 15463 # Number of common genes and proteins between scRNA-seq data and scATAC-seq
self.rna_paths = ['data_10x/exprs_10xPBMC_rna.npz'] # RNA gene expression from scRNA-seq data
self.rna_labels = ['data_10x/cellType_10xPBMC_rna.txt'] # scRNA-seq data cell type labels (coverted to numeric)
self.atac_paths = ['data_10x/exprs_10xPBMC_atac.npz'] # ATAC gene activity matrix from scATAC-seq data
self.atac_labels = []
self.rna_protein_paths = []
self.atac_protein_paths = []

Training config

self.batch_size = 256
self.lr_stage1 = 0.01
self.lr_stage3 = 0.01
self.lr_decay_epoch = 20
self.epochs_stage1 = 20
self.epochs_stage3 = 20
self.p = 0.8
self.embedding_size = 64
self.momentum = 0.9
self.center_weight = 1
self.with_crossentorpy = True
self.seed = 1
self.checkpoint = ''
After modifying config.py, we are ready to run scJoint in terminal with
python3 main.py
This takes ~4 minutes using 1 thread for PyTorch.

Visualisation

rna_embeddings = np.loadtxt('./output/exprs_10xPBMC_rna_embeddings.txt')
atac_embeddings = np.loadtxt('./output/exprs_10xPBMC_atac_embeddings.txt')
print(rna_embeddings.shape)
print(atac_embeddings.shape)
embeddings =  np.concatenate((rna_embeddings, atac_embeddings))
print(embeddings.shape)
tsne_results = TSNE(perplexity=30, n_iter = 1000).fit_transform(embeddings)
tsne_results.shape
df = pd.DataFrame()
df['tSNE1'] = tsne_results[:,0]
df['tSNE2'] = tsne_results[:,1]
rna_labels = np.loadtxt('./data_10x/cellType_10xPBMC_rna.txt')
atac_predictions = np.loadtxt('./output/exprs_10xPBMC_atac_knn_predictions.txt')
labels =  np.concatenate((rna_labels, atac_predictions))
label_to_idx = pd.read_csv('./data_10x/label_to_idx.txt', sep = '\t', header = None)
label_to_idx.shape
label_dic = []
for i in range(label_to_idx.shape[0]):
    label_dic = np.append(label_dic, label_to_idx[0][i][:-2])

data_label = np.array(["scRNA-seq", "scATAC-seq"])
df['data'] = np.repeat(data_label, [rna_embeddings.shape[0], atac_embeddings.shape[0]], axis=0)
df['predicted'] = label_dic[labels.astype(int)]

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "data",
    palette = sns.color_palette("tab10", 2),
    data = df,
    legend = "full",
    alpha = 0.3
)

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "predicted",
    data = df,
    legend = "full",
    alpha = 0.3
)
图片.png
Mouse Primary Motor Cortex Data Integration using scJoint
import process_db
import h5py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import random
import os
import re
random.seed(1)
output_dir = "data_MOp/output/"
embeddings_file_names = [fn for fn in os.listdir(output_dir)
                         if any(fn.endswith(ext) for ext in ['_embeddings.txt'])]
embeddings_file_names.sort()
embeddings_file_names
embeddings = np.loadtxt(output_dir + embeddings_file_names[0])
batch = np.repeat(re.sub('_embeddings.txt', '', embeddings_file_names[0]), embeddings.shape[0])
for fn in embeddings_file_names[1:]:
    e = np.loadtxt(output_dir + fn)
    embeddings = np.append(embeddings, e, axis=0)
    batch = np.append(batch, np.repeat(re.sub('_embeddings.txt', '', fn), e.shape[0]))
print(embeddings.shape)
print(batch.shape)

tsne_results = TSNE(perplexity=30, n_iter = 1000).fit_transform(embeddings)

df = pd.DataFrame()
df['tSNE1'] = tsne_results[:,0]
df['tSNE2'] = tsne_results[:,1]
df['data'] = batch

prediction_file_names = [fn for fn in os.listdir(output_dir)
                         if any(fn.endswith(ext) for ext in ['_knn_predictions.txt'])]
#print(prediction_file_names)
atac_prediction = np.loadtxt(output_dir + prediction_file_names[0])
methy_prediction = np.loadtxt(output_dir + prediction_file_names[1])
#print(atac_prediction.shape)
#print(methy_prediction.shape)

# Reading the original labels
data_dir = "data_MOp/"
label_file_names = [fn for fn in os.listdir(data_dir)
                         if any(fn.endswith(ext) for ext in ['cellTypes.txt'])]
label_file_names.sort()
#print(label_file_names)

atac_original = np.loadtxt(data_dir + label_file_names[0])
methy_original = np.loadtxt(data_dir + label_file_names[8])
rna_original = []

for fn in label_file_names[1:8]:
    p = np.loadtxt(data_dir + fn)
    rna_original = np.append(rna_original, p)
    
#print(rna_original.shape)
prediction = np.append(atac_prediction, rna_original)
prediction = np.append(prediction, methy_original)

# matching the numeric labels to cell type annotation
label_to_idx = pd.read_csv(data_dir + 'label_to_idx.txt', sep = '\t', header = None)
label_to_idx.shape
label_dic = []
for i in range(label_to_idx.shape[0]):
    label_dic = np.append(label_dic, label_to_idx[0][i][:-2])
    
df['predicted'] = label_dic[prediction.astype(int)]

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "data",
    palette = sns.color_palette("tab10", 9),
    data = df.sample(frac=1),
    legend = "full",
    alpha = 0.3
)
图片.png
plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "predicted",
    data = df,
    legend = "full",
    alpha = 0.3
)

图片.png

生活很好,有你更好

©著作权归作者所有,转载或内容合作请联系作者
禁止转载,如需转载请通过简信或评论联系作者。
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 220,809评论 6 513
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 94,189评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 167,290评论 0 359
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 59,399评论 1 294
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 68,425评论 6 397
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 52,116评论 1 308
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,710评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,629评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 46,155评论 1 319
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,261评论 3 339
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,399评论 1 352
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 36,068评论 5 347
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,758评论 3 332
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,252评论 0 23
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,381评论 1 271
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,747评论 3 375
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,402评论 2 358

推荐阅读更多精彩内容