CytoTRACE2-basic

CytoTRACE2

CytoTRACE2是 Newman Lab 开发的用于预测scRNA-seq数据中细胞干性和发育潜能的计算方法。

CytoTRACE2本质是一种深度学习框架，该模型在涵盖 28 种组织类型和整个发育范围的人类和小鼠 scRNA-seq 数据集的训练和验证（总共31个数据）。利用该模型，CytoTRACE2可以将细胞分类为终末分化（differentiated: 0）和全能细胞（totipotent: 1）.

clipboard-768829770.png

本文简单介绍CytoTRACE2 R包的基本用法，具体原理可参考他们目前的预印本 https://doi.org/10.1101/2024.03.19.585637

安装

devtools::install_github("digitalcytometry/cytotrace2", subdir = "cytotrace2_r") #installing

数据准备

CytoTRACE2需要的输入数据包括单细胞表达数据，这里利用一个已发表的胰腺的10X scRNA-seq作为示例（Bastidas-Ponce et al., 2019）

这里下载CytoTRACE2提供的示例数据集，包括了：

表达数据（expression_data）：小鼠胰腺上皮的2280细胞的10X scRNA-seq数据。该数据已经进行归一化处理
注释数据（annotation）：表型注释

# download the .rds file (this will download the file to your working directory)
download.file("https://drive.google.com/uc?export=download&id=1ivi9TBlmzVTDGzNWQrXXeyL68Wug989K", "Pancreas_10x_downsampled.rds")

运行CytoTRACE2

运行 cytotrace2() 进行细胞干性推断，函数运行时执行以下操作：

预处理数据：在每个细胞中对表达数据进行降序排序，使用排序后的秩（rank）进行预测

预测细胞发育状态：读入模型预训练参数进行预测

    # full model
    parameter_dict <- readRDS(system.file("extdata", "parameter_dict_17.rds", package = "CytoTRACE2"))
    # else 
    parameter_dict <- readRDS(system.file("extdata", "parameter_dict_5_best.rds", package = "CytoTRACE2"))

后处理预测结果：利用diffusion方法平滑预测值，并利用kNN和binning的方法重新正则化预测值

library(CytoTRACE2) #loading

## Loading required package: data.table

## Loading required package: doParallel

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: ggplot2

## Loading required package: HiClimR

## Loading required package: magrittr

## Loading required package: Matrix

## Loading required package: plyr

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## Loading required package: RANN

## Loading required package: Rfast

## Loading required package: Rcpp

## Loading required package: RcppZiggurat

## Loading required package: RcppParallel

## 
## Attaching package: 'RcppParallel'

## The following object is masked from 'package:Rcpp':
## 
##     LdFlags

## 
## Rfast: 2.1.0

##  ___ __ __ __ __    __ __ __ __ __ _             _               __ __ __ __ __     __ __ __ __ __ __   
## |  __ __ __ __  |  |  __ __ __ __ _/            / \             |  __ __ __ __ /   /__ __ _   _ __ __\  
## | |           | |  | |                         / _ \            | |                        / /          
## | |           | |  | |                        / / \ \           | |                       / /          
## | |           | |  | |                       / /   \ \          | |                      / /          
## | |__ __ __ __| |  | |__ __ __ __           / /     \ \         | |__ __ __ __ _        / /__/\          
## |    __ __ __ __|  |  __ __ __ __|         / /__ _ __\ \        |_ __ __ __ _   |      / ___  /           
## |   \              | |                    / _ _ _ _ _ _ \                     | |      \/  / /       
## | |\ \             | |                   / /           \ \                    | |         / /          
## | | \ \            | |                  / /             \ \                   | |        / /          
## | |  \ \           | |                 / /               \ \                  | |       / /          
## | |   \ \__ __ _   | |                / /                 \ \     _ __ __ __ _| |      / /          
## |_|    \__ __ __\  |_|               /_/                   \_\   /_ __ __ __ ___|      \/             team

## 
## Attaching package: 'Rfast'

## The following object is masked from 'package:dplyr':
## 
##     nth

## The following object is masked from 'package:data.table':
## 
##     transpose

## Loading required package: Seurat

## Loading required package: SeuratObject

## Loading required package: sp

## 'SeuratObject' was built under R 4.4.0 but the current version is
## 4.4.1; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed

## 
## Attaching package: 'SeuratObject'

## The following objects are masked from 'package:base':
## 
##     intersect, t

## Loading required package: stringr

## Warning: replacing previous import 'data.table::first' by 'dplyr::first' when
## loading 'CytoTRACE2'

## Warning: replacing previous import 'data.table::last' by 'dplyr::last' when
## loading 'CytoTRACE2'

## Warning: replacing previous import 'data.table::between' by 'dplyr::between'
## when loading 'CytoTRACE2'

# load rds
scdata <- readRDS("Pancreas_10x_downsampled.rds")

# extract expression data
expression_data <- scdata$expression_data
dim(expression_data)

## [1] 17326  2280

# running CytoTRACE 2 main function - cytotrace2 - with default parameters
cytotrace2_result <- cytotrace2(expression_data, 
                                species = 'mouse',
                                is_seurat = F)

## cytotrace2: Started loading data

## Dataset contains 17326 genes and 2280 cells.

## The passed subsample size is greater than the number of cells in dataset.
## Now setting subsample size to 2280

## cytotrace2: Running on 1 subsample(s) approximately of length 2280

## cytotrace2: Started running on subsample(s). This will take a few minutes.

## cytotrace2: Started preprocessing.

## 13191 input genes mapped to model genes.

## cytotrace2: Started prediction.

## This section will run using  5 / 11 core(s).

## cytotrace2: Started postprocessing.

## cytotrace2: Running with fast mode (subsamples are processed in parallel)

## This section will run on 2 sub-sample(s) of approximately 1140 cells each using 2 / 11 core(s).

## cytotrace2: Finished

# extract annotation data
annotation <- scdata$annotation
head(annotation)

##                        phenotype1 phenotype2        potency
## CACCAGGCAGGTGCCT-1-0   Alpha cell      Alpha Differentiated
## CCCAGTTAGGGAACGG-1-2   Alpha cell      Alpha Differentiated
## CGGACTGCATCTACGA-1-2 Epsilon cell    Epsilon Differentiated
## TACCTTAGTTCCAACA-1-2   Alpha cell      Alpha Differentiated
## TACTTACCAGACAGGT-1-2   Alpha cell      Alpha Differentiated
## ATGAGGGGTGGTACAG-1-3    Beta cell       Beta Differentiated
##                      absolute_granular_ordering relative_ordering
## CACCAGGCAGGTGCCT-1-0                         23                 6
## CCCAGTTAGGGAACGG-1-2                         23                 6
## CGGACTGCATCTACGA-1-2                         23                 6
## TACCTTAGTTCCAACA-1-2                         23                 6
## TACTTACCAGACAGGT-1-2                         23                 6
## ATGAGGGGTGGTACAG-1-3                         23                 6

# generate prediction and phenotype association plots with plotData function
plots <- plotData(cytotrace2_result = cytotrace2_result, 
                  annotation = annotation,
                  expression_data = expression_data
                  )

## Preparing input for visualization.

## Creating plots.

## Done. You can access any plot directly from the returned list by '$' operator (i.e. plots$CytoTRACE2_Potency_UMAP).

在plotData 这一步会运行seurat流程包括Normalization, FindVariableFeatures, PCA, UMAP.

# pseudocode
seurat <- CreateSeuratObject(counts = as.matrix(expression_data), project = "nn", min.cells = 3, min.features = 200)
# Normalization
seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", scale.factor = 10000) ##defaults
# Find variable features
seurat <- FindVariableFeatures(seurat, selection.method = "vst", nfeatures = 2000)
  
# Scale data
all.genes <- rownames(seurat)
seurat <- ScaleData(seurat, features = all.genes)

# PCA
seurat <- RunPCA(seurat, features = VariableFeatures(object = seurat))

# UMAP
seurat <- RunUMAP(seurat, dims = 1:pc_dims) # default to 30

然后，将cytoTRACE2预测的细胞发育结果进行可视化，包括：

potency_score_umap：UMAP可视化cytoTRACE2预测的细胞潜能打分（CytoTRACE2_Score）
potency_category_umap：UMAP可视化cytoTRACE2预测的细胞干性分类（CytoTRACE2_Potency）
rel_order_umap：UMAP可视化cytoTRACE2预测的细胞潜能打分的相对排序值0-1（CytoTRACE2_Relative）
phenotype_umap：UMAP可视化表型分组，如果有的话。
potencyBoxplot_byPheno：每个表型分组的CytoTRACE2_Score分布

table(annotation$phenotype1)

## 
##                        Alpha cell                         Beta cell 
##                               276                               258 
##                        Delta cell          Endocrine precursor cell 
##                                20                               509 
##              Endocrine progenitor                      Epsilon cell 
##                               499                                46 
##           Immature endocrine cell Multipotent pancreatic progenitor 
##                               427                               245

table(cytotrace2_result$CytoTRACE2_Potency)

## 
## Differentiated      Unipotent    Oligopotent    Multipotent    Pluripotent 
##            952            598            501            229              0 
##     Totipotent 
##              0

plots$CytoTRACE2_UMAP

unnamed-chunk-5-1.png

Potency score distribution by phenotype

plots$CytoTRACE2_Boxplot_byPheno

unnamed-chunk-6-1.png

Potency category

plots$CytoTRACE2_Potency_UMAP

unnamed-chunk-7-1.png

Relative order

plots$CytoTRACE2_Relative_UMAP

unnamed-chunk-8-1.png

Phenotype

plots$Phenotype_UMAP

unnamed-chunk-9-1.png

Seurat object input

library(Seurat)
seu <- CreateSeuratObject(expression_data, meta.data = annotation)

## Warning: Data is of class data.frame. Coercing to dgCMatrix.

seu

## An object of class Seurat 
## 17326 features across 2280 samples within 1 assay 
## Active assay: RNA (17326 features, 0 variable features)
##  1 layer present: counts

run cytotrace2

cytotrace2_result <- cytotrace2(seu, is_seurat = TRUE, slot_type = "counts", species = 'mouse')

## cytotrace2: Started loading data

## Dataset contains 17326 genes and 2280 cells.

## The passed subsample size is greater than the number of cells in dataset.
## Now setting subsample size to 2280

## cytotrace2: Running on 1 subsample(s) approximately of length 2280

## cytotrace2: Started running on subsample(s). This will take a few minutes.

## cytotrace2: Started preprocessing.

## 13191 input genes mapped to model genes.

## cytotrace2: Started prediction.

## This section will run using  5 / 11 core(s).

## cytotrace2: Started postprocessing.

## cytotrace2: Running with fast mode (subsamples are processed in parallel)

## This section will run on 2 sub-sample(s) of approximately 1140 cells each using 2 / 11 core(s).

## cytotrace2: Finished

# plotting
plots <- plotData(cytotrace2_result = cytotrace2_result, 
                  annotation = annotation, 
                  is_seurat = TRUE)

## Preparing input for visualization.

## Creating plots.

## Done. You can access any plot directly from the returned list by '$' operator (i.e. plots$CytoTRACE2_Potency_UMAP).

可视化

plots$CytoTRACE2_Potency_UMAP

unnamed-chunk-12-1.png

也可以用seurat流程跑完后的对象进行UMAP可视化

# Normalize the data
cytotrace2_result <- NormalizeData(cytotrace2_result)

## Normalizing layer: counts

# Find variable features
cytotrace2_result <- FindVariableFeatures(cytotrace2_result, selection.method = "vst", nfeatures = 2000)

## Finding variable features for layer counts

all.genes <- rownames(cytotrace2_result)
# Scale the data
cytotrace2_result <- ScaleData(cytotrace2_result, features=all.genes)

## Centering and scaling data matrix

# Perform PCA
cytotrace2_result <- RunPCA(cytotrace2_result, npcs = 30)

## PC_ 1 
## Positive:  Cpe, Pcsk1n, Pax6, Chga, Fam183b, Hmgn3, Chgb, Scg3, Map1b, Cryba2 
##     Isl1, Slc38a5, Neurod1, Prnp, Gnao1, Scgn, Gnas, Gch1, Aplp1, Fev 
##     Dbpht2, Pcsk2, Vwa5b2, Pyy, Pam, Pim2, Cnih2, Mafb, Abcc8, Tspan7 
## Negative:  H19, Wfdc2, Tead2, Sparc, Smco4, Phgdh, Rbp1, Rpl36a, Cd24a, Mgst1 
##     Adamts1, 1700011H14Rik, Gsta3, Stmn1, Ldha, Rpl39, Hmga2, Spp1, Ptn, Rpl12 
##     Galk1, Wfdc15b, Cdkn1c, Sox9, Serpinh1, Sox4, Habp2, Hes1, Vim, Rps12 
## PC_ 2 
## Positive:  Btbd17, Ppp1r14a, Gadd45a, Neurod2, Igfbpl1, Sult2b1, Mfng, Neurog3, Pax4, Tubb3 
##     Btg2, Mfap4, Smarcd2, Tmsb4x, Plk3, Cbfa2t3, Cck, Notum, Actg1, Hes6 
##     Sulf2, Lynx1, Gfra3, Epb42, Lingo1, Gdpd1, Foxa3, Cdc14b, Hpca, Rcor2 
## Negative:  Atp1b1, Rbp4, Tagln2, Tuba1b, Akr1c19, Id2, Gcsh, Fxyd6, Rnase4, Bex4 
##     1700011H14Rik, Tmem27, Siva1, Phgdh, Tubb2a, Meis2, Cmtm8, Mgst3, Ociad2, Crip1 
##     Adamts1, Parm1, Fkbp2, Fabp3, Id3, Eif4ebp1, Spc25, Sdc4, Cpn1, Cyr61 
## PC_ 3 
## Positive:  Rpl13a, Rpl14, Rps14, mt-Atp6, Eef1a1, Rpl10, Rpl41, Ftl1, mt-Co3, Rps16 
##     Rps18, Rps27a, Rps3a1, Rpl26, mt-Cytb, Rps6, mt-Nd1, Rpl13, Rpl11, Rpl6 
##     Rpl9, mt-Co1, Rps5, Rps7, Ppia, Rps2, Rpl17, Rplp0, Rps9, Rpl32 
## Negative:  Neurog3, Anxa2, Sparc, Col9a3, Col9a1, Fxyd2, Amotl2, Mpc1, Aurka, Tpx2 
##     Cotl1, Ckb, Cenpf, Gpx2, Mia, Spp1, Cd82, Cdk2ap1, Cdc20, Lurap1l 
##     Pbk, Gsta3, Racgap1, Hmmr, Cxcl12, Aurkb, Ccnb1, Aldh1b1, Tpm1, Tpm4 
## PC_ 4 
## Positive:  Cldn4, Runx1t1, Fev, Irf2bpl, Cacna2d1, Hnrnpa1, Pdk3, Slc35d3, Grb10, Gm43861 
##     Akr1c19, Jun, 1110012L19Rik, BC023829, Glud1, Vdr, Tm4sf4, Emb, St18, Celf3 
##     Pgf, Ids, Pcyt1b, Cited2, Maff, Elavl4, Sulf1, Neurod1, Serpini1, Ddc 
## Negative:  Ppp1r1a, Iapp, Pdia6, Scg2, Calr, Ttr, Higd1a, G6pc2, Tmed3, Tmsb15l 
##     Papss2, Sdf2l1, Hspa5, Clu, Ins1, Tmem27, Ly6e, Cd82, Slc30a8, Dap 
##     Igfbp7, Ins2, Ociad2, Gcg, Ttc28, Qsox1, Bcl2, Gapdh, Irx1, Npy 
## PC_ 5 
## Positive:  Cxcl12, Pdzk1ip1, Anxa2, Grin3a, Dcdc2a, Krt18, 8430408G22Rik, Pamr1, Anxa5, Bicc1 
##     Capg, Muc1, Krt8, Krt7, S100a10, Cldn3, Cystm1, Spp1, Gsta3, Trim47 
##     Smoc2, Ppp1r1b, Col9a3, Vim, Ndrg2, Tsc22d1, Cat, Cldn4, Txnip, Nfib 
## Negative:  Vtn, 2810417H13Rik, Hist1h2ap, Chst2, Tuba1a, Top2a, P2rx1, Aurkb, Rrm2, Fbxo5 
##     Ccna2, Cpn1, Spc25, H2afz, Cenpf, Birc5, Ccnb1, Pbk, Gcat, Smc2 
##     Ube2c, Incenp, Mki67, Car14, Rad51ap1, Gc, Cdk1, Plk1, Prc1, Smc4

# Find neighbors
cytotrace2_result <- FindNeighbors(cytotrace2_result, dims = 1:20)

## Computing nearest neighbor graph

## Computing SNN

# Find clusters
cytotrace2_result <- FindClusters(cytotrace2_result, resolution = 0.1)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2280
## Number of edges: 73179
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9505
## Number of communities: 4
## Elapsed time: 0 seconds

# Run UMAP
cytotrace2_result <- RunUMAP(cytotrace2_result, dims = 1:20)

## 14:12:35 UMAP embedding parameters a = 0.9922 b = 1.112

## 14:12:35 Read 2280 rows and found 20 numeric columns

## 14:12:35 Using Annoy for neighbor search, n_neighbors = 30

## 14:12:35 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 14:12:35 Writing NN index file to temp file /var/folders/yk/896pglhn555drshbgn27f71h0000gn/T//RtmpD9l2F4/file95af46be469e
## 14:12:35 Searching Annoy index using 1 thread, search_k = 3000
## 14:12:35 Annoy recall = 100%
## 14:12:35 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 14:12:35 Initializing from normalized Laplacian + noise (using RSpectra)
## 14:12:35 Commencing optimization for 500 epochs, with 87006 positive edges
## 14:12:37 Optimization finished

DimPlot(cytotrace2_result, group.by = 'CytoTRACE2_Potency')

unnamed-chunk-13-1.png

或者看score的分布

FeaturePlot(cytotrace2_result, features = 'CytoTRACE2_Score')

unnamed-chunk-14-1.png

Ref:

https://github.com/digitalcytometry/cytotrace2/tree/main

https://www.biorxiv.org/content/10.1101/2024.03.19.585637v1.full

最后编辑于：2024.12.12 07:49:59

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 213,864评论 6赞 494
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 91,175评论 3赞 387
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 159,401评论 0赞 349
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 57,170评论 1赞 286
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 66,276评论 6赞 385
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 50,364评论 1赞 292
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,401评论 3赞 412
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,179评论 0赞 269
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,604评论 1赞 306
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 36,902评论 2赞 328
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,070评论 1赞 341
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,751评论 4赞 337
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,380评论 3赞 319
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,077评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,312评论 1赞 267
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 46,924评论 2赞 365
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 43,957评论 2赞 351