CNV(Copy number variation, 拷贝数变异),也称CNP(copy—number polymorphism,拷贝数目多态),是由于基因组发生重排而导致的大小介于1kb至3Mb的DNA片段的变异,是基因组结构变异(SV)的组成部分,其突变率远超过SNPs(Single Nucleotide Polymorphisms, 单核苷酸多态性),是人类疾病的重要致病因素之一。目前,最常用于单细胞CNV分析的软件主要有两:inferCNV和CopyKAT。根据本人亲测,总体上CopyKAT的效果高于inferCNV,流程简单,不需要指定恶性/非恶性细胞,且准确性高,速度更快!
CopyKAT(Copynumber Karyotyping of Tumors) is a computational tool using integrative Bayesian approaches to identify genome-wide aneuploidy at 5MB resolution in single cells to separate tumor cells from normal cells, and tumor subclones using high-throughput sc-RNAseq data.
所以,CopyKAT转为infer单细胞CNV而生,而inferCNV不是。
一、copyKAT工具的简介
肿瘤单细胞 RNA 测序的一个主要挑战是区分恶性细胞和非恶性细胞类型,以及多个肿瘤亚克隆的存在。CopyKAT(肿瘤的拷贝数核型分析)是一种计算工具,它使用综合贝叶斯方法来识别单细胞中分辨率为5MB的全基因组非整倍性,从而使用scRNA-seq 数据将肿瘤细胞与正常细胞和肿瘤亚克隆区分开来。
从 RNAseq 数据计算 DNA 拷贝数事件的基本逻辑是:许多相邻基因的表达水平可以提供深度信息来推断该区域的基因组拷贝数。CopyKAT 估计的拷贝数谱可以与通过全基因组 DNA 测序获得的实际 DNA 拷贝数达到高度一致 (80%)。预测肿瘤/正常细胞状态的基本原理是非整倍性在人类癌症中很常见(90%)。具有广泛的全基因组拷贝数畸变(非整倍性)的细胞被认为是肿瘤细胞,而正常的基质细胞和免疫细胞通常具有 2N 二倍体或接近二倍体的拷贝数。
二、基本流程如图
三、安装
# 使用github来安装包
> devtools::install_github("navinlabcode/copykat")
Downloading GitHub repo navinlabcode/copykat@HEAD
These packages have more recent versions available.
It is recommended to update all of them.
Which would you like to update?
1: All
2: CRAN packages only
3: None
4: MatrixModels (0.5-0 -> 0.5-1 ) [CRAN]
5: RcppArmad... (0.11.2.3.1 -> 0.11.2.4.0) [CRAN]
Enter one or more numbers, or an empty line to skip updates: 3
The downloaded source packages are in
‘/tmp/Rtmp7So8AL/downloaded_packages’
✔ checking for file ‘/tmp/Rtmp7So8AL/remotes510e550b1b04/navinlabcode-copykat-256de33/DESCRIPTION’ (401ms)
─ preparing ‘copykat’:
✔ checking DESCRIPTION meta-information ...
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
NB: this package now depends on R (>= 3.5.0)
WARNING: Added dependency on R >= 3.5.0 because serialized objects in
serialize/load version 3 cannot be read in older versions of R.
File(s) containing such objects:
‘copykat/data/sysdata.rda’
─ building ‘copykat_1.0.8.tar.gz’
Installing package into ‘/home/Tom/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
* installing *source* package ‘copykat’ ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (copykat)
四、数据分析(一定要按照官方文档步骤分析,不然各种奇奇怪怪的报错!)
4.1 数据准备
> library(copykat)
> library(Seurat)
# 读取数据
> sc_CRC = readRDS('scRNA/sc_CRC.rds')
> counts = as.matrix(sc_CRC@assays$RNA@counts)
注意:建议输入counts矩阵,因为软件在数据分析的时候会自动log标准化。
4.2 CNV分析
CopyKAT分析单细胞测序CNV不需要向infercnv设置正常细胞对照,它会根据哪些细胞为二倍体就自动作为正常细胞,以推测恶性细胞的CNV。
> sc_cnv = copykat(rawmat = counts,ngene.chr = 5,sam.name = 'CRC',n.cores = 20)
[1] "running copykat v1.0.8 updated 02/25/2022 introduced mm10 module, fixed typos"
[1] "step1: read and filter data ..."
[1] "27448 genes, 34001 cells in raw data"
[1] "10118 genes past LOW.DR filtering"
[1] "step 2: annotations gene coordinates ..."
[1] "start annotation ..."
[1] "step 3: smoothing data with dlm ..."
[1] "step 4: measuring baselines ..."
"number of iterations= 360"
"number of iterations= 128"
"number of iterations= 425"
"number of iterations= 3482"
"number of iterations= 3486"
"number of iterations= 2247"
[1] "step 5: segmentation..."
[1] "step 6: convert to genomic bins..."
[1] "step 7: adjust baseline ..."
[1] "step 8: final prediction ..."
[1] "step 9: saving results..."
[1] "step 10: ploting heatmap ..."
参数介绍:
ngene.chr参数是用于过滤细胞,要求每条染色体最少包含5个基因。
sam.name参数定义输出文件名称前缀。
n.cores参数是运行的核数,默认值为 1核,服务器牛掰你可以设定1000……😱
id.name参数设置基因名,默认情况下cellranger输出的基因为gene symbol,因此设置为“Symbol”或“S”。
LOW.DR和UP.DR参数是用于过滤低表达基因,默认 LOW.DR=0.05, UP.DR=0.2,如果想要每个基因都进行分析,可以调低,但保证前者小于后者。
KS.cut参数值介于 0 到 1之间,值越靠近1表示灵敏度越低,通常是0.05~0.15。
output.seg参数表示是否输出seg文件直接用于IGV查看,默认FALSE。
其他参数重要性不高,就不做介绍,感兴趣的可以去官方查阅!
4.3 后续分析
可以将CNV分析结果中的恶性细胞挑出来,进行更进一步深度分析,比如整合进Seurat的meta.data中,联合umap或tsne可视化。
# 整合预测结果到Seurat对象中
single_RNA<- readRDS('scRNA/CopyKAT/single_RNA.rds')
all_cells <- copyKAT_results$sc_cnv$prediction[copyKAT_results$sc_cnv$prediction$copykat.pred != 'not.defined',]
# 提取CopyKAT预测的恶性/非恶性细胞counts矩阵
sc_counts <- as.matrix(single_RNA@assays$RNA@counts)[,all_cells$cell.names]
# 标准的Seurat流程
CRC.data <- Matrix::Matrix(sc_counts)
single_RNA<- CreateSeuratObject(counts = CRC.data, project = "CRC_Eds", min.cells = 3, min.features = 200)
# 导入metadata信息
single_RNA[['group_copykat']] <- all_cells[match(rownames(single_RNA@meta.data),all_cells$cell.names),2]
# 标准化数据及鉴定高变基因 (feature selection)
single_RNA<- NormalizeData(single_RNA, normalization.method = "LogNormalize", scale.factor = 10000) %>%
FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>%
ScaleData(features = rownames(single_RNA)) # if 选择do.center = FALSE,得到值均为正
# 降维找邻近基因和cluster
single_RNA<- RunPCA(single_RNA, features = VariableFeatures(object = single_RNA)) %>%
RunUMAP(dims = 1:20,reduction = 'pca',) %>%
RunTSNE(dims = 1:20,reduction = 'pca') %>%
FindNeighbors(dims = 1:20,reduction = 'pca') %>%
FindClusters(resolution = 0.8)
plot1 <- DimPlot(single_RNA, reduction = "umap",label = TRUE,label.box = TRUE,repel = TRUE,pt.size = 1)
plot2 <- DimPlot(single_RNA, reduction = "umap",group.by = 'group_copykat',label = TRUE,label.box = TRUE,pt.size = 1)
plot1 + plot2
plot3 <- DimPlot(single_RNA, reduction = "tsne",label = TRUE,label.box = TRUE,repel = TRUE,pt.size = 1)
plot4 <- DimPlot(single_RNA, reduction = "tsne",group.by = 'group_copykat',label = TRUE,label.box = TRUE,pt.size = 1)
plot3 + plot4
4.4 结束语
到这里,CopyKAT分析及后续基本分析就结束了,后续的深度详细挖掘可根据个人需求去做,有问题直接私信与我,或加我微信交流探讨,微信号:abo1028,欢迎骚扰!!!