单细胞转录组中的pseudocell又是什么

在读郭老师HCL（ $Construction of a human cell landscape at single-cell leve$ ）文章的时候，有个概念：Pseudocell。

Pseudotime才搞明白，Pseudocell又是什么意思啊，cell还可以Pseudo？我们来看一下原文：

这里提到【衰减噪声和异常值的影响】，经过平均之后确实可以在某种程度上抹平噪声和异常值。我们来看一下参考文献【36】：

Tosches, M. A. et al. Evolution of pallium, hippocampus, and cortical cell types revealed by single-cell transcriptomics in reptiles. Science 360, 881-888, https://doi.org/10.1126/ science.aar4237 (2018).

在这篇文章的附录方法里面我们看到：

在用单细胞数据的WGCNA分析之前也是每个cluster随机选一部分细胞构成Pseudocell（局部bulk的方法）。怪不得我用原始的count矩阵做WGCNA的结果这么差呢。

在郭老师的文章中也单独地提出了Pseudocell的分析：

为了从高通量单细胞mRNA数据中增加基因数量和基因表达相关性，我们从同一细胞群中的多个细胞中收集数据，制作假细胞（Pseudocell）用于网络解释。看来这个Pseudocell概念是为了弥补稀疏矩阵在计算相关性上的缺陷，毕竟零值太多，影响相关性的计算。

看来在高通量单细胞转录组数据上应用bulk RNA 的分析方法的时候，采用这种局部bulk的方法还是有必要的，一方面是维度缩减，一方面是提高模型的适应性。

那么，Pseudocell是如何计算的呢？我们来看看郭老师的代码：

load("/home/jingjingw/Jingjingw/Project/2018-MH-new/Pseudocell/FetalStomach1_500more.RData")
name<-"FetalStomach1"
outfile1<-"Human_FetalStomach1_pseudocell20.Rds"
outfile2<-"Human_FetalStomach1_pseudocell20.pheno.out"



Inter<-get(paste(name,"pbmc",sep = "_"))
Inter[Inter<0]=0
idd<-get(paste(name,"Anno1",sep = "_"))
Inter.id<-cbind(rownames(idd),idd$Cluster_id)

rownames(Inter.id)<-rownames(idd)
colnames(Inter.id)<-c("CellID","Celltype")
Inter.id<-as.data.frame(Inter.id)
Inter1<-Inter[,Inter.id$CellID]
Inter<-as.matrix(Inter1)
pseudocell.size = 20 ## 10 test
new_ids_list = list()
for (i in 1:length(levels(Inter.id$Celltype))) {
    cluster_id = levels(Inter.id$Celltype)[i]
    cluster_cells <- rownames(Inter.id[Inter.id$Celltype == cluster_id,])
    cluster_size <- length(cluster_cells)       
    pseudo_ids <- floor(seq_along(cluster_cells)/pseudocell.size)
    pseudo_ids <- paste0(cluster_id, "_Cell", pseudo_ids)
    names(pseudo_ids) <- sample(cluster_cells)  
    new_ids_list[[i]] <- pseudo_ids     
    }
    
new_ids <- unlist(new_ids_list)
new_ids <- as.data.frame(new_ids)
new_ids_length <- table(new_ids)

new_colnames <- rownames(new_ids)  ###add
all.data<-Inter[,as.character(new_colnames)] ###add
all.data <- t(all.data)###add

new.data<-aggregate(list(all.data[,1:length(all.data[1,])]),
    list(name=new_ids[,1]),FUN=mean)
rownames(new.data)<-new.data$name
new.data<-new.data[,-1]

new_ids_length<-as.matrix(new_ids_length)##
short<-which(new_ids_length<10)##
new_good_ids<-as.matrix(new_ids_length[-short,])##
result<-t(new.data)[,rownames(new_good_ids)]
colnames(result)<-paste("Human",colnames(result),sep="")
rownames(result)<-rownames(Inter)
#saveRDS(result,file=outdir1[i]) ###
saveRDS(result,file=outfile1) ###
cellty<-gsub("[_]Cell[0-9]|[_]Cell[0-9][0-9]|[_]Cell[0-9][0-9][0-9]|[_]Cell[0-9][0-9][0-9][0-9]|[_]Cell[0-9][0-9][0-9][0-9][0-9]","",colnames(result))
new.phe<-paste(colnames(result),'HumanFetal',cellty,sep="\t")

#write.table(new.phe,file=outdir2[i],quote=F,row.names=F) ###

write.table(new.phe,file=outfile2,quote=F,row.names=F) ###

可以看到对每个cluster的循环，随机用的是sample，生成新的表达谱的时候用的是aggregate(...FUN=mean)。

HCL

最后编辑于：2020.05.13 17:36:40