CNS图表复现07—原来这篇文章有两个单细胞表达矩阵

本文是参考学习CNS图表复现07—原来这篇文章有两个单细胞表达矩阵的学习笔记。可能根据学习情况有所改动。

文章的第一次分群按照 :

  • immune (CD45+,PTPRC),
  • epithelial/cancer (EpCAM+,EPCAM),
  • stromal (CD10+,MME,fibo or CD31+,PECAM1,endo)

的表达量分布,可以拿到如下所示的:

图片

第一次分群</figcaption>

文章提到的各大亚群细胞数量是:(epithelial cells [n = 5,581], immune cells [n = 13,431], stromal cells [n = 4,249]).

我发现自己怎么都重复不出来,因为唯一的质量控制步骤,细胞数量就开始不吻合了:

 # 这里没有绝对的过滤标准
  # Filter cells so that remaining cells have nGenes >= 500 and nReads >= 50000
  # 26485 features across 27489 samples
  main_tiss_filtered <- subset(x=main_tiss, subset = nCount_RNA > 50000 & nFeature_RNA > 500)
  main_tiss_filtered
  # 26485 features across 21444 samples

也就是说27489个细胞,被过滤成为了 21444细胞,而文章的细胞数量是23261,但是我明明是跟文章作者一模一样的处理过程。

真的是百思不得其解。

后来认真看了看它提供的文件列表:

1.4G Aug 30 12:04 S01_datafinal.csv
11M Aug 30 11:34 S01_metacells.csv
3.5K Aug 30 11:35 by_sample_ratios_driver_muts_3.30.20.csv
3.3K Aug 30 11:35 neo-osi_metadata.csv
184M Aug 30 11:35 neo-osi_rawdata.csv
13K Aug 30 11:34 samples_x_genes_3.30.20.csv

原来是有两个表达矩阵文件,neo-osi_rawdata.csv 和 S01_datafinal.csv 文件,都需要走流程~

需要注意的是这个文章作者并没有上传这些文件到GEO上面,仅仅是上传了原始数据在:https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA591860,这个原始数据接近4T的文件。

这些csv文件呢,作者上传到了谷歌云盘,我们人工搬运到了百度云,在交流群可以拿到。如果你也想加入交流群,自己去:你要的rmarkdown文献图表复现全套代码来了(单细胞)找到我们的拉群小助手哈。

我漏掉的这个 neo-osi_rawdata.csv 走流程拿到的是 2070 个单细胞

# 26485 features across 3507 samples within 1 assay 
  osi_object_filtered <- subset(x=osi_object, subset = nCount_RNA > 50000 & nFeature_RNA > 500)
  osi_object_filtered 
  # 26485 features across 2070 samples within 1 assay 

然后把两个单细胞表达矩阵构建好的Seurat 对象合并起来,代码如下:得到的就差不多是文章里面的 23514 个单细胞啦,文章的细胞数量是23261。

main_tiss_filtered
osi_object_filtered
main_tiss_filtered1 <- merge(x = main_tiss_filtered, y = osi_object_filtered)
main_tiss_filtered1 
An object of class Seurat 
26485 features across 23514 samples within 1 assay 
Active assay: RNA (26485 features, 0 variable features)

代码汇总 全部代码可以复制粘贴的
总共是 270 行,基本上涵盖了我们单细胞的方方面面知识点。

rm(list=ls())
options(stringsAsFactors = F)
library(Seurat)
pro='first'

# 单细胞项目:来自于30个病人的49个组织样品,跨越3个治疗阶段

# Single-cell RNA sequencing (scRNA-seq) of
# metastatic lung cancer was performed using 49 clinical biopsies obtained from 30 patients before and during
# targeted therapy. 

# before initiating systemic targeted therapy (TKI naive [TN]), 
# at the residual disease (RD) state,
# acquired drug resistance (progression [PD]).

# 首先读取第一个单细胞转录组表达矩阵文件:
if(T){
  
  # Load data 
  dir='./'
  Sys.time()
  raw.data <- read.csv(paste(dir,"Data_input/csv_files/S01_datafinal.csv", sep=""), header=T, row.names = 1)
  Sys.time()
  dim(raw.data) # 
  raw.data[1:4,1:4]
  head(colnames(raw.data)) 
  # Load metadata 
  metadata <- read.csv(paste(dir,"Data_input/csv_files/S01_metacells.csv", sep=""), row.names=1, header=T)
  head(metadata) 
  
  # Find ERCC's, compute the percent ERCC, and drop them from the raw data.
  
  erccs <- grep(pattern = "^ERCC-", x = rownames(x = raw.data), value = TRUE)
  percent.ercc <- Matrix::colSums(raw.data[erccs, ])/Matrix::colSums(raw.data)
  fivenum(percent.ercc)
  ercc.index <- grep(pattern = "^ERCC-", x = rownames(x = raw.data), value = FALSE)
  raw.data <- raw.data[-ercc.index,]
  dim(raw.data) 
  # Create the Seurat object with all the data (unfiltered)
  main_tiss <- CreateSeuratObject(counts = raw.data)
  # add rownames to metadta 
  row.names(metadata) <- metadata$cell_id
  # add metadata to Seurat object 
  main_tiss <- AddMetaData(object = main_tiss, metadata = metadata)
  main_tiss <- AddMetaData(object = main_tiss, percent.ercc, col.name = "percent.ercc")
  # Head to check
  head(main_tiss@meta.data)
  
  # Calculate percent ribosomal genes and add to metadata
  ribo.genes <- grep(pattern = "^RP[SL][[:digit:]]", x = rownames(x = main_tiss@assays$RNA@data), value = TRUE)
  percent.ribo <- Matrix::colSums(main_tiss@assays$RNA@counts[ribo.genes, ])/Matrix::colSums(main_tiss@assays$RNA@data)
  fivenum(percent.ribo)
  main_tiss <- AddMetaData(object = main_tiss, metadata = percent.ribo, col.name = "percent.ribo")
  main_tiss
  
  # 唯一的质量控制步骤:
  # 这里没有绝对的过滤标准
  # Filter cells so that remaining cells have nGenes >= 500 and nReads >= 50000
  # 26485 features across 27489 samples
  main_tiss_filtered <- subset(x=main_tiss, subset = nCount_RNA > 50000 & nFeature_RNA > 500)
  main_tiss_filtered
  # 26485 features across 21444 samples
}
main_tiss_filtered
# 26485 features across 21444 samples

# 然后读取第二个单细胞转录组表达矩阵文件:
if(T){
  # load and clean rawdata
  # Load data 
  dir='./'
  Sys.time()
  osi.raw.data <- read.csv(paste(dir,"Data_input/csv_files/neo-osi_rawdata.csv", sep=""), row.names = 1)
  
  Sys.time()
 
  colnames(osi.raw.data) <- gsub("_S.*.homo", "", colnames(osi.raw.data))
  osi.raw.data[1:4,1:4]
  tail( osi.raw.data[ 1:4])
  dim(osi.raw.data)
  
  # drop sequencing details from gene count table
  # 因为是HTseq这个软件跑出来的表达矩阵
  row.names(osi.raw.data)[grep("__", row.names(osi.raw.data))]
  osi.raw.data <- osi.raw.data[-grep("__", row.names(osi.raw.data)),]
  
  #Make osi.metadata by cell from osi.metadata by plate
  
  osi.metadata <- read.csv(paste(dir, "Data_input/csv_files/neo-osi_metadata.csv", sep = ""))
  osi.meta.cell <- as.data.frame(colnames(osi.raw.data))
  osi.meta.cell <- data.frame(do.call('rbind', strsplit(as.character(osi.meta.cell$`colnames(osi.raw.data)`),'_',fixed=TRUE)))
  rownames(osi.meta.cell) <- paste(osi.meta.cell$X1, osi.meta.cell$X2, sep = "_")
  colnames(osi.meta.cell) <- c("well", "plate")
  osi.meta.cell$cell_id <- rownames(osi.meta.cell)
  library(dplyr)
  osi.metadata <- left_join(osi.meta.cell, osi.metadata, by = "plate")
  rownames(osi.metadata) <- osi.metadata$cell_id
  head(osi.metadata)
  
  unique(osi.metadata$plate)
  
  
  # Find ERCC's, compute the percent ERCC, and drop them from the raw data.
  
  erccs <- grep(pattern = "^ERCC-", x = rownames(x = osi.raw.data), value = TRUE)
  percent.ercc <- Matrix::colSums(osi.raw.data[erccs, ])/Matrix::colSums(osi.raw.data)
  ercc.index <- grep(pattern = "^ERCC-", x = rownames(x = osi.raw.data), value = FALSE)
  osi.raw.data <- osi.raw.data[-ercc.index,]
  dim(osi.raw.data)
  
  
  # Create the Seurat object with all the data (unfiltered)
  
  osi_object <- CreateSeuratObject(counts = osi.raw.data)
  osi_object <- AddMetaData(object = osi_object, metadata = osi.metadata)
  osi_object <- AddMetaData(object = osi_object, percent.ercc, col.name = "percent.ercc")
  # Changing nUMI column name to nReads
  colnames(osi_object@meta.data)[colnames(osi_object@meta.data) == 'nUMI'] <- 'nReads'
  head(osi_object@meta.data)
  
  
  # Calculate percent ribosomal genes and add to osi.metadata
  
  ribo.genes <- grep(pattern = "^RP[SL][[:digit:]]", x = rownames(x = osi_object@assays$RNA@data), value = TRUE)
  percent.ribo <- Matrix::colSums(osi_object@assays$RNA@data[ribo.genes, ])/Matrix::colSums(osi_object@assays$RNA@data)
  osi_object <- AddMetaData(object = osi_object, metadata = percent.ribo, col.name = "percent.ribo")
  osi_object
  # 26485 features across 3507 samples within 1 assay 
  osi_object_filtered <- subset(x=osi_object, subset = nCount_RNA > 50000 & nFeature_RNA > 500)
  osi_object_filtered 
  # 26485 features across 2070 samples within 1 assay 
  VlnPlot(osi_object_filtered, features = "nFeature_RNA")
  VlnPlot(osi_object_filtered, features = "nCount_RNA", log = TRUE)

}
main_tiss_filtered
osi_object_filtered
main_tiss_filtered1 <- merge(x = main_tiss_filtered, y = osi_object_filtered)
main_tiss_filtered1 
raw_sce <- main_tiss_filtered1
if(F){
  # Drop any samples with 10 or less cells
  
  main_tiss_filtered@meta.data$sample_name <- as.character(main_tiss_filtered@meta.data$sample_name)
  sample_name <- as.character(main_tiss_filtered@meta.data$sample_name)
  # Make table 
  tab.1 <- table(main_tiss_filtered@meta.data$sample_name) 
  # Which samples have less than 10 cells 
  samples.keep <- names(which(tab.1 > 10))
  metadata_keep <- filter(main_tiss_filtered@meta.data, sample_name %in% samples.keep)
  # Subset Seurat object 
  tiss_subset <- subset(main_tiss_filtered, cells=as.character(metadata_keep$cell_id))
  tiss_subset
  raw_sce <- tiss_subset
}

# Gene-expression profiles of 23,261 cells were retained after quality control filtering.



raw_sce 
rownames(raw_sce)[grepl('^mt-',rownames(raw_sce),ignore.case = T)]
rownames(raw_sce)[grepl('^Rp[sl]',rownames(raw_sce),ignore.case = T)]

raw_sce[["percent.mt"]] <- PercentageFeatureSet(raw_sce, pattern = "^MT-")
fivenum(raw_sce[["percent.mt"]][,1])
rb.genes <- rownames(raw_sce)[grep("^RP[SL]",rownames(raw_sce),ignore.case = T)]
C<-GetAssayData(object = raw_sce, slot = "counts")
percent.ribo <- Matrix::colSums(C[rb.genes,])/Matrix::colSums(C)*100
fivenum(percent.ribo)
raw_sce <- AddMetaData(raw_sce, percent.ribo, col.name = "percent.ribo")

plot1 <- FeatureScatter(raw_sce, feature1 = "nCount_RNA", feature2 = "percent.ercc")
plot2 <- FeatureScatter(raw_sce, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(plot1, plot2))

VlnPlot(raw_sce, features = c("percent.ribo", "percent.ercc"), ncol = 2)
VlnPlot(raw_sce, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)
VlnPlot(raw_sce, features = c("percent.ribo", "nCount_RNA"), ncol = 2)
raw_sce
 
sce=raw_sce
sce
#sce <- NormalizeData(sce, normalization.method =  "LogNormalize",  scale.factor = 1e6)
sce <- NormalizeData(sce, normalization.method =  "LogNormalize",  scale.factor = 1e5)
GetAssay(sce,assay = "RNA")
sce <- FindVariableFeatures(sce, 
                            selection.method = "vst", nfeatures = 2000) 
# 步骤 ScaleData 的耗时取决于电脑系统配置(保守估计大于一分钟)
sce <- ScaleData(sce) 
sce <- RunPCA(object = sce, pc.genes = VariableFeatures(sce)) 
DimHeatmap(sce, dims = 1:12, cells = 100, balanced = TRUE)
ElbowPlot(sce)
# 跟文章保持一致,第一次分群,resolution 采取 0.5
sce <- FindNeighbors(sce, dims = 1:15)
sce <- FindClusters(sce, resolution = 0.2)
table(sce@meta.data$RNA_snn_res.0.2) 
sce <- FindClusters(sce, resolution = 0.8)
table(sce@meta.data$RNA_snn_res.0.8) 
sce <- FindClusters(sce, resolution = 0.5)
table(sce@meta.data$RNA_snn_res.0.5) 

# 不同的 resolution 分群的结果不一样
library(gplots)
tab.1=table(sce@meta.data$RNA_snn_res.0.2,sce@meta.data$RNA_snn_res.0.8) 
balloonplot(tab.1)

# Check clustering stability at given resolution  
# Set different resolutions 
res.used <- seq(0.1,1,by=0.2)
res.used
# Loop over and perform clustering of different resolutions 
for(i in res.used){
  sce <- FindClusters(object = sce, verbose = T, resolution = res.used)
}
# Make plot 
library(clustree)
clus.tree.out <- clustree(sce) +
  theme(legend.position = "bottom") + 
  scale_color_brewer(palette = "Set1") +
  scale_edge_color_continuous(low = "grey80", high = "red")

clus.tree.out 

res.used <- 0.5
sce <- FindClusters(object = sce, verbose = T, resolution = res.used)


set.seed(123)
sce <- RunTSNE(object = sce, dims = 1:15, do.fast = TRUE)
DimPlot(sce,reduction = "tsne",label=T)
phe=data.frame(cell=rownames(sce@meta.data),
               cluster =sce@meta.data$seurat_clusters)
head(phe)
table(phe$cluster)
tsne_pos=Embeddings(sce,'tsne') 
head(phe)
table(phe$cluster)
head(tsne_pos) 
dat=cbind(tsne_pos,phe)

save(dat,file=paste0(pro,'_for_tSNE.pos.Rdata')) 
load(file=paste0(pro,'_for_tSNE.pos.Rdata'))  
library(ggplot2)
p=ggplot(dat,aes(x=tSNE_1,y=tSNE_2,color=cluster))+geom_point(size=0.95)
p=p+stat_ellipse(data=dat,aes(x=tSNE_1,y=tSNE_2,fill=cluster,color=cluster),
                 geom = "polygon",alpha=0.2,level=0.9,type="t",linetype = 2,show.legend = F)+coord_fixed()
print(p) 
theme= theme(panel.grid =element_blank()) +   ## 删去网格
  theme(panel.border = element_blank(),panel.background = element_blank()) +   ## 删去外层边框
  theme(axis.line = element_line(size=1, colour = "black")) 
p=p+theme+guides(colour = guide_legend(override.aes = list(size=5)))
print(p)
ggplot2::ggsave(filename = paste0(pro,'_tsne.pdf'))

sce <- RunUMAP(object = sce, dims = 1:15, do.fast = TRUE)
DimPlot(sce,reduction = "umap",label=T)
ggplot2::ggsave(filename = paste0(pro,'_umap.pdf')) 
 
sce.markers <- FindAllMarkers(object = sce, only.pos = TRUE, min.pct = 0.25, 
                              thresh.use = 0.25)

DT::datatable(sce.markers)
write.csv(sce.markers,file=paste0(pro,'_sce.markers.csv'))
library(dplyr) 
top10 <- sce.markers %>% group_by(cluster) %>% top_n(10, avg_logFC)
DoHeatmap(sce,top10$gene,size=3)
ggsave(filename=paste0(pro,'_sce.markers_heatmap.pdf'))
sce.first=sce
save(sce.first,sce.markers,file = 'first_sce.Rdata')
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,794评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,050评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,587评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,861评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,901评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,898评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,832评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,617评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,077评论 1 308
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,349评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,483评论 1 345
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,199评论 5 341
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,824评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,442评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,632评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,474评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,393评论 2 352

推荐阅读更多精彩内容