顶刊分享---超过一千份肿瘤样本的瘤间转录异质性特征

作者，Evil Genius

肿瘤样本的异质性，肿瘤样本分析的关键问题。

正常组织细胞图谱相对好做一点，但是肿瘤样本的图谱，单一组学几乎不能解释任何问题，尤其转录组。

异质性在单细胞最大的体现，就是转录状态，而且转录状态几乎课题间无法复用。

今日参考文献

知识积累

每个肿瘤都包含不同的细胞状态，这些状态是肿瘤内异质性（ITH）的基础，这是癌症治疗的核心挑战。

整理、注释和整合了来自77项不同研究的数据，以揭示涵盖24种肿瘤类型的1163个肿瘤样本的转录ITH模式。

ITH是由遗传学、表观遗传学和微环境影响驱动的肿瘤的基本特性，是治疗失败、转移和其他癌症表型的核心。

在同一癌症类型的肿瘤中发现了类似的ITH程序，在某些情况下甚至在不同的癌症类型中发现了。这些相似之处表明，ITH表达程序反映了肿瘤生物学的基本方面。因此，试图确定不同肿瘤的相关ITH程序之间的共识，将其称为元程序（MP）。

任何特定MP的高表达可被视为定义细胞状态。

值得注意的是，MP往往仅限于数十个基因，这些基因的表达叠加在细胞的基线表达谱上，因此反映了相对的细胞状态。

亚群将处于不同的全局细胞状态（反映肿瘤特性），但处于相同的相对细胞状态（反应特定MP的激活）。

肿瘤单细胞数据

包括1456个样本，涵盖24种癌症类型和2591545个细胞。

细胞注释 + CNV推断恶性细胞类型。

定义和注释MPs

定义了在每个肿瘤内不同的表达程序，随后比较了所有研究中肿瘤的ITH程序（NMF）。

对于每种肿瘤，利用非负矩阵因子分解（NMF）来表征其恶性细胞中不同的ITH程序，每个程序由其得分最高的50个基因表示。

在scRNA-seq平台上检测到的MP具有可比性，包括高表达和非高表达基因

Regulation of MPs

将SCENIC应用于所有肿瘤样本,探索了肿瘤内和肿瘤间MP的遗传调控。

所有肿瘤中，研究了MP表达是否与特定突变或CNA有关。由于只有少数肿瘤通过scRNA-seq和全基因组或全基因组测序进行了分析，分析转向了癌症基因组图谱（TCGA）的大量数据，从而将分析扩展到数千个基因注释肿瘤，但将分析限制在肿瘤的平均MP表达。分析发现MP表达与特定突变或CNA之间存在许多关联。

MP临床相关性

鉴于scRNA-seq数据集的临床注释有限，再次转向分析bulk TCGA样本中的（平均）MP表达，并确定了与总体生存率、分级和分期、淋巴结转移和治疗耐药性的关联。

MP背景特异性

将每种癌症类型中每种MP的频率分为缺失、低、中、高或高和显著富集。

同时检查了卵巢癌症、皮肤鳞状细胞癌和胶质母细胞瘤的空间转录组学（Visium）数据。

转录ITH的特征

细胞周期和应激后最常见的标志是“谱系相关”MP，包括类似于发育和分化细胞类型的程序，如皮肤色素沉着、肺泡细胞、神经元和少突胶质细胞祖细胞。这些MP与ITH概括发展轨迹的概念是一致的。

非恶性TME细胞类型的MP

上述分析来定义这些细胞类型中的MP，并通过功能丰富对其进行注释。

大多数恶性MP与上皮细胞的非恶性MP相似，但与其他细胞类型的MP不同），这表明恶性细胞中的大部分异质性已经存在于起源细胞中。

对癌前组织（腺瘤）的分析表明，MP表达与恶性上皮细胞的相似性高于非恶性上皮细胞。

MHC II与干扰素MP的偶联

MHC II基因的背景特异性偶联尤其常见。

MHC II、干扰素反应和肺泡分化反映了在几种细胞类型和癌症类型中观察到的一致表达程序；然而，这些程序之间的耦合在细胞类型、肿瘤之间甚至同一肿瘤内的空间上都有所不同，这表明其调控复杂。MHC II和干扰素反应的表达对T细胞功能有重要影响，这增加了肿瘤中偶联（或解耦）可能未被选择的可能性。

推断细胞类型关联

使用空间蛋白质组学数据集，验证了几种细胞类型的循环细胞共同定位在“高增殖”生态位的预测。不同细胞类型的相关增殖可能是由TME有丝分裂原或细胞类型之间的相互作用驱动的，例如通过外泌体或其他机制转移蛋白质。

免疫相关TME关联

癌症细胞的转录组可以被认为是其“全局”状态的表征。

肿瘤间整体细胞状态的比较突显了高度的细胞异质性.

最后我们再次复习一下NMF

library(reshape2)
library(NMF)
library(ggplot2)
library(scales)
source("custom_magma.R")
source("robust_nmf_programs.R")   


## Input:
# Genes_nmf_w_basis is a list in which each entry contains NMF gene-scores of a single sample. In our study we ran NMF using ranks 4-9 on the top 7000 genes in each sample. Hence each entry in Genes_nmf_w_basis is a matrix with 7000 rows (genes) X 39 columns (NMF programs)  
# For the code below to run smoothly, please use the following nomenclature:
  # 1) End entry names in Genes_nmf_w_basis (i.e. each sample name) with '_rank4_9_nruns10.RDS' 
  # 2) End each matrix column name with an extension that represents the NMF rank and program index. for example '_rank4_9_nrun10.RDS.4.1' to represent the first NMF program obtained using rank=4, or '_rank4_9_nrun10.RDS.6.5' to represent the fifth NMF program obtained using rank=6    
# See Genes_nmf_w_basis_example.RDS for an example 
# We define MPs in 2 steps:
  # 1) The function robust_nmf_programs.R performs filtering, so that programs selected for defining MPs are:
  #    i) Robust                - recur in more that one rank within the sample 
  #    ii) Non-redundant        - once a NMF program is selected, other programs within the sample that are similar to it are removed
  #    iii) Not sample specific - has similarity to NMF programs in other samples 
  # ** Please see https://github.com/gabrielakinker/CCLE_heterogeneity for more details on how to define robust NMF programs 
  # 2) Selected NMFs are then clustered iteratively, as described in Figure S1. At the end of the process, each cluster generates a list of the 50 genes (i.e. the MP) that represent the NMF programs that contributed to the cluster. Notably, not all initially selected NMFs end up participating in a cluster  


# ----------------------------------------------------------------------------------------------------
# Select NMF programs
# ----------------------------------------------------------------------------------------------------


## Parameters 
intra_min_parameter <- 35
intra_max_parameter <- 10
inter_min_parameter <- 10


# get top 50 genes for each NMF program 
nmf_programs          <- lapply(Genes_nmf_w_basis, function(x) apply(x, 2, function(y) names(sort(y, decreasing = T))[1:50]))
nmf_programs          <- lapply(nmf_programs,toupper) ## convert all genes to uppercase 

# for each sample, select robust NMF programs (i.e. observed using different ranks in the same sample), remove redundancy due to multiple ranks, and apply a filter based on the similarity to programs from other samples. 
nmf_filter_ccle       <- robust_nmf_programs(nmf_programs, intra_min = intra_min_parameter, intra_max = intra_max_parameter, inter_filter=T, inter_min = inter_min_parameter)  
nmf_programs          <- lapply(nmf_programs, function(x) x[, is.element(colnames(x), nmf_filter_ccle),drop=F])
nmf_programs          <- do.call(cbind, nmf_programs)

# calculate similarity between programs
nmf_intersect         <- apply(nmf_programs , 2, function(x) apply(nmf_programs , 2, function(y) length(intersect(x,y)))) 

# hierarchical clustering of the similarity matrix 
nmf_intersect_hc     <- hclust(as.dist(50-nmf_intersect), method="average") 
nmf_intersect_hc     <- reorder(as.dendrogram(nmf_intersect_hc), colMeans(nmf_intersect))
nmf_intersect        <- nmf_intersect[order.dendrogram(nmf_intersect_hc), order.dendrogram(nmf_intersect_hc)]



# ----------------------------------------------------------------------------------------------------
# Cluster selected NMF programs to generate MPs
# ----------------------------------------------------------------------------------------------------


### Parameters for clustering
Min_intersect_initial <- 10    # the minimal intersection cutoff for defining the first NMF program in a cluster
Min_intersect_cluster <- 10    # the minimal intersection cutoff for adding a new NMF to the forming cluster 
Min_group_size        <- 5     # the minimal group size to consider for defining the first NMF program in a cluster 

Sorted_intersection       <-  sort(apply(nmf_intersect , 2, function(x) (length(which(x>=Min_intersect_initial))-1)  ) , decreasing = TRUE)

Cluster_list              <- list()   ### Every entry contains the NMFs of a chosen cluster
MP_list                   <- list()
k                         <- 1
Curr_cluster              <- c()

nmf_intersect_original    <- nmf_intersect

while (Sorted_intersection[1]>Min_group_size) {  
  
  Curr_cluster <- c(Curr_cluster , names(Sorted_intersection[1]))
  
  ### intersection between all remaining NMFs and Genes in MP 
  Genes_MP                    <- nmf_programs[,names(Sorted_intersection[1])] # Genes in the forming MP are first chosen to be those in the first NMF. Genes_MP always has only 50 genes and evolves during the formation of the cluster
  nmf_programs                <- nmf_programs[,-match(names(Sorted_intersection[1]) , colnames(nmf_programs))]  # remove selected NMF
  Intersection_with_Genes_MP  <- sort(apply(nmf_programs, 2, function(x) length(intersect(Genes_MP,x))) , decreasing = TRUE) # intersection between all other NMFs and Genes_MP  
  NMF_history                 <- Genes_MP  # has genes in all NMFs in the current cluster, for redefining Genes_MP after adding a new NMF 
  
  ### Create gene list is composed of intersecting genes (in descending order by frequency). When the number of genes with a given frequency span bewond the 50th genes, they are sorted according to their NMF score.    
  while ( Intersection_with_Genes_MP[1] >= Min_intersect_cluster) {  
    
    Curr_cluster  <- c(Curr_cluster , names(Intersection_with_Genes_MP)[1])
    
    Genes_MP_temp   <- sort(table(c(NMF_history , nmf_programs[,names(Intersection_with_Genes_MP)[1]])), decreasing = TRUE)   ## Genes_MP is newly defined each time according to all NMFs in the current cluster 
    Genes_at_border <- Genes_MP_temp[which(Genes_MP_temp == Genes_MP_temp[50])]   ### genes with overlap equal to the 50th gene
    
    if (length(Genes_at_border)>1){
      ### Sort last genes in Genes_at_border according to maximal NMF gene scores
      ### Run across all NMF programs in Curr_cluster and extract NMF scores for each gene
      Genes_curr_NMF_score <- c()
      for (i in Curr_cluster) {
        curr_study           <- paste( strsplit(i , "[.]")[[1]][1 : which(strsplit(i , "[.]")[[1]] == "RDS")]   , collapse = "."  )
        Q                    <- Genes_nmf_w_basis[[curr_study]][ match(names(Genes_at_border),toupper(rownames(Genes_nmf_w_basis[[curr_study]])))[!is.na(match(names(Genes_at_border),toupper(rownames(Genes_nmf_w_basis[[curr_study]]))))]   ,i] 
        names(Q)             <- names(Genes_at_border[!is.na(match(names(Genes_at_border),toupper(rownames(Genes_nmf_w_basis[[curr_study]]))))])  ### sometimes when adding genes the names do not appear 
        Genes_curr_NMF_score <- c(Genes_curr_NMF_score,  Q )
      }
      Genes_curr_NMF_score_sort <- sort(Genes_curr_NMF_score , decreasing = TRUE)
      Genes_curr_NMF_score_sort <- Genes_curr_NMF_score_sort[unique(names(Genes_curr_NMF_score_sort))]   
      
      Genes_MP_temp             <- c(names(Genes_MP_temp[which(Genes_MP_temp > Genes_MP_temp[50])]) , names(Genes_curr_NMF_score_sort))
      
    } else {
      Genes_MP_temp <- names(Genes_MP_temp)[1:50] 
    }
    
    NMF_history     <- c(NMF_history , nmf_programs[,names(Intersection_with_Genes_MP)[1]]) 
    Genes_MP        <- Genes_MP_temp[1:50]
    
    nmf_programs    <- nmf_programs[,-match(names(Intersection_with_Genes_MP)[1] , colnames(nmf_programs))]  # remove selected NMF
    
    Intersection_with_Genes_MP <- sort(apply(nmf_programs, 2, function(x) length(intersect(Genes_MP,x))) , decreasing = TRUE) # intersection between all other NMFs and Genes_MP  
    
  }
  
  Cluster_list[[paste0("Cluster_",k)]] <- Curr_cluster
  MP_list[[paste0("MP_",k)]]           <- Genes_MP
  k <- k+1
  
  nmf_intersect             <- nmf_intersect[-match(Curr_cluster,rownames(nmf_intersect) ) , -match(Curr_cluster,colnames(nmf_intersect) ) ]  # Remove current chosen cluster
  
  Sorted_intersection       <-  sort(apply(nmf_intersect , 2, function(x) (length(which(x>=Min_intersect_initial))-1)  ) , decreasing = TRUE)   # Sort intersection of remaining NMFs not included in any of the previous clusters
  
  Curr_cluster <- c()
  print(dim(nmf_intersect)[2])
}



####  Sort Jaccard similarity plot according to new clusters:

inds_sorted <- c()

for (j in 1:length(Cluster_list)){
  
  inds_sorted <- c(inds_sorted , match(Cluster_list[[j]] , colnames(nmf_intersect_original)))
  
}
inds_new <- c(inds_sorted   ,   which(is.na( match(1:dim(nmf_intersect_original)[2],inds_sorted)))) ### clustered NMFs will appear first, and the latter are the NMFs that were not clustered

nmf_intersect_meltI_NEW <- reshape2::melt(nmf_intersect_original[inds_new,inds_new]) 

ggplot(data = nmf_intersect_meltI_NEW, aes(x=Var1, y=Var2, fill=100*value/(100-value), color=100*value/(100-value))) + 
  geom_tile() + 
  scale_color_gradient2(limits=c(2,25), low=custom_magma[1:111],  mid =custom_magma[112:222], high = custom_magma[223:333], midpoint = 13.5, oob=squish, name="Similarity\n(Jaccard index)") +                                
  scale_fill_gradient2(limits=c(2,25), low=custom_magma[1:111],  mid =custom_magma[112:222], high = custom_magma[223:333], midpoint = 13.5, oob=squish, name="Similarity\n(Jaccard index)")  +
  theme( axis.ticks = element_blank(), panel.border = element_rect(fill=F), panel.background = element_blank(),  axis.line = element_blank(), axis.text = element_text(size = 11), axis.title = element_text(size = 12), legend.title = element_text(size=11), legend.text = element_text(size = 10), legend.text.align = 0.5, legend.justification = "bottom") + 
  theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + 
  theme(axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) + 
  guides(fill = guide_colourbar(barheight = 4, barwidth = 1))


MP_list <-  do.call(cbind, MP_list)

生活很好，有你更好

顶刊分享---超过一千份肿瘤样本的瘤间转录异质性特征

作者，Evil Genius

肿瘤样本的异质性，肿瘤样本分析的关键问题。

正常组织细胞图谱相对好做一点，但是肿瘤样本的图谱，单一组学几乎不能解释任何问题，尤其转录组。

异质性在单细胞最大的体现，就是转录状态，而且转录状态几乎课题间无法复用。

今日参考文献

知识积累

每个肿瘤都包含不同的细胞状态，这些状态是肿瘤内异质性（ITH）的基础，这是癌症治疗的核心挑战。

整理、注释和整合了来自77项不同研究的数据，以揭示涵盖24种肿瘤类型的1163个肿瘤样本的转录ITH模式。

ITH是由遗传学、表观遗传学和微环境影响驱动的肿瘤的基本特性，是治疗失败、转移和其他癌症表型的核心。

任何特定MP的高表达可被视为定义细胞状态。

值得注意的是，MP往往仅限于数十个基因，这些基因的表达叠加在细胞的基线表达谱上，因此反映了相对的细胞状态。

亚群将处于不同的全局细胞状态（反映肿瘤特性），但处于相同的相对细胞状态（反应特定MP的激活）。

肿瘤单细胞数据

包括1456个样本，涵盖24种癌症类型和2591545个细胞。

细胞注释 + CNV推断恶性细胞类型。

定义和注释MPs

定义了在每个肿瘤内不同的表达程序，随后比较了所有研究中肿瘤的ITH程序（NMF）。

对于每种肿瘤，利用非负矩阵因子分解（NMF）来表征其恶性细胞中不同的ITH程序，每个程序由其得分最高的50个基因表示。

在scRNA-seq平台上检测到的MP具有可比性，包括高表达和非高表达基因

Regulation of MPs

将SCENIC应用于所有肿瘤样本,探索了肿瘤内和肿瘤间MP的遗传调控。

MP临床相关性

鉴于scRNA-seq数据集的临床注释有限，再次转向分析bulk TCGA样本中的（平均）MP表达，并确定了与总体生存率、分级和分期、淋巴结转移和治疗耐药性的关联。

MP背景特异性

将每种癌症类型中每种MP的频率分为缺失、低、中、高或高和显著富集。

同时检查了卵巢癌症、皮肤鳞状细胞癌和胶质母细胞瘤的空间转录组学（Visium）数据。

转录ITH的特征

细胞周期和应激后最常见的标志是“谱系相关”MP，包括类似于发育和分化细胞类型的程序，如皮肤色素沉着、肺泡细胞、神经元和少突胶质细胞祖细胞。这些MP与ITH概括发展轨迹的概念是一致的。

非恶性TME细胞类型的MP

上述分析来定义这些细胞类型中的MP，并通过功能丰富对其进行注释。

大多数恶性MP与上皮细胞的非恶性MP相似，但与其他细胞类型的MP不同），这表明恶性细胞中的大部分异质性已经存在于起源细胞中。

对癌前组织（腺瘤）的分析表明，MP表达与恶性上皮细胞的相似性高于非恶性上皮细胞。

MHC II与干扰素MP的偶联

MHC II基因的背景特异性偶联尤其常见。

推断细胞类型关联

使用空间蛋白质组学数据集，验证了几种细胞类型的循环细胞共同定位在“高增殖”生态位的预测。不同细胞类型的相关增殖可能是由TME有丝分裂原或细胞类型之间的相互作用驱动的，例如通过外泌体或其他机制转移蛋白质。

免疫相关TME关联

癌症细胞的转录组可以被认为是其“全局”状态的表征。

肿瘤间整体细胞状态的比较突显了高度的细胞异质性.

最后我们再次复习一下NMF

生活很好，有你更好

相关阅读更多精彩内容

友情链接更多精彩内容