这是ggplot2可视化专题的第三个实例操作

【ggplot2的基本思路见前文总论】：基于ggplot2的RNA-seq转录组可视化：总述和分文目录

【ggplot2绘图具体策略第一篇】：测序结果概览：基因表达量rank瀑布图，高密度表达相关性散点图，注释特定基因及errorbar的表达相关性散点图绘制

【ggplot2绘图具体策略第二篇】：单/多个基因在组间同时展示的多种选择：分组小提琴图、分组/分面柱状图、单基因蜂群点图拼图绘制

有的时候我们需要对配对样品进行某些操作或事件前后的宏观比较，以及各样本单独的变化趋势（如进行治疗前后样本的某指标变化，或给药处理前后平行样本间某基因的变化趋势，或同一病人不同组织样本间某基因的表达情况展示），这时候就可以用到连线散点图了。找到了一篇文章利用它进行了数据阐释：

配对散点连线图

该文章使用了多个GEO数据展示化疗前后患者的免疫相关预后评分，不仅展示了组间在化疗前后的总体改变趋势，配对散点间的连线也展示了个体在化疗前后的评分反应。

致谢文章：Wang, S., Q. Zhang, C. Yu, Y. Cao, Y. Zuo and L. Yang (2020). "Immune cell infiltration-based signature for prognosis and immunogenomic analysis in breast cancer." Brief Bioinform.

下面就尝试来用前文介绍的TCGA白人LUSC肺鳞癌mRNA-seq公共数据复制一下图片大致的形式。因为使用TCGA数据，下载的样本点比较多，为了减少散点重合的情况，接下来介绍使用散点的jitter形式，即散点在横向也展开一定距离，图片的可视化效果更好。

数据获取

基于第一篇文章从TCGA数据库下载并整合清洗高通量肿瘤表达谱-临床性状数据，我们下载并清洗TCGA数据库中white人种的LUSC肺鳞癌mRNA-seq转录组counts数据和FPKM数据。

随后根据第二篇文章TCGA数据整合后进行DESeq2差异表达分析和基于R的多种可视化对counts数据进行了基于DESeq2的差异分析。

现在假设我们已经获得：
（样本1到344为cancer，345到386为normal）
（1）resSigAll: the subset object generated by DESeq2 containing info about all differentially expressed genes.
（2）condition_table: the data frame defining the sample_type and recording the TCGA_IDs and submitter_id of each TCGA sample. It will be used for grouping.
（3）expr_vst: vst transformed normalized counts matrix of genes and samples generated by DESeq2. It is the raw material of downstream visualization.

需要R包：

ggplot2 (作图)、DESeq2 (从差异分析对象中提取一些数据)、customLayout和gridExtra（ggplot2的拼图），和dplyr (进行文字处理)。

install.packages('ggplot2')
install.packages('DESeq2')
install.packages('customLayout')
install.packages('dplyr')
install.packages('gridExtra')
library('ggplot2')
library('DESeq2')
library('customLayout')
library('dplyr')
library('gridExtra')

从所有样本中仅选取配对样本，并建立用于绘图的数据表：

TCGA的正常样本数据也是从部分患者的癌旁组织获取，所以有癌旁组织就会有癌组织，而某些癌组织没有相应的癌旁组织。而TCGA样本编号的规则此前也讲过，配对的癌旁和癌组织样本,TCGA ID前12位字符（也就是submitter ID）是完全相等的。

首先使用condition_table中的submitter_id列筛选匹配样本到新的数据框中。然后用sample函数随机选取四个差异表达基因在配对样本中的vst transformed normalized counts值用于表达水平散点绘制（储存在resSigAll@rownames中）。

对每个样本加上sample_type列，和grouping_info列（也就是submitter_id，每个normal-cancer对间的submitter_id相等，用于连线）。将sample_type列因子化。（告诉ggplot2 x轴分组排列的顺序）

paired_condition_table <- dplyr::filter(condition_table, submitter_id %in% condition_table$submitter_id[condition_table$sample_type=='normal'])
set.seed(2999)
paired_for_trend_vis <- t(expr_vst[sample(resSigAll@rownames, 4) ,paired_condition_table$TCGA_IDs])
paired_for_trend_vis <- data.frame(paired_for_trend_vis, 'grouping_info'=paired_condition_table$submitter_id,
                                   'sample_type'=paired_condition_table$sample_type, stringsAsFactors = F)
paired_for_trend_vis$sample_type <- factor(paired_for_trend_vis$sample_type, levels = c('normal','cancer'), ordered = T)

一个傻瓜但不完美的办法：

直接使用ggplot2包中的geom_jitter函数，将散点横向排开。这里我只可视化一个基因。
ggplot函数输入新建立的数据表。geom_line函数绘制配对散点间的连线，补充group参数输入组间配对，color参数确定连线颜色，alpha参数确定连线透明度。geom_jitter函数设定散点抖动模式，补充color参数确定sample_type组间点的颜色，width确定横向抖动的宽度。theme确定背景分割线白色。

ggplot(paired_for_trend_vis,aes(x=sample_type,y=DLX5))+theme_bw()+
  geom_line(aes(group=grouping_info), color="black",alpha=0.4)+
  geom_jitter(aes(color=sample_type), width = 0.1)+
  scale_color_manual(values = c('blue','orange'))+
  theme(panel.grid = element_line(colour = 'white'))

得到图片如下，可见尽管散点有了抖动，但组间连线并未随散点发生改变，仍旧是组间中线处互连。

an imperfect paired scatterplot

解决办法：

首先我们依旧获得配对数据并随机选取4个基因表达值，整合成用于绘图的数据框。

paired_condition_table <- dplyr::filter(condition_table, submitter_id %in% condition_table$submitter_id[condition_table$sample_type=='normal'])
set.seed(2999)
paired_for_trend_vis <- t(expr_vst[sample(resSigAll@rownames, 4) ,paired_condition_table$TCGA_IDs])
paired_for_trend_vis <- data.frame(paired_for_trend_vis, 'grouping_info'=paired_condition_table$submitter_id,
                                   'sample_type'=paired_condition_table$sample_type, stringsAsFactors = F)

随后的思路是这样的：相比于前面直接将sample_type作为因子型变量离散分组，这里我们将sample_type先转化为连续数字变量（normal=1,cancer=2），然后用jitter函数微调各sample_type的连续数字值，使其在连续x轴上表现为抖动。而同时保留原分类变量sample_type，用于点的分组上色。

paired_for_trend_vis$sample_type <- factor(paired_for_trend_vis$sample_type, levels = c('normal','cancer'), ordered = T)
paired_for_trend_vis <- data.frame(paired_for_trend_vis, 'group_jitter'=as.numeric(paired_for_trend_vis$sample_type))
paired_for_trend_vis$group_jitter <- jitter(paired_for_trend_vis$group_jitter, 0.4)

整理结束后得到的paired_for_trend_vis数据框格式是这样的：

paired_for_trend_vis

这里依旧使用拼图包customLayout进行ggplot2绘图对象的拼图，因此我们建立一个函数，输入一个基因名可产生对应绘图对象。

注：x轴不再是分类sample_type，而是连续值group_jitter。然后使用 scale_x_continuous函数将将x坐标轴上的1、2的刻度改为'normal'和'cancer'字样。

draw_the_plot <- function(intended_gene){
  ggplot(paired_for_trend_vis,aes(x=group_jitter,y=get(intended_gene)))+
    geom_line(aes(group=grouping_info), colour='black',alpha=0.4)+
    geom_point(aes(colour=sample_type), size=1.5)+scale_color_manual(values = c('blue','orange'))+
    scale_x_continuous(breaks = c(1,2), labels=c('1'='normal','2'='cancer'), limits = c(0.7,2.3))+
    theme_bw()+theme(panel.grid = element_line(colour = 'white'), legend.position = 'none')+xlab('groups')+
    ylab(paste0('vst transformed counts of ', intended_gene))
}

然后用lay_new函数新建画布，画布中的图片按matrix的数字排列：高1格，宽4格。使用lapply函数获得含多个ggplot2绘图对象的list作为lay_grid函数的输入。

lay <- lay_new(mat=matrix(1:4, ncol = 4), widths=c(2,2,2,2),heights = 1)
multiple_jitters <- lapply(colnames(paired_for_trend_vis)[1:4], draw_the_plot)
lay_grid(grobs = multiple_jitters, lay = lay)

最终获得的图片输出如下：

the collage containing 4 paired scatterplot showing trends of gene expressions

这样就完美地解决了之前连线和抖动点之间没有横向匹配上的尴尬~

喜欢的朋友就学起来，点个赞吧！

配对样本基因表达趋势：优化前后的散点连线图+拼图绘制