生信分析不只是跑下软件 | TADCompare 差异分析要留心

序言

生信分析真的是跑下别人的软件就ok了么？很明显这样认为有些肤浅。先不说软件好不好用的事，即使勉强能够正常运行得到结果就万事大吉了么？这里也不讨论数据好坏的问题，得到的结果就是准确的么？下面以之前提到过的软件TADCompare来说明，生信分析有时候并不是看到的那么简单！还不了解TADCompare软件的朋友可以戳这里[TADCompare：差异TAD分析]。

案例

TADCompare函数可以直接利用两个HiC矩阵得到差异TAD，这源自于其内嵌类似TAD calling的功能，可以得到可能的TAD边界以及差异情况。除此之外，软件也接受提供预先定义的TAD，以三列bed格式提供给参数pre_tads。bed格式大家应该不陌生，但这里提供的bed应该包含列名，至于为什么要加列名后面再说。格式类似如下：

chr  start    end
chr1 49200000 50150000
chr1 50150000 51250000

TADCompare函数最终使用的矩阵格式类似如下，行列名都是相应分辨率下每个bin的染色体起始位置。

         16050000 16100000 16150000 16200000 16250000
16050000       12        0        0        0        0
16100000        0        0        0        0        0
16150000        0        0        0        0        0
16200000        0        0        0        4        0
16250000        0        0        0        0        0

对于矩阵行列名，TADCompare函数在参数说明中也明确的提醒the column names must correspond to the start point of the corresponding bin。如果要用软件就得按别人要求来，否则可能都没法正常运行。

Usage:

     TADCompare(cont_mat1, cont_mat2, resolution = "auto", z_thresh = 2,
       window_size = 15, gap_thresh = 0.2, pre_tads = NULL)

Arguments:

cont_mat1: Contact matrix in either sparse 3 column, n x n or n x (n+3)
          form where the first three columns are coordinates in BED
          format.  See "Input_Data" vignette for more information.  If
          an n x n matrix is used, the column names must correspond to
          the start point of the corresponding bin. Required.

假设文件格式都没有问题了，就可以仿照软件文档提供的示例来分析了。类似下面的代码，就是用自己输入的矩阵和TAD数据来分析差异TAD：

bed1 <- 'sample1_tad.bed'
bed2 <- 'sample2_tad.bed'
Combined_Bed <-  list(bed1, bed2)
TD_Compare <- TADCompare(mat1, mat2, resolution = 50000, pre_tads = Combined_Bed)

至此，得到结果依然是理所当然，如果还是符合预期的结果那更是心里美滋滋。倘若，结果差强人意，好似刨了半天地什么都没种下，就挖坑了这谁受得了？高低也得整点理由说服自己放弃。
然后，不知道是什么神奇的力量驱使，把矩阵的行列名换成每个bin的染色体结束位置。与想象的一致，毫无意外的正常运行得到结果。有些意外的是这次得到较多的差异TAD，这个时候就有些不安了，这结果准确么？
其实，也许只有矩阵文件的时候，用bin的起始或结束位置差别不大，但加入bed文件结果就不一样了。为什么不一样呢？因为bed也有开始和结束两个位置。那么，软件使用是哪一个呢？咱们一起来一看究竟。

if (!is.null(pre_tads)) {
        pre_tads = lapply(pre_tads, as.data.frame)
        TAD_Frame = TAD_Frame %>% filter(Boundary %in% bind_rows(pre_tads)$end) %>%
            mutate(Differential = ifelse(abs(Gap_Score) > z_thresh,
                "Differential", "Non-Differential"), Enriched_In = ifelse(Gap_Score >
                0, "Matrix 1", "Matrix 2")) %>% arrange(Boundary) %>%
            mutate(Bound_Dist = abs(Boundary - lag(Boundary))/resolution) %>%
            mutate(Differential = ifelse((Differential == "Differential") &
                (Bound_Dist <= 5) & !is.na(Bound_Dist) & (Enriched_In !=
                lag(Enriched_In)) & (lag(Differential) == "Differential"),
                "Shifted", Differential)) %>% mutate(Differential = ifelse(lead(Differential) ==
            "Shifted", "Shifted", Differential)) %>% dplyr::select(-Bound_Dist)
    }

上面就是提供bed文件时执行的代码，中间有这么一句bind_rows(pre_tads)$end，作用是将多个bed整合为一个数据框，也可以看到整合的是bin的结束位置。回到前面提到的软件明确指出矩阵的行列名用bin起始位置，上面结果不一致的问题迎刃而解，但出现了另一个疑问：软件为什么要这么设定，具体的工作原理是什么？

结语

软件有详细的文档固然是好的，能说明原理更是求之不得，但这样的事是可遇不可求的，毕竟最终的解释权在软件作者手里，作为使用者还需谨慎一些。所以，生信分析并没有想象中的那么简单，只能说知道越多越觉得自己无知。。。

生信分析不只是跑下软件 | TADCompare 差异分析要留心

序 言

案 例

结 语

推荐阅读更多精彩内容

序言

案例

结语