FIND tutorial

从FIND包中拷贝出来的
作者见下:


title: 'FIND: difFerential chromatin INteractions Detection using a spatial Poisson process'
author: "Mohamed Nadhir Djekidel djek.nad@gmail.com"
package: FIND
date: 'r Sys.Date()'
output:
BiocStyle::html_document:
toc: true
toc_float: true
vignette: >
%\VignetteIndexEntry{FIND: difFerential chromatin INteractions Detection using a spatial Poisson process}
%\VignetteEngine{knitr::knitr}
%\VignetteEncoding{UTF-8}


BiocStyle::markdown()
knitr::opts_chunk$set(dev="png",fig.show="hold",
               fig.width=8,fig.height=4.5,fig.align="center",
               message=FALSE,collapse=TRUE)
set.seed(1)
suppressMessages(require(png))
suppressMessages(require(grid))

Introduction

Improvements to the genome-wide chromatin conformation analysis techniques
enabled scientists explore the chromatin structure at an unprecedent scale.
labs from different backgrounds are starting to adopt the high-resolution Hi-C protocols
in their studies, which led to an increasing amount of available Hi-C data.

With more data available, scientists would like perform comparative analysis
between the available Hi-C samples,
i.e: finding differential chromatin interactions (DCI).

Previously developed differential chromatin interaction detection techniques
were designed with a low-resolution Hi-C in mind (40kb or more). Thus,
they based their models on the assumption of the independence between
the different interacting pairs (Shown in Fig.1a), which is logical
as the low-resolution Hi-C maps capture interactions between relatively
distant chromatin fibers.

However, when dealing with high-resolution Hi-C (less than 10kb or so) the
independence assumption between the adjacent "sequares" in the Hi-C matrix does
not hold, due to the polymer nature of the chromatin that imply the dependence
between adjacent chromatin fibers (Fig.1b).

To overcome this issue, we developed a new differential chromatin interaction
detection technique that takes into consideration the dependency between
neighboring loci. As depicted in this Fig.1b, when there is a change
in the interaction frequency between the bins (i,j) we expect this movement
to be reflected on the interaction frequencies in the bins around it.

img <- readPNG("Figure1.png")
grid.raster(img)

Principle behind FIND

This part maybe boring, but I put it as summary for people who want to try the
method but didn't go through the paper yet.

FIND is designed for problems in which we have two biological conditions c\in \{1,2\}, each with a certain number of Hi-C replicates \{n_c\}.

FIND is based on the simple idea that the establishment of a chromatin loop between two loci i and j, a "mountain shape" structure should be formed (due to the spatial dependency) in the 3D space defined by the 2D coordinates of the loci and their interaction frequency. If there is a change in the interaction frequency, the height of the tip of the mountain (shown in Fig.2 as \mu_1) will change.

This change, will also influence the height of the points located in the neighborhood of \mu_1 to change also. If this change is significance, the density of the points around \mu_1 should change between the two biological conditions.

img <- readPNG("Hic_3D.PNG")
grid.raster(img)

To estimate this change, we first define a window of size W around our point
of interest (i,j). For example if the window is of size 3, it will include
all the W^2 points located in the coordinates defined by the
Cartesian product \{i-1,i,i+1\} \times \{j-1,j,j+1\}. In this annimation
generated using conv_arithmetic
we show how FIND scans through the Hi-C matrix using a window of size 3.

For the pairwise interactions located at the borders of the matrix, only the
neighbours located in the matrix. So for the interaction (1,1) only 4 neighbours
will be considered.

[图片上传失败...(image-9c7290-1678796374534)]

Then, we compare the content of this window between the two conditions.

To do this comparison, we fist rank the bins in each window and in each replicate according to their 3D distance to the pairwise interaction (i,j,\mu_1), such as \mu_1 is mean interaction frequency of (i,j) in the first condition.

Let x_{n,k}^{(c )} be the distance from the k^{th} nearest neighbor to \mu_1 in the n^{th} replicate of condition c. The probability to observe the k-nearest neighbour at a distance x_{n,k}^{(c )} is euqal to (Burguet et al. 2009, 2011; Burguet and Andrey 2014):
f(x_{n,k}^{(c )})= \frac{(4\lambda^{(c )}\pi)^k}{3^{k-1}(k-1)!} (x_{n,k}^{(c )})^{k-1} exp(\lambda^{(c )}\frac{4\pi}{3}x_{n,k}^{(c )})

The equation, looks a bit confusing, but if you read it carefully, you can notice that the only unkown is \lambda^{(c )}, which represents the estimated density of the k^{th} nearest neighbors around \mu_1 in the replicates of the condition c. To estimate \lambda^{(c )}, we just need to calculate the maximum likelihood given by this equation:
\lambda^{( c )}(\mu_1)=\frac{n_ck-1}{\frac{4\pi}{3}\sum_{n=1}^{n_c}(x_{n,k}^{(c )})^3}

Using this formula, we can then estimate the density of the k-nearest neighbours (KNN) in the 1st condition around \mu_1, \lambda^{(1)}(\mu_1), and the density of the KNN of the 2nd condition around \mu_1, \lambda^{(2)}(\mu_1). The ratio between these two densities R_k(\mu_1) will tell us, if the KNNs change their distribution around \mu_1 or not. It can be shown that this ratio follows a Fisher distribution with 2n_1k and 2n_2k degree of freedome:
R_k(\mu_1) \sim FS(2n_1k,2n_2k)

Thus, we can calculate for each KKN a p-value that indicate it shows a differential distribution around \mu_1 or not. To avoid problems related to the variability between replicates, we also use \mu_2 (the mean frequency of (i,j) in the second condition), and do the same calculations. Hence, at the end we get 2W^2 p-values (W^2 p-values using \mu_1 as a reference and W^2 p-values using \mu_2 as reference). We need then to combine these p-value to calculate a final combined p-value.

Different p-value combination methods exist; however, each p-value combination method has a different hypothesis setting. Some are very stringent and require all the p-values to be significant, others are very loose and require one or more p-values to be significant in order for the combined p-value to be significant. For a short review, you can check (Lun-Ching Chang et al, 2013, BMC Bioinformatics). In our case, we use the r-th ordered p -value (rOP) which requires that at least r out the 2W^2 p-values to be significant in order for the combined p-value to be significant. This statistic will give us more flexibility as the user can select the level of stringency he wants to apply.

Basic usage

loading required libraries

to accelerate data processing and use as less memory as possible, FIND requires the Matrix package. For visualization, in this vignette we prefer to use the rasterVis package. The mvtnorm is needed only in the simulation. We also need the gridExtra to arrange the plots.

require(FIND)
require(Matrix)
require(mvtnorm)
library(raster)
require(rasterVis)
require(gridExtra)
require(HiTC)
require(edgeR)
require(ggsci)

A simulation example

Generate simulated Hi-Cs

To demonstrate the functionalities of FIND, we use this simulation example in
which we load a part of the K562 matrix generated by Rao et al, 2014 and use it
as a base to generate two Hi-C matrices.

In this example, we will generate 2 replicates in each condition. The simulated pairwise interactions will have a 0.1% chance to be diffrential, with 40% of them will be showing a intrease of 5 fold-change
in the interaction frequency in the second condition. We remove the noisy interactions that have a dispersion larger than 0.1. This can be done using the generateSimulation function

data(K562_matrix_small)
simRes <- generateSimulation(K562_matrix_small[1:200,1:200],
                             pDiff = 0.01,pUp = 0.6,
                             foldDiff = 3,
                             extreme.dispersion = 0.1, 
                             nrep1=2,nrep2=2)

The generateSimulation function returns a list that contains: the simulated Hi-C samples and a binarry matrix that contains the position of the DCIs.

simRes

For example, we can use the code bellow to display the simulated data

p1 <- levelplot(as.matrix(simRes@Hic1[[1]]), par.settings=YlOrRdTheme,main="Cond1 Rep1")
p2 <- levelplot(as.matrix(simRes@Hic1[[2]]), par.settings=YlOrRdTheme,main="Cond1 Rep2")
p3 <- levelplot(as.matrix(simRes@Hic2[[1]]), par.settings=YlOrRdTheme,main="Cond2 Rep1")
p4 <- levelplot(as.matrix(simRes@Hic2[[2]]), par.settings=YlOrRdTheme,main="Cond2 Rep2")
p5 <- levelplot(as.matrix(simRes@trueStates), par.settings= YlOrRdTheme, main="DCIs position")


# Here we use ncol=1 due to the limite width of page, 
# if you have a large screen you can put 2 plots in a column
grid.arrange(p1,p2,p3,p4,p5, ncol=2)

Instead you can just display the log2 fold-change between the samples

## Plot the log2 Fold-change between the two conditions
m1 <- simRes@Hic1[[1]] + simRes@Hic1[[2]]
m2 <- simRes@Hic2[[1]] + simRes@Hic2[[2]]
levelplot(as.matrix(log2((m1+0.01)/(m2+0.01))), par.settings = BuRdTheme, main="log2(Cond1/cond2)")

Finding Differential Chromatin Interactions (DCIs)

Calling DCI from a list of Hi-C matrices

In many cases you want to try some newly published Hi-C normalization methods or you developped your own Hi-C processing methods, then you'll probably load it as an R matrix. In this case you can directly call the functin getDiff.

In this example, we will continue with the simulated Hi-C data. Here, we pass two lists of Sparce-matrices and we specifiy that we will be using a window of 3 and we want to calculate the p-value on the

DCis <- getDCIs_fromSim(x = simRes,
                        windowSize =  3,
                        alpha =  0.7, 
                        method = "hardCutoff",
                        qvalue = 1e-2,
                        isrOP_qval =FALSE)

equivantly, if we have two list of matrices we could have loaded them using the getDCIs_fromMat

DCis <- getDCIs_fromMat(Hic1 = simRes@Hic1, 
                        Hic2 = simRes@Hic2,
                        windowSize =  3,
                        alpha =  0.7, 
                        method = "hardCutof",
                        qvalue = 1e-2,
                        isrOP_qval =FALSE)

We can plot the q-values

p1 <- levelplot(as.matrix(-log10(DCis@qvals)), par.settings= YlOrRdTheme,main="-log10(qvalue) after cutoff")
p2 <- levelplot(as.matrix(DCis@DCIPos), par.settings= YlOrRdTheme,main="Predicted DCIs")
p3 = levelplot(as.matrix(log2((m1+0.01)/(m2+0.01))), par.settings = BuRdTheme, main="log2(Cond1/cond2)")
grid.arrange(p3,p1,p2,ncol=2)

After p-value correction diagonal-wise, we select only the regions that cluster together, to avoid any noisy regions.
Probably it is not a very statistical way that doing this, as we are not really sure
how it controls for false-positives, but it tends to select regions that show a big structural change.
Further down-stream analysis should be done.

We can see that it is a hard to select a cutoff, thus, it is better to use a kind of a semi-automatic way to select the q-value.
One solution is to use quantile regression to select the best cutoff given the distance.

Bellow we show the distribution of the q-values given the different distances (THIS IS DEPRECATED AS IT IS VERY SLOW APROACH)


smry <- summary(DCis@qvals)
smry$x <- -log10(smry$x)
smry$dist <- smry$j - smry$i

## remove very low qvals as they are the majority
#smry <- subset(smry , x> -log10(pbeta(0.1,10,18-10+1)) )

o <- order(smry$dist)

smry <- smry[o,]

X <- model.matrix(smry$x ~ splines::bs(smry$dist, df=9))
  
  smoothScatter(smry$dist, smry$x, xlab = "Distance",ylab="-log10(qvalue)")

  quants =seq(0.9,0.99,.01)
  cols = ggsci::pal_d3()(length(quants))
  for(j in 1:length(quants)){
    thr  <- quantreg::rq(x ~ splines::bs(dist,df = 9), data = smry,tau = quants[j])
    thr.fit <- X %*% thr$coef
    smry$threshold <- thr.fit
    lines(smry$dist,smry$threshold,col=cols[j],lwd=2)
  }

  legend(max(smry$dist)-20, max(smry$x)-20, legend=quants,
       col=cols, lty=1, cex=0.5)


As in this simulated examples, the DCIs were generated independently from the interaction span, we notice that the different quantiles show a kind off uniform distribution, indicating the independence.

If the quantile regression is selected (method = "quantreg"), the 99th percentile is selected by defaults.
We can select a lower cutoff using the quantile parameter.

DCis <- getDCIs_fromSim(x =  simRes, windowSize =  3,alpha =  0.7, 
                        method = "quantreg",quantile = 0.99)

We can compare our results to true location of DCIs.

compareWithTrueDCI(DCis@DCIPos,simRes@trueStates)

Try Parallel processing


DCis_par = getDCIs_parallel(Hic1 = simRes@Hic1,
                       Hic2 = simRes@Hic2,
                       windowSize = 3,
                       alpha = 0.7, 
                       chunk_size=100,
                       method = "quantreg",
                       quantile = 0.99,
                       qvalue = 1e-6,
                       isrOP_qval = F,
                       nbProcessors = 4,
                       verbose =  TRUE)

Check that we are getting the correct qvalues

Here we want to make sure that the qvalues calculated in the paraallel version
are similar to the ones obtaines in the sequential version.

smry_par = summary(DCis_par@qvals)
smry_seq = summary(DCis@qvals)
colnames(smry_par)[3] = "qval_par"
colnames(smry_seq)[3] = "qval_seq"

mrg = merge(smry_par, smry_seq, all=TRUE)

# a few sites will show some difference, generally very slight at the decimal positions.
nrow(subset(mrg, qval_par != qval_seq))

loading genomewide chromatin interactions

FIND also provides a function to perform genomewide diffrential chromatin
interaction detection. The internals of the function depend on the HiTC package.
For the moment we support matrix files that are in the same format as the one
generated by HiC-Pro.

Two types of files are needed:

  • A list of Hi-C matrix files: which are a three-columns tab-separated file,
    where the first two columns indicate the bin ID and the last column indicated the
    interaction frequency.
151 151 68.372895
151 152 51.179648
151 153 37.575510
151 154 17.432925
151 155  8.201022
151 156 18.941581
  • Bin indices: Which is a four-columned tab-separated files that indicated
    the genomic position of the bins in the first 3 columns and the binID in the 4-th
    column:
chr1      0  20000  1
chr1  20000  40000  2
chr1  40000  60000  3
chr1  60000  80000  4
chr1  80000 100000  5
chr1 100000 120000  6

Suppose we have a folder with 4 Hi-C matrices, we can first load them using
the loadHicExperiment function:

matrices_lst <- as.list(list.files("HiC_results/",pattern = "*iced.matrix"))
binIndices = "MM9_5000_abs.bed"
gw_interactions <- loadHicExperiment(matrices_lst,binIndices = binIndices,
                                     groups =c(1,1,0,0),type = "HicPro",
                                     genome = "mm9")

Next, we can run the differential analysis. Chromosomes are processed sequenciantially,
but the interaction matrix of each chromosome is processed sepraratly.
This choise is to euilibrate the tradeoff between speed and memory,
we can parallalize everything, but the number of processes will explode.
If you are running of a cluster though, you can generate a HicExperiment object
for each interaction, and processes them in different nodes.

Generally a chunk_size of 300~500 bins, gives a good trade off. You fist test
on the smaller chromosomes and pick the paramters that suits you.

gw_interactions <- getDCIs_HicExpr(gw_interactions,
                                   windowSize = 3,
                                   alpha = 0.7, 
                                   chunk_size=500,
                                   method = "quantreg",
                                   quantile = 0.99,qvalue = 1e-6,isrOP_qval = F,
                                   nbProcessors = 8,verbose =  TRUE)

Loading .hic data

the .hic format is one of the widely adopted chromatin interaction storage
format. Hopefully, Aiden's group have developed a wrapper function
straw available in their github repository.

We took advantage of the available C++ wrapper and interated it into FIND,
unfortunatly due to the way C++ long type are codded between windows and linux,
the code does not work in windows and may lead to a system crash.

Bellow is an example showing how deal with such data using the readJuiceBoxFormat
function provided by FIND.

# We need to  know the chromosomes lengths
require(BSgenome.Hsapiens.UCSC.hg19)

seqlen = seqlengths(BSgenome.Hsapiens.UCSC.hg19)[1:22]

# We load two K562 replicates and GM12878
hic_mats = list(K562_reps1 = "GSM1551620_HIC071_30.hic",
               K562_reps2 = "GSM1551623_HIC074_30.hic",
               GM12878_reps1 = "GSM1551574_HIC025_30.hic",
               GM12878_reps2 = "GSM1551575_HIC026_30.hic")

As it have been reported in many diffrential chromatin interaction analysis,
such as in diffHic, regioins with Copy Number Variation are hard to interprete
and the observed different might be just due to CNV. One can use some methods
to correct for this bias, or can ignore these regions directly.

Here we will just remove these regions. We will use the already calculated CNV
regions for the K562 downloaded from GSM999287


## We use this function as a proxy to call
fct = function(chrom,seqlen, hicfile, resolution){

  intdata = readJuiceBoxFormat(hicfile,
                               chromosome=chrom,
                               normMethod="VC_SQRT",
                               resolution=resolution)

  smry <- summary(intdata)
  bins <- tileGenome(seqlen[chrom],tilewidth=resolution, cut.last.tile.in.chrom=TRUE)
  bins$name <- 1:length(bins)
  
  # correct size
  intdata <- sparseMatrix(i=smry$i,j=smry$j,x=smry$x, dims=c(length(bins),length(bins)))

  HTCexp(intdata, bins, bins) 
  }


## Read the list of .hic matrices
hic_matrices = list()
for(hic in names(hic_mats)){
  
  # here we are using chromosom 22, just to test
  chroms =names(seqlen)[22]
  
  hic_lst  = mclapply(chroms,fct,
  hicfile= hic_mats[[hic]],
  resolution=5000,
  seqlen=seqlen,mc.cores=4
  ) 

  hic_matrices[[hic]] = HTClist(hic_lst)
}

## We can use the MA-plot technique to do a joint normalization between replicates



# define experiment design
groups=c(0,0,1,1)

# Create a HicExperiment object
hicCol <- new("HicExperiment",
                hic_matrices= hic_matrices, 
                groups=groups,
                DCIs=list(), 
                genome="hg19")

Ideally, we should do inter-replicates normalization
otherwise, many highly variable regions will be reported as DCI, some packages
such as HiCcompare already implement the inter-replicates normalization using
the M-Distance plot for normalization. Another good strategy used by HiCcompare,
is the weighted normalization of the M values (log2(f1/f2)) using the A values
(log2((f1+f2)/2)) as low frequency values tend to be very variable and noisy.

# lunch parallel processing.
hicCol <- getDCIs_HicExpr(HicExpr=hicCol,
                                   windowSize = 3,
                                   alpha = 0.7, 
                                   chunk_size=500,
                                   method = "quantreg",
                                   quantile = 0.99,
                                   qvalue = 1e-6,
                                   isrOP_qval = F,
                                   nbProcessors = 8,
                                   verbose =  TRUE)

Displaying Diffrential Chromatin interactions

Some considerations when preparing Hi-C matrices

Hi-C matrices replicability

One concern when dealing with Hi-C replicates is the replicability of the
results. In our manuscript we used the MA-normalization,
which is ok, but it was not designed for the Hi-C context. Hopefully,
some new strategies are getting developped to deal with this concern,
for example we can refer the user to methods such as :

For more details you can the preprint by Nobel's group (https://www.biorxiv.org/content/early/2018/02/05/188755)

Dealing with Cancer samples

Copy number variation can be a source of bias in Cancer cell samples and can
introduce many false positive differential interactions. One option is to try
to correct for this bias using tools shuch as :

Some implementation strtergies

Hi-C matrices are very large, their processing can take time and space. When
developping FIND, we tried to uses different tricks to improve the processing
seed. The usage of a the sparse matrix class dgCMatrix

Potentially Comming Features

  • loading of different formats : There is a juicebox format loading function, but works only on linux for the moment.

Session Info

sessionInfo()
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,417评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,921评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,850评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,945评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,069评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,188评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,239评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,994评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,409评论 1 304
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,735评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,898评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,578评论 4 336
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,205评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,916评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,156评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,722评论 2 363
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,781评论 2 351