从FIND包中拷贝出来的
作者见下:
title: 'FIND: difFerential chromatin INteractions Detection using a spatial Poisson process'
author: "Mohamed Nadhir Djekidel djek.nad@gmail.com"
package: FIND
date: 'r Sys.Date()
'
output:
BiocStyle::html_document:
toc: true
toc_float: true
vignette: >
%\VignetteIndexEntry{FIND: difFerential chromatin INteractions Detection using a spatial Poisson process}
%\VignetteEngine{knitr::knitr}
%\VignetteEncoding{UTF-8}
BiocStyle::markdown()
knitr::opts_chunk$set(dev="png",fig.show="hold",
fig.width=8,fig.height=4.5,fig.align="center",
message=FALSE,collapse=TRUE)
set.seed(1)
suppressMessages(require(png))
suppressMessages(require(grid))
Introduction
Improvements to the genome-wide chromatin conformation analysis techniques
enabled scientists explore the chromatin structure at an unprecedent scale.
labs from different backgrounds are starting to adopt the high-resolution Hi-C protocols
in their studies, which led to an increasing amount of available Hi-C data.
With more data available, scientists would like perform comparative analysis
between the available Hi-C samples,
i.e: finding differential chromatin interactions (DCI).
Previously developed differential chromatin interaction detection techniques
were designed with a low-resolution Hi-C in mind (40kb or more). Thus,
they based their models on the assumption of the independence between
the different interacting pairs (Shown in Fig.1a), which is logical
as the low-resolution Hi-C maps capture interactions between relatively
distant chromatin fibers.
However, when dealing with high-resolution Hi-C (less than 10kb or so) the
independence assumption between the adjacent "sequares" in the Hi-C matrix does
not hold, due to the polymer nature of the chromatin that imply the dependence
between adjacent chromatin fibers (Fig.1b).
To overcome this issue, we developed a new differential chromatin interaction
detection technique that takes into consideration the dependency between
neighboring loci. As depicted in this Fig.1b, when there is a change
in the interaction frequency between the bins we expect this movement
to be reflected on the interaction frequencies in the bins around it.
img <- readPNG("Figure1.png")
grid.raster(img)
Principle behind FIND
This part maybe boring, but I put it as summary for people who want to try the
method but didn't go through the paper yet.
FIND
is designed for problems in which we have two biological conditions , each with a certain number of Hi-C replicates .
FIND
is based on the simple idea that the establishment of a chromatin loop between two loci and , a "mountain shape" structure should be formed (due to the spatial dependency) in the 3D space defined by the 2D coordinates of the loci and their interaction frequency. If there is a change in the interaction frequency, the height of the tip of the mountain (shown in Fig.2 as ) will change.
This change, will also influence the height of the points located in the neighborhood of to change also. If this change is significance, the density of the points around should change between the two biological conditions.
img <- readPNG("Hic_3D.PNG")
grid.raster(img)
To estimate this change, we first define a window of size around our point
of interest . For example if the window is of size 3, it will include
all the points located in the coordinates defined by the
Cartesian product . In this annimation
generated using conv_arithmetic
we show how FIND scans through the Hi-C matrix using a window of size 3.
For the pairwise interactions located at the borders of the matrix, only the
neighbours located in the matrix. So for the interaction (1,1) only 4 neighbours
will be considered.
[图片上传失败...(image-9c7290-1678796374534)]
Then, we compare the content of this window between the two conditions.
To do this comparison, we fist rank the bins in each window and in each replicate according to their 3D distance to the pairwise interaction , such as is mean interaction frequency of in the first condition.
Let be the distance from the nearest neighbor to in the replicate of condition . The probability to observe the k-nearest neighbour at a distance is euqal to (Burguet et al. 2009, 2011; Burguet and Andrey 2014):
The equation, looks a bit confusing, but if you read it carefully, you can notice that the only unkown is , which represents the estimated density of the nearest neighbors around in the replicates of the condition . To estimate , we just need to calculate the maximum likelihood given by this equation:
Using this formula, we can then estimate the density of the k-nearest neighbours (KNN) in the 1st condition around , , and the density of the KNN of the 2nd condition around , . The ratio between these two densities will tell us, if the KNNs change their distribution around or not. It can be shown that this ratio follows a Fisher distribution with and degree of freedome:
Thus, we can calculate for each KKN a p-value that indicate it shows a differential distribution around or not. To avoid problems related to the variability between replicates, we also use (the mean frequency of in the second condition), and do the same calculations. Hence, at the end we get p-values ( p-values using as a reference and p-values using as reference). We need then to combine these p-value to calculate a final combined p-value.
Different p-value combination methods exist; however, each p-value combination method has a different hypothesis setting. Some are very stringent and require all the p-values to be significant, others are very loose and require one or more p-values to be significant in order for the combined p-value to be significant. For a short review, you can check (Lun-Ching Chang et al, 2013, BMC Bioinformatics). In our case, we use the r-th ordered p -value (rOP) which requires that at least out the p-values to be significant in order for the combined p-value to be significant. This statistic will give us more flexibility as the user can select the level of stringency he wants to apply.
Basic usage
loading required libraries
to accelerate data processing and use as less memory as possible, FIND
requires the Matrix
package. For visualization, in this vignette we prefer to use the rasterVis
package. The mvtnorm
is needed only in the simulation. We also need the gridExtra
to arrange the plots.
require(FIND)
require(Matrix)
require(mvtnorm)
library(raster)
require(rasterVis)
require(gridExtra)
require(HiTC)
require(edgeR)
require(ggsci)
A simulation example
Generate simulated Hi-Cs
To demonstrate the functionalities of FIND
, we use this simulation example in
which we load a part of the K562 matrix generated by Rao et al, 2014 and use it
as a base to generate two Hi-C matrices.
In this example, we will generate 2 replicates in each condition. The simulated pairwise interactions will have a 0.1% chance to be diffrential, with 40% of them will be showing a intrease of 5 fold-change
in the interaction frequency in the second condition. We remove the noisy interactions that have a dispersion larger than 0.1. This can be done using the generateSimulation
function
data(K562_matrix_small)
simRes <- generateSimulation(K562_matrix_small[1:200,1:200],
pDiff = 0.01,pUp = 0.6,
foldDiff = 3,
extreme.dispersion = 0.1,
nrep1=2,nrep2=2)
The generateSimulation
function returns a list that contains: the simulated Hi-C samples and a binarry matrix that contains the position of the DCIs.
simRes
For example, we can use the code bellow to display the simulated data
p1 <- levelplot(as.matrix(simRes@Hic1[[1]]), par.settings=YlOrRdTheme,main="Cond1 Rep1")
p2 <- levelplot(as.matrix(simRes@Hic1[[2]]), par.settings=YlOrRdTheme,main="Cond1 Rep2")
p3 <- levelplot(as.matrix(simRes@Hic2[[1]]), par.settings=YlOrRdTheme,main="Cond2 Rep1")
p4 <- levelplot(as.matrix(simRes@Hic2[[2]]), par.settings=YlOrRdTheme,main="Cond2 Rep2")
p5 <- levelplot(as.matrix(simRes@trueStates), par.settings= YlOrRdTheme, main="DCIs position")
# Here we use ncol=1 due to the limite width of page,
# if you have a large screen you can put 2 plots in a column
grid.arrange(p1,p2,p3,p4,p5, ncol=2)
Instead you can just display the log2 fold-change between the samples
## Plot the log2 Fold-change between the two conditions
m1 <- simRes@Hic1[[1]] + simRes@Hic1[[2]]
m2 <- simRes@Hic2[[1]] + simRes@Hic2[[2]]
levelplot(as.matrix(log2((m1+0.01)/(m2+0.01))), par.settings = BuRdTheme, main="log2(Cond1/cond2)")
Finding Differential Chromatin Interactions (DCIs)
Calling DCI from a list of Hi-C matrices
In many cases you want to try some newly published Hi-C normalization methods or you developped your own Hi-C processing methods, then you'll probably load it as an R matrix. In this case you can directly call the functin getDiff
.
In this example, we will continue with the simulated Hi-C data. Here, we pass two lists of Sparce-matrices and we specifiy that we will be using a window of 3 and we want to calculate the p-value on the
DCis <- getDCIs_fromSim(x = simRes,
windowSize = 3,
alpha = 0.7,
method = "hardCutoff",
qvalue = 1e-2,
isrOP_qval =FALSE)
equivantly, if we have two list of matrices we could have loaded them using the getDCIs_fromMat
DCis <- getDCIs_fromMat(Hic1 = simRes@Hic1,
Hic2 = simRes@Hic2,
windowSize = 3,
alpha = 0.7,
method = "hardCutof",
qvalue = 1e-2,
isrOP_qval =FALSE)
We can plot the q-values
p1 <- levelplot(as.matrix(-log10(DCis@qvals)), par.settings= YlOrRdTheme,main="-log10(qvalue) after cutoff")
p2 <- levelplot(as.matrix(DCis@DCIPos), par.settings= YlOrRdTheme,main="Predicted DCIs")
p3 = levelplot(as.matrix(log2((m1+0.01)/(m2+0.01))), par.settings = BuRdTheme, main="log2(Cond1/cond2)")
grid.arrange(p3,p1,p2,ncol=2)
After p-value correction diagonal-wise, we select only the regions that cluster together, to avoid any noisy regions.
Probably it is not a very statistical way that doing this, as we are not really sure
how it controls for false-positives, but it tends to select regions that show a big structural change.
Further down-stream analysis should be done.
We can see that it is a hard to select a cutoff, thus, it is better to use a kind of a semi-automatic way to select the q-value.
One solution is to use quantile regression to select the best cutoff given the distance.
Bellow we show the distribution of the q-values given the different distances (THIS IS DEPRECATED AS IT IS VERY SLOW APROACH)
smry <- summary(DCis@qvals)
smry$x <- -log10(smry$x)
smry$dist <- smry$j - smry$i
## remove very low qvals as they are the majority
#smry <- subset(smry , x> -log10(pbeta(0.1,10,18-10+1)) )
o <- order(smry$dist)
smry <- smry[o,]
X <- model.matrix(smry$x ~ splines::bs(smry$dist, df=9))
smoothScatter(smry$dist, smry$x, xlab = "Distance",ylab="-log10(qvalue)")
quants =seq(0.9,0.99,.01)
cols = ggsci::pal_d3()(length(quants))
for(j in 1:length(quants)){
thr <- quantreg::rq(x ~ splines::bs(dist,df = 9), data = smry,tau = quants[j])
thr.fit <- X %*% thr$coef
smry$threshold <- thr.fit
lines(smry$dist,smry$threshold,col=cols[j],lwd=2)
}
legend(max(smry$dist)-20, max(smry$x)-20, legend=quants,
col=cols, lty=1, cex=0.5)
As in this simulated examples, the DCIs were generated independently from the interaction span, we notice that the different quantiles show a kind off uniform distribution, indicating the independence.
If the quantile regression is selected (method = "quantreg"
), the 99th percentile is selected by defaults.
We can select a lower cutoff using the quantile
parameter.
DCis <- getDCIs_fromSim(x = simRes, windowSize = 3,alpha = 0.7,
method = "quantreg",quantile = 0.99)
We can compare our results to true location of DCIs.
compareWithTrueDCI(DCis@DCIPos,simRes@trueStates)
Try Parallel processing
DCis_par = getDCIs_parallel(Hic1 = simRes@Hic1,
Hic2 = simRes@Hic2,
windowSize = 3,
alpha = 0.7,
chunk_size=100,
method = "quantreg",
quantile = 0.99,
qvalue = 1e-6,
isrOP_qval = F,
nbProcessors = 4,
verbose = TRUE)
Check that we are getting the correct qvalues
Here we want to make sure that the qvalues calculated in the paraallel version
are similar to the ones obtaines in the sequential version.
smry_par = summary(DCis_par@qvals)
smry_seq = summary(DCis@qvals)
colnames(smry_par)[3] = "qval_par"
colnames(smry_seq)[3] = "qval_seq"
mrg = merge(smry_par, smry_seq, all=TRUE)
# a few sites will show some difference, generally very slight at the decimal positions.
nrow(subset(mrg, qval_par != qval_seq))
loading genomewide chromatin interactions
FIND
also provides a function to perform genomewide diffrential chromatin
interaction detection. The internals of the function depend on the HiTC
package.
For the moment we support matrix files that are in the same format as the one
generated by HiC-Pro
.
Two types of files are needed:
-
A list of Hi-C matrix files: which are a three-columns tab-separated file,
where the first two columns indicate the bin ID and the last column indicated the
interaction frequency.
151 151 68.372895
151 152 51.179648
151 153 37.575510
151 154 17.432925
151 155 8.201022
151 156 18.941581
-
Bin indices: Which is a four-columned tab-separated files that indicated
the genomic position of the bins in the first 3 columns and the binID in the 4-th
column:
chr1 0 20000 1
chr1 20000 40000 2
chr1 40000 60000 3
chr1 60000 80000 4
chr1 80000 100000 5
chr1 100000 120000 6
Suppose we have a folder with 4 Hi-C matrices, we can first load them using
the loadHicExperiment
function:
matrices_lst <- as.list(list.files("HiC_results/",pattern = "*iced.matrix"))
binIndices = "MM9_5000_abs.bed"
gw_interactions <- loadHicExperiment(matrices_lst,binIndices = binIndices,
groups =c(1,1,0,0),type = "HicPro",
genome = "mm9")
Next, we can run the differential analysis. Chromosomes are processed sequenciantially,
but the interaction matrix of each chromosome is processed sepraratly.
This choise is to euilibrate the tradeoff between speed and memory,
we can parallalize everything, but the number of processes will explode.
If you are running of a cluster though, you can generate a HicExperiment
object
for each interaction, and processes them in different nodes.
Generally a chunk_size
of 300~500 bins, gives a good trade off. You fist test
on the smaller chromosomes and pick the paramters that suits you.
gw_interactions <- getDCIs_HicExpr(gw_interactions,
windowSize = 3,
alpha = 0.7,
chunk_size=500,
method = "quantreg",
quantile = 0.99,qvalue = 1e-6,isrOP_qval = F,
nbProcessors = 8,verbose = TRUE)
Loading .hic data
the .hic
format is one of the widely adopted chromatin interaction storage
format. Hopefully, Aiden's group have developed a wrapper function
straw available in their github repository.
We took advantage of the available C++ wrapper and interated it into FIND
,
unfortunatly due to the way C++ long
type are codded between windows and linux,
the code does not work in windows and may lead to a system crash.
Bellow is an example showing how deal with such data using the readJuiceBoxFormat
function provided by FIND
.
# We need to know the chromosomes lengths
require(BSgenome.Hsapiens.UCSC.hg19)
seqlen = seqlengths(BSgenome.Hsapiens.UCSC.hg19)[1:22]
# We load two K562 replicates and GM12878
hic_mats = list(K562_reps1 = "GSM1551620_HIC071_30.hic",
K562_reps2 = "GSM1551623_HIC074_30.hic",
GM12878_reps1 = "GSM1551574_HIC025_30.hic",
GM12878_reps2 = "GSM1551575_HIC026_30.hic")
As it have been reported in many diffrential chromatin interaction analysis,
such as in diffHic
, regioins with Copy Number Variation are hard to interprete
and the observed different might be just due to CNV. One can use some methods
to correct for this bias, or can ignore these regions directly.
Here we will just remove these regions. We will use the already calculated CNV
regions for the K562 downloaded from GSM999287
## We use this function as a proxy to call
fct = function(chrom,seqlen, hicfile, resolution){
intdata = readJuiceBoxFormat(hicfile,
chromosome=chrom,
normMethod="VC_SQRT",
resolution=resolution)
smry <- summary(intdata)
bins <- tileGenome(seqlen[chrom],tilewidth=resolution, cut.last.tile.in.chrom=TRUE)
bins$name <- 1:length(bins)
# correct size
intdata <- sparseMatrix(i=smry$i,j=smry$j,x=smry$x, dims=c(length(bins),length(bins)))
HTCexp(intdata, bins, bins)
}
## Read the list of .hic matrices
hic_matrices = list()
for(hic in names(hic_mats)){
# here we are using chromosom 22, just to test
chroms =names(seqlen)[22]
hic_lst = mclapply(chroms,fct,
hicfile= hic_mats[[hic]],
resolution=5000,
seqlen=seqlen,mc.cores=4
)
hic_matrices[[hic]] = HTClist(hic_lst)
}
## We can use the MA-plot technique to do a joint normalization between replicates
# define experiment design
groups=c(0,0,1,1)
# Create a HicExperiment object
hicCol <- new("HicExperiment",
hic_matrices= hic_matrices,
groups=groups,
DCIs=list(),
genome="hg19")
Ideally, we should do inter-replicates normalization
otherwise, many highly variable regions will be reported as DCI, some packages
such as HiCcompare
already implement the inter-replicates normalization using
the M-Distance plot for normalization. Another good strategy used by HiCcompare
,
is the weighted normalization of the M values (log2(f1/f2)
) using the A values
(log2((f1+f2)/2)
) as low frequency values tend to be very variable and noisy.
# lunch parallel processing.
hicCol <- getDCIs_HicExpr(HicExpr=hicCol,
windowSize = 3,
alpha = 0.7,
chunk_size=500,
method = "quantreg",
quantile = 0.99,
qvalue = 1e-6,
isrOP_qval = F,
nbProcessors = 8,
verbose = TRUE)
Displaying Diffrential Chromatin interactions
Some considerations when preparing Hi-C matrices
Hi-C matrices replicability
One concern when dealing with Hi-C replicates is the replicability of the
results. In our manuscript we used the MA-normalization,
which is ok, but it was not designed for the Hi-C context. Hopefully,
some new strategies are getting developped to deal with this concern,
for example we can refer the user to methods such as :
- hicrep (Available on Bioconductor)
- genomeDISCO (https://github.com/kundajelab/genomedisco)
- HiC-spector (https://github.com/gersteinlab/HiC-spector )
- QuASAR-Rep (part of https://github.com/bxlab/hifive)
For more details you can the preprint by Nobel's group (https://www.biorxiv.org/content/early/2018/02/05/188755)
Dealing with Cancer samples
Copy number variation can be a source of bias in Cancer cell samples and can
introduce many false positive differential interactions. One option is to try
to correct for this bias using tools shuch as :
- caICB: (part of https://bitbucket.org/mthjwu/hicapp).
Some implementation strtergies
Hi-C matrices are very large, their processing can take time and space. When
developping FIND
, we tried to uses different tricks to improve the processing
seed. The usage of a the sparse matrix class dgCMatrix
Potentially Comming Features
- loading of different formats : There is a juicebox format loading function, but works only on linux for the moment.
Session Info
sessionInfo()