通常,要产生的snp位点直接并不是独立的,他们之间是条件后的二项分布。
比如P(B|A)=(P(B)*P(A)+D)/P(A)
这样,我们产生模拟数据的时候就要注意位点之间的连锁关系:
for(h in 1:N) #for each individual
{
gMat = matrix(0, nrow=length(freq), ncol=2) #initialize a matrix to store geno for 2 allele on each site
for(i in 1:length(freq)) #for each site
{
for(j in 1:2)#for each allele
{
idx = ifelse(runif(1, 0, 1) < freq[i], 0, 1) #a random value between 0 and 1, if smaller than the freq the idx is 0
if(i == 1) # for the first site
{
gMat[i,j] = idx #the value for this site is idx
}
else # for not first site
{
d = runif(1, 0, 1) #random values for following comparison as a threshold for P(A|B)
a = gMat[i-1,j]
f1 = ifelse(a == 0, freq[i-1], 1-freq[i-1]) #get the allele percentage for the previous site
f2 = ifelse(a == 0, freq[i], 1-freq[i]) # get the allele percentage for present site
gMat[i,j] = ifelse(d < (f1 * f2 +ld[i-1])/f1, gMat[i-1,j], 1-gMat[i-1,j])