全基因组关联分析

1.1 SNP 数据GWAS分析
1.1.1 文件准备
1.1.2 GWAS分析

emmax：gwas分析软件
tassel: gwas分析
admixture: 群体结构分析
plink: pca分析
qqman or CMplot : gwas 绘图R包
beagle: 数据填充

表型文件：Real_trait.C16_0.table
Q矩阵：Real_snp.3.Q

head -n 5 Real_trait.C16_0.table

GA0001      53.25
GA0002      70.04

mkdir ./TMP

## 这里all.vcf 是一组测试数据
$ singularity exec ../../software/Reseq_genek.sif   \
       java   -Xmx4g  -Djava.io.tmpdir=./TMP \
       -jar    /opt/beagle.jar \
       gt = ../data/all.vcf  \   #输入vcf文件
       out=all.impute \         # 输出结果前缀
       impute = true \           # 是否进行基因型填充
       window = 10 \            # 窗口大小
       nthreads = 4              # 线程数


用 emmax 进行 GWAS 分析
## step1 vcf格式转成tped格式（emmax软件要求）
$ singularity exec ../../software/Reseq_genek.sif plink \
         --vcf ../data/Real_snp.vcf \
         --maf 0.05 --geno 0.1 \ # 过滤数据
         --recode 12 transpose \ # genotype转12格式并生成tped文件
         --output-missing-genotype 0 \ # 缺失数据用0表示
         --out snp_filter \ # 数据文件前缀
         --set-missing-var-ids @:# \ # 设置SNP名称
         --allow-extra-chr \
         --keep-allele-order

## 生成文件如下
$ ls snp_filter.*
     snp_filter.log
     snp_filter.nosex
     snp_filter.tfam
     snp_filter.tped

## 表型数据准备, 主要目的是将样品排序
$ perl ../script/sort_pheno.pl snp_filter.tfam \
../data/Real_trait.C16_0.table > trait.sort.txt
## 查看数据，缺失用NA表示
$ head -n 5 trait.sort.txt
   GA0001 GA0001 53.25
   GA0002 GA0002 59.18
   GA0003 GA0003 70.04

## 计算亲缘关系矩阵
$ singularity exec ../../software/Reseq_genek.sif emmax-kin-intel64 \
           -v \ # 打开日志详细模式
          -d 10 \ # 指定精度
          -o ./pop.kinship \ # 指定输出文件名称
           snp_filter # 指定输入SNP数据前缀

## 进行gwas分析
$ singularity exec ../../software/Reseq_genek.sif emmax-intel64 \
         -v -d 10 \ # 指定精度
         -t snp_filter \ # 输入SNP数据前缀
         -p trait.sort.txt \ # 表型数据
         -k pop.kinship \ # 亲缘关系矩阵
         -o emmax.out \ # 输出文件前缀
         1> emmax.log 2>emmax.err

## 结果文件为emmax.out.ps, 最后一列为p-value
$ head -n 5 emmax.out.ps
Chr01:18180 -0.2773197509 0.5047378721 0.583283761
Chr01:18197 -0.1820301152 0.5113534847 0.722210058
Chr01:18241 -0.09952198502 0.5070841116 0.8445912325
Chr01:18260 -0.151943789 0.5086490625 0.7654447133
Chr01:18409 -0.2613732556 0.5014925903 0.6027754005

### 生成绘图输入文件
$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"}' snp_filter.tped | \
paste - emmax.out.ps | \
awk 'BEGIN{print "SNP\tCHR\tBP\tP"}{if($2==$5){print \
$2"\t"$1"\t"$4"\t"$NF}}' > emmax.out.ps.manht_input

### 绘图输入文件需要为4列, SNP名称， 染色体，位置，p-value
$ head -n 5 emmax.out.ps.manht_input
SNP CHR BP P
Chr01:18180 1 18180 0.8540511205
Chr01:18197 1 18197 0.8085936307
Chr01:18241 1 18241 0.2641876378
Chr01:18409 1 18409 0.7272875197

### 绘制Manhattan图和qqplot图
$ singularity exec ../../software/Reseq_genek.sif Rscript \
../script/manhattan_cmplot.R \
emmax.out.ps.manht_input \
emmax.out.ps.manht_figure

曼哈顿图

1.1.2.2 emmax + Q

用Q矩阵作为协变量用emmax 进行GWAS分析

## 准备SNP数据根据Q矩阵 表型数据
$ ln -s ../01.GWAS_emmax/snp_filter.tped
$ ln -s ../01.GWAS_emmax/snp_filter.tfam
$ ln -s ../01.GWAS_emmax/trait.sort.txt
$ ln -s ../data/Real_snp.3.Q
$ ln -s ../01.GWAS_emmax/pop.kinship

snp数据进行GWAS分析

## 修改Q矩阵格式
$ awk '{print $1}' snp_filter.tfam | \
     paste - Real_trait.C16_0.table
     awk '{print $1" "$1 1 "$2" "$3"}' > pop.Qmatrix

## 修改之后的格式
$ head -n 5 pop.Qmatrix
GA0001 GA0001 1 0.010490 0.480657
GA0002 GA0002 1 0.000010 0.864384
GA0003 GA0003 1 0.055588 0.944402
GA0004 GA0004 1 0.632317 0.259245
GA0005 GA0005 1 0.999980 0.000010

## GWAS分析
$ singularity exec ../../software/Reseq_genek.sif emmax-intel64 \
      -v \
     -d 10 \
-t snp_filter \
-p trait.sort.txt \
-k pop.kinship \
-c pop.Qmatrix \
-o emmax.out 1> emmax.log 2>emmax.err

1.1.2.3 tassel + glm

用tassel 软件基于 glm 模型进行 gwas 分析

## 修改表型数据格式
$ awk 'BEGIN{print "<Trait>\tC16_0"}{ print $0 }' \
../data/Real_trait.C16_0.table > trait.txt
## 查看表型数据
$ head -n 5 trait.txt
<Trait> C16_0
GA0001 53.25
GA0002 59.18
GA0003 70.04
GA0004 62.25
## 修改Q矩阵格式
$ awk '{print $1}' ../data/Real_snp.fam | \
paste - ../data/Real_snp.3.Q | \
awk 'BEGIN{print "<Covariate>\n<Trait>\tQ1\tQ2\tQ3" }{print $0}' \
> structure_Q3.txt
## 查看Q矩阵内容

## 修改表型数据格式
$ awk 'BEGIN{print "<Trait>\tC16_0"}{ print $0 }' \
../data/Real_trait.C16_0.table > trait.txt
## 查看表型数据
$ head -n 5 trait.txt
<Trait> C16_0
GA0001 53.25
GA0002 59.18
GA0003 70.04
GA0004 62.25
## 修改Q矩阵格式
$ awk '{print $1}' ../data/Real_snp.fam | \
paste - ../data/Real_snp.3.Q | \
awk 'BEGIN{print "<Covariate>\n<Trait>\tQ1\tQ2\tQ3" }{print $0}' \
> structure_Q3.txt
## 查看Q矩阵内容

$ head -n 5 structure_Q3.txt
<Covariate>
<Trait> Q1 Q2 Q3
GA0001 0.010490 0.480657 0.508852
GA0002 0.000010 0.864384 0.135606
GA0003 0.055588 0.944402 0.000010
## 如果vcf文件未排序，请进行vcf文件sort
$ singularity exec ../../software/Reseq_genek.sif run_pipeline.pl \
-Xms512m -Xmx10g \
-SortGenotypeFilePlugin \
-inputFile ../data/Real_snp.vcf \
-outputFile Real_snp.sort.vcf \
-fileType VCF
## 使用glm模型进行gwas分析
$ singularity exec ../../software/Reseq_genek.sif run_pipeline.pl \
-Xms512m -Xmx10g \ # 设置内存大小
-fork1 -vcf ../data/Real_snp.vcf \ # 读入vcf文件
-fork2 -t trait.txt \ # 读入表型数据
-fork3 -q structure_Q3.txt -excludeLastTrait \ #读入Q矩阵
-combine4 -input1 -input2 -input3 -intersect \ # 数据取交集
-FixedEffectLMPlugin -endPlugin \ # 进行glm分析
-export glm_output
## 分析结果如下
$ ls glm_output*.txt
glm_output1.txt
glm_output2.txt
## 绘制曼哈顿图和QQplot图

$ awk '{print $2"\t"$3"\t"$4"\t"$6}' glm_output1.txt | \
awk '$NF ~ /[0-9]+/ || NR == 1 ' > glm_output.manht_input
$ singularity exec ../../software/Reseq_genek.sif Rscript \
../script/manhattan_cmplot.R \
glm_output.manht_input \
glm_output.manht_figure

曼哈顿

全基因组关联分析

全基因组关联分析

全基因组关联分析

相关阅读更多精彩内容

友情链接更多精彩内容