基本概念
基迪奥有篇文章写得非常的简单明了,我这里就不再赘述,大家移步去搞清楚基本知识。
STRUCTURE软件的使用准则
软件假设输入的标记数据中,每个标记都是独立的,所以在分析之前,需要对标记按照一定规则进行筛选。常见筛选方法有如下三种Nat Rev Genet, 2015:
- 一定物理距离取一个代表用于分析
- 全基因组上随机抽取一部分标记进行分析
- 按照LD筛选:LD强度大于一定阈值的标记只保留一个用于分析
STRUCTURE软件实操:
前期准备
给标记加上ID
SNP data通常都是以VCF格式文件呈现,拿到VCF文件的第一件事情就是添加各个SNP位点的ID。
先看一下最开始生成的VCF文件:
可以看到,ID列都是".",需要我们自己加上去。我用的是某不知名大神写好的perl脚本,可以去我的github上下载,用法:
perl path2file/VCF_add_id.pl YourDataName.vcf YourDataName-id.vcf`
当然也可以用excel手工添加。添加后的文件如下图所示(格式:CHROMID__POS):
SNP位点过滤(Missing rate and maf filtering)
SNP位点过滤前需要问自己一个问题,我的数据需要过滤吗?
一般要看后期是否做关联分析(GWAS);如果只是单纯研究群体结构建议不过滤,因为过滤掉低频位点可能会改变某些样本之间的关系;如果需要和表型联系其来做关联分析,那么建议过滤,因为在后期分析中低频位点是不在考虑范围内的,需要保持前后一致。
如果过滤,此处用到强大的plink软件,用法:
plink --vcf YourDataName-id.vcf --maf 0.05 --geno 0.2 --recode vcf-iid -out YourDataName-id-maf0.05 --allow-extra-chr
参数解释:--maf 0.05:过滤掉次等位基因频率低于0.05的位点;--geno 0.2:过滤掉有20%的样品缺失的SNP位点;--allow-extra-chr:我的参考数据是Contig级别的,个数比常见分析所用的染色体多太多,所以需要加上此参数。
LD筛选(LD pruning and make bed file)
前文提到STRUCTURE软件假设输入的标记数据中,每个标记都是独立的,所以我们需要对标记按照一定规则进行筛选,这里用其中的一种方法——LD筛选。
plink --vcf YourDataName-id-maf0.05.vcf --indep-pairwise 100 50 0.2 -out YourDataName-id-maf0.05-LD --allow-extra-chr --make-bed
100—以100个kb为单位;50—SNP数目,50个SNP的步长;0.2—LD强度。
转换为STRUCTURE格式
plink --bfile YourDataName-id-maf0.05-LD --extract YourDataName-id-maf0.05-LD.prune.in --out YourDataName-id-maf0.05-LD-structure --recode structure --allow-extra-chr
填写STRUCTURE配置文件:
配置文件有两个,分别是mainparams和extraparams。我们需要填写mainparams同时生成空extraparams文件。
注意:mainparams配置文件的个数为最大K值乘重复次数,如计算K从1到10,每个重复3次,则要有30个该文件,也要有对应的30个命令行。
如果计算K从1到10,每个重复3次,30个配置文件可以这样命名:
STRUCTURE运行
运行STRUCTURE很简单:
#单个运行:
structure -m mainparams_1_1 -e extraparams
structure -m mainparams_1_2 -e extraparams
structure -m mainparams_1_3 -e extraparams
。
。
#同时运行:将mainparams配置文件名放到一个list中,用for循环调用运行STRUCTURE:
for i in $(less mainparams.list); do nohup structure -m ${i} -e extraparams & done
结果可视化
Structure的结果可视化用到一个R包——pophelper,需要在R环境中安装后调用。注意:新版pophelper用下述命令会报错,最好使用V2.2.9
#安装pophelper 2.2.9软件:
install.packages(c("Cairo","devtools","ggplot2","gridExtra","gtable","tidyr"),dependencies=T)
devtools::install_github('royfrancis/pophelper')
数据可视化包括两个方面,1)计算K值并画图,2)绘制Structure堆叠图。方法很简单,把所有的结果都放在同一个文件夹里,调用pophelpe即可,写好的R命令如下所示,按需求执行:
另外,需要准备分组文件(pop_list.txt),我分了如下图的几列,大家可以自己DIY。注意:该文件中的样品排序需要与VCF中的样品排序相对应
# read structure results
#更改工作路径(该路径下存有Structure所有的运行结果)
setwd("F:structure_results")
#调用pophelper
library(pophelper)
file_list <- list.files(path = "./out/", full.names = T) # list file directory
qlist <- readQ(file_list) # read result files
# evanno method to calculate deltaK
tbq <- tabulateQ(qlist)
smq <- summariseQ(tbq)
###绘制最佳K值线
evannoMethodStructure(smq, exportplot = T, writetable = T,
imgtype = "png", height = 16, width = 18,outputfilename = "evanno")
evannoMethodStructure(smq, exportplot = T, writetable = T,
imgtype = "pdf", height = 16, width = 18,outputfilename = "evanno")
# clumpp repeat results
clumppExport(qlist = qlist, parammode = 3, prefix = "pop", useexe = T) # run clumpp
collectClumppOutput(prefix = "pop", filetype = "both", runsdir = getwd()) # collect clumpp results
# read clumpp merged results
fclum <- list.files(path = "pop-both", full.names = T, pattern = "merge")
qclum <- readQ(fclum)
sample_order <- read.table("./pop_list.txt", header = T, stringsAsFactors = F)
ind_name <- sample_order[,1]
for(i in 1:length(qclum)){
row.names(qclum[[i]]) <- ind_name
}
mink <- 2
maxk <- 10
k_order <- vector()
if(maxk < 10){
k_order <- 1:length(qclum)
} else if (maxk < 20) {
end1 <- maxk - 10 + 1
start2 <- end1 + 1
k_order <- c(start2:length(qclum), 1:end1)
}
klab <- vector()
if(mink == 1){
klab <- 2:maxk
} else {
klab <- mink:maxk
}
# 绘制全局structure图
# plot global barplot without group information
prefix <- "demo"
height <- 2
width <- 16
plotQ(qclum[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
outputfilename=prefix,imgtype="pdf", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA)
plotQ(qclum[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=0.1,
outputfilename=prefix,imgtype="png", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA)
# 绘制全局并带有组信息的structure图
# plot global barplot with group information
prefix <- "demo_label"
plotQ(qclum[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
outputfilename=prefix,imgtype="pdf", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA,
grplab=sample_order[,2:3,drop=FALSE],ordergrp=T, grplabsize=2, grplabheight = 4)
plotQ(qclum[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=0.1,
outputfilename=prefix,imgtype="png", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA,
grplab=sample_order[,2:3,drop=FALSE],ordergrp=T,grplabsize=2, grplabheight = 4)
# 绘制各个k值的structure图
# plot single K barplot
plotQ(qclum, imgoutput = "sep",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
imgtype="pdf", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA)
plotQ(qclum, imgoutput = "sep",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
imgtype="pdf", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA,
grplab=sample_order[,2:3,drop=FALSE],ordergrp=T,grplabsize=2, grplabheight = 4)
## for admixture plot
library(pophelper)
setwd("F:/works/developing/course/gwas/data/lecture07/admixture_results")
file_list_admix <- list.files("admixture_output/", pattern = ".Q", full.names = T)
info <- read.table("sample_order.txt", header = T, stringsAsFactors = F)
qlist_admix <- readQ(file_list_admix)
for(i in 1:length(qlist_admix)){
row.names(qlist_admix[[i]]) <- info$sample
}
k_order <- vector()
mink <- 1
maxk <- 10
if(maxk < 10){
k_order <- 1:length(qlist_admix)
} else if (maxk < 20) {
end1 <- maxk - 10 + 1
start2 <- end1 + 1
k_order <- c(start2:length(qlist_admix), 1:end1)
}
klab <- vector()
if(mink == 1){
klab <- 2:maxk
} else {
klab <- mink:maxk
}
prefix <- "admix"
height <- 1
width <- 16
plotQ(qlist_admix[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
outputfilename=prefix,imgtype="pdf", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA)
plotQ(qlist_admix[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
outputfilename=prefix,imgtype="png", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA)
参考:
群体结构图形——structure堆叠图
Sehraiber J G. Methods and models for unravelling human evolutionary history. Nature Reviews. Genetics, 2015