参考:https://www.jianshu.com/p/ade4fb0c152f
一 背景简介
为了预防大家对这个概念或者领域比较陌生,下面会简单的和大家介绍一下其背景知识。
2005年,Tettelin等人提出了微生物泛基因组概念(pangenome,pan源自希腊语‘παν’,全部的意思),泛基因组即某一物种全部基因的总称。2009 年,Li等人首次采用新全基因组组装方法对多个人类个体基因组进行拼接,发现了个体独有的DNA序列和功能基因,并首次提出了“人类泛基因组”的概念,即人类群体基因序列的总和。2009 年泛基因组测序首次应用于人类基因组学研究;2013年泛基因组测序应用于动植物研究领域;2019年第一个泛基因组研究在大规模的(910名)非洲人后裔中开展。源源不断中,还有很多很多泛基因组的研究正在进行中。可以看出,随着测序技术的进步和其价格的下降,泛基因组研究变得越来越热门。
如图,泛基因组进而可以分为,核心基因组(core genome)和可变基因组 (variable genome)。核心基因指的是,在所有动植物品系或者菌株中都存在的基。可变基因组是指,在1个以及1个以上的动植物品系或者菌株中存在的基因。如果某个基因,仅存在某一个动植物品系或者菌株中,该基因还可以细分为品系或者菌株特有基因。一般来说,核心基因组控制着生命体基本生成代谢的功能。另外,结构变异中的存在/缺失变化(presnece/absence variation)是泛基因组的重点研究对象,因为可变基因组可能就是使个体产生不同性状(抗病性,抗寒性等)的原因。
泛基因组的构建法:重头组装构建法
这是构建泛基因组最经典的方法,最早运用于构建细菌的泛基因组。分别对多个个体进行,分别的从头组装和注释,然后将所得的每个个体的新组装的序列与参考序列reference基因组进行比对,找出比对不上的区域,进行整合。对于比较大的基因组,此方法需要耗费更多的电脑资源,因为需要对每一个个体进行分别进行重头组装。那为什么要提这个构建方法呢,因为即将介绍的工具是基于这种泛基因组的构建方法。
二 工具的安装
介绍了那么多,差不多可以开始接触这个工具了。首先是要把这个工具还有其需要依赖的工具一一安装好。
ppsPCP的安装
简单地从git hub上拷贝下来就好了
git clone git@github.com:Zhuxitong/ppsPCP.git
export PATH=/path/to/ppsPCP/bin/:$PATH
依赖包
MUMmer
这里一定要下载并使用 Mummer-4.0.0beta2的版本,如果使用其它版本,根据我的测试是会报错的。估计这个工具写的时候,是按照只适用于该版本的Mummer进行编写的。
$ wget https://github.com/mummer4/mummer/releases/download/v4.0.0beta2/mummer-4.0.0beta2.tar.gz
$ tar -xvzf mummer-4.0.0beta2.tar.gz
$ ./configure --prefix=/path/to/installation
$ make
$ make install
# Add MUMmer tools to your PATH
$ export PATH=/path/to/installation/:$PATH
Blast+
经我测试对版本没有绝对的要求,如果你系统已经按照好了旧的blast+,依然是可以运行的。当然不推荐用很旧很旧的版本(好几年没更新过那种)。
$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.9.0+-x64-linux.tar.gz
$ tar zxvf ncbi-blast-2.9.0+-x64-linux.tar.gz
# Add Blast+ tools to your PATH
$ export PATH=/path/to/blast+/bin:$PATH
Bedtools
这里我要说,这个bedtools一定要使用最新的版本,一开始我根据该工具的manual使用bedtoolsv v2.5 结果就是无论怎样,运行到最后一步bedtools要index fasta这一步都会出错。
$ wget https://github.com/arq5x/bedtools2/releases/download/v2.28.0/bedtools-2.28.0.tar.gz
$ tar -zxvf bedtools-2.28.0.tar.gz
$ cd bedtools2
$ make
$ export PATH=/path/to/bedtools/bin:$PATH
Blat和gffread
这几个就正常跟着流程按照就好了,对版本要求不大。
###blat
$ mkdir UCSC_tools
$ rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ ./
# Add blat to your PATH
export PATH=/path/to/UCSC_tools/blat/:$PATH
###gffread 它是cufflinks中的一个小工具,所以通过按照cufflinks就能调用了
$ wget http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz
$ tar zxvf cufflinks-2.2.1.Linux_x86_64.tar.gz
# Add gffread to your PATH
$ export PATH=/path/to/cufflinks-2.2.1.Linux_x86_64/:$PATH
Perl module
最后就是安装bioperl这个模块,因为这个工具其中的几个脚本都有调用到这个模块
$ cpanm --local-lib=~/perl5 local::lib && eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)
$ cpanm Bio::Perl
三 流程概述
接着就让我们更加细致了解一下这个工具所涉及的流程,主要步骤分为十步:
Aligning query to ref with nucmer !
Extracting PAVs from nucmer output,最短的PAV长度是被定义为100bp。
为了确认这些PAV,BLASTn用于在参考基因组上搜索这些PAV序列,确认他们是存在/缺失。
Filtering PAVs from blastn output (1. high coverage and similarity, 2. unmapped/ no similarity)
将上面BLASTn的结果用来判断PAVs序列可不可信,分别有两种情况:
1)与参考序列非常接近(相似度>=95% 和该片段>=90%的序列被覆盖到),这种情况下,这些高度相似的片段会被过滤掉。
2)PAVs序列在参考基因组中找不到。另一句话说他们是缺失的。要被选出来。Extension and correction of filtered PAVs by matching them with query genome to get full gene covering regions
这些被选出来的PAVs,进一步和query片段比较,如果他们是重叠的,就进而延长他们。Filtering and annotating genes overlapped with extracted PAVs
所有被挑选出来并延长的PAVS片段,会挪到一起,每个片段之间使用100bp的N-bases相连.然后准备进行注释Merging filtered information with reference genome and making sequence based draft pan-genome
最后这些新的PAVS片段和参考基因组就组成了泛基因组Realigning the draft pan-genome to query genome as reference using BLAT, to filter less similar genes or genes not fulfill the previous defined criteria
每个材料的基因组和参考基因组比对使用BLAT工具。Including missing/less similar genes to step 5 output for final process
提取的基因区域是从query基因组和新生成二代PAV基因组中挖掘出来的(第五步中的结果)Generating final pan-genome and its annotation file
一个完整的基于参考基因组的泛基因组,是通过合并PAVS序列文件及其与参考基因组的注释来构成泛基因组的
四 使用
1. 工具使用说明
依赖的包不少,但使用的方法却很简单,只要将参考基因组和参考基因组的注释,再加上其它不同的植株所对应的组装和注释放到命令中就行了。换句话讲,这就要求你事先已经组装好每一个不同植株或者个体,并且将其注释好(为了确保准确性和一致性这里一般都会使用相同的工具进行组装和注释)。
Usage:
make_pan.pl [options] --ref [reference_genome] --ref_anno [refernece_anno] --query query1_genome[query2...] --query_anno query1_anno[query2...]
其它比较重要的参数就是--coverage
,--sim_gene
和--sim_pav
,基因存在/缺失的变异,将由序列的覆盖度,相似度来决定。然后就是--thread
,使用更多的CPU能够在mummer和blast的步骤极大提高运行的效率。
***Filter parameters
--coverage The coverage used to filter similar PAVs. Can be any number between 0 and 1. Default: 0.9
--sim_pav The similarity used to filter similar PAVs. Can be any number between 0 and 1. Default: 0.95
--sim_gene Then similarity used to filter mapped genes in blat mapping. Can be any number between 0 and 1. Default: 0.8
--thread The number of threads used for mummer and blastn. Remember not all the phases of ppsPCP are parallelized. Default: 1
2. 输入和输出文件
输入文件
运行ppsPCP至少需要两个基因组序列文件和两个相应的注释文件。基因组序列文件应该是具有以下格式的fasta文件:
>chr1
ATCGATCG...
文件扩展名无关紧要,可以接受“ .fa”,“.fasta”或任何其他后缀。但是序列文件的前缀名称将用于指示临时文件,因此我们建议您使用“ cultivar.fa(例如rice.fa)”来运行ppsPCP。
注释文件应为GFF3格式:
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003
ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
ppsPCP也可以接受带有“基因”信息行的GFF格式。
输出文件
ppsPCP的主要输出文件是“ pangenome.fa”和“ pangenome.gff3”,以及一些有关泛基因组的有用信息,例如在查询中的PAV数量,merge到泛基因组中的基因数量等等。ppsPCP支持多个查询基因组文件,这些文件将生成“ pangenome1.fa”,“ pangenome2.fa” ...等等,每个文件都有对应的gff3文件。最后一个泛基因组将是最终的泛基因组,代表从每个查询基因组扫描并合并到参考基因组中的PAV /基因的总集合。
3. 使用其提供的测试文件进行测试
make_pan.pl --ref Zmw_sc00394.1.fa --ref_anno Zmw_sc00394.1.gff3 --query Zjn_sc00188.1.fa --query_anno Zjn_sc00188.1.gff3
生成了两个文件:pangenome1.fa
和pangenome1.gff3
。简单解析一下,pangenome1.fa
就是该测试数据所生成的泛基因组,其由参考基因组Zmw_sc00394.1.fa
和另一个isolateZjn_sc00188.1.fa
组装中和参考基因组不相同的序列组成(相当于non-reference sequences,Zjn_sc00188.1 isolate中独有的基因)。pangenome1.gff3
就是由参考基因中的基因注释加上只存在于Zjn_sc00188.1中的PAVS 基因(存在/缺失)注释共同组成。
五 Scripts
内部调用 ppsPCP.sh
###check input files
for my $file ($ref_seq, $ref_anno, @query, @query_anno){
if ( ! -e $file ){
die "\nError: file \"$file\" doesn't exist, please check!\n\n";
}
}
if ( -d $tmp ) {
#die "ERROR: Temporary directory $tmp already exists, please check. Delete it or change a directory to run the command again!\n"
}
else {
mkdir $tmp,0755 or die "Cannot create $tmp: $!";
}
my $pan_num = 1;
my $pan = basename($ref_seq);
if ( $pan =~ /pangenome(\d+).fa/){
$pan_num = $1 + 1;
}
###main funcation
my $ref_file = abs_path( $ref_seq );
my $ref_anno_file = abs_path( $ref_anno );
$sim_pav = $sim_pav * 100;
$sim_gene = $sim_gene * 100;
for my $index (0..$#query) {
my $query_file = abs_path( $query[$index] );
my $query_anno_file = abs_path( $query_anno[$index] );
chdir $tmp;
system("sh $Bin/ppsPCP.sh $ref_file $ref_anno_file $query_file $query_anno_file $pan_num $src_path $tmp $no_tmp $coverage $sim_pav $sim_gene $thread");
$ref_file = "pangenome$pan_num.fa";
$ref_anno_file = "pangenome$pan_num.gff3";
$pan_num++;
chdir "../";
}
if ( $no_tmp ){
unlink glob("$tmp/*");
rmdir $tmp;
}
ppsPCP.sh脚本内部也比较简单
#########################################################################
# File Name: ppsPCP.sh
# Author: Muhammad Tahir ul Qamar
# mail: m.tahirulqamar@hotmail.com
# Created Time: Wed Oct 17 16:15:15 2018
#########################################################################
#!/bin/bash
#########################################################################
# Explain parameter
# $5 = pan_number
# $6 = src_path
# $7 = tmp
# $8 = keep_tmp
# $9 = coverage
# $10 = sim_pav
# $11 = sim_gene
# $12 = thread
#########################################################################
#check for commands
#for com in nucmer delta-filter show-coords perl makeblastdb blastn gffread blat bedtools
for com in perl makeblastdb blastn gffread blat bedtools
do
mg=$(command -v $com)
if [ "$mg" == "" ]
then
echo -e "\n\tError: Command $com is NOT in you PATH. Please check.\n"
exit 1
fi
done
# File name parser
ref=$( basename $1 )
query=$( basename $3 )
refgff=$( basename $2)
querygff=$( basename $4)
refbase=`basename $ref`
querybase=`basename $query`
refbase=${refbase%\.*}
querybase=${querybase%\.*}
res="${querybase}to${refbase}"
# Create link files for genome and annotation file in tmp
if [ ! -e "$ref" ]
then
ln -s $1
ln -s $2
fi
ln -s $3
ln -s $4
#check for 'gene' line
gene=$(grep -m 1 -P "\tgene\t" $4 | head -n 1)
if [ "$gene" == '' ]
then
echo -e "\n\tNo gene line found in $4, please check the README.md for the requirements of input files\n"
exit 1
fi
#--------------------------------------------------------------------------------------------------
echo -e "\n##########################################################################################\n"
echo -e "Step 1: Aligning $query to $ref with nucmer!"
echo -n "Total number of genes in ${refbase}: "
grep -cP "\tgene\t" $2
echo -n "Total number of genes in ${querybase}: "
grep -cP "\tgene\t" $4
if [ ! -e "${res}.delta" ]
then
/share/nas1/dengdj/software/MUMmer/MUMmer4.0/bin/nucmer -p $res -t ${12} $ref $query
fi
if [ ! -e "${res}.rq.delta" ]
then
/share/nas1/dengdj/software/MUMmer/MUMmer4.0/bin/delta-filter -1 ${res}.delta > ${res}.rq.delta
fi
echo "/share/nas1/dengdj/software/MUMmer/MUMmer4.0/bin/show-coords -clrT -I 0.95 -L 100 ${res}.rq.delta > ${res}.rq.coords"
/share/nas1/dengdj/software/MUMmer/MUMmer4.0/bin/show-coords -clrT -I 0.95 -L 100 ${res}.rq.delta > ${res}.rq.coords
#--------------------------------------------------------------------------------------------------
echo -e "\nStep 2: Extracting PAVs from nucmer output!"
perl $6/get_absese_region.pl ${res}.rq.coords ${refbase}_absence.txt ${querybase}_absence.txt
echo "perl $6/get_seq.pl ${ref} ${refbase}_absence.txt ${refbase}_absence.fa"
perl $6/get_seq.pl ${query} ${querybase}_absence.txt ${querybase}_absence.fa
#perl $6/get_seq.pl ${ref} ${refbase}_absence.txt ${refbase}_absence.fa
echo -n "Total number of PAVs extracted from ${querybase}: "
grep -c '>' ${querybase}_absence.fa
#--------------------------------------------------------------------------------------------------
echo -e "\nStep 3: Aligning the extracted PAVs against reference genome using blastn!"
#makeblastdb -in ${ref} -title ${refbase}_DB -dbtype nucl -out ${refbase}_database
#blastn -db ${refbase}_database -query ${querybase}_absence.fa -num_threads ${12} -evalue 1e-5 -outfmt 6 -out ${res}_blastn.txt
perl $6/cut_file_and_make_blast+.pl -draft ${querybase}_absence.fa -ref ${ref} -db 0 -o_dr ${querybase}_BLASTN -p 2
cp ${querybase}_BLASTN/all.blast ${res}_blastn.txt
#--------------------------------------------------------------------------------------------------
echo -e "\nStep 4: Filtering PAVs from blastn output (1. high coverage and similarity, 2. unmapped/ no similarity)!"
perl $6/get_coverage_filter.pl ${res}_blastn.txt ${querybase}_absence.fa ${querybase}_absence_filtered.fa ${querybase}_absence_pavs.txt $9 ${10}
echo "perl $6/sep_pav_bed.pl ${querybase}_absence_pavs.txt ${querybase}"
perl $6/sep_pav_bed.pl ${querybase}_absence_pavs.txt ${querybase}
perl $6/get_unmapped_pavs.pl ${res}_blastn.txt ${querybase}_absence.fa ${querybase}_unmapped_pavs.txt
echo -n "Number of filtered ${querybase} PAVs having definded covergae/similarity: "
grep -c '>' ${querybase}_absence_filtered.fa
echo -n "Number of filtered ${querybase} PAVs having no similarity/hit with ${refbase}: "
cat ${querybase}_unmapped_pavs.txt | wc -l
#--------------------------------------------------------------------------------------------------
echo -e "\nStep 5: Extension and correction of filtered PAVs by matching them with query genome to get full gene covering regions!"
perl $6/sep_pav_bed.pl ${querybase}_unmapped_pavs.txt ${querybase}.2
cat ${querybase}.bed ${querybase}.2.bed | sortBed > ${querybase}_draft.bed
echo -n "Total number of ${querybase} PAVs after boundries correction: "
cat ${querybase}_draft.bed | wc -l
#--------------------------------------------------------------------------------------------------
echo -e "\nStep 6: Filtering and annotating genes overlapped with extracted PAVs!"
grep -P "\tgene\t" ${querygff} > ${querybase}_loci.gff3
/share/nas2/genome/biosoft/bedtools2/2.25/bin/intersectBed -a ${querybase}_draft.bed -b ${querybase}_loci.gff3 -wa -wb > ${querybase}_pav_intersect_gene.gff3
#intersectBed -a ${querybase}_draft.bed -b ${querybase}_loci.gff3 -wa -wb > ${querybase}_pav_intersect_gene.gff3
perl $6/get_pav_for_each.pl ${querybase}_draft.bed ${querybase}_pav_intersect_gene.gff3 ${query} ${querybase}_pav_region.txt ${querybase}_pav_final.fa
perl $6/get_the_pav_seq.pl ${querybase}_pav_region.txt ${querybase}_pav_final.fa ${querybase}_pav_seq.fa ${querybase}_pav.agp ${querybase}_pav.bed ${querybase}
#--------------------------------------------------------------------------------------------------
echo -e "\nStep 7: Merging filtered information with reference genome and making sequence based draft pan-genome!"
cat ${ref} ${querybase}_pav_seq.fa > ${res}_draft_genome.fa
echo -n "Size of the draft pan-genome: "
ls -lh "${querybase}to${refbase}_draft_genome.fa" | awk '{print $5}'
/share/nas2/genome/biosoft/bedtools2/2.25/bin/intersectBed -a ${querybase}_pav.bed -b ${querybase}_loci.gff3 -wa -wb -F 1 > ${querybase}_gene_relocation.txt
#intersectBed -a ${querybase}_pav.bed -b ${querybase}_loci.gff3 -wa -wb -F 1 > ${querybase}_gene_relocation.txt
perl $6/get_gff3_file.pl ${querybase}_gene_relocation.txt ${querygff} ${querybase}_gene_relocation.gff3
echo -n "Number of genes overlapped with ${querybase} PAVs and added into draft genome: "
grep -P "\tgene\t" ${querybase}_gene_relocation.gff3 | cut -f 9 | sort | uniq | wc -l
#--------------------------------------------------------------------------------------------------
echo -e "\nStep 8: Realigning the draft pan-genome to query genome as reference using BLAT, to filter less similar genes or genes not fulfill the previous defined criteria!"
gffread ${querygff} -g ${query} -w ${querybase}_exons.fa
cat ${querybase}_exons.fa | awk '/^>/ {if(N>0) printf("\n"); printf("%s\t",$0);N++;next;} {printf("%s",$0);} END {if(N>0) printf("\n");}' | awk -F ' ' '{printf("%s\t%d\n",$0,length($2));}' | awk '{n=NF-1;print $1" "$2"\t"$n"\t"$NF}' | sort -k2,2 -k4,4nr | sort -k2,2 -u -s | awk '{print $1" "$2"\n"$3}' | fold -w 80 > ${querybase}_exons_longest.fa
blat ${res}_draft_genome.fa ${querybase}_exons_longest.fa ${res}_output.psl
sed '1,5'd ${res}_output.psl | awk -v sim=${11} '$1/$11*100 >= sim{print $1"\t"$1/$11*100"\t"$10}' | sort -k3,3 -k1,1nr | sort -k3,3 -u -s | cut -f 3 | sort > ${querybase}_mapped.gene.list.txt
grep '>' ${querybase}_exons_longest.fa | awk '/>/{sub(/>/,"",$1);print $1}' | sort | uniq > ${querybase}_all.gene.list.txt
comm -1 -3 ${querybase}_mapped.gene.list.txt ${querybase}_all.gene.list.txt > ${querybase}_unmapped.genes.txt
echo -ne "\nNumber of ${querybase} genes mapped to draft pan-genome: "
cat ${querybase}_mapped.gene.list.txt | wc -l
echo -n "Number of ${querybase} genes NOT mapped to draft pan-genome: "
cat ${querybase}_unmapped.genes.txt | wc -l
#--------------------------------------------------------------------------------------------------
echo -e "\nStep 9: Including missing/less similar genes to step 5 output for final process!\n"
perl $6/get_unmapped_gene_bed.pl ${querybase}_unmapped.genes.txt ${querygff} ${querybase} | sortBed > ${querybase}.3.bed
cut -f 1-4 ${querybase}_pav.bed | cat - ${querybase}.3.bed | sortBed > ${querybase}_final.bed
#--------------------------------------------------------------------------------------------------
echo -e "\nStep 10: Generating final pan-genome and its annotation file!"
/share/nas2/genome/biosoft/bedtools2/2.25/bin/intersectBed -a ${querybase}_final.bed -b ${querybase}_loci.gff3 -wa -wb > ${querybase}_pav_intersect_gene_final.gff3
perl $6/get_pav_for_each.pl ${querybase}_final.bed ${querybase}_pav_intersect_gene_final.gff3 ${query} ${querybase}_pav_region_final.txt ${querybase}_pavs_confirm_final.fa
awk '{print $1"\t"$5"\t"$6"\t"$4"\t"$2"\t"$3}' ${querybase}_pav_region_final.txt | /share/nas2/genome/biosoft/bedtools2/2.25/bin/sortBed | /share/nas2/genome/biosoft/bedtools2/2.25/bin/bedtools merge -i stdin -c 4,5,6 -o collapse | awk '{split($4,name,/,/);split($5,start,/,/);split($6,end,/,/);l=length(end);print $1"\t"$2"\t"$3"\t"$1"_"$2"_"$3}' > ${querybase}_pav_region_final2.txt
awk '{print $1"\t"$5"\t"$6"\t"$4"\t"$2"\t"$3}' ${querybase}_pav_region_final.txt | /share/nas2/genome/biosoft/bedtools2/2.25/bin/sortBed | /share/nas2/genome/biosoft/bedtools2/2.25/bin/bedtools merge -i stdin -c 4,5,6 -o collapse | awk '{split($4,name,/,/);split($5,start,/,/);split($6,end,/,/);l=length(end);print $1"\t"start[1]"\t"end[l]"\t"name[1]"\t"$2"\t"$3}' > ${querybase}_pav_region_final3.txt
echo -e "\n/share/nas2/genome/biosoft/bedtools2/2.25/bin/bedtools getfasta -fi ${query} -bed ${querybase}_pav_region_final2.txt -name -fo ${querybase}_pavs_confirm_final2_mid.fa"
/share/nas2/genome/biosoft/bedtools2/2.25/bin/bedtools getfasta -fi ${query} -bed ${querybase}_pav_region_final2.txt -name -fo ${querybase}_pavs_confirm_final2_mid.fa
echo -e "\ncat ${querybase}_pavs_confirm_final2_mid.fa | fold -w 80 > ${querybase}_pavs_confirm_final2.fa"
cat ${querybase}_pavs_confirm_final2_mid.fa | fold -w 80 > ${querybase}_pavs_confirm_final2.fa
echo -e "\nperl $6/get_the_pav_seq.pl ${querybase}_pav_region_final3.txt ${querybase}_pavs_confirm_final2.fa ${querybase}_pav_seq_final.fa ${querybase}_pav_final.agp ${querybase}_pav_final.bed ${querybase}"
perl $6/get_the_pav_seq.pl ${querybase}_pav_region_final3.txt ${querybase}_pavs_confirm_final2.fa ${querybase}_pav_seq_final.fa ${querybase}_pav_final.agp ${querybase}_pav_final.bed ${querybase}
echo -e "\n/share/nas2/genome/biosoft/bedtools2/2.25/bin/intersectBed -a ${querybase}_pav_final.bed -b ${querybase}_loci.gff3 -wa -wb -F 1 > ${querybase}_gene_relocation_final.txt"
/share/nas2/genome/biosoft/bedtools2/2.25/bin/intersectBed -a ${querybase}_pav_final.bed -b ${querybase}_loci.gff3 -wa -wb -F 1 > ${querybase}_gene_relocation_final.txt
echo -e "\nperl $6/get_gff3_file.pl ${querybase}_gene_relocation_final.txt ${querygff} ${querybase}_gene_relocation_final.gff3"
perl $6/get_gff3_file.pl ${querybase}_gene_relocation_final.txt ${querygff} ${querybase}_gene_relocation_final.gff3
cat ${ref} ${querybase}_pav_seq_final.fa > pangenome$5.fa
cat ${refgff} ${querybase}_gene_relocation_final.gff3 > pangenome$5.gff3
echo -n "Number of ${querybase} PAVs added to the pan-genome${5}: "
cat ${querybase}_pav_final.bed | wc -l
echo -n "Number of ${querybase} genes added to the pan-genome${5}: "
grep -cP "\tgene\t" ${querybase}_gene_relocation_final.gff3
echo -n "Average length of ${querybase} PAVs: "
awk '{n=$6-$5;sum+=n}END{print sum/NR/1000"Kb"}' ${querybase}_pav_final.bed
echo -n "Total size of the pan-genome${5}: "
ls -lh pangenome${5}.fa | awk '{print $5}'
echo -n "Total number of genes in the pan-genome${5}: "
grep -Pc "\tgene\t" pangenome${5}.gff3
#--------------------------------------------------------------------------------------------------
cp pangenome$5.fa ../
cp pangenome$5.gff3 ../
echo -e "\nJob successfully completed. Check current directory for final results!"