写在前面
性别被称为“进化生物学问题的皇后”,一直是生命科学领域中最具吸引力和最热门研究方向之一。而有性生物其性染色体的解析是性别研究的基础,这篇文章主要介绍一套鉴定性染色体的流程。有性生物的性别决定系统主要有两种:雄性性别决定系统(XX/XY)和雌性性别决定系统(ZZ/ZW)。因此需要鉴定性染色体X/Y(Z/W)到底是基因组的哪条染色体。目前主流鉴定方式有两种:
一是利用雌雄测序reads(二代三代都行)在基因组上的覆盖度差异来区分常染色体和性染色染体,这种方式适合X和Y(或Z和W)分化差异较大的时候。如果以YY基因组为参考,雄性reads在Y特异的区域的覆盖度是在常染色体/假常染色体(PAR)的一半,而雌性reads而则在Y特异的区域完全没有覆盖度;同样地如果以XX基因组为参考,雄性reads在X特异的区域的覆盖度也是在常染色体/假常染色体(PAR)的一半,而雌性reads而则在X特异的区域的区域与常染色体/假常染色体(PAR)没有区别。
二是利用X和Y(或Z和W)同源重组抑制产生的大量的SNP来鉴定,我们将一定数量的雌雄个人重测序reads比对到参考基因组 ,获得了各个样本的SNP,筛选在符合要求的SNP,看看其主要位于哪条染色体上。比如我们如果以YY基因组为参考,就可以把所有雄性个体其基因型为杂合突变(0/1)而雌性个体其基因型为纯合突变(1/1)的SNP位点筛选出来,看他们位于哪里;同样地如果以XX基因组为参考,就需要筛出所有雄性个体其基因型为杂合突变(0/1)而雌性个体其基因型为未突变(0/0)的SNP位点。
这次我用第二种方式,用雌雄覆盖度下次有空再学习下!
大致的流程:
对比: 用bwa对重测序数据进行比对,拿到个样本的bam文件;
snp calling: 对拿到的bam用picard标记重复,随后将标重的bam用gatk去call SNP,拿到每个样本的gvcf文件;
joint calling: 将所有样本的gvcf还是用gatk合并成一个gvcf并联合分型生成vcf文件;
snp筛选并作图: 对vcf文件进行过滤,并筛选出符合要求的snp并作图展示出来。
1.比对
相关软件:
bwa,ParaFly,samtools。 ParaFly可以批量并行命令,因为这里有多个样本需要比对,彼此之间是独立的可以并行。也可以不装,可以依次进行单个样本比对,或者同时提交多个任务,就是麻烦点。
bwa主页:https://github.com/lh3/bwa
ParaFly主页:https://parafly.sourceforge.net/
## bwa手动安装
git clone https://github.com/lh3/bwa.git
cd bwa
make
## ParaFly自动安装
conda install -c bioconda parafly
#1.对参考基因组建立bwa以及samtools引索index
ln -s ../ref/YY2.fa caYY.fa
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools faidx caYY.fa
/newlustre/home/jfgui/Wangtao/software/bwa/bwa index caYY.fa
#2.生成所有样本的比对列表文件mapping.cmd.list
当然生成mapping.list方法有很多,按照自己方法来就行,实在不行久就手打所有命令行。
方法一:先生成一个所有样本的样本名称文件sample.list,再用awk print生成一个所有样本的比对列表mapping.cmd.list。
#生成样本名称文件
ls ../reseq-data/ca-mc/*_1.* | xargs -n 1 basename | sed 's/_1.fq.gz//' > sample.list # 用ls命令加basename导出一个只含有R1端样本的名称文件,再用sed命令把后面的字符替换掉。
#生成比对列表文件 (这里需要注意的点是-R参数外面的单引号和\t需要反斜杠来转义)
awk '{print "/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '\''@RG\\tID:"$1"\\tSM:"$1"\\tLB:"$1"\\tPL:ILLUMINA'\'' caYY.fa ../reseq-data/ca-mc/"$1"_1.fq.gz ../reseq-data/ca-mc/"$1"_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/"$1".bam"}' sample.list > mapping.cmd.list
#####################################
示例结果如下:
$cat sample.list
female1-728
female2-733
female3-735
female4-736
female5-737
female6-755
female7-729
female8-730
female9-732
male1-752
male2-753
male3-738
male4-739
male5-740
male6-746
male7-750
male8-747
male9-748
$cat mapping.cmd.list
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female1-728\tSM:female1-728\tLB:female1-728\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female1-728_1.fq.gz ../reseq-data/ca-mc/female1-728_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/female1-728.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female2-733\tSM:female2-733\tLB:female2-733\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female2-733_1.fq.gz ../reseq-data/ca-mc/female2-733_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/female2-733.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female3-735\tSM:female3-735\tLB:female3-735\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female3-735_1.fq.gz ../reseq-data/ca-mc/female3-735_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/female3-735.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female4-736\tSM:female4-736\tLB:female4-736\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female4-736_1.fq.gz ../reseq-data/ca-mc/female4-736_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/female4-736.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female5-737\tSM:female5-737\tLB:female5-737\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female5-737_1.fq.gz ../reseq-data/ca-mc/female5-737_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/female5-737.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female6-755\tSM:female6-755\tLB:female6-755\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female6-755_1.fq.gz ../reseq-data/ca-mc/female6-755_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/female6-755.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female7-729\tSM:female7-729\tLB:female7-729\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female7-729_1.fq.gz ../reseq-data/ca-mc/female7-729_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/female7-729.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female8-730\tSM:female8-730\tLB:female8-730\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female8-730_1.fq.gz ../reseq-data/ca-mc/female8-730_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/female8-730.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:female9-732\tSM:female9-732\tLB:female9-732\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/female9-732_1.fq.gz ../reseq-data/ca-mc/female9-732_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/female9-732.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male1-752\tSM:male1-752\tLB:male1-752\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male1-752_1.fq.gz ../reseq-data/ca-mc/male1-752_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/male1-752.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male2-753\tSM:male2-753\tLB:male2-753\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male2-753_1.fq.gz ../reseq-data/ca-mc/male2-753_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/male2-753.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male3-738\tSM:male3-738\tLB:male3-738\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male3-738_1.fq.gz ../reseq-data/ca-mc/male3-738_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/male3-738.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male4-739\tSM:male4-739\tLB:male4-739\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male4-739_1.fq.gz ../reseq-data/ca-mc/male4-739_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/male4-739.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male5-740\tSM:male5-740\tLB:male5-740\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male5-740_1.fq.gz ../reseq-data/ca-mc/male5-740_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/male5-740.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male6-746\tSM:male6-746\tLB:male6-746\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male6-746_1.fq.gz ../reseq-data/ca-mc/male6-746_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/male6-746.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male7-750\tSM:male7-750\tLB:male7-750\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male7-750_1.fq.gz ../reseq-data/ca-mc/male7-750_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/male7-750.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male8-747\tSM:male8-747\tLB:male8-747\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male8-747_1.fq.gz ../reseq-data/ca-mc/male8-747_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/male8-747.bam
/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:male9-748\tSM:male9-748\tLB:male9-748\tPL:ILLUMINA' caYY.fa ../reseq-data/ca-mc/male9-748_1.fq.gz ../reseq-data/ca-mc/male9-748_2.fq.gz | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/male9-748.bam
方法二(推荐):用bash脚本循环处理实现,脚本如下:
#!/bin/bash
# 循环处理每个 *_1.fq.gz 文件
fq_directory=../reseq-data/ca-mc
ls $fq_directory/*1.fq.gz | while read i;
do
base=$(basename $i _1.fq.gz)
r2=${i%1.fq.gz}2.fq.gz
# 将 BWA 映射命令追加到比对列表文件
echo "/newlustre/home/jfgui/Wangtao/software/bwa/bwa mem -t 32 -R '@RG\tID:$base\tSM:$base\tLB:$base\tPL:ILLUMINA' caYY.fa $i $r2 | /newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools sort -@24 - > bam/$base.bam " >> mapping.cmd.list
done
#3.用PareFly软件批量并行这些比对命令
mkdir bam ## 新建一个目录用于放置所有样本的比对后的.bam文件
ParaFly -c ./mapping.cmd.list -CPU 64
2.snp calling
相关软件:
picard,gatk,samtools。
picard主页:https://github.com/broadinstitute/picard
gatk主页:https://github.com/broadinstitute/gatk
## picard手动安装
wget https://github.com/broadinstitute/picard/releases/download/3.1.1/picard.jar
chmod 777 java -jar picard.jar
java -jar picard.jar
## gatk4手动安装
wget https://github.com/broadinstitute/gatk/releases/download/4.2.0.0/gatk-4.2.0.0.zip
unzip gatk-4.2.0.0.zip
cd gatk-4.2.0.0/
gatk --help
#1.用Picard生成.dict文件
用Picard的CreateSequenceDictionary
命令生成dict字典文件,这个文件是一个index,GATK很多过程都需要。
java -jar /newlustre/home/jfgui/Wangtao/software/picard.jar CreateSequenceDictionary R=caYY.fa O=caYY.dict
#2.用picard的MarkDuplicates
命令标记重复bam文件
#3.samtolols对以标记重复的bam文件建立引索
#4.用gatk的HaplotypeCaller
命令开始call snp
注1.因为picard是个java软件,所以可能在运行这个软件会出现临时内存不足的报错,我这里遇到了就指定了临时文件的位置TMP_DIR=
。
注2.gatk HaplotypeCaller
这个过程非常久,可能是软件不支持多线程跑得很慢;可用Sentieon替代,但该软件并不开源。
注3.本来也是想用parafly并行跑这些步骤的,但中间老是报错,就分开跑了,同样地用一个bash生成所有样本的命令,脚本如下
#!/bin/bash
ls bam/*.bam | while read i;
do
base=$(basename $i .bam)
echo "java -jar /newlustre/home/jfgui/Wangtao/software/picard.jar MarkDuplicates I=$i O=bam/$base.markdup.bam M=bam/$base.markdup.metrics.txt TMP_DIR=/newlustre/home/jfgui/Wangtao/ca/workdir/" >> $base.call.sh
echo "/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools index bam/$base.markdup.bam" >> $base.call.sh
echo "/newlustre/home/jfgui/Wangtao/software/gatk-4.2.0.0/gatk HaplotypeCaller -R caYY.fa -I bam/$base.markdup.bam -O gvcf/$base.g.vcf.gz -ERC GVCF" >> $base.call.sh
done
随后依次对每个样品进行snp calling,下面是其中一个样本calling命令
java -jar /newlustre/home/jfgui/Wangtao/software/picard.jar MarkDuplicates I=bam/female2-733.bam O=bam/female2-733.markdup.bam M=bam/female2-733.markdup.metrics.txt TMP_DIR=/newlustre/home/jfgui/Wangtao/ca/workdir/
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools index bam/female2-733.markdup.bam
mkdir gvcf
/newlustre/home/jfgui/Wangtao/software/gatk-4.2.0.0/gatk HaplotypeCaller -R caYY.fa -I bam/female2-733.markdup.bam -O gvcf/female2-733.g.vcf.gz -ERC GVCF
#########
所有样本运行完成后,查看中间文件,最后snp calling的结果是每一个样本都生成一个gvcf文件
$ll bam
总用量 1.4T
-rw-rw-r-- 1 jfgui jfgui 37G 11月 25 14:29 female1-728.bam
-rw-rw-r-- 1 jfgui jfgui 38G 11月 25 19:36 female1-728.markdup.bam
-rw-rw-r-- 1 jfgui jfgui 4.6M 11月 25 19:41 female1-728.markdup.bam.bai
-rw-rw-r-- 1 jfgui jfgui 3.4K 11月 25 19:36 female1-728.markdup.metrics.txt
-rw-rw-r-- 1 jfgui jfgui 36G 11月 25 13:41 female2-733.bam
-rw-rw-r-- 1 jfgui jfgui 36G 11月 26 23:50 female2-733.markdup.bam
-rw-rw-r-- 1 jfgui jfgui 4.6M 11月 26 23:55 female2-733.markdup.bam.bai
-rw-rw-r-- 1 jfgui jfgui 3.4K 11月 26 23:50 female2-733.markdup.metrics.txt
-rw-rw-r-- 1 jfgui jfgui 39G 11月 25 14:38 female3-735.bam
-rw-rw-r-- 1 jfgui jfgui 40G 11月 27 00:12 female3-735.markdup.bam
-rw-rw-r-- 1 jfgui jfgui 4.6M 11月 27 00:17 female3-735.markdup.bam.bai
-rw-rw-r-- 1 jfgui jfgui 3.4K 11月 27 00:12 female3-735.markdup.metrics.txt
-rw-rw-r-- 1 jfgui jfgui 40G 11月 25 14:39 female4-736.bam
-rw-rw-r-- 1 jfgui jfgui 40G 11月 27 00:12 female4-736.markdup.bam
-rw-rw-r-- 1 jfgui jfgui 4.6M 11月 27 00:17 female4-736.markdup.bam.bai
-rw-rw-r-- 1 jfgui jfgui 3.4K 11月 27 00:12 female4-736.markdup.metrics.txt
-rw-rw-r-- 1 jfgui jfgui 39G 11月 25 14:42 female5-737.bam
-rw-rw-r-- 1 jfgui jfgui 40G 11月 27 00:11 female5-737.markdup.bam
-rw-rw-r-- 1 jfgui jfgui 4.6M 11月 27 00:16 female5-737.markdup.bam.bai
-rw-rw-r-- 1 jfgui jfgui 3.4K 11月 27 00:11 female5-737.markdup.metrics.txt .......
$ll gvcf
总用量 89G
-rw-rw-r-- 1 jfgui jfgui 4.8G 11月 30 17:24 female1-728.g.vcf.gz
-rw-rw-r-- 1 jfgui jfgui 2.7M 11月 30 17:24 female1-728.g.vcf.gz.tbi
-rw-rw-r-- 1 jfgui jfgui 5.5G 12月 1 03:20 female2-733.g.vcf.gz
-rw-rw-r-- 1 jfgui jfgui 2.7M 12月 1 03:20 female2-733.g.vcf.gz.tbi
-rw-rw-r-- 1 jfgui jfgui 4.9G 12月 1 04:00 female3-735.g.vcf.gz
-rw-rw-r-- 1 jfgui jfgui 2.7M 12月 1 04:00 female3-735.g.vcf.gz.tbi
-rw-rw-r-- 1 jfgui jfgui 4.7G 12月 1 00:38 female4-736.g.vcf.gz
-rw-rw-r-- 1 jfgui jfgui 2.7M 12月 1 00:38 female4-736.g.vcf.gz.tbi
-rw-rw-r-- 1 jfgui jfgui 4.8G 12月 1 05:09 female5-737.g.vcf.gz
-rw-rw-r-- 1 jfgui jfgui 2.7M 12月 1 05:09 female5-737.g.vcf.gz.tbi .......
3. joint calling
相关软件:
gatk,bcftools。
bcftools主页: https://github.com/samtools/bcftools
手动安装
wget https://github.com/samtools/bcftools/releases/download/1.18/bcftools-1.18.tar.bz2
tar -xvjf bcftools-1.18.tar.bz2
cd bcftools-1.18
make
make install
#1.用gatk的CombineGVCFs
命令将所有样本的gvcf合并
#2.用gatk的GenotypeGVCFs
命令对合并后单个gvcf进行联合变异检测,生成vcf文件
ls $(pwd)/gvcf/*.g.vcf.gz > gvcf.list ## 先生成一个所有样本的gvcf的路径文件
mkdir vcf
/newlustre/home/jfgui/Wangtao/software/gatk-4.2.0.0/gatk CombineGVCFs -R caYY.fa --variant gvcf.list -O vcf/merge.g.vcf.gz
/newlustre/home/jfgui/Wangtao/software/gatk-4.2.0.0/gatk GenotypeGVCFs -R caYY.fa -V vcf/merge.g.vcf.gz -O vcf/merge.vcf.gz
#3.对vcf文件进行初步过滤
这一步需要对vcf文件格式有一个了解,进而可以按照自己的需求过滤掉”低质量“的变异位点,有很多说得很详细的文章了,这里不做赘述。此外,很多软件都能对vcf文件坐过滤可以根据自己喜爱来,这里用bcftools。
/newlustre/home/jfgui/Wangtao/software/bcftools-1.18/bcftools view -v snps -m2 -M2 -Oz --threads 24 -o vcf/merge.snp.vcf.gz vcf/merge.vcf.gz # -v只保留snp变异信息
/newlustre/home/jfgui/Wangtao/software/bcftools-1.18/bcftools view -i 'MAF >= 0.05 && F_MISSING <= 0.2 && INFO/DP > 4' vcf/merge.snp.vcf.gz -Oz -o vcf/merge.filtered.snp.vcf.gz # -i保留条件参数
4. snp筛选并作图
相关软件:
python3,bedtools,samtools,R。
python和R后面我从头学习这两门编程语言再仔细记录下吧,现在这里就会用脚本并能简单修改就行,这里用到服务器的python脚本筛选雄性特异的SNP以及用本地的R/Rstdio画图。
bedtools主页: https://github.com/arq5x/bedtools2
## bedtools自动安装
conda install -c bioconda bedtools
## bedtools手动安装
wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz
tar -xvzf bedtools-2.30.0.tar.gz
cd bedtools-2.30.0
make
#1.用一个python脚本筛出符合条件的snp
我们这里使用YY基因组参考的,因此筛出的SNP的基因组需要满足:雄鱼基因型为0/1
,而雌鱼的基因型为1/1
。我这里用的师兄写的一个python脚本SexFinder.py
,当前能力确实还不足以自己写python脚本或R脚本😂。这个脚本把≥8个雄鱼样本的基因型为0/1
同时满足≥8个雌鱼的基因型为1/1
所有snp挑出了,具体内容如下:
import sys
import gzip
def filter_vcf(femalelist, malelist, vcf):
female = [line.strip() for line in open(femalelist)]
male = [line.strip() for line in open(malelist)]
head = ""
with gzip.open(vcf, "rt", encoding="utf-8") as txt:
for line in txt:
if not line.startswith("#"):
if not head:
header = header.strip().split()
femalePos = []
malePos = []
for index, value in enumerate(header[9:], start=9):
if value in female:
femalePos.append(index)
elif value in male:
malePos.append(index)
head = "done"
line = line.strip().split()
p = line[6]
maleP = 0
femaleP = 0
for i in malePos:
sample = line[i].split(':')
tp = sample[0]
# dp = int(sample[2])
if (tp == "0/1") or (tp == "0|1"):
maleP += 1
for i in femalePos:
sample = line[i].split(':')
tp = sample[0]
# dp = int(sample[2])
if (tp == "1/1") or (tp == "1|1"):
femaleP += 1
if (maleP >= 8) and (femaleP >= 8):
print("\t".join(line))
else:
header = line
def main():
if len(sys.argv) != 4:
print('''Usage: for REF YY python3 SexFinder.py <femalelist> <malelist> <vcf>
for REF XX python3 SexFinder.py <malelist> <femalelist> <vcf>''')
sys.exit(1)
filter_vcf(sys.argv[1], sys.argv[2], sys.argv[3])
if __name__ == "__main__":
main()
运行这个脚本之前需要将雌雄样本名分别写入female.list
和male.list
中:
## 基于前面的sample.list生成female.list和male.list文件
head -n 9 sample.list > female.list
tail -n 9 sample.list > male.list
## 运行python脚本
mkdir sexfinder
python3 SexFinder.py female.list male.list vcf/merge.filtered.snp.vcf.gz > sexfinder/sex.snp.vcf
#2.将筛好vcf文件转换成适合画图的bed文件
上一步我们拿到了我们包含雄性特异snp相应的vcf文件,但是vcf文件里的信息太多了,事实上我们画图只需要snp的位置信息就行了,因此我们只需要保留前两列染色体号CHROM
和 具体位置 POS
就够了;此外为了直观表现雄性特异snp集中富集在哪里,我们对染色体画窗口,窗口的位置作为横坐标,并对每个窗口数计数,每个窗口内包含的snp数量为纵坐标,以此来画一个曼哈顿图。
## 保留vcf文件的前两列,但打印3列信息,第3列和第2列的信息一致均为snp的位置,这是为了后面取交集用。
awk '{print $1"\t"$2"\t"$2}' sexfinder/sex.snp.vcf > sexfinder/sex.snp.bed
## 对参考基因组建一个samtools faidx引索,其实前面做过了,这里可不做。
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/samtools faidx caYY.fa
## 基于caYY.fa.fai引索,提取其以Chr开头的行的前两个字段,其实也就是生成2两信息,染色体编号及其长度。
grep "^Chr" caYY.fa.fai | cut -f 1,2 > sexfinder/caYY.fa.size
## 利用bedtools makewindows工具对染色体依次画50k大小的窗口
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/bedtools makewindows -g sexfinder/caYY.fa.size -w 50000 > sexfinder/caYY.fa.50000.bed
## 利用bedtools intersect 工具对两个bed文件取范围交集并计数。-a 后接染色体窗口文件,-b后接snp坐标信息文件,-wa以a文件为原始条目,-c对b文件计数。
/newlustre/home/jfgui/wy/anaconda3/envs/pengfang/bin/bedtools intersect -a sexfinder/caYY.fa.50000.bed -b sexfinder/sex.snp.bed -wa -c > sexfinder/sex.snp.50000.bed
#3.将上一步生成的以窗口计数snp的bed文件作为输入在R画图
将上一步生成的以窗口计数snp的bed文件作为导入R中绘制曼哈顿图,也是师兄给的R脚本,按照自己的需求修改图片参数就行。
library(ggplot2)
library(dplyr)
library(readr)
library(stringr)
library(ggpubr)
df <- read.delim("C:/Users/WIIT/Documents/Rworkdir/sex.snp.50000.bed" , header = F)
unique(df$V1)
df$V2 <- df$V2/1000000
df$V3 <- df$V3/1000000
df$V1 <- str_replace(df$V1 , "Chr0" , "") %>% str_replace( "Chr" , "")
df$V1 <- factor(df$V1 , levels = paste0(rep(seq(1,25) , each = 2), rep(c("A","B") , 25)))
df.tmp <- df %>%
group_by(V1) %>%
summarise(chr_len=max(V3)) %>%
mutate(tot=cumsum(chr_len)-chr_len) %>%
dplyr::select(-chr_len) %>%
left_join(df, by = c('V1' = 'V1')) %>%
arrange(V1, V3) %>%
mutate( BPcum=V3+tot)
axisdf <- df.tmp %>% group_by(V1) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )
# comment the line below if you want to show all chromosome on x-axis
axisdf <- axisdf %>% filter(as.numeric(V1)%%2==1)
axisdf_tick <- df.tmp %>% group_by(V1) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )
color <- c("grey", "black")
manhattan <- ggplot(df.tmp, aes(x=BPcum, y=V4)) +
geom_point(aes(color=as.factor(V1)), size=0.5) +
scale_color_manual(values = rep(color , 25)) +
scale_x_continuous( label = axisdf$V1, breaks= axisdf$center ) +
theme_pubr(
legend = "none"
) +
theme(axis.title = element_text(size = 10),
axis.text = element_text(size = 7)) +
xlab("Chromosome") +
ylab("Male-specific SNPs count")
ggsave("C:/Users/WIIT/Documents/Rworkdir/manhattan.pdf" ,plot = manhattan, width = 10 , height = 3)
最终成图:
其他相关推荐
VCF文件参数解读 - 简书 (jianshu.com)
讨厌又迷人的reads去重复 - 简书 (jianshu.com)
bedtools intersect用法 (intersectBed) - 简书 (jianshu.com)