# wes分析时的bed文件相关问题

在WES中，因为测序只针对外显子区域。因此，我们就当只分析外显子区域，这就需要引入一个表示外显子区域的INTERVAL文件，既可加快分析速度，也可以降低OFF-TARGET，那么，在什么步骤引入INTERVAL合适呢？关于这个问题，可以参见GATK的官方说明：When should I restrict my analysis to specific intervals?

总结下来就是四个理由interval的引入

针对子集进行快速分析，常常用于排队故障；
并行进行跨基因组分析；
排除数据质量低的区域；
数据集只针对某个特定区域。

一般来讲，常见的invterval文件为bed类型的文件（扩展名是bed）

bed文件类型的格式是固定的

1	chrom	Chromosome (e.g. chr3, chrY, chr2_random)
2	chromStart	Start coordinate on the chromosome or scaffold for the sequence considered (the first base on the chromosome is numbered 0)
3	chromEnd	End coordinate on the chromosome or scaffold for the sequence considered. This position is non-inclusive, unlike chromStart.
4	name	genesymbol（我常用这个）
5	score	Score between 0 and 1000（具体含义不理解）
6	strand	DNA strand orientation (positive ["+"] or negative ["-"] or "." if no strand)

获取方法如下

1. 从官网获取

根据试剂盒kit的规格不同从官网下载（但一般AGILENT公司的试剂盒，官网下载的都不太适配），如agilent官网。

这里以agilent官网的试剂盒对应的bed为例子，选择不同型号的kit后，对应的文件如下表，但一般选用的是region.bed

total 389768
-rw-rw-r--@ 1 stevenjefferson  staff    49M  5 14 22:43 S31285117_AllTracks.bed
-rw-rw-r--@ 1 stevenjefferson  staff    44M  5 14 22:43 S31285117_Covered.bed
-rw-rw-r--@ 1 stevenjefferson  staff    44M  5 14 22:43 S31285117_MergedProbes.bed
-rw-rw-r--@ 1 stevenjefferson  staff    44M  5 14 22:43 S31285117_Padded.bed
-rw-rw-r--@ 1 stevenjefferson  staff   401B  5 14 22:43 S31285117_README
-rw-r--r--@ 1 stevenjefferson  staff   4.9M  5 15 12:41 S31285117_Regions.bed
-rw-rw-r--@ 1 stevenjefferson  staff   3.6M  5 14 22:43 S31285117_Targets.txt

对于我来说，官网的bed文件不太适配，原因：regions.bed缺少第4列和第5列信息。

2. 从UCSC官网下载相关数据，然后整理对应的bed文件，参考的biostar论坛的方法。

相关内容粘贴如下（！！！可能因为版本的不同，为了满足最终bed文件六列信息，我更改了一些内容）

Download a bed file for the canonical transcripts using UCSC Table Browser:

track: UCSC Genes
table: knownCanonical
output format: select fields from primary and related tables
press get output
select fields from hg19.knownCanonical: chrom, chromStart, chromEnd,transcript
transcript select fields from hg19.kgXref: geneSymbol
press get output

The file UCSC_canonical.bed looks like:

#hg19.knownCanonical.chrom      hg19.knownCanonical.chromStart  hg19.knownCanonical.chromEnd    hg19.knownCanonical.transcript  hg19.kgXref.geneSymbol
chr1    11873   14409   uc010nxq.1      DDX11L1
chr1    14361   19759   uc009viu.3      WASH7P
chr1    14406   29370   uc009viw.2      WASH7P
chr1    34610   36081   uc001aak.3      FAM138F
chr1    69090   70008   uc001aal.1      OR4F5
chr1    134772  140566  uc021oeg.2      LOC729737
chr1    321083  321115  uc001aaq.2      DQ597235
chr1    321145  321207  uc001aar.2      DQ599768
chr1    322036  326938  uc009vjk.2      LOC100133331

Download a bed file for all UCSC exons using UCSC Table Browser:

track: UCSC Genes
table: knownGene
output format: BED - browser extensible data
press get output
select option Exons
press get BED

The file UCSC_exons.bed looks like that:

chr1    11873   12227   uc001aaa.3_exon_0_0_chr1_11874_f        0       +
chr1    12612   12721   uc001aaa.3_exon_1_0_chr1_12613_f        0       +
chr1    13220   14409   uc001aaa.3_exon_2_0_chr1_13221_f        0       +
chr1    11873   12227   uc010nxr.1_exon_0_0_chr1_11874_f        0       +
chr1    12645   12697   uc010nxr.1_exon_1_0_chr1_12646_f        0       +
chr1    13220   14409   uc010nxr.1_exon_2_0_chr1_13221_f        0       +
chr1    11873   12227   uc010nxq.1_exon_0_0_chr1_11874_f        0       +
chr1    12594   12721   uc010nxq.1_exon_1_0_chr1_12595_f        0       +
chr1    13402   14409   uc010nxq.1_exon_2_0_chr1_13403_f        0       +
chr1    14361   14829   uc009vis.3_exon_0_0_chr1_14362_r        0       -

Modify the file to separate the transcript name of the rest of information:

awk '{split ($4,a,"_"); {print $1"\t"$2"\t"$3"\t"a[1]"\t"a[3]"\t"$5"\t"$6}}' UCSC_exons.bed > UCSC_exons_modif.bed

The file UCSC_exons_modif.bed:

chr1    11873   12227   uc001aaa.3      0       +
chr1    12612   12721   uc001aaa.3      1       +
chr1    13220   14409   uc001aaa.3      2       +
chr1    11873   12227   uc010nxr.1      0       +
chr1    12645   12697   uc010nxr.1      1       +
chr1    13220   14409   uc010nxr.1      2       +
chr1    11873   12227   uc010nxq.1      0       +
chr1    12594   12721   uc010nxq.1      1       +
chr1    13402   14409   uc010nxq.1      2       +
chr1    14361   14829   uc009vis.3      0       -

Join the sorted files based on the transcript identificator:

join -1 4 -2 4 <(sort -k4 UCSC_exons_modif.bed ) <(sort -k4 UCSC_canonical.bed) | awk '{print $2"\t"$3"\t"$4"\t"$10"\t"$5"\t"$6}' | bedtools sort -i "-" > UCSC_exons_modif_canonical.bed

The final file contains exons of the canonical transcripts:

chr1    11873   12227   DDX11L1 0       +
chr1    12594   12721   DDX11L1 1       +
chr1    13402   14409   DDX11L1 2       +
chr1    14361   14829   WASH7P  0       -
chr1    14406   16765   WASH7P  0       -
chr1    14969   15038   WASH7P  1       -
chr1    15795   15947   WASH7P  2       -
chr1    16606   16765   WASH7P  3       -
chr1    16857   17055   WASH7P  4       -
chr1    16857   17055   WASH7P  1       -

虽然上述方法也获得了需要的bed文件，但是使用qualimap对比对后的bam进行coverage 计算发现覆盖率特别低，我觉得大概率还是bed文件的问题，所以我又换了一个方法。。。

3. 从CCDS官网获取

在上述两个方法失利后我最后通过这个方法获取了想要的bed文件。

CCDS官网

进入官网后进入其ftp服务器

在进入ftp后首先看一下readme文件，里面有对应的老版本文件放在哪里（最新版本的应该是hg38的bed文件）根据需要下载hg19对应的文件和hg38对应的文件

注意这里我为了下载hg19对应的bed文件我下载的是CCDS.20131129.txt，即是2013年11月29日释放的file。

conda activate exon
cat CCDS.20131129.txt|perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i |awk '{print "chr"$0"\t0\t+"}'  > hg19.exon.bed