前言
以这篇文章 Chromatin Accessibility-Based Characterization of the Gene Regulatory Network Underlying Plasmodium falciparum Blood-Stage Development 为例,发表于 Cell Host Microbe. (IF=15.793)。
这篇文章主要是利用染色质可接近信息性来解析恶性疟原虫在红内期基因调控网络的特征 ,选这篇文章的原因是因为其用到了ATAC-seq和RNA-seq联合分析,其中也结合了部分Chip-seq的信息,比较全面。另外,其主要研究物种恶性疟原虫的基因组较小(23Mb左右),适合使用小服务器或者自己的电脑分析。
背景资料
恶性疟原虫(Plasmodium falciparum)是一种原生动物寄生虫,是引发人类疟疾的疟原虫的一种,由雌性疟蚊传播。该型疟原虫引发的疟疾是所有疟疾中死亡率最高的一种。屠呦呦团队研究的青蒿素就是对抗这种恶性疟原虫引起的疟疾的药品。
数据资料
看一下该文章中用到的数据,发现其用到了ATAC-seq、RNA-seq和chip-seq的数据。恶性疟原虫的基因组、转录本、和蛋白组也给出了相应的链接。基因组、转录本、和蛋白组信息直接点开链接或者复制地址用wget就可以下载,这里就不赘述了。下面看一下组学的数据。
在NCBI中找到Project: GSE104075 ,我们进入其 Send results to Run selector 数据里看一下,发现其一共有26个数据,包含了18个ATAC-seq数据和8个RNA-seq数据。详细信息可以点击RunInfo Table下载。
同样找到GSE80293, 其有9个Chip-seq的实验数据,具体内容还要参见Red Blood Cell Invasion by the Malaria Parasite Is Coordinated by the PfAP2-I Transcription Factor这篇文章。
看一下18个ATAC-seq的数据,其用的是双端测序,150 bp,测了8个不同时期的数据,另外还有2个野生型ring stage 的疟原虫的gDNA来进行校正(对照?)
To control for sequence bias, the same ATAC protocol was applied to genomic DNA from synchronous wild-type 3D7 P. falciparum ring stage parasites using 547.0 ng or 60.8 ng of input DNA.
Run | Assay Type | AvgSpotLen | developmental_stage | LibraryLayout | LibrarySource | source_name | input |
---|---|---|---|---|---|---|---|
SRR6055335 | ATAC-seq | 150 | ring stage | PAIRED | GENOMIC | red blood cell stage, 0-20 hours post invasion | 547.0 ng genomic DNA |
SRR6055336 | ATAC-seq | 150 | ring stage | PAIRED | GENOMIC | genomic DNA red blood cell stage, 0-20 hours post invasion | 60.8 ng of genomic DNA |
SRR6055333 | ATAC-seq | 150 | medium schizont | PAIRED | GENOMIC | red blood cell stage, 27-35 hours post invasion | |
SRR6689450 | ATAC-seq | 150 | medium schizont | PAIRED | GENOMIC | red blood cell stage, 27-35 hours post invasion | |
SRR6055328 | ATAC-seq | 150 | medium ring stage | PAIRED | GENOMIC | red blood cell stage, 2-10 hours post invasion | |
SRR6689445 | ATAC-seq | 150 | medium ring stage | PAIRED | GENOMIC | red blood cell stage, 2-10 hours post invasion | |
SRR6055331 | ATAC-seq | 150 | late trophozoite | PAIRED | GENOMIC | red blood cell stage, 17-25 hours post invasion | |
SRR6689448 | ATAC-seq | 150 | late trophozoite | PAIRED | GENOMIC | red blood cell stage, 17-25 hours post invasion | |
SRR6055334 | ATAC-seq | 150 | late schizont | PAIRED | GENOMIC | red blood cell stage, 32-40 hours post invasion | |
SRR6689451 | ATAC-seq | 150 | late schizont | PAIRED | GENOMIC | red blood cell stage, 32-40 hours post invasion | |
SRR6055329 | ATAC-seq | 150 | late ring stage | PAIRED | GENOMIC | red blood cell stage, 7-15 hours post invasion | |
SRR6689446 | ATAC-seq | 150 | late ring stage | PAIRED | GENOMIC | red blood cell stage, 7-15 hours post invasion | |
SRR6055330 | ATAC-seq | 150 | early trophozoite | PAIRED | GENOMIC | red blood cell stage, 12-20 hours post invasion | |
SRR6689447 | ATAC-seq | 150 | early trophozoite | PAIRED | GENOMIC | red blood cell stage, 12-20 hours post invasion | |
SRR6055332 | ATAC-seq | 150 | early schizont | PAIRED | GENOMIC | red blood cell stage, 22-30 hours post invasion | |
SRR6689449 | ATAC-seq | 150 | early schizont | PAIRED | GENOMIC | red blood cell stage, 22-30 hours post invasion | |
SRR6055327 | ATAC-seq | 150 | early ring stage | PAIRED | GENOMIC | red blood cell stage, 40-5 hours post invasion | |
SRR6689444 | ATAC-seq | 150 | early ring stage | PAIRED | GENOMIC | red blood cell stage, 40-5 hours post invasion |
看一下8个RNA-seq的数据,其用的是链特异性(Strand-specific )RNA-seq,关于链特异性RNA-seq可以参考这两篇博文:什么是链特异性的RNA-Seq?和 链特异性测序那点事(后续还需自己学习)。单端测序,75 bp,测了8个不同时期的数据,和ATAC-seq的8个时期相对应。
Run | Assay Type | AvgSpotLen | developmental_stage | LibraryLayout | LibrarySource | source_name |
---|---|---|---|---|---|---|
SRR6055343 | RNA-Seq | 75 | medium schizont | SINGLE | TRANSCRIPTOMIC | red blood cell stage, 27-35 hours post invasion |
SRR6055338 | RNA-Seq | 75 | medium ring stage | SINGLE | TRANSCRIPTOMIC | red blood cell stage, 2-10 hours post invasion |
SRR6055341 | RNA-Seq | 75 | late trophozoite | SINGLE | TRANSCRIPTOMIC | red blood cell stage, 17-25 hours post invasion |
SRR6055344 | RNA-Seq | 75 | late schizont | SINGLE | TRANSCRIPTOMIC | red blood cell stage, 32-40 hours post invasion |
SRR6055339 | RNA-Seq | 75 | late ring stage | SINGLE | TRANSCRIPTOMIC | red blood cell stage, 7-15 hours post invasion |
SRR6055340 | RNA-Seq | 75 | early trophozoite | SINGLE | TRANSCRIPTOMIC | red blood cell stage, 12-20 hours post invasion |
SRR6055342 | RNA-Seq | 75 | early schizont | SINGLE | TRANSCRIPTOMIC | red blood cell stage, 22-30 hours post invasion |
SRR6055337 | RNA-Seq | 75 | early ring stage | SINGLE | TRANSCRIPTOMIC | red blood cell stage, 40-5 hours post invasion |
文章中对于chip-seq测序的描述。
Magna ChIP protein A+G magnetic beads (Millipore) (ChIP replicates 1 and 2) or protein A salmon sperm agarose beads (Millipore) (ChIP replicate 3), an input sample was collected and the rest of the chromatin was incubated overnight at 4C with 1 mg of polyclonal anti-GFP or, as control, the same amount of IgG.
两种抗体(anti-GFP 和anti-IgG,其中anti-IgG是对照),还有一组没有没有用抗体。3次重复,用了两种磁珠,rep1 rep2用的是 A+G magnetic beads,rep3用的是agarose beads。
关于chip-seq的对照使用参见Kai的博文,摘抄如下:
- 什么是input:样本经过超声,但是没有进行ChIP,包含样本超声后总DNA,开头进行的电泳,检测超声效果,同时,可以与最后ChIP样本进行比较,判断ChIP的效率(如果用同一引物进行PCR,ChIP组和input组亮度差不多,说明ChIP效率高,样本中所有的目的基因片段都被ChIP下来了,繁殖,说明效率低,实验条件有待改进)
- 什么是IgG:用普通的IgG做为抗体,理论上不会ChIP下来任何DNA片段,但是由于非特异结合,或者实验过程中,没有发生结合的DNA清除不完全,可能也会出现条带。(ps.似乎IgG的抗体也无需上机测序,只是判断IP试验中所选取的抗体是否特异。)
阳性对照:
一般用anti-RNA polymerase II抗体,因为RNA polymerase II是通用转录因子,在所有细胞中都能结合基因的核心启动子区,因此理论上,ChIP后PCR会有条带。一般阳性对照不进行测序。
9个数据,单端测序,150bp
Run | Assay Type | AvgSpotLen | chip_antibody | LibraryLayout | LibrarySelection | LibrarySource | source_name |
---|---|---|---|---|---|---|---|
SRR5114667 | ChIP-Seq | 150 | None | SINGLE | ChIP | GENOMIC | PfAP2-I-GFP DNA, input |
SRR5114670 | ChIP-Seq | 150 | None | SINGLE | ChIP | GENOMIC | PfAP2-I-GFP DNA, input |
SRR5114673 | ChIP-Seq | 150 | None | SINGLE | ChIP | GENOMIC | PfAP2-I-GFP DNA, input |
SRR5114666 | ChIP-Seq | 150 | Anti-IgG (Abcam ab27472) | SINGLE | ChIP | GENOMIC | PfAP2-I-GFP DNA, IgG ChIP |
SRR5114669 | ChIP-Seq | 150 | Anti-IgG (Abcam ab27472) | SINGLE | ChIP | GENOMIC | PfAP2-I-GFP DNA, IgG ChIP |
SRR5114672 | ChIP-Seq | 150 | Anti-IgG (Abcam ab27472) | SINGLE | ChIP | GENOMIC | PfAP2-I-GFP DNA, IgG ChIP |
SRR5114665 | ChIP-Seq | 150 | Anti-GFP (Abcam ab290) | SINGLE | ChIP | GENOMIC | PfAP2-I-GFP DNA, GFP ChIP |
SRR5114668 | ChIP-Seq | 150 | Anti-GFP (Abcam ab290) | SINGLE | ChIP | GENOMIC | PfAP2-I-GFP DNA, GFP ChIP |
SRR5114671 | ChIP-Seq | 150 | Anti-GFP (Abcam ab290) | SINGLE | ChIP | GENOMIC | PfAP2-I-GFP DNA, GFP ChIP |
资料下载
所需软件:sra-tools,可直接用conda下载
conda install -c bioconda sra-tools
所需表单:sra.list
将表格中的run ID 复制到一个文档里,命名为sra.list
less sra.list
SRR6055327
SRR6689444
SRR6055332
SRR6689449
SRR6055343
SRR6055330
SRR6689447 .....
这里踩过一个坑,提取的list里有CR,下一步使用prefetch的时候会报错。这个图是没有CR的,在notepad里显示所有字符就可以看到是否含有CR,关于CR和LF的区别可以参考这里。
命令:
-
直接使用prefetch
prefetch --option-file sra.list
-
使用脚本,命名为download_srr.sh,内容如下:
for id in `cat sra.list`;do nohup prefetch $id done
之后会自动下载所有文件,下载过程中输出的内容会存入nohup.out 里,下载的sra文件储存在/home/ncbi/public/sra 下
#2019-11-19T11:11:22 prefetch.2.9.6: 1) Downloading 'SRR6689446'...
#2019-11-19T11:11:22 prefetch.2.9.6: Downloading via https...
#2019-11-19T11:15:38 prefetch.2.9.6: https download succeed
#2019-11-19T11:15:38 prefetch.2.9.6: 1) 'SRR6689446' was downloaded successfully
#2019-11-19T11:15:38 prefetch.2.9.6: 'SRR6689446' has 0 unresolved dependencies
ls *.sra
#SRR6055327.sra SRR6689444.sra SRR6689445.sra
资料解压
从NCBI中下到的是sra文件,需要使用fastq-dump将sra文件转化成fastq文件。
具体使用方法参见洲更大神的简书Fastq-dump: 一个神奇的软件
其实根据他文章所说,可以都按照双端测序的流程就行解压。具体我没有测试,我是把双端和单端数据(也就是ATAC-seq和RNA-seq)分别进行处理的。将单端的ID命名为single.txt, 双端测序的ID命名为paired.txt。对于双端测序的文件。
for id in `cat ../paired.txt`; do
fastq-dump --gzip --split-3 ./${id}.sra
done
对于单端测序的文件
for id in `cat ../single.txt`; do
fastq-dump --gzip ./${id}.sra
done
看一下解压出来的结果,一共44个文件,都是fastq.gz,也就是fastq的压缩文件。
ls *.gz
#SRR6055327_1.fastq.gz SRR6055331_2.fastq.gz SRR6055336_1.fastq.gz SRR6055344.fastq.gz #SRR6689448_1.fastq.gz SRR6055327_2.fastq.gz SRR6055332_1.fastq.gz .....
ls *.gz |wc -l
#44
看看有多少双端测序文件,因为双端测序会以_分开,故可以使用以下代码统计个数。
ls *_*.gz| wc -l #wc -l 统计行数
#36
ls *_*.gz| grep -c _
#36
ATAC-seq一共18*2, 36个没问题。再看看有多少单端测序文件。
ls *.gz| grep -cv _ #grep -c 个数 -v不匹配的项目
#17
RNA-seq一共8个, chip-seq9个,也没有问题。
结语
好了,现在拿到了fastq的原始数据。下一步应该对原始数据进行质控了,下一篇见!加油!努力记录自己的点滴。另外还要多学习一下grep、awk和sed的用法,实在是用的太不熟练了。