miRNA 专题 | 数据过滤 & 比对 & 靶基因预测

如何提取感兴趣物种的miRNA成熟体序列，有三种方式。

perl、python 脚本
R脚本
Notepad++ 或 EmEditor正则表达式查找替换

miRNA分析流程

具体参考[miRNA 数据过滤我使用cutadapt](miRNA 数据过滤我使用cutadapt)，进行了一些整理，感谢博主的分享。

一、miRNA 数据过滤（植物18～30nt）

cutadapt  -a  AGATCGGAAGAGCACACGTCT  -m  15  -q  20  --discard-untrimmed  -o  outname .fa

--discard-untrimmed 把reads 中不含有adaper的reads 去掉。
-a 剪切reads 3' 端adapter (双端测序第一条read)，加$表示adapter锚定在reads 3'端(可找公司要)。
-g 剪切reads 5'端adapter （双端测序第一条reads），加$表示adapter锚定在reads 5'端。
-q 低质量碱基。
-m reads 短于15时，丢弃该reads。

获得合适长度的reads

二、miRNA 比对

方案1. 比对到Rfam中的ncRNA，去除snRNA，snoRNA，rRNA和tRNA等。
方案2. 将miRNA 比对到目标物种的参考基因组上，去除那些匹配不上的序列。

为了减少比对时间，在比对之前可将每个样本中的reads 进行合并，得到fasta 格式，其命名规则为：样本_r数字_x数字，其中r中的数字表示reads序号；x中数字表示该条reads重复次数

miR-PREFeR 软件的使用

介绍：miR-PREFeR: microRNA PREdiction From small RNAseq data，本文主要参考github上的tutorial。
借助miR-PREFeR软件比对到参考基因组，鉴定新的miRNA。

分析流程

1. Required programs （必要的安装包）

a. 提前安装ViennaRNA，且版本最好在1.8.5、2.1.2、 2.1.5及以上。

wget  https://www.tbi.univie.ac.at/RNA/download/sourcecode/2_4_x/ViennaRNA-2.4.18.tar.gz
tar  zvxf  ViennaRNA-2.4.18.tar.gz
cd  ViennaRNA-2.4.18.tar.gz
./configure --prefix="/user/tools/ViennaRNA/" --without-perl
make
make  install

b. 安装samtools (0.1.15 或之后的版本）

cd   /manager/biosoft/
tar  jfx  samtools-0.1.19.tar.bz2
cd  samtools-0.1.19
make

☝注意：由于miR-PREFeR是基于Python2版本，所以Python3版本运行会报错！

The current version is only tested under Python 2.6.7, Python 2.7.2 and Python 2.7.3 and should work under Python 2.6. and Python 2.7.

2. Obtain and install the pipeline (下载安装miR-PREFeR)

git clone https://github.com/hangelwen/miR-PREFeR.git

☝如果没法上下载git，可以从我网盘下载。
链接：https://pan.baidu.com/s/1UqkKYDOGcjv13dHm9pi9ew
提取码：volh

3. Test the pipeline （软件调试用，可以跳过）

作者贴心的给出了测试数据(example/exampledata.tar.gz)以及测试整个软件的pipeline(HOW_TO_RUN_EXAMPLE.txt)。

以下是该HOW_TO_RUN_EXAMPLE.txt的具体内容，下面具体看看

================================================================================
1. Test the pipeline.

# The package provides a small example dataset for testing the pipeline. The
# dataset is for Aradidopsis, chromosome 1. To run the example, first change
# directory to the example folder:

cd  example
tar  xvf  exampledata.tar.gz       #  Then decompress the exampledata.tar.gz file:

# Then open the config.example file, change the PIPELINE_PATH to the path where
# you put the miR-PREFeR package folder. For example, if you put miR-PREFeR at
# /home/username/tools/miR-PREFeR-v0.09, then set PIPELINE_PATH as:
PIPELINE_PATH=/home/username/tools/miR-PREFeR-v0.09

# Save the config.example file. In the example folder, execute command:
python  ../miR_PREFeR.py  -L  -k  pipeline  config.example

# The -L option generates a log file in the output directory example-result. The
# -k option keeps the temp directory used to store the intermediate files. The
# temp directory is in the example-result directory.

# If you have python, samtools, RNALfold installed and in the PATH, you should be
# able to run the test program. It takes about one or two minutes to
# finish. You'll be able to see the result in the example-result folder.



================================================================================
2. Test how to do checkpointing.

# Before testing this, if you have run the pipeline with the example.config file
# in this folder, please remove the example-result folder first.

# Then change the 'CHECKPOINT_SIZE' option to a smaller value (30, for
# example). The reason to do this is that by default the pipeline makes a
# checkpoint after finishing folding every 3000 sequences, but the sample data is
# so small that the total number of sequences is smaller than the default.

# Then run the pipeline with 'pipeline' command:
python  ../miR_PREFeR.py  -L  -k  pipeline  config.example

# After running for a while (10 seconds, for example. You should let it run for
# enough time to do at least one checkpoint. A "Done" is shown when a checkpoint
# is applied), kill the process by "Ctrl-C". To check where the pipeline was stopped,
# run:
python ../miR_PREFeR.py -L check config.example

# This will show the checkpoint information.

# To restart the pipeline from where it was stopped, run:
python  ../miR_PREFeR.py  -L  recover  config.example

# The pipeline will continue to finish the job specified in the config.example
file.
================================================================================

4. How to run the pipeline （现在正式干活了）

a. Prepare input data for the pipeline.

A fasta file, which contains the gnome sequences of the species under study.
one or more SAM files which contains the alignments of small RNAseq data with the gnome.
(Optional) An GFF (http://www.sanger.ac.uk/resources/software/gff/spec.html) file which lists regions in the gnome sequences that should be ignored from miRNA analysis.

a). Genome fasta file （是`A fasta file`的解读）

Fasta format specification can be found at http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml. In miR-PREFeR, for the string following ">", only the first word that is delimited by any white space characters (whitespace, tab, etc) is used. For example, for the following sequence, 'ath-MIR773a' is used as the identifier of the seqeunce. Thus, please ensure that all the sequences in the FASTA files have different identifiers.

>ath-MIR773a MI0005103
AGGAGGCAAUAGCUUGAGCAAAUAAUUGAUUGCAGAAGUCCAUCGACUAAAGCUGUCACCUGUUUGCUUCCAGCUUUUGUCUCCU

b). SAM alignment files （是`SAM files`的解读）

The miR-PREFeR pipeline takes SAM format alignment files. SAM alignment files can be generated by many aligners. Here we use Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) as an example.

$\color{green}{\it\small{注意}}$

今天累了，未完待续....

miRNA 专题 | 数据过滤 & 比对 & 靶基因预测