本文主要参考知乎帖子三种方法提取miRNA成熟体序列
如何提取感兴趣物种的miRNA成熟体序列,有三种方式。
- perl、python 脚本
- R脚本
- Notepad++ 或 EmEditor正则表达式查找替换
miRNA分析流程
具体参考[miRNA 数据过滤我使用cutadapt](miRNA 数据过滤我使用cutadapt),进行了一些整理,感谢博主的分享。
一、miRNA 数据过滤(植物18~30nt)
cutadapt -a AGATCGGAAGAGCACACGTCT -m 15 -q 20 --discard-untrimmed -o outname .fa
-
--discard-untrimmed把reads 中不含有adaper的reads 去掉。 -
-a剪切reads 3' 端adapter (双端测序第一条read),加$表示adapter锚定在reads 3'端(可找公司要)。 -
-g剪切reads 5'端adapter (双端测序第一条reads),加$表示adapter锚定在reads 5'端。 -
-q低质量碱基。 -
-mreads 短于15时,丢弃该reads。
获得合适长度的reads
二、miRNA 比对
- 方案1. 比对到Rfam中的ncRNA,去除snRNA,snoRNA,rRNA和tRNA等。
- 方案2. 将miRNA 比对到目标物种的参考基因组上,去除那些匹配不上的序列。
为了减少比对时间,在比对之前可将每个样本中的reads 进行合并,得到fasta 格式,其命名规则为:
样本_r数字_x数字,其中r中的数字表示reads序号;x中数字表示该条reads重复次数
miR-PREFeR 软件的使用
介绍:miR-PREFeR: microRNA PREdiction From small RNAseq data,本文主要参考github上的tutorial。
借助miR-PREFeR软件比对到参考基因组,鉴定新的miRNA。
分析流程
1. Required programs (必要的安装包)
a. 提前安装ViennaRNA,且版本最好在1.8.5、2.1.2、 2.1.5及以上 。
wget https://www.tbi.univie.ac.at/RNA/download/sourcecode/2_4_x/ViennaRNA-2.4.18.tar.gz
tar zvxf ViennaRNA-2.4.18.tar.gz
cd ViennaRNA-2.4.18.tar.gz
./configure --prefix="/user/tools/ViennaRNA/" --without-perl
make
make install
b. 安装samtools (0.1.15 或之后的版本)
cd /manager/biosoft/
tar jfx samtools-0.1.19.tar.bz2
cd samtools-0.1.19
make
☝注意:由于miR-PREFeR是基于Python2版本,所以Python3版本运行会报错!
The current version is only tested under Python 2.6.7, Python 2.7.2 and Python 2.7.3 and should work under Python 2.6. and Python 2.7.
2. Obtain and install the pipeline (下载安装miR-PREFeR)
git clone https://github.com/hangelwen/miR-PREFeR.git
☝如果没法上下载git,可以从我网盘下载。
链接:https://pan.baidu.com/s/1UqkKYDOGcjv13dHm9pi9ew
提取码:volh
3. Test the pipeline (软件调试用,可以跳过)
作者贴心的给出了测试数据(example/exampledata.tar.gz)以及测试整个软件的pipeline(HOW_TO_RUN_EXAMPLE.txt)。
以下是该HOW_TO_RUN_EXAMPLE.txt的具体内容,下面具体看看
================================================================================
1. Test the pipeline.
# The package provides a small example dataset for testing the pipeline. The
# dataset is for Aradidopsis, chromosome 1. To run the example, first change
# directory to the example folder:
cd example
tar xvf exampledata.tar.gz # Then decompress the exampledata.tar.gz file:
# Then open the config.example file, change the PIPELINE_PATH to the path where
# you put the miR-PREFeR package folder. For example, if you put miR-PREFeR at
# /home/username/tools/miR-PREFeR-v0.09, then set PIPELINE_PATH as:
PIPELINE_PATH=/home/username/tools/miR-PREFeR-v0.09
# Save the config.example file. In the example folder, execute command:
python ../miR_PREFeR.py -L -k pipeline config.example
# The -L option generates a log file in the output directory example-result. The
# -k option keeps the temp directory used to store the intermediate files. The
# temp directory is in the example-result directory.
# If you have python, samtools, RNALfold installed and in the PATH, you should be
# able to run the test program. It takes about one or two minutes to
# finish. You'll be able to see the result in the example-result folder.
================================================================================
2. Test how to do checkpointing.
# Before testing this, if you have run the pipeline with the example.config file
# in this folder, please remove the example-result folder first.
# Then change the 'CHECKPOINT_SIZE' option to a smaller value (30, for
# example). The reason to do this is that by default the pipeline makes a
# checkpoint after finishing folding every 3000 sequences, but the sample data is
# so small that the total number of sequences is smaller than the default.
# Then run the pipeline with 'pipeline' command:
python ../miR_PREFeR.py -L -k pipeline config.example
# After running for a while (10 seconds, for example. You should let it run for
# enough time to do at least one checkpoint. A "Done" is shown when a checkpoint
# is applied), kill the process by "Ctrl-C". To check where the pipeline was stopped,
# run:
python ../miR_PREFeR.py -L check config.example
# This will show the checkpoint information.
# To restart the pipeline from where it was stopped, run:
python ../miR_PREFeR.py -L recover config.example
# The pipeline will continue to finish the job specified in the config.example
file.
================================================================================
4. How to run the pipeline (现在正式干活了)
a. Prepare input data for the pipeline.
-
A fasta file, which contains the gnome sequences of the species under study. - one or more
SAM fileswhich contains the alignments of small RNAseq data with the gnome. - (Optional)
An GFF(http://www.sanger.ac.uk/resources/software/gff/spec.html) file which lists regions in the gnome sequences that should be ignored from miRNA analysis.
a). Genome fasta file (是A fasta file的解读)
Fasta format specification can be found at http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml. In miR-PREFeR, for the string following ">", only the first word that is delimited by any white space characters (whitespace, tab, etc) is used. For example, for the following sequence, 'ath-MIR773a' is used as the identifier of the seqeunce. Thus, please ensure that all the sequences in the FASTA files have different identifiers.
>ath-MIR773a MI0005103
AGGAGGCAAUAGCUUGAGCAAAUAAUUGAUUGCAGAAGUCCAUCGACUAAAGCUGUCACCUGUUUGCUUCCAGCUUUUGUCUCCU
b). SAM alignment files (是SAM files的解读)
The miR-PREFeR pipeline takes SAM format alignment files. SAM alignment files can be generated by many aligners. Here we use Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) as an example.
今天累了,未完待续....