一、提取未比对上的序列

ls *bt2.sam |while read id
do
samtools fastq -@ 2 -f 4 -N $id -1 ${id%.*}_1.fq -2 ${id%.*}_2.fq -s ${id%.*}_single.fq;
gzip ${id%.*}_1.fq ${id%.*}_2.fq ${id%.*}_single.fq;
done

二、bowtie2比对

ls *1.fq.gz|while read id;
do
bowtie2 -t -p 2 --very-sensitive --score-min=C,-15,0 --mm \
-x /media/luozhixin/0000678400004823/circRNA/1download_data/index/hg19.fa \
-1 $id -2 ${id%_*}_2.fq.gz \
-S ${id%_*}.pebt2.sam \
2>${id%_*}.bowtie2.log 
done

-p 使用多线程;
--very-sensitive Same as: -D 20 -R 3 -N 0 -L 20 -i S,1,0.50.
允许多重比对,报告出最好的一个;
-N，允许错配的碱基数，0或1，越高运行越慢。
-L，种子碱基长度。较小的值使比对更慢，但更敏感。
-i，设置一个函数，该函数控制要在多种子对齐期间使用的种子子字符串之间的间隔。
--score-min=C，-15,0 设置比对分数函数;
--mm 设置I/O模式，
-h 文件包含header line;
-b 输出bam格式;
-u 输出非压缩的bam格式
–S 忽略版本兼容
获得unmapped reads，f，提取；F，过滤
参考https://www.jianshu.com/p/f5636a0121a6

三、提取未必对序列，分割并生成fastq（fasta）文件

3.1 官方给出的 pipeline

3.1.1 生成bam文件

ls *sebt2.sam|while read id;
do
samtools view -@ 2 -hbf 4 $id >${id%%_*}.unmapped.bam; 
done

-h 文件包含header line，@开头的注释；
-f 提取；
-b 输出为bam格式

3.1.2 从序列两端提取锚点序列

安装所需要的包

conda install numpy
conda install -c bioconda pysam

测试

./unmapped2anchors.py -h

ls *unmapped.bam | while read id;
do
./unmapped2anchors.py -a 20 $id > ${id%%.*}.anchor.fq; 
gzip ${id%%.*}.anchor.fq
done

-a ASIZE参数指定anchors长度，默认20nt；
gzip，压缩

3.2 一步法

ls *sebt2.sam|while read id;
do
samtools fastq -@ 2 -f 4 $id >${id%%.*}.unmapped.fq;
./unmapped2anchors-V3.py -Q -a 20 ${id%%.*}.unmapped.fq  >${id%%.*}.anchor.fq; 
gzip ${id%%.*}.anchor.fq
done

注意：unmapped2anchors.py前面不支持管道命令“|”

四、将anchor序列比对参考基因组

ls *anchor.fq.gz|while read id;
do
bowtie2 -p 2 \
--reorder  \
--mm \
--score-min=C,-15,0 \
-q -x /media/luozhixin/0000678400004823/Indexs/bowtie2/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index \
-U $id \
-S ${id%_*}.bt2.sam \
2>${id%_*}.bt2.log
done

--reorder 确保输出SAM文件read的顺序与原始输入文件中的read顺序相对应，即使-p设置大于1。如果指定-reorder并将-p设置大于1，会导致Bowtie 2运行得稍微慢一些，并且比没有指定-reorder时使用更多的内存。如果-p设置为1，则没有影响，因为在这种情况下，输出顺序与输入顺序对应。
--mm 使用内存映射的I/O来加载索引，而不是使用典型的I/O文件。内存映射允许bowtie在同一台计算机上的许多并行进程共享索引的相同内存映像(即只需一次内存开销)。这有助于在不可能使用-p或不可取的情况下提高Bowtie的内存并行化效率。
--reorder 多线程运算时, 比对结果在顺序上会和文件中reads的顺序不一致, 使用该选项, 则使其一致.
--mm 使用内存定位的I/O来载入index, 而不是常规的文件I/O. 从而使多个bowtie程序共用内存中同样的index, 节约内存消耗.

五、预测circRNA

ls *bt2.sam | while read id;
do
source activate python27;
python find_circ.py $id \
-G /media/luozhixin/本地磁盘/bioinfomatics/genome/human/Homo_sapiens.GRCh38.dna_sm.toplevel.fa \
-p hsa_ -s ${id%%.*}.find_circ.log \
>${id%%.*}.find_circ.bed \
2>${id%%.*}.find_circ.reads; 
done

circ_find使用