minimap2 简单笔记

帮助：https://lh3.github.io/minimap2/minimap2.html

1、使用建议
对于短reads 比对建议使用bwa ；由于设计算法问题，默认设置上对gap 容忍更高，bwa相对minimap 比对 short reads 准确性更高；

对于基因组之间的比较建议用minimap；长reads 中有较多错误，minimap中可以更好的比对；但是针对ont,pacbio 最好使用预设的参数集选项；

没有笨蛋，只有放错地方的天才；没有不好的软件，只有不会用软件的人

2、有用的参数

--secondary=yes|no
    Whether to output secondary alignments [yes]
#是否输出第二比对结果，默认yes
#用于确定最好比对，不考虑低分的结果
#对于[基因组比对]，第二比对，一般比对打分都很低，而且是片段上一小部分 比对到了其他地方


-c  Generate CIGAR. In PAF, the CIGAR is written to the ‘cg’ custom tag.
--cs[=STR]
    Output the cs tag. STR can be either short or long. If no STR is given, short is assumed. [none]
#输出 cs tag [cs 标签]
#如果给了 -c 参数，默认输出短形式的cs tag，这种大多数啊软件都支持识别；但是 long cs tag 很多软件不识别，一般long cs tag使用与三代reads 的比对；不够由于识别软件还不多，这里默认是不会输出 cs tags的


都用于call 变异，建议加上下列参数
-a 生成sam 格式，有CIGAR
--MD  输出 MD tag，与sam 保持格式一致；Sniffles 需要 有-a --MD 参数的minimap2比对
-L  CIGAR with >65535 operators at the CG tag；保证旧软件兼容，因为就软件不会去多CG tag 区域

3、默认输出paf 文件；格式了解下；分析马上安排上

Col Type    Description
1   string  Query sequence name
2   int Query sequence length
3   int Query start coordinate (0-based)
4   int Query end coordinate (0-based)
5   char    ‘+’ if query/target on the same strand; ‘-’ if opposite
6   string  Target sequence name
7   int Target sequence length
8   int Target start coordinate on the original strand
9   int Target end coordinate on the original strand
10  int Number of matching bases in the mapping
11  int Number bases, including gaps, in the mapping
12  int Mapping quality (0-255 with 255 for missing)

格式上开始12列相对好理解
【比对序列，长度，比对坐标[stat ,end ]】，方向是否一致，【目标序列，长度，比对坐标[stat ,end ]】，准确碱基数量，粗略比对上的碱基数量，比对质量[0-255]

12列以后会有几列有趣的详细信息

Tag Type    Description
tp  A   Type of aln: P/primary, S/secondary and I,i/inversion
cm  i   Number of minimizers on the chain
s1  i   Chaining score
s2  i   Chaining score of the best secondary chain
NM  i   Total number of mismatches and gaps in the alignment
MD  Z   To generate the ref sequence in the alignment
AS  i   DP alignment score
ms  i   DP score of the max scoring segment in the alignment
nn  i   Number of ambiguous bases in the alignment
ts  A   Transcript strand (splice mode only)
cg  Z   CIGAR string (only in PAF)
cs  Z   Difference string
dv  f   Approximate per-base sequence divergence
de  f   Gap-compressed per-base sequence divergence
rl  i   Length of query regions harboring repetitive seeds

一个例子慢慢看

tp:A:P  cm:i:38442      s1:i:398614     s2:i:1347       dv:f:0.0039     rl:i:4767142

# tp
#当前行，是否是第一还是第二比对，
#P/primary, S/secondary and I,i/inversion
#inversion倒置的还没见过
tp:A:P
tp:A:S
tp:A:i


cg:Z:83M1D232M7I1M6D552M1D6M1I550M1D2M1D140M    
“M”表示 match或 mismatch；
“I”表示 insert；
“D”表示 deletion；
“N”表示 skipped（跳过这段区域）；
“S”表示 soft clipping（被剪切的序列存在于序列中）；
“H”表示 hard clipping（被剪切的序列不存在于序列中）；
“P”表示 padding；
# 使用--eqx  参数后“M” 被“=”与“X” 替代，修改版的pbmm2 固定使用--eqx 
“=”表示 match；
“X”表示 mismatch（错配，位置是一一对应的，但是碱基不同）；

#The hard clipping operation H indicates that the clipped sequence is not present in the sequence field.

以下参考自：[https://www.cnblogs.com/leezx/p/6105646.html](https://www.cnblogs.com/leezx/p/6105646.html)

clipped alignment因为着在比对过程中，并没有用到全部的read的序列，read两段的序列被截取了（clip or trim）。如下表示，即为clip alignment。

Alignment:
Read:          ACGGTTGCGTTAA-TCCGCCACG
                               |                           ||||||||| ||||||
Reference: TAACTTGCGTTAAATCCGCCTGG

与clipped alignment对应的是spliced alignment，即read的中间没有比对到而两段比对上了。对应的表示如下：

Alignment:
Read:          ACGGTTGCGTTAAGCTCATCCGCCACG
|                 |||||||||||||         |||||||||
Reference: ACGGTTGCGTTAA…..TCCGCCACG

clip alignment对应的CIGAR表示有两种S （soft clip） 和H （hard clip）

注意cigar 在不同软件中的兼容性；可用软件jvarkit下biostar234081 转换

Chaining score 链打分，reads拆分为 minizimers，多个连续比对的minizmier称为anchors, colinear anchors 作为 chains，之后根据chains 作为分析比较打分的对象；对照ngmlr ,Reads 根据比对中大的indel拆分为 subsegment ,colinear subsegment alignments 作为long segments ，以long segments作为比对打分的对象。

如三代reads 比对，总会有有一些indel 导致reads 不能很好比对而分为多个部分比对到参考基因组；而这多个部分构成的打分则是Chaining score
cigar 值中就可以看到频繁的在M之间插入了其它比对状态

cs 标签和sam 格式中的类似，比较正要，sam 格式输出中 short 短形式可被旧软件识别， long cs tag 长标签很多软件还不能完全支持

The cs tag encodes difference sequences in the short form or the entire query AND reference sequences in the long form. It consists of a series of operations:

Op  Regex   Description
=   [ACGTN]+    Identical sequence (long form)
:   [0-9]+  Identical sequence length
*   [acgtn][acgtn]  Substitution: ref to query
+   [acgtn]+    Insertion to the reference
-   [acgtn]+    Deletion from the reference
~   [acgtn]{2}[0-9]+[acgtn]{2}  Intron length and splice signal



上个例子
cs:Z::48*ag:2*ag:21*tc*tc:8-t:2*ag:5*tc:2*ag:4*ct*ac*ct:2*tc*ct:2*ag:4*ag:17*ga:2*ag*tc:9*ag:5*tc:23*ag:11*ta:1*ag*ct:21*tc:10*tc:21*tc:30*ag:7*ct:6*ag*ga*ct:7*ag:13+ttctttt:1-cacccc:19*ct:1*ct:12*ag:6*ct*ca:2*tc:1*ga:1*ct:8*ga:19*ct:24*ag:42*ct:1*ct:134*tc:47*tc:2*ag:13*tc:1*ct:4*ga:30*ag*ga:28*ct*ct:20*tc:7*tc:1*ag:89*ct:9*ct:3-a:1*ag:4+t:1*tc*tc:13*tc:6*tc:12*ag:19*tc:14*ag:16*tg:21*ag*ag:7*tc:1*ag:11*ct:1*tc:3*tc:2*ag:2*gc:11*ac:16*ca*tc:5*tc:8*ga*ac:5*tc:2*ga:10*tc:7*tc:21*ga*ac:2*ct:1*tc:11*tc:1*tc:7*ct:17*ta*tc:1*tc:9*tc:2*tc:19*ag:6*ct:22*tc:23*tc:27*tc:1*at*ga:52*ag:49*tc:25*ga:1*ct*ct:9-t:2-a:8*tc:10*ct:2*tc:5*ct:18*ga:19*ga*ta:1*ga:13*tc*ta*tc:12*ag:4*tc:11*ct:23

参考：
# Difference between Hard Clip（H） and Soft Clip（S） in Samtools CIGAR string

https://lh3.github.io/minimap2/minimap2.html
sam/bam格式：源于别人的最全的