SAM/BAM文件中的MD标签

我用bowtie比对了序列，想查看reads的错配情况。SAM flag中的M包含了比对上的碱基和错位的碱基，不能区分错配。

参考bowtie的文档，可以看到XM的标签可以指示mismatch的个数，MD标签可以查看具体的错配情况。


`XN:i:<N>`	The number of ambiguous bases in the reference covering this alignment. Only present if SAM record is for an aligned read.
`XM:i:<N>`	The number of mismatches in the alignment. Only present if SAM record is for an aligned read.
`MD:Z:<S>`	A string representation of the mismatched reference bases in the alignment. See SAM Tags format specification for details. Only present if SAM record is for an aligned read.

SAM手册对与MD的介绍

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


String encoding mismatched and deleted reference bases, used in conjunction with the CIGAR and SEQ fields to reconstruct the bases of the reference sequence interval to which the alignment has been mapped. This can enable variant calling without requiring access to the entire original reference.
编码错配或del的字符串，与CIGAR和SEQ一起使用，重建read的比对情况。可以在不需要整个参考序列的情况下，用于突变检测。

The MD string consists of the following items, concatenated without additional delimiter characters:
MD字符串由以下不含分隔符的条目组成。

• [0-9]+, indicating a run of reference bases that are identical to the corresponding SEQ bases;表示与相应SEQ碱基相同的一系列参考碱基；这里[0-9]+是正则表示。
• [A-Z], identifying a single reference base that differs from the SEQ base aligned at that position;在这个位置SEQ碱基与参考碱基不一致
• \^[A-Z]+, identifying a run of reference bases that have been deleted in the alignment.以^开头，表明有个del，注意del前有^。

总结：数字表示匹配，碱基表示错配，^碱基表示del。

As shown in the complete regular expression above, numbers alternate with the other items. Thus if two mismatches or deletions are adjacent without a run of identical bases between them, a ‘0’ (indicating a 0-length run) must be used to separate them in the MD string. 数字和字符交替出现，如果两个连续错配，中间需要用0来隔开。

Clipping, padding, reference skips, and insertions (‘H’, ‘S’, ‘P’, ‘N’, and ‘I’ CIGAR operations) are not represented in the MD string. When reconstructing the reference sequence, inserted and soft-clipped SEQ bases are omitted as determined by tracking ‘I’ and ‘S’ operations in the CIGAR string. (If the CIGAR string contains ‘N’ operations, then the corresponding skipped parts of the reference sequence cannot be reconstructed.)
Clipping, padding, reference skips, and insertions (‘H’, ‘S’, ‘P’, ‘N’, and ‘I’ CIGAR operations)不体现在MD字符串中，所以要与CIGAR结合。

For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches.
10A5^AC6表示10个匹配（10），一个与A不匹配（A），2bp的del（^AC），6碱基的匹配。

但需要注意的MD不包含insertion，含有insertion的字符串表示更复杂，需要与CIGAR联合看。

Example 1：包含插入

read seq length: 101 CIGAR: 89M1I11M MD: 100

MD为100，但测序长度为101，第89号后有一个插入（1I）。

Example 2：

read length: 101 CIGAR: 9M1I91M MD: 48T42G8

CIGAR表明：9个M匹配之后1个insertion然后91个M，M可能是匹配或者错配。结合MD，可以发现，

先匹配MD里面的48

1，MD48的前9个完全匹配；

2，1个insertion；

3，然后48-9=39个完全匹配（属于CIGAR里面的91M的前半部分，此时MD里面的48匹配完）；

91M已经被MD 48里面的39匹配，还剩52M，其中

4，然后是1个和T的错配；

5，然后是42完全匹配；

6，然后是1个和G不匹配；

7，然后是91M里面剩下的91-39-1-42-1=8个完全匹配。

一共是9+1+39+1+42+1+8=101个碱基

参考

https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

https://vincebuffalo.com/notes/2014/01/17/md-tags-in-bam-files.html

https://samtools.github.io/hts-specs/SAMtags.pdf

####################################################################

#Author: Jason

#####################################################################

文章目录

参考