Skip to content

Commit

Permalink
Describe MM and MP methylation tags.
Browse files Browse the repository at this point in the history
  • Loading branch information
jkbonfield committed Jun 10, 2019
1 parent 267de28 commit a6cb469
Showing 1 changed file with 86 additions and 0 deletions.
86 changes: 86 additions & 0 deletions SAMtags.tex
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,8 @@ \section{Standard tags}
{\tt MD} & Z & String for mismatching positions \\
{\tt MF} & ? & Reserved for backwards compatibility reasons \\
{\tt MI} & Z & Molecular identifier; a string that uniquely identifies the molecule from which the record was derived \\
{\tt MM} & i & Base modifications / methylation \\
{\tt MP} & i & Base modification qualities \\
{\tt MQ} & i & Mapping quality of the mate/next segment \\
{\tt NH} & i & Number of reported alignments that contain the query in the current record \\
{\tt NM} & i & Edit distance to the reference \\
Expand Down Expand Up @@ -453,6 +455,90 @@ \subsubsection{Color space}
Color read quality on the original strand of the read. Same encoding as {\sf QUAL}; same length as {\tt CS}.
\end{description}

\subsection{Base modifications}

Base basications, including base methylation, are represented as a
series of edits from the primary unmodified sequence stored in the
main SAM {\sf SEQ} field. If the modifed base has no natural
unmodified form then this should be stored as ``N''.

Each modified base listed also has a quality value associated with
it. Given the unmodified base already has a phred likelihood, this
base modification quality should be interpreted as the likelihood of
this modification being correct, rather than the base being unmodified.

\begin{description}
\item[MM:Z:\tagregex{([ACGTN][-+][a-z](,[0-9]+)+;)*}]
\hfill\\
The first character is the unmodified base as seen in the {\sf SEQ}
field, one of {\tt A}, {\tt C}, {\tt G}, {\tt T} or {\tt N}, with
the exception that {\tt N} is used to match any base rather than
strictly {\tt N}. This is followed by either plus or minus
indicating the strand the modification was observed on (relative to
the recorded strand of {\sf SEQ} with plus meaning same
orientation), and a base modification symbol. This is then followed
by a comma separated list of how many unmodified seq bases of the
stated base type to skip, stored as a delta to the last and starting
with 0 as the first (or next) base. Hence this number series is
comparable to the numbers in an {\tt MD} tag.

For example {\tt C+m,5,12,0;} tells us there are three 5-Methylcytosine
bases in the original {\sf SEQ}. The first 5 {\tt C} bases are
unmodified and the 6th is modified, as are the 19th (12 inbetween the
6th and 19th) and 20th. Similarly {\tt G-m,14;} indicates the 15th
{\tt G} is a 5-Methylcytosine on the opposite strand.

This permits modifications to be listed on either strand with the rare
potential for both strands to have a modification at the same site.

If the modification is not one of the standard common types (listed
below) it can be specified as a numeric ChEBI code. For example
{\tt C+76792,57;} is the same as {\tt C+h,57;}.

An unmodified base of {\tt N} means count any base in {\sf SEQ}, not
only those of {\tt N}. Thus {\tt N+n,100;} means the 101st base is
Xanthosine (n), irrespective of the sequence composition.

The standard code types and their associated ChEBI values are listed
below, taken from \emph{Modeling methyl-sensitive transcription factor
motifs with an expanded epigenetic alphabet}, Coby Viner
et.al. \url{https://www.biorxiv.org/content/10.1101/043794v1}.

\begin{center}
\begin{tabular}{lllll}
{\bf Unmodified base} & {\bf Code} & {\bf Abbreviation} & {\bf Name} & {\bf ChEBI} \\
\hline
C & m & 5mC & 5-Methylcytosine & 27551 \\
C & h & 5hmC & 5-Hydroxymethylcytosine & 76792 \\
C & f & 5fC & 5-Formylcytosine & 76794 \\
C & c & 5caC & 5-Carboxylcytosine & 76793 \\
\hline
T & g & 5hmU & 5-Hydroxymethyluracil & 16964 \\
T & e & 5fU & 5-Formyluracil & 80961 \\
T & b & 5caU & 5-Carboxyluracil & ? \\
\hline
A & a & 6mA & 6-Methyladenine & 28871 \\
\hline
G & o & 8oxoG & 8-Oxoguanine & 44605 \\
\hline
N & n & Xao & Xanthosine & 18107 \\
\end{tabular}
\end{center}

\item[MP:Z:\tagvalue{qualities}]
\hfill\\
The {\tt MP} tag if present lists the Phred qualities of each
modification listed in the {\tt MM} tag. The length should match the
number of position deltas from {\tt MM}. The qualities are encoded in
the same manner as the primary {\sf QUAL} field; one byte per quality
with ASCII value Phred score + 33. No separators should be present.

For example {\tt MM:Z:C+m,5,12,3;C+h,57;} may have an associated
quality tag of {\tt MP:Z:5EB/}.


\end{description}

\section{Locally-defined tags}

You can freely add new tags.
Expand Down

0 comments on commit a6cb469

Please sign in to comment.