Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Mm tag to allow callers to indicate unknown positions #592

Closed
wants to merge 4 commits into from

Conversation

jts
Copy link
Contributor

@jts jts commented Aug 25, 2021

This draft PR addresses the issue with the Mm tag raised in #582. It adds an optional . or ? to change how downstream tools should interpret the skipped bases. When ? is present, nothing should be assumed about the modification status of the skipped bases. This proposal is slightly different than what we originally discussed in #582:

  • the flag is now optional as suggested by @jkbonfield
  • the flag now comes after the modification code, as I found this to be nicer when optional
  • I added an example of how ? should be interpreted
  • I added a footnote to guide interpretation of the (lack of) modification probabilities for skipped bases in . mode

Also, I reworded some existing text to reflect the uncertainty present in either mode. In some places the existing text implied that any bases not skipped are modified, but this needs to be determined by looking at the Ml tag to see if the probability is > 0.5. So I added words like "potentially modified" in a number of places.

@hts-specs-bot
Copy link

Changed PDFs as of 59b785e: SAMtags (diff).

@jts
Copy link
Contributor Author

jts commented Sep 27, 2021

Hi @jkbonfield, could I gently bump this? I'd like to release a new version of nanopolish with modbam support, and would like to see this merged in (if acceptable) before I make that release.

@jkbonfield
Copy link
Contributor

My apologies Jared. I've been concentrating my efforts elsewhere and this has slipped. I'll take a look.

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with the principle, but there a number of errors in the descriptions to fix.

Following the base modification codes is an optional {\tt .} or {\tt ?} describing how skipped seq bases of the stated base type should be interpreted by downstream tools.
When this flag is {\tt ?} there is no information about the modification status of the skipped bases provided.
When this flag is not present, or it is {\tt .}, these bases should be assumed to have low probability of modification\footnote{The decision whether a base is assumed to be unmodified or has a probability explicitly provided is up to the modification calling program. Some programs will elide calls with modification probabilites below a threshold to provide a more compact modification tag.}.
This is then followed by a comma separated list of how many seq bases of the stated base type to skip, stored as a delta to the last and starting with 0 as the first (or next) base, starting from the uncomplemented 5' end of the {\sf SEQ} field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the wording of the [.?]? optional code. Thanks.

Also good catch on the "unmodified seq bases" to "seq bases" change. If we have multiple modification strings, we're counting the ones not listed as potentially being modified by this code, note all codes, plus obviously we don't actually know with certainty which really are modified.

SAMtags.tex Outdated
\hfill\\
The first character is the unmodified ``fundamental'' base as reported
by the sequencing instrument for the top strand.
It must be one of {\tt A}, {\tt C}, {\tt G}, {\tt T}, {\tt U} (if RNA) or {\tt N} for anything else, including any IUPAC ambiguity codes in the reported SEQ field.
Note {\tt N} may be used to match any base rather than specifically an {\tt N} call by the sequencing instrument.
This may be used in situations where the base modification is not a derivation of a standard base type.
This is followed by either plus or minus indicating the strand the modification was observed on (relative to the original sequenced strand of {\sf SEQ} with plus meaning same orientation),\footnote{Hence a tool that may reverse complement sequences does not need to understand how to manipulate the {\tt Mm} and {\tt Ml} tags.} and one or more base modification codes.
This is then followed by a comma separated list of how many unmodified seq bases of the stated base type to skip, stored as a delta to the last and starting with 0 as the first (or next) base, starting from the uncomplemented 5' end of the {\sf SEQ} field.
This is followed by either plus or minus indicating the strand the modification was analysed for (relative to the original sequenced strand of {\sf SEQ} with plus meaning same orientation),\footnote{Hence a tool that may reverse complement sequences does not need to understand how to manipulate the {\tt Mm} and {\tt Ml} tags.} and one or more base modification codes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like the "observed on" to "analysed for" change. My instinct is we're recording data observations, and how that observation was derived is neither here nor there. This isn't a strong feeling though, so if you are particularly focused on wanting "analysed for" then I could be persuaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert this back to "observed on". I changed this as "the modification was observed on" felt too definitive to me but I think it is clear enough from that context that it is OK

SAMtags.tex Outdated
The first 5 {\tt C} bases are unmodified and the 6th, 19th and 20th have modification status indicated by the corresponding probabilities in the {\tt Ml} tag. The 12 cytosines between the 6th and 19th cytosine are unmodified. Modification probabilities for the 17 skipped cytosines are not provided.

When the {\tt ?} flag is present the tag {\tt C+.m?,5,12,0;} tells us the modification status of the first five
cytosine bases is unknown, the sixth cytosine is called, followed by 12 more unknown cytosines, and the 19th and 20th are called.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One slight wording change here maybe is that with C+m? when we say "the sixth cytosine is called" we mean "the status of sixth cytosince has been recorded" (which may or may not be called as a likely base modification). I get what you mean by "called", but it may be misinterpreted by others as evidence of a positive modification event. Or perhaps keep the word called but with an addition: eg "is called (as either modified or unmodified)"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll take your suggestion to add "(as either modified or unmodified)"

SAMtags.tex Outdated
potential 5-Methylcytosine bases on the top strand of {\sf SEQ}.
The first 5 {\tt C} bases are unmodified and the 6th, 19th and 20th have modification status indicated by the corresponding probabilities in the {\tt Ml} tag. The 12 cytosines between the 6th and 19th cytosine are unmodified. Modification probabilities for the 17 skipped cytosines are not provided.

When the {\tt ?} flag is present the tag {\tt C+.m?,5,12,0;} tells us the modification status of the first five
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C+.m?,5,12,0; has both . and ? present, in two different locations. I don't understand this syntax. Did you mean C+?m,5,12,0;?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the . were leftover from my first pass, which had a mandatory flag. I'll fix these to be C+m?.

SAMtags.tex Outdated
Given the unmodified base already has a phred likelihood, this base modification quality should be interpreted as the likelihood of this modification being correct given an assumption the original call is correct.

\begin{description}
\item[Mm:Z:\tagregex{([ACGTUN][-+]([a-z]+|[0-9]+)(,[0-9]+)*;)*}]
\item[Mm:Z:\tagregex{([ACGTUN][-+]([a-z]+|[0-9]+)(,[0-9]+)*[.?]?;)*}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regexp is wrong.

Breaking it down, the original was canonical_base [ACGTUN], strand [+-], a list of modified base types or CHEBI ID ([a-z]+|[0-9]+), followed by a comma-separated series of deltas: (,[0-9]+)* ending in a semicolon.

As far as I understand the intention, the . or ? appear between the strand and base type list, so we'd want:

([ACGTUN][-+]([a-z]+|[0-9]+)[.?]?(,[0-9]+)*;)*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. My intention is for the flag to appear after the base modification code - which I think is what your proposed regex has? My regex is quite rusty...

SAMtags.tex Outdated
@@ -530,10 +537,10 @@ \subsection{Base modifications}
When multiple modifications are listed, for example {\tt C+mh,5,12,0;}, it indicates the modification may be any of the stated bases.
The associated confidence values in the {\tt Ml} tag may be used to determine the relative likelihoods between the options.
The example above is equivalent to {\tt C+m,5,12,0;C+h,5,12,0;}, although this will have a different ordering of confidence values in {\tt Ml}.
Note ChEBI codes cannot be used in the multi-modification form (such as the {\tt C+mh} example above).
Note ChEBI codes cannot be used in the multi-modification form (such as the {\tt C+.mh} example above).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a C+.mh example above? We have (if fixed) a C+?mh one though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this to C+mh, the . shouldn't have been there.

@jts
Copy link
Contributor Author

jts commented Sep 29, 2021

Thanks James! I'm teaching a workshop today but will address these soon.

@jkbonfield
Copy link
Contributor

Thanks James! I'm teaching a workshop today but will address these soon.

No problem. I'm in no position to comment on slow responses. :)

@hts-specs-bot
Copy link

Changed PDFs as of 6854fe0: SAMtags (diff).

@jkbonfield
Copy link
Contributor

Thanks Jared. I'm happy with these amendments now.

As the SAM spec co-maintainer, I'll ping @jmarshall incases he wishes to comment too. However consider the general gist of it as accepted as we've previously discussed it in our meetings and the general consensus is it's the right move. If we change anything it's likely to be minor wording.

@jts
Copy link
Contributor Author

jts commented Oct 12, 2021

Great, thanks James.

@jkbonfield jkbonfield added the sam label Oct 15, 2021
Copy link
Member

@jmarshall jmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the proposed notation, and pleased that the existing syntax without any .? flag character remains accepted.

I've pointed out a couple of formatting issues. There are a few more, but as some of those are already present in the existing text I'm happy to merge this as is (ideally with the suggested formatting fixes) and address those other concerns myself separately.

This is then followed by a comma separated list of how many unmodified seq bases of the stated base type to skip, stored as a delta to the last and starting with 0 as the first (or next) base, starting from the uncomplemented 5' end of the {\sf SEQ} field.
Following the base modification codes is an optional {\tt .} or {\tt ?} describing how skipped seq bases of the stated base type should be interpreted by downstream tools.
When this flag is {\tt ?} there is no information about the modification status of the skipped bases provided.
When this flag is not present, or it is {\tt .}, these bases should be assumed to have low probability of modification\footnote{The decision whether a base is assumed to be unmodified or has a probability explicitly provided is up to the modification calling program. Some programs will elide calls with modification probabilites below a threshold to provide a more compact modification tag.}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention in this document is for footnotes to go after the punctuation.

\hfill\\
The first character is the unmodified ``fundamental'' base as reported
by the sequencing instrument for the top strand.
It must be one of {\tt A}, {\tt C}, {\tt G}, {\tt T}, {\tt U} (if RNA) or {\tt N} for anything else, including any IUPAC ambiguity codes in the reported SEQ field.
Note {\tt N} may be used to match any base rather than specifically an {\tt N} call by the sequencing instrument.
This may be used in situations where the base modification is not a derivation of a standard base type.
This is followed by either plus or minus indicating the strand the modification was observed on (relative to the original sequenced strand of {\sf SEQ} with plus meaning same orientation),\footnote{Hence a tool that may reverse complement sequences does not need to understand how to manipulate the {\tt Mm} and {\tt Ml} tags.} and one or more base modification codes.
This is then followed by a comma separated list of how many unmodified seq bases of the stated base type to skip, stored as a delta to the last and starting with 0 as the first (or next) base, starting from the uncomplemented 5' end of the {\sf SEQ} field.
Following the base modification codes is an optional {\tt .} or {\tt ?} describing how skipped seq bases of the stated base type should be interpreted by downstream tools.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quote marks around the literal characters, as so (and similarly in nearby lines):

Suggested change
Following the base modification codes is an optional {\tt .} or {\tt ?} describing how skipped seq bases of the stated base type should be interpreted by downstream tools.
Following the base modification codes is an optional `{\tt .}' or `{\tt ?}' describing how skipped seq bases of the stated base type should be interpreted by downstream tools.

@jkbonfield
Copy link
Contributor

Thanks for the review John.

I'm happy with the suggested formatting changes, so if you wish to edit and merge yourself please do so. Or if you want me to make the changes let me know and I'll do so (and merge).

@jkbonfield
Copy link
Contributor

We did also say we were going to make them official, so mM to MM etc. We gave warning on this (#418 (comment) - long overdue), and this (welcomed) tweak was one reason for delay. I think once this is merged we should also make the switch to official name space too, in its own commit.

@jkbonfield
Copy link
Contributor

Rebased, squashed and merged as 71241fe, with John's footnote fix and quoting tidyups (and a few more nearby).

@jkbonfield jkbonfield closed this Dec 9, 2021
@jmarshall jmarshall added the Merged For marking PRs that have been merged outwith GitHub and are shown as merely “Closed” label Dec 13, 2021
@jmarshall
Copy link
Member

Pushed 6835cd3 with a title page recording a commit that exists on master. PRs don't need to include updated PDFs (this will be done separately when they are merged), and draft PDFs containing commit hashes that exist only pre-rebasing should be undone during merging… 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged For marking PRs that have been merged outwith GitHub and are shown as merely “Closed” sam sam-tags
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants