Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe MM and MP methylation tags. #418

Merged
merged 1 commit into from
Jul 14, 2021

Conversation

jkbonfield
Copy link
Contributor

This provides a couple tags for encoding base modifications as a side-channel on top of the unmodified canonical sequence. It's a slightly updated implementation of the ideas presented in #362

The tag names are somewhat arbitrarily chosen tag names; feel free to suggest improvements, but chosen "M" for modification and M/P as both are free and adjacent in the sorted list.

I'm still in two minds as to whether this method is the best place to be. We could use a simpler form which is more MD style where modifications are recorded as positional deltas to any base rather than the appropriate canonical base. This is larger, but simpler to work with. How much value do we put in the storage cost vs complexity?

A pictorial example on a tiny sequence:

CCAGCTATAGGCCCCCC (canonical)
mCAomTATAGomCmCCm (modified)
m...m......m.m..m
...o......o......

C+m,0,3,6,1,2;G+o,3,6;  (count all bases)

CCAGCTATAGGCCCCCC (canonical)
mCAomTATAGomCmCCm (modified)
m.  m      m.m..m
   o     .o

C+m,0,1,0,1,2;G+o,0,1;  (count canonical bases only)

If we call the method in this PR ConDelta and the simpler method AllDelta then for two levels of simulated modification on long read data (3% C->m vs 30% C->m with other C/G modification types adjusted in scale too), then our sizes are as follows:

          Size (3% C->m)               Size (30% C->m)
          BAM          CRAM            BAM            CRAM
No mods   1327060      815861          1327060        815861
ConDelta  +25049/1.9%  +9482 /1.2%     +92938 /7.0%   +39775/4.9%
AllDelta  +30912/2.3%  +12386/1.5%     +134178/10.1%  +66382/8.1%

The delta sizes here are shown as +extra amount. Ie ConDelta for BAM is 1327060+25049 bytes long. (This is obviously a small test set just for experimentation.)

The /1.9% bit in +25049/1.9% is showing what percentage of the original file that represents - ie 25049/1327060 is ~0.019.

Thus by comparing AllDelta vs ConDelta total file sizes, where see the simpler method will use in total between 0.4% (3% C meth) to 3.0% (30% C meth) extra storage. Quite small overall, despite the actual chunk of data being used for methylation signal itself being considerably larger.

NB: the tests above were all done without the MP modification quality value tag, but that doesn't change size and would have minimal impact on total file sizes.

So, discuss...

@hts-specs-bot
Copy link

Changed PDFs as of a6cb469: SAMtags (diff).

@jts
Copy link
Contributor

jts commented Jun 13, 2019

I think this is a good start and I like the proposal. I'm not sure whether I favor AllDelta or ConDelta yet.

Something I'm unclear about, and I think is worth discussing at this point, is how the upstream tools generate these tags. Right now there are two approaches to modification calling - directly by the basecaller and by a post-processing tool like nanopolish. For nanopolish the workflow would be pretty simple - it could take in a BAM with canonical bases and augment the records to contain these tags.

For basecalling it is less clear, as the basecaller needs to represent the modified bases directly in its output. Should it output FASTQ with DNAMod characters, or write out these mod strings in some auxiliary file? Perhaps for modification analysis the basecaller should write out unaligned SAM/BAM with canonical bases and these mod strings.

@gringer
Copy link

gringer commented Jun 13, 2019

When implementing a method, it can help to think about known extremes, and a good example of that is RNA modifications (over 110 known). Would those be treated as separate, distinct changes (e.g [5m] vs [6m] vs [6,6m]), or multi-layered / hierarchical modifications (e.g. [m;5] vs [m;6] vs [m;6,6])?

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 14, 2019

Actually I realise I forgot to add any form of ambiguity codes, because these weren't in the paper (Coby Vine et al.). However I think they're probably useful. Eg "we know this base is methylated, but cannot tell which form" is a realistic output for some technologies. Being forced to state all modified C are 5mC is incorrect.

Would just 1 ambiguity code per canonical base type be sufficient?

Edit: perhaps for a later update, but would it be useful to base-callers outputting unaligned SAM to have the ability to specify (maybe in an RG header tag?) a list of methylation codes they can possibly emit - eg what they've been trained on? There's no way to do such global meta-data in FASTQ though, but that's why it should die :-)

@jkbonfield
Copy link
Contributor Author

Quoting @jts:

For basecalling it is less clear, as the basecaller needs to represent the modified bases directly in its output. Should it output FASTQ with DNAMod characters, or write out these mod strings in some auxiliary file? Perhaps for modification analysis the basecaller should write out unaligned SAM/BAM with canonical bases and these mod strings.

It's clear we'll need separate PRs for SAM, VCF, and hmm - FASTQ isn't represented here. (Should it be? We discussed and rejected it in the past but I'm not entirely sure that's helpful.)

My own personal view is FASTQ should be auxiliary tags after the @name part of the header, as already supported by tools such as bwa. This means the sequence isn't modified and we don't need custom aligners to read the sequence. It also solves the problem of limited alphabet. However that's a separate issue and is best to continue such discussion in issue #362 rather than the SAM-tags specific PR.

Quoting @gringer

When implementing a method, it can help to think about known extremes, and a good example of that is RNA modifications (over 110 known). Would those be treated as separate, distinct changes (e.g [5m] vs [6m] vs [6,6m]), or multi-layered / hierarchical modifications (e.g. [m;5] vs [m;6] vs [m;6,6])?

Thanks for the discussion.

I think multi-layered / hierarchical is going to get messy and it may not be clear for all possible modifications out there, so it wouldn't be my personal preference. We did think about the extremes, which is what revealed to us that the letters in the sequence just isn't going to cut it. (We'd run out of symbols unless we go Unicode, and dear god let's not go down that route!) Hence the approach to permitting ChEBI IDs instead. It's also what lead us to wanting to not replicate a long ChEBI value over and over again, hence putting it up-front once and listing the coordinates it refers to.

@jkbonfield
Copy link
Contributor Author

Given I haven't yet included ambiguity codes in the SAM PR (#418), and they're not well described in the source document either, I'm thinking of what symbols to use.

https://dnamod.hoffmanlab.org/
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0349-4
https://www.biorxiv.org/content/10.1101/043794v1

The biorxiv link does have z for unknown cytosine modification and w / x / y for certain classes of C mods, but it doesn't seem to cover other base types. Additionally, z means C or modified C. I think the or can be encompassed in the quality score so ideally we'd have z being any of the known modifications (not just m h, f, c, although others are far rarer I assume) but not an unmodified C, and the quality score being the likelihood of it not being modified at all. I'm not familiar enough with the science and instruments to know how useful w, x and y are. Anyone closer to the coal-face here able to elaborate?

Given in SAM we're encoding this as a side channel, we can reuse bases here that exist in the canonical channel. What do people think to using A, C, G and T in MM tag as generic modified ambiguity codes, that don't state explicitly the type of modification? That's not going to work in FASTQ though, but that's not really our bailiwick for this PR.

Other ambiguities are already covered by ChEBI I think. Eg mG and mA in the dnamod plot have associated ChEBI values with links to the specific versions of mG. Again I'm not familiar with the science, so don't know how useful adding single letter codes would be. My gut feeling is keep it simple for now and we can extend if we need to later on.

@hts-specs-bot
Copy link

Changed PDFs as of 873b592: SAMtags (diff).

@hts-specs-bot
Copy link

Changed PDFs as of 6b06097: SAMtags (diff).

@timp0
Copy link

timp0 commented Jun 26, 2019

There's another ambiguity not mentioned above - bases that can't be confidently called as either modified or unmodified. This happens sometimes currently when looking at 5mC - we have three possible calls - either unmodified, modified, or unsure. The above proposes looking at unmodified, known modifications and unknown modifications (but definitely modified).

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 26, 2019

Isn't that covered by the associated phred score? Eg if you think it's 70% likely then call 5mC and set a qual of 5 (-10 * log10 (.3)).

Similarly if you think it's 20% 5hmC, 50% 5mC and 30% unmodified then use "C" (modified but not specifying which) and also keep the qual at 5 indicating 70% likelihood of the base being one of the known modifications. There isn't a way though to indicate how much more likely it is with one confidence than another, unless we explicitly make multiple calls at the same location (which is possible, but I'm not sure we should permit it.)

@kpalin
Copy link

kpalin commented Oct 19, 2019

I have a technical issue with the strand specification in the tag. The strand is currently defined with "either plus or minus indicating the strand the modification was observed on (relative to the recorded strand of SEQ with plus meaning same orientation),". The thing is that SEQ tends to get reverse complemented after mapping and the coordinates get messed up unless the downstream tools know how to reverse the MM/MP tags (which they don't).

Would it be better if the strand would be defined in the original strand of SEQ? (I checked the SAM specification and "recorded" SEQ does mean the possibly reverse complemented one.)

EDIT: In relation to the tag syntax is defined as MM:Z:([ACGTN][-+][a-z](,[0-9]+)+;)*

  • How to interpret empty tag MM:Z: which carries no information? (I guess it's correct to ignore in reading and avoid in writing.)
  • Asking more of recommendation: How to encode information that "Yes, we tried to find 5mC modifications in this read but couldn't find any"

SAMtags.tex Outdated Show resolved Hide resolved
SAMtags.tex Outdated Show resolved Hide resolved
SAMtags.tex Outdated Show resolved Hide resolved
SAMtags.tex Outdated
the same manner as the primary {\sf QUAL} field; one byte per quality
with ASCII value Phred score + 33. No separators should be present.

For example {\tt MM:Z:C+m,5,12,3;C+h,57;} may have an associated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to separate (e.g. with a semi-colon) the qualities that pertain to different modifications? in the example, the qualities would thus become 5EB;/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(perhaps with a ~ which has been used in the past to separate lists of qualities)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have precedence elsewhere then it is good to follow convention.

Reviewing what we have already:

  • OX/BZ. OX is hyphen-concatenated sequence. BZ is defined to be the same length and BZ uses space to separate them (blergh, but non quality value hence chosen).
  • RX/QX. As above
  • CR/CY. As above
  • BC/QT. As above

I've tried hard to expunge that period from my mind, but it seems we got to a consensus somewhere along the line. :-)

So possibly we should do the same here? It's not quite analogous as we have a comma separated list instead of a sequence, but the same deal applies of choosing an invalid ASCII phred character as a separator. We can't however claim the strings need to be equal length of course.

Thoughts anyone? Space at least shows up really well when reading it.

@yfarjoun
Copy link
Contributor

I think that @kpalin is onto something. perhaps the modifications should always be written relative to the original (non-rev comped) strand? that will mean that aligners will not have to know how to flip MM and MP, and there will be no doubt about what strand the annotation pertains to.

In addition, we should mention whether all the bases need to be accounted for and what it means if there are more bases in MM than are present in SEQ.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Oct 20, 2019

I have a technical issue with the strand specification in the tag. The strand is currently defined with "either plus or minus indicating the strand the modification was observed on (relative to the recorded strand of SEQ with plus meaning same orientation),". The thing is that SEQ tends to get reverse complemented after mapping and the coordinates get messed up unless the downstream tools know how to reverse the MM/MP tags (which they don't).

Would it be better if the strand would be defined in the original strand of SEQ? (I checked the SAM specification and "recorded" SEQ does mean the possibly reverse complemented one.)

Good point. If I recall my intention was simply that the aligners could always emit "+" for single-stranded DNA, which is basically everything(?) at the moment, but allowing room for future developments. However you're right - it's easier for aligners that don't understand methylation data to simply pass this tag through as-is irrespective of whether they've reverse complemented the sequence, thus it needs to relate to the original DNA strand. I'm not sure how best to phrase it hough.

EDIT: In relation to the tag syntax is defined as MM:Z:([ACGTN][-+][a-z](,[0-9]+)+;)*

How to interpret empty tag MM:Z: which carries no information? (I guess it's correct to ignore in reading and avoid in writing.)

That would be logical interpretation, but it probably needs to be explicit.

Asking more of recommendation: How to encode information that "Yes, we tried to find 5mC modifications in this read but couldn't find any"

Well that would ideally be MM:Z:C+m; so maybe that last + in the regexp should be * so we can accurately distinguish between didn't look and looked but didn't find.

Thanks for the comments.

@hts-specs-bot
Copy link

Changed PDFs as of 11d7fb9: SAMtags (diff).

@jkbonfield
Copy link
Contributor Author

I've updated it following comments. Thanks all. A few key points:

  • Coordinate lists are now optional, so we have a mechanism to explicitly say "looked for but did not find modifications".

  • Strand is relative to the original. This was my original intention I think, but poorly thought out. The expectation is that all current methods will be using "+" everywhere as they're detecting single stranded signals only.

  • There is a new example showing explicit ambiguity of two different possible modification types. This is messy and hard to represent!

On that last case, there is room for improvement. Eg consider a modified ONT guppy caller that was trained to distinguish 5mC and 5hmC. We don't really want to have to use two separate lists C+m and C+h with many coordinates duplicated up. Equally so we may not wish to have every coordinate present in both C+m and C+h because we ideally only want to record the calls which have a sufficiently high probability that the phred score is above 1.

This does rather come down to the core of it and whether the representation is correct. Eg I could imagine completely different mechanisms where we list the codes inline instead - more MD style - but with bracketing to handle multiple choices like a regexp character class. Eg MM:Z:10m20m15[mh]40h. The corresponding MP would note 5 possible bases (m m, mh and h) and so contain 5 confidence values. We may still want to keep A, C, G and T "fundamental base" coordinates distinct as it reduces the magnitude of the numbers and hence helps compression, but that too is debatable: it trades compression ratio against ease of processing.

This was one of the ideas discussed in #362, but at the time of writing I hadn't seen any real data in the wild and there was minimal external engagement.

So it's still open for discussion. This PR was just my guess based on lack of any external prodding.

SAMtags.tex Outdated Show resolved Hide resolved
SAMtags.tex Outdated Show resolved Hide resolved
SAMtags.tex Outdated
This permits modifications to be listed on either strand with the rare
potential for both strands to have a modification at the same site.

Note it is permitted for the coordinate list to be empty (for example {\tt MM:Z:C+m;}), which may be used as an explicit indicator that this base modification is not present.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would a technology emit modifications on + and - at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any can right now, but there were (are?) technologies that unwound the double stranded DNA and sequenced one strand followed by the other strand. IIRC PacBio SmartBells did this at one point, or maybe still do, and ONT earlier had their "2D" reads which did this.

The point being the format has to cope with what may come along down the road and this is conceivable, albeit a rather unlikely event to need to encode.

Copy link

@gringer gringer Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ONT has had 1D^2 (essentially from when they stopped 2D), from which the forward and reverse-complement reads from a single double-stranded sequence (as separate reads) can be combined into a single sequence. They also plan to re-introduce fully-connected 2D sequences in the future.

SAMtags.tex Outdated

To represent several possible modifications at the same site the {\tt MP} tag can be used to indicate the probabilities of each possibility.
The values used should be absolute probabilities, not relative between the alternatives.
For example, a C base that has 90\% chance of being modified with 5mC being three times more likely than 5hmC will encode 5mC with 67.5\% probability ($0.9 * 0.75$)and 5hmC with 22.5\% probability ($0.9 * 0.25$).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really want to be able to encode probabilities in the range of 10% - 90% perhaps we should not use Phred which has terrible resolution in that range (10%=10 and 90%=".45"...1?0?)
Please be explicit that the " and & are encoding phred 1 and 5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time of writing this I hadn't seen any real data.

However I thought about that once I saw how ONT handle things. They report probabilities as p*255; ie p=0.33 is 84. They have arrays of values, so if trained on C, 5mC and 5hmC then they'd have 3 score for C which jointly add up to 255 (irrespective of whether they believe the call to be C - this is to be used after combining with the base call probabilities themselves), giving the relative likelihoods of the various modified and unmodified forms.

However it also means they cannot give higher certainties than 99.6%, or around phred 24. That may well be sufficient.

The question is, do we acknowledge this is different data and is never going to be extremely accurate? Do we actually care about small fluctuations in probabilities at the low end? Eg p=0.15 vs p=0.2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot to add the big about explicit " and &.

Quoting those is messy too as we have latex quotes vs ascii quotes mixed together! I'll try and find some different numbers that don't produce a quote. :-)

Copy link
Contributor Author

@jkbonfield jkbonfield Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually removing that phred ASCII quote is simply not possible, which comes down to the phred range again.

If we wish to emit probabilities for more than one possibility, which I believe is how ONT Guppy may go in the future (https://twitter.com/AlbertVilella/status/1151395294879395845), then phred scores will always generate a score of 0, 1 or 2 somewhere. That's because inherently there will be one call with p < 0.5.

The Illumina answer to this was to do phred+64 and use log-odds instead of log. Ie log(p/(1-p)) instead of log(1-p). That gives 0 being p=0.5 and is symmetric so choice A if p=.99 and choice B of p=0.01 gives phred scores of +20 and -20.

However it still begs the question of whether a log scale is appropriate, or whether even knowing the other values helps. When I was doing consensus calling algorithms I always grumbled that phred isn't good enough. If you're trying to make a call and you get A with P=0.9 but everything else is C, then it really helps to know whether than remaining 0.1 probabiity on the A was evenly spread across C, G and T or predominantly a remainder for C. (I just had to go with flat distribution in lieu of anything better.)

I simply don't have enough understanding of the downstream tools to know whether emitting the score for the most likely call is correct (the phred approach), or whether we the benefits of knowing 2nd, 3rd best choices outweighs the extra storage cost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not know what people will want to write (and the resolution they need) should we hold-off on defining the quality tag? perhaps suggest that the qualities be written into a lower-case tag until we have a better understanding? I don't want to "waste" a tag that will be unusable in the future.....

@yfarjoun
Copy link
Contributor

Can ONT or other technolgies read Uracil directly? At some point we will have modifications of U...should we allow U in the modified bases as well?

SAMtags.tex Show resolved Hide resolved
SAMtags.tex Show resolved Hide resolved
SAMtags.tex Outdated Show resolved Hide resolved
@marcus1487
Copy link

I view reference based analysis more as a downstream format, such as mpileup, consensus callers, VCF, etc.

The issue is that most of the downstream processing formats lose the per-read long range connection information. So I do think that swapping out the SEQ field for the reference sequence should be supported. I personally don’t have an issue with the practice of swapping out the SEQ field with the reference sequence, but I appreciate the concern that this is a relatively uncommon practice. If there were another solution to this issue I’d be open to it, but given the issues raised here, the current implementation of this tag seems to be the best solution at the moment.

Note that for Nanopore data, processing higher accuracy calls are achieved by anchoring to the reference than anchoring to the basecalls. But this is still done on a per-read level. I mentioned these issues a while back in this thread #418 (comment)

@jts
Copy link
Contributor

jts commented May 7, 2021

Thanks all, it seems the consensus is to not add a flag to MM so I won't push on that.

I took a step back to evaluate how well projecting reference calls back onto the read sequence would work on two datasets, one a few years old (NA12878 WGS consortium) and the other very recent (a sample sequenced here last week).

For NA12878 84.7% of sites cleanly project onto the read (the reference C matches a read C). 8.9% of sites project to a mismatch, but would be handled by @jkbonfield's suggestion to use N as the canonical base. The remaining 6.4% of calls project to a deletion and would be lost.

For the recent sample (using a recent version of guppy) the numbers are 89.3%, 6.6%, 4.1% for match, mismatch, and deletion. I think 4% loss of calls is too high so would like either a new tag, or to emit the reference as SEQ.

While I could live with using reference as SEQ, it would cause downstream difficulties. Some workflows (e.g. haplotype specific methylation) would be locked into a particular order of steps (align -> phase -> call methylation, as align -> call methylation -> phase won't work). This is incompatible with live methylation calling (calling during a run) that both nanopolish and megalodon support. One solution would be to keep two sets of BAMs around but this is inefficient and would be difficult to keep in sync. I'd prefer a new set of tags like @jrobinso suggested.

@marcus1487
Copy link

@jts I’m not sure if this would play nice with downstream tools, but would the MD field suffice to transfer the sequence variant information in a single BAM file? I’m also not sure if this solution plays nicely with a perfectly matching cigar string. Figured it might be worth bringing up though.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented May 7, 2021

Maybe I'm missing something here.

How is it possible to call a modified C on a read that doesn't even have a C there (it's a deletion)? The only way I can think of if is you're essentially redoing the base-calling (in the light of all the other sequences around it), in which case you should be emitting the recalled sequence too surely.

It doesn't make any sense to me to have SEQ and base modifications coming from different length reads.

@marcus1487
Copy link

Within megalodon, a read is basecalled and then mapped to the reference but the link between the raw nanopore signal and the reference is maintained. Then for modified base calling, the canonical basecalls are discarded and modified base calls are made using the reference bases and the raw signal in the region around a reference base of interest. The idea here is that the reference should be correct. If the reference for a sample is wrong this should be fixed before performing modified base calling. This process results in more accurate modified base calls than anchoring the mod calls to the original basecalls.

@jts
Copy link
Contributor

jts commented May 7, 2021

How is it possible to call a modified C on a read that doesn't even have a C there (it's a deletion)?

By anchoring to the reference sequence, nanopolish/megalodon turn what is a largely unconstrained task (basecalling can emit any sequence of length L over the alphabet) into a binary classification problem (is this reference C methylated or not?). As @marcus1487 notes above this improves accuracy but leads to inconsistencies wrt the basecalled read (making calls at deletions).

@jkbonfield
Copy link
Contributor Author

I'm not questioning that it gives more accurate results, but saying that if you can correctly call a base is methylated then you can also correctly call that base exists.

Ie if you want to write this data back to BAM et al again, then why not write out the polished sequence field, as well as the polished methylation calls. It makes no sense to only emit one without the other, as that leads to all the issues we're discussing here.

Or putting it another way, why wouldn't you want to also correct the base calls if you have evidence it was previously called incorrect?

@jts
Copy link
Contributor

jts commented May 7, 2021

There are a few reasons we wouldn't want to output the corrected sequence:

  1. downstream tools (medaka, etc) all expect (and are trained on) basecalled reads, so would not work well with the sequence nanopolish determines. The original bam would still be needed for input into other tools.

  2. it is prohibitively expensive to polish the entire read anyway (only ~1% of the genome is CpG)

These issues are not a concern if we're happy for this bam to only be used for methylation workflows (generating a bed file with frequencies, visualization) but if the plan is to provide a lightweight way to carry the methylation information through a workflow, all the way to archival in ENA, then I'm against modifying SEQ.

@jrobinso
Copy link

jrobinso commented May 7, 2021

@jkbonfield I have a question about allowed characters for the modification code. The motivation is to derive a rule to distinguish between mulit-modifications and modifications that are not one of the standard code values. From the draft spec I infer that the exahaustive list of standard 1-character codes is

'C', 'm','h','f','c','T','g','e','b','U','A','a','G','o','N','n'

The only other modification string mentioned are "chebi" codes, which appear to be integers. Thus given a string like 'Cmhf' I might infer it is a multi-modification because all the individual characters are in the standard code set(4 modifications in this instance, not likely but)

Or I might infer it is not a chebi code because its not an integer, but could it just be a single modification that happens to be named 'Cmhf'?. Its unclear what the rule here is.

@jkbonfield
Copy link
Contributor Author

Or I might infer it is not a chebi code because its not an integer, but could it just be a single modification that happens to be named 'Cmhf'?. Its unclear what the rule here is.

It's a modified base and has a similar single-character encoding as bases, so Cmhf is unambiguously 4 nucleotides. The chebi codes are a bit of a shoe-in as a way to avoid having to code to larger UTF-* style alphabets.

Also your description is presumably something related to a third-party format (eg fasta) and not this spec. We never mix standard IUPAC codes and the modification codes together in SAM. One is a layer on top of the other. So you wouldn't be seeing C and mhf next to each other.

@jrobinso
Copy link

@jkbonfield Thanks for the response, I got it, my contrived example was a poor one. I understand its a modified base but a base can have multiple modifications. (multiple modification mode), but I suppose no more than 2? By my reading "mh" could be 2 modifications of 1 base or, some code called "mh", I was unsure of what the rules were for characters making up a ChEBI code, or for that matter if ChEBI code is the only allowed type of code. Reading it again I think that it is. Still when seeing multiple characters for the modification one has to decide is this s ChEBI code or a multi-modification, I'm using the rule that if all the characters are among the table of codes in the spec this is a multi-modification.

@cjw85
Copy link

cjw85 commented May 10, 2021

@jkbonfield

I'll unceremoniously wade in here, since I think both @jts and @marcus1487 implied it but stopped short of saying it:

The only way I can think of if is you're essentially redoing the base-calling

In essence, yes. The process that nanopolish et. al. others take is akin to variant calling. The resultant "basecall" for the read is the reference sequence with some Cs (other other base) swapped out for modified forms. Megalodon already does write out a BAM as you suggest with the SEQ field swapped out to a sequence concomitant with the methylation tag: this sequence is precisely the reference sequence.

My only concern here is the knock-on effect this has for users. I can imagine users loading these files into IGV and immediately sending @jrobinso a bug report that the reads all display perfectly as the reference sequence. Perhaps this is just a bit of user education, but my gut feeling is that it would be preferable to store both the basecall and the methylation data w.r.t. the reference sequence if we can find a way. @jts already raised the idea of a flag within the MM tag, perhaps another idea is to define a separate reserved tag?

SAMtags.tex Outdated
In the example used above, the 6th {\tt C} has 80\% chance of being {\tt 5mC}, 10\% chance of being {\tt 5hmC} and 10\% chance of being an unmodified {\tt C}.

{\tt Ml} values for ambiguity codes give the probability that the modification is one of the possible codes compatible with that ambiguity code.
For example {\tt Mm:Z:C+C,10; Ml:Z:229} indicates a C call with a probability of 90\% of having some form of unspecified modification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this Ml tag start with Ml:Z:C, as all the others do in the specification? Or is the C optional?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fbrennen is correct. The Z is clearly an error as the quality values are a byte array.

@marcus1487
Copy link

marcus1487 commented May 19, 2021

A point of confusion has come up in a couple conversations recently so thought it worth mentioning here. Given the current file format it is allowed to have multiple modifications on the same base in the same read (all good here) and each of these probabilities could take values between 0 and 1 (encoded on 0-255 scale). The issue here is if these probabilities sum to a value greater than 1, what should be the interpretation?

For example it seems legal to have tags such as MM:Z:C+mh,0 ML:B:C,250,250 indicating 0.98 probability of both 5mC and 5hmC. This seems invalid, but may be output by some programs.

The note on lines 608-611 of SAMtags.tex comes close to making this point, but stops short of indicating that such probabilities summing to greater than 1 are considered invalid. I think a note explicitly indicating that this is considered invalid within the format would be helpful.

Note that there is a chance that floating point and rounding errors could result in implied values slightly over 1. I don't think these cases are of grave concern, but cases where the probabilities sum to values much larger than 1 would be hard to interpret for downstream tools.

@jkbonfield
Copy link
Contributor Author

An interesting point. The intention was to describe them as summing to something <= 256 (or as you point out perhaps marginally above is OK due to rounding errors, so we need to be clear). I'll see what I can do to improve the wording, but describing it as sums of probabilities rather than sums of integer encoding would help I think.

@hts-specs-bot
Copy link

Changed PDFs as of 039c151: SAMtags (diff).

@jkbonfield
Copy link
Contributor Author

Just a heads up to people, we (SAM spec maintainers) plan to merge this in around 3-weeks, as a "Draft tag" (ie half upper, half lower).

Given it has already been implemented in a number of tools, we don't envisage it will stay in draft form for particularly long, perhaps in the 1-2 month time frame, before becoming officially adopted. We didn't wish to jump straight to the final tag definition as it's not until it's been actively used for a while that we have confidence there are no nasty corner cases lurking.

Thank you all for your patience and help in pulling this together.

@fbrennen
Copy link

Fantastic news! =)

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jul 8, 2021

Squashed, rebased, and updated the SAMtags history from June 2019 to July 2021!

(Ie ready to merge.)

These are currently in draft form, as Mm and Ml, to be migrated to MM
and ML in a month or so.
@hts-specs-bot
Copy link

Changed PDFs as of 0bd2976: SAMtags (diff).

@jmarshall
Copy link
Member

This introduced the term “top strand” without really saying what it means — see #639.

@PanZiwei
Copy link

PanZiwei commented Mar 9, 2023

The thing is that SEQ tends to get reverse complemented after mapping and the coordinates get messed up unless the downstream tools know how to reverse the MM/MP tags (which they don't).

Since the quote part was raised by @kpalin in 2019, I would like to follow up with the current strategy that samtools used to deal with the strand orientation, and any potential way to reverse the MM/MP tags?

What's the corresponding relationship of the MM /ML tags between these 2 reads below?

Read1   5'-XXCXXCXCXXCXCXXCXCXCXX-3'
Read2   3'-XXGXXGXGXXGXGXXGXGXGXX-5'

Let's say read1 is XXCXXCXCXXCXCXXCXCXCXX(containing 8 C), and giving MM:Z: C+m?,2,4; ML:B:C,256,256 and I am only interested in the C, not G. Can I infer the MM and ML tag for C in its complementary read (read2)? If so, how can I do that?

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Mar 9, 2023

I don't think there is any way you can infer what modifications another read has. If it's not measured, it's assumed to be unrecorded (unless there is an MM tag explicitly stating nothing was found).

However if you were to record modifications on read 2 and found them to have modifications at the same locations that read 1 did, then it would be counting the Gs starting from the right hand end. So for read1 you skipped 2 Cs, called a C, skipped 4 more, and then called the next. Ie 3rd and 8th C:

Read1   5'-XXCXXCXCXXCXCXXCXCXCXX-3'
             -  - M  - -  - - M 

Read2   3'-XXGXXGXGXXGXGXXGXGXGXX-5'
             -  - M  - -  - - M

The opposite strand is calling the 1st and 6th going 5' to 3', which would be C+m?,0,4; ML obviously also working right to left.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.