-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sam: Add PL value "MGIG400" as seen in the wild #718
Comments
Nevermind, MGI doesn't seem to ship aligners. PL field most probably introduced on third party pipeline downstream erroneously. |
It's also worth noting even if this was being generated by MGI and we hadn't previously decided on DNBSEQ, it would be rejected. The problem with MGIG400 is it conflates platform with model, which are two different tags. |
We did also discuss this sort of issue in the most recent conference call. The issue was one of validation. Having an invalid field here does not invalidate the rest of the file. Syntacically it would all make sense. The point was raised what if we check these fields and reject files that don't match, but the specification then gets updated? We haven't updated the SAM version number when we've added extra fields here as the syntax is identical, so programs cannot check that either. That's a valid point. We felt the correct process would be, if validation is performed, to make it a warning only. This fits with the point above that unknown data here does not invalidate any remainder of the file. You could argue then what's the point of having a controlled vocabulary, as it doesn't stop people from just adding anything there anyway (as demonstrated). We feel there is still merit in having PL as a controlled vocabulary, as it gives vendors and users alike a clue as to what is expected. Without it we're highly likely to get ONT, OxfordNanopore, OxfordNanoporeTechnology, Oxford_Nanopore_Technology, etc. That's even ignoring the issue of case sensitivity. With it, well we may still get invalid fields, but hopefully it is significantly reduced. Note that this also ties in with sequence submissions as the archives have a controlled vocabularly in their schemas. |
A file can either be well-formed or not. I'm not sure why you would want or trust a somewhat valid file w.r.t. a specification, especially in the scientific domain.
Appending to a list of known values changes the syntax.
In this case, the spec shouldn't define them as valid values but as suggested values. |
The syntax of header lines is described in the first paragraph of §1.3: the fields are TAB-separated, and the line matches the regexp shown (notwithstanding the minor UTF-8-related issue you noted elsewhere). So regardless of whatever characters are in the PL field value, there is no difficulty parsing it: there is a TAB before the The list of keywords in the PL description is a list of semantically valid values. The syntax of the header line is unchanged when this list is appended to, as doing that does not affect parsing of that line or of the rest of that file or even (generally speaking) the interpretation of the rest of the file. Note for example that ULTIMA was added to the spec when the SAM VN version number was already 1.6. Nonetheless a SAM file that says We've discussed this any number of times, e.g. on #454. I don't see any particular need to relitigate it, but if we do let's do it on a new open issue rather than a closed WONTFIX based on someone mistakenly specifying MGIG400 on their |
The BAM header below was refusing to be parsed by noodles due to the PL field (via htsget-rs), as reported by NESI. This BAM field came from this machine. Non-related fields have been manually masked/anonymized with
****
:Interestingly, this PL field should have been
DNBSEQ
but the software machine outputs its model (MGIG400
) instead, nowadays? Perhaps the vendor should be notified and prompted to update their tooling instead of introducing this change to the standard.Tangentially related issue worth exploring along this one: #679
/cc @ohofmann @mmalenic @zaeleus
The text was updated successfully, but these errors were encountered: