Skip to content

encoding/xml: proposed fixes for namespaces #14407

Open
@pdw-mb

Description

@pdw-mb

Issue #13400 lists a number of issues related to namespace handling in encoding/xml. It's been noted that this area needs a bit of a rethink. This issue documents a set of proposed fixes to address the problems currently seen. I've grouped the current set of bugs into 7 separate topics to be addressed:

1. Lack of control over prefixes

Encoder currently provides no mechanism for the user to specify namespace/prefix bindings. In theory, this shouldn't matter: documents using different prefixes for the same namespaces are equivalent. In practice, people often do care.

For namespaced attributes, Encoder generates namespace prefixes based on the last part of the URI, and for elements, it redeclares the default namespace as required. This results in documents that are technically correct, but not what the user wants, and the generated prefixes may be cumbersome.

This raises the question of how much control we should give over prefixes.

XML allows quite a lot of complexity: prefixes can be rebound to different namespace URIs within a document, and the same namespace URI can simultaneously be bound to multiple prefixes. There's
a very old post on xml-dev that make a plea to produce "sane" documents, that is, ones where a given prefix is only ever used to refer to a single namespace URI, and a given namespace URI is only ever represented by a single prefix:

http://lists.xml.org/archives/xml-dev/200204/msg00170.html

I suggest that it is sufficient that we support the creation of "sane" documents, noting that we should allow a single URI to be represented by both a prefix and the default namespace within a document (this is effectively what Encoder does currently).

I think that the current approach of using the default namespace for elements, and generated prefixes for attributes is a good default: generated prefixes may be ugly, so we should use them only where needed (i.e. attributes), but should provide a mechanism for users to specify their own.

Proposed approach:

  • Allow users to specify a map of "preferred prefixes" on Encoder, that obeys the "sane" constraints. e.g. Encoder.AddNamespaceBindings(map[string]string])
  • When marshaling a namespaced attribute, use a preferred prefix, if available, and a generated one if not.
  • When marshaling a namespaced element, use a preferred prefix if available, and the default namespace otherwise.

Notes/questions:

  • The code changes required for this are pretty simple, although it does mean maintaining a notion of "preferred" and "generated" prefixes.
  • We probably want to allow users to specify additional preferred prefixes during the marshaling process (see point 2). This raises a question of whether we want to enforce that the document is "sane" - a user could potentially specify different bindings for the same namespace in different subtrees, e.g.:
  <foo>
    <x:bar xmlns:x="blort" />
    <y:baz xmlns:y="blort" />
  </foo>
  • Should preferred prefixes be output as soon as they're defined, or only when used? The former should probably be the default, but we might want to provide an option to control this, as it would allow the user to chuck a big bucket of "preferred prefixes" at the encoder, without bloating the document with unused prefixes

Issues addressed:
#11496: Serializing XML with namespace prefix
#9519: support for XML namespace prefixes

I think #9519 is based on a misunderstanding of how the current system works, but it seems likely that the user actually wants control over prefix names. I'm not sure if the reporter realises that a prefix is meaningless (and illegal) without a namespace URI binding.

2. Inability to access/set namespace declaration (handling QName values)

Namespace bindings are sometimes used by element and attribute values. For example:

  <foo xmlns:a="bar">a:blort</foo>

In order to correctly understand "a:blort" you need to know the currently effective namespace bindings. The same problem exists in reverse when encoding: you need to make sure that necessary namespace declarations are in place in the document.

Proposed approach:

We need to allow Unmarshalers and Marshalers to obtain and insert namespace bindings respectively. This means:

  1. A method on Decoder to expose current namespace bindings (trivial - it's already present privately)
  2. A change to UnmarshalerXMLAttr, as this does't currently provide the decoder. The safe way to make this change would be to create a new interface (e.g. UnmarshalerXMLAttrWithDecoder).
  3. Provide a method for Marshalers to inject namespace bindings. I suggest doing this by providing a method on Encoder to obtain a prefix for a namespace (GetPrefix ?), which will then take care of declaring the namespace if it hasn't yet been used. If the user cares what prefix they get, they should provide a preferred prefix prior to making the call to obtain one.
  4. As a convenience, we should make XMLName a MarshalerXML/UnmarshalerXML

Issues addressed:
#12406: support QName values / expose namespace bindings

3. Specifying namespaces in tags is cumbersome

Currently namespaces for elements may only be specified by including the full namespace URI, e.g.:

  `xml:"http://www.example.com/some/namespace/v1 foo"`

Aside from being verbose and repetitive, it means URIs can't be changed at runtime. It's not uncommon to want to use the same struct for different namespaces, for example, where version number in the namespace has changed, or as per #12624, to cope with documents using a subtlely
wrong namespace.

Proposed solution:

Given the mechanism in (1) to allow the user to specify namespaces/prefix mappings, it makes it possible for a struct to unambiguously use prefixes to reference namespaces. The obvious notation
is QName notation:

  `xml:"nsv1:foo"`

Under this proposal it would be an error to use a prefix that hadn't been explicitly specified for the user (i.e. it won't use prefixes picked up from the document when decoding). Users might be surprised that the above wouldn't match the following document unless they'd
explicitly set the prefix "nsv1" on the Decoder:

   <nsv1:foo xmlns="...">bar</nsv1:foo>

but doing so would be inherently fragile, as it wouldn't work with the entirely equivalent:

  <foo xmlns="...">bar</foo>

Notes:

This proposal changes the behaviour of Encoder/Decoder for tags with a colon in them, which it's possible that existing code relies on. On the other hand the current behaviour of such tags is clearly a source of confusion and bugs and doesn't work for Decoding anyway (see #11496)

Issues addressed:
#9775: Unmarshal does not properly handle NCName in XML namespaces

I think the bug as described is invalid: it's not clear what you'd expect to happen given that the namespace being used is undeclared.
#12624: brittle support for matching a namespace by identifier or url

The exact requirement behind this bug is not totally clear: it appears that the user wants unmarshaled elements that have one of a number of namespaces. I don't understand "Xmlns wasn't defined, but the namespace was used (ie. for mRSS with media namespace)" - that sounds like invalid
XML.

4. "No namespace" indistinguishable from "any namespace" in struct tags

When decoding, the tag xml:"foo" means element "foo" in any namespace. There's no way to say that you want foo in the null namespace. i.e.

   <foo xmlns="" />

This is a problem if a namespaced sibling of the same localname also exists. #8535 demonstrates this quite clearly.

Proposed approach:

Introduce a way of explicitly referencing the null namespace, e.g.

  `xml:"_ foo"`

We could go for the logical, but horribly subtle:

  `xml:" foo"`

(note the space before foo)

Issues addressed:
#8535: failure to handle conflicting tags in different namespaces
#11724: namespaced and non-namespaced attributes conflict

5. Bug: default namespace not set to null on un-namespaced children

It's not currently possible to produce the following XML:

<a xmlns="b">
  <c xmlns=""/>
</a>

If you produce <c> with a tag of:

`xml:"c"`

No xmlns declaration will be added, so <c> will inherit the namespace of it's parent <a>. This is related to issue (4): we don't currently distinguish between "any namespace" and "no namespace".

I can see two possible solutions here:

  1. Treat xml:"c" as meaning "no namespace" and insert xmlns="" as required to make that so.
  2. Treat xml:"c" as meaning "any namespace" and make it inherit the namespace of its parent. If you really want no namespace, use xml:"_ c" (or whatever notation we settle on for (4))

I can see arguments both ways.

Issues addressed
#7113: encoding/xml: missing nested namespace not handled

6. Bug: xmlns attributes not removed from Token (Decode/Encode not idempotent)

Decoder includes xmlns attributes in start element tokens. For example, an attribute of xmlns:foo="bar" would be included as an attribute with a name of {Space: "xmlns", Local: "foo"}. This is very dubious. xmlns attributes are special, but which ever way you look at it "xmlns" is not
a namespace URI - if anything, it's a prefix.

This creates problems if you feed the output of a Decoder into an Encoder, as it treats "xmlns" as a namespace URI, and introduces namespace declarations for it.

There's no good reason to include these attributes. It's reasonable to expose the current set of namespace bindings (see point 2), but the attributes themselves are not needed. If a user really wants to do their own namespace processing, they should use RawToken.

Proposed solution:

  • xmlns and xmlns:foo attributes should be stripped from the list of attributes returned by Token. They should be retained on RawToken.

Issues addressed:
#7535 Encoder duplicates namespace tags

7. Specifying xmlns attributes manually: allow or disallow?

Should we allow users to manually insert xmlns:* or xmlns="..." attributes?
#8167: disallow attributes named xmlns:*
#11431: encoding/xml: loss of xmlns= in encoding since Go 1.4

I don't think we need to support this, given the mechanism introduced under (1) and (3) above. One of the reasons why you might want to do it, is because namespace URIs are otherwise hard-coded into struct tags. The solution to (3) gives us a mechanism to avoid this.

That said, I'm struggling to see why we couldn't treat this as a call to add a preferred prefix - although there's a question of whether it should force the creation of the xmlns declaration if it's already in
scope.

8. Other issues

#11735: empty namespace conventions are badly documented

Yes, this should be clearer.
#8068: encoding/xml: empty namespace prefix definitions should be illegal

It sounds like this should be resolved as invalid.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsDecisionFeedback is required from experts, contributors, and/or the community before a change can be made.early-in-cycleA change that should be done early in the 3 month dev cycle.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions