Description
Initial checklist
- I read the support docs
- I read the contributing guide
- I agree to follow the code of conduct
- I searched issues and couldn’t find anything (or linked relevant results below)
Problem
The way DOM clobbering prevention is implemented, this plugin leaves sanitized HTML broken: it correctly rewrites the attributes that can cause clobbering (id
and name
, configured by default) and by default some attributes that reference those IDs (ariaDescribedBy
and ariaLabelledBy
).
Notably missing is the href
attribute, which can be used for internal links using a fragment which value matches the id
or name
attribute of an element. This breaks GFM footnotes (cf. #30, #29) but also internal links such as those targetting headings slugified by rehype-slug
.
The current solution has been deemed acceptable because 1) it keeps the internal links "pretty" for the users; and 2) github likes it like that.
I think however that it is not very good, because it requires awareness of the issue otherwise the plugin just appears broken, and by default breaks the document and requires a workaround to be applied later. And the suggested workaround (the github script) also has limitations, like it wouldn't support multiple generated documents being included with different clobberPrefix
to avoid collisions.
I would much prefer if the sanitizer wouldn't break my documents by default where there is no reason to. Pretty links are nice, but it should be a conscious decision to enable them as it requires extra effort to un-break the document afterwards.
Solution
I propose to improve the rewriting of attributes that reference another element via their ID as follows:
While visiting nodes:
- Prefix the
id
andname
attributes as currently; - Add the original value of the rewritten
id
andname
attributes to a list of targets; - If the node contains an attribute that can reference an ID (e.g.
ariaDescribedBy
,ariaLabelledBy
but alsohref
), add it to a list of references.
Now that the whole tree has been traversed, we know all the IDs in the sanitized document, and can start rewriting the references. Doing it after the traversal instead of during allows to be a bit smarter with the rewriting, and allows (maybe as an option) to keep references untouched when they don't target an ID within the document (if the document is later included into another one, this would allow referencing elements in the enclosing document, e.g. #top
for an internal link, or some common aria labels or descriptions).
For the href
attribute, I propose to only rewrite them if they contain just the fragment. So for example href="foobar.htm#section-xyz"
would not be rewritten, because without extra context we can't know whether foobar.htm
is the current page, but href="#section-xyz"
would be rewritten to href="#user-content-section-xyz"
. This would at least fix the broken GFM footnotes, and generally internal links, which I think is a common case when converting from markdown.
This would require to special-case href
because the assumption on the format used currently for the two aria attributes doesn't hold. Instead of that, I also propose to slightly change the format of the configuration to allow passing a function to handle the rewriting, which would let users add support for more exotic attributes, such as custom data-*
attributes (not that anybody requested such a thing yet) or simply let the user handle the difficult things like the URL fragment not being on its own.
Alternatives
- Do nothing: If github is happy, everyone's happy 😿
- Monkey-patching the
hast-util-sanitize
lib, as mentioned by the reporter of Don't apply the clobber prefix when it's already present #29: Yes, it works, but we're basically all on our own. Since sanitization is security-relevant, I think it would be better to do it oncerightbetter. It would also be quite brittle, and could break with any update. - Write another plugin just to fix the links: This plugin would have less context to work with than the sanitize plugin, e.g. no idea which IDs have been rewritten (it would need to assume that specifically
hast-util-sanitize
was used and all IDs have been rewritten). Also an extra walk of the tree. And the plugin's name would be somewhat stupid:hast-util-sanitize-clobbering-fix
. - Write another plugin to handle just clobbering: This would make the functionality in
hast-util-sanitize
redundant, while probably ending up lifting a big chunk of the current code, would require configuring another plugin to make it work (i.e. disabling theclobber
handling inhast-util-sanitize
), extra tree walk, annoying plugin name.
I hope you are at least open to the idea that my proposal can be better than the alternatives. On my part, I am also open to suggestions if my proposition doesn't quite fit the bill. I can implement it, but I can't promise to implement it fast =)