Skip to content

What should mf2 textContent parsing result in? User expectation vs. DOM specification. #15

Open
@Zegnat

Description

@Zegnat

Summary

At several points the parsing specification says to return the textContent, but it never defines what this means. I personally always assumed the DOM textContent property for the current element, but this does not seem to match with what parsers have been doing.

Discussion

@aaronpk wrote a blogpost today containing the following, emphasis mine:

I think my only solution for this is going to be to create my own plaintext value out of the sanitized HTML. Unfortunately, that is not a straightforward process, as demonstrated by this relatively long function that does this in the PHP parser. However that might be the technically better option anyway, since XRay can’t be sure exactly what method was used to generate the plaintext value from the original HTML anyway.

I replied to the emphasised statement in chat:

DOM’s textContent should be used, IIRC, else the parser is broken.

This started a discussion in the #indieweb-dev chat that is best read in the chat logs. The discussion continued in the #micoformats chat. The important take-away is that the PHP parser includes its own text extraction implementation, after an issue was filed by a user that was missing expected white space in the output.

It turned out that the JavaScript parser (glennjones/microformat-shiv) was already doing something like that.

The important part here is user expectation. The user who opened the issue on the PHP parser was expecting to see a line break in the plain text value where a <br> used to be. It is also what aaronpk would expect. From chat:

no, I would definitely expect newlines in the plaintext
given that's how a browser will render it
and if you copypaste the text from the browser it will have newlines

I don’t have any real personal preference. I do feel that the parsing specification should define what it wants to guarantee compatibility between parsers.

If we end up defining our own textContent algorithm for HTML→plain-text, I do think we should take a good look at what browsers are doing. Especially plain text browsers such as lynx and w3m.

Parser behaviour

Test:

<div class="h-entry"><p>Wow<br><span>This</span></p><p>Is Interesting</p></div>

Tested through microformats.io. Output shortened to only the affected h-entry. Node and Ruby were not available for testing.

PHP

        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "name": [
                    "Wow\nThis Is Interesting"
                ]
            }
        }

Python

  {
   "type": [
    "h-entry"
   ], 
   "properties": {
    "name": [
     "WowThisIs Interesting"
    ]
   }
  }

Go

    {
      "type": [
        "h-entry"
      ],
      "properties": {
        "name": [
          "WowThisIs Interesting"
        ]
      }
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions