What should mf2 textContent parsing result in? User expectation vs. DOM specification.

## Summary

At several points the parsing specification says to return the `textContent`, but it never defines what this means. I personally always assumed [the DOM `textContent` property](https://dom.spec.whatwg.org/#dom-node-textcontent) for the current element, but this does not seem to match with what parsers have been doing.

## Discussion

@aaronpk wrote [a blogpost](https://aaronparecki.com/2018/01/12/3/xray) today containing the following, emphasis mine:

> I think my only solution for this is going to be to create my own plaintext value out of the sanitized HTML. Unfortunately, that is not a straightforward process, as demonstrated by [this relatively long function](https://github.com/indieweb/php-mf2/blob/master/Mf2/Parser.php#L436) that does this in the PHP parser. However that might be the technically better option anyway, since **XRay can’t be sure exactly what method was used to generate the plaintext value from the original HTML** anyway.

I replied to the emphasised statement [in chat](https://chat.indieweb.org/dev/2018-01-12/1515772228975700):

> DOM’s textContent should be used, IIRC, else the parser is broken.

This started [a discussion in the #indieweb-dev chat](https://chat.indieweb.org/dev/2018-01-12#t1515772228975700) that is best read in the chat logs. The discussion continued in [the #micoformats chat](https://chat.indieweb.org/microformats/2018-01-12#t1515774252519800). The important take-away is that [the PHP parser](https://github.com/indieweb/php-mf2/issues/69) includes [its own text extraction implementation](https://github.com/indieweb/php-mf2/pull/82), after [an issue](https://github.com/indieweb/php-mf2/issues/69) was filed by a user that was missing expected white space in the output.

It turned out that [the JavaScript parser (glennjones/microformat-shiv)](https://github.com/glennjones/microformat-shiv) was already doing [something like that](https://github.com/glennjones/microformat-shiv/blob/dev/lib/text.js).

The important part here is **user expectation**. The user who opened the issue on the PHP parser was expecting to see a line break in the plain text value where a `<br>` used to be. It is also [what aaronpk would expect](https://chat.indieweb.org/microformats/2018-01-12/1515775140341300). [From chat](https://chat.indieweb.org/microformats/2018-01-12#t1515775140341300):

> no, I would definitely expect newlines in the plaintext
> given that's how a browser will render it
> and if you copypaste the text from the browser it will have newlines

I don’t have any real personal preference. I do feel that the parsing specification should define what it wants to guarantee compatibility between parsers.

If we end up defining our own textContent algorithm for HTML→plain-text, I do think we should take a good look at what browsers are doing. Especially plain text browsers such as lynx and w3m.

## Parser behaviour

Test:

```html
<div class="h-entry"><p>Wow<br><span>This</span></p><p>Is Interesting</p></div>
```

Tested through microformats.io. Output shortened to only the affected h-entry. Node and Ruby were not available for testing.

### PHP

```json
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "name": [
                    "Wow\nThis Is Interesting"
                ]
            }
        }
```

### Python

```json
  {
   "type": [
    "h-entry"
   ], 
   "properties": {
    "name": [
     "WowThisIs Interesting"
    ]
   }
  }
```

### Go

```json
    {
      "type": [
        "h-entry"
      ],
      "properties": {
        "name": [
          "WowThisIs Interesting"
        ]
      }
    }
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What should mf2 textContent parsing result in? User expectation vs. DOM specification. #15

Summary

Discussion

Parser behaviour

PHP

Python

Go

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What should mf2 textContent parsing result in? User expectation vs. DOM specification. #15

Description

Summary

Discussion

Parser behaviour

PHP

Python

Go

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions