Description
Summary
At several points the parsing specification says to return the textContent
, but it never defines what this means. I personally always assumed the DOM textContent
property for the current element, but this does not seem to match with what parsers have been doing.
Discussion
@aaronpk wrote a blogpost today containing the following, emphasis mine:
I think my only solution for this is going to be to create my own plaintext value out of the sanitized HTML. Unfortunately, that is not a straightforward process, as demonstrated by this relatively long function that does this in the PHP parser. However that might be the technically better option anyway, since XRay can’t be sure exactly what method was used to generate the plaintext value from the original HTML anyway.
I replied to the emphasised statement in chat:
DOM’s textContent should be used, IIRC, else the parser is broken.
This started a discussion in the #indieweb-dev chat that is best read in the chat logs. The discussion continued in the #micoformats chat. The important take-away is that the PHP parser includes its own text extraction implementation, after an issue was filed by a user that was missing expected white space in the output.
It turned out that the JavaScript parser (glennjones/microformat-shiv) was already doing something like that.
The important part here is user expectation. The user who opened the issue on the PHP parser was expecting to see a line break in the plain text value where a <br>
used to be. It is also what aaronpk would expect. From chat:
no, I would definitely expect newlines in the plaintext
given that's how a browser will render it
and if you copypaste the text from the browser it will have newlines
I don’t have any real personal preference. I do feel that the parsing specification should define what it wants to guarantee compatibility between parsers.
If we end up defining our own textContent algorithm for HTML→plain-text, I do think we should take a good look at what browsers are doing. Especially plain text browsers such as lynx and w3m.
Parser behaviour
Test:
<div class="h-entry"><p>Wow<br><span>This</span></p><p>Is Interesting</p></div>
Tested through microformats.io. Output shortened to only the affected h-entry. Node and Ruby were not available for testing.
PHP
{
"type": [
"h-entry"
],
"properties": {
"name": [
"Wow\nThis Is Interesting"
]
}
}
Python
{
"type": [
"h-entry"
],
"properties": {
"name": [
"WowThisIs Interesting"
]
}
}
Go
{
"type": [
"h-entry"
],
"properties": {
"name": [
"WowThisIs Interesting"
]
}
}