Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently handle inline elements with spaces #281

Merged

Conversation

kevindew
Copy link
Contributor

@kevindew kevindew commented Apr 3, 2019

This resolves some odd situations that can occur when there are inline
elements that contain spaces in sentences.

The first situation is when there is an element that includes a space
between words, for example Test<span> </span>content. This would
previously have produced a two space result: Test content because
this element would have matched both leading and trailing whitespace
tests.

The second situation is when there is an element that includes a space
outside the tests, which is the case of a non-breaking space character
(unicode U+00A0), then the space is removed. An example of this is
Test<span>&nbsp;</span>content which would result in Testcontent as
this wouldn't match the tests for leading/trailing whitespace.

This resolves these problems by changing the whitespace tests to use \s
rather than a subset of space characters (which is consistent with the
blank test 1) and only allows a leading space if the test for both
leading and trailing whitespace passes on a blank element.

This resolves some odd situations that can occur when there are inline
elements that contain spaces in sentences.

The first situation is when there is an element that includes a space
between words, for example 'Test<span> </span>content'. This would
previously have produced a two space result: 'Test  content' because
this element would have matched both leading and trailing whitespace
tests.

The second situation is when there is an element that includes a space
outside the tests, which is the case of a non-breaking space character
(unicode U+00A0), then the space is removed. An example of this is
'Test<span>&nbsp;</span>content' which would result in 'Testcontent' as
this wouldn't match the tests for leading/trailing whitespace.

This resolves these problems by changing the whitespace tests to use \s
rather than a subset of space characters (which is consistent with the
blank test [1]) and only allows a leading space if the test for both
leading and trailing whitespace passes on a blank element.

[1]: https://github.com/domchristie/turndown/blob/80297cebeae4b35c8d299b1741b383c74eddc7c1/src/node.js#L14
kevindew added a commit to alphagov/paste-html-to-govspeak that referenced this pull request Apr 4, 2019
This was a rather hard problem to solve, so brace yourself as this
message might be complex.

We were having cases of HTML pasted from browsers adding span elements in
for spaces. So we could end up with HTML like:
`Some text<span> </span><a href="">Link</a>` where, depending on the
environment, the contents of the span could be `<span>&nbsp;</span>` or
`<span> </span>`. In browsers it was the former, whereas we experienced
the latter on JSDOM.

This causes some poor behaviour in turndown where the presence of a
nbsp; character (which has an ASCII code of 160) causes the whole space
to be stripped out, resulting in `Some text[Link]()` as output. On the
other hand the presence of a normal space (ASCII code 32) causes a
different problem of two spaces, resulting in output such as `Some text
[Link]()`.

This problem is due to Turndowns whitespace rules where they only match
a normal space character: https://github.com/domchristie/turndown/blob/80297cebeae4b35c8d299b1741b383c74eddc7c1/src/node.js#L25-L33

With our change to blankReplacement we can fix the case for &nbsp;
(which is the more common on we're witnessing) however we can't easily
fix the double space issue. We have left tests for both cases to help
any future developers that may venture into this hole.

We've raised a PR with turndown to fix both of these issues,
mixmark-io/turndown#281, so hopefully in the
non-too-distant future we can remove this code.
kevindew added a commit to alphagov/paste-html-to-govspeak that referenced this pull request Apr 4, 2019
This was a rather hard problem to solve, so brace yourself as this
message might be complex.

We were having cases of HTML pasted from browsers adding span elements in
for spaces. So we could end up with HTML like:
`Some text<span> </span><a href="">Link</a>` where, depending on the
environment, the contents of the span could be parsed as `<span>&nbsp;</span>`
or `<span> </span>`. In browsers it was the former, whereas we experienced
the latter on JSDOM.

This causes some poor behaviour in turndown where the presence of a
nbsp character (which has an ASCII code of 160) causes the whole space
to be stripped out, resulting in `Some text[Link]()` as output. On the
other hand the presence of a normal space (ASCII code 32) causes a
different problem of two spaces, resulting in output such as `Some text
[Link]()`.

This problem is due to Turndowns whitespace rules where they only match
a normal space character: https://github.com/domchristie/turndown/blob/80297cebeae4b35c8d299b1741b383c74eddc7c1/src/node.js#L25-L33

With our change to blankReplacement we can fix the case for nbsp
(which is the more common on we're witnessing) however we can't easily
fix the double space issue. We have left tests for both cases to
document this scenario.

We've raised a PR with turndown to fix both of these issues,
mixmark-io/turndown#281, so hopefully in the
non-too-distant future we can remove this code.
@domchristie domchristie merged commit cae7098 into mixmark-io:master Aug 23, 2019
@domchristie
Copy link
Collaborator

Thanks for tracking this down!

@kevindew kevindew deleted the vanishing-or-duplicating-spacing branch September 8, 2022 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants