Skip to content

Conversation

@mpkorstanje
Copy link
Contributor

@mpkorstanje mpkorstanje commented Jul 27, 2025

🤔 What's changed?

Define whitespace as Unicode category Zs or its bidirectional class WS, B, or S.

⚡️ What's your motivation?

Gherkin lines were trimmed according regex pattern \s + NEL + NBSP while comments on tags lines were assumed to be delimited by just \s#. This leads to some inconsistent behaviour where adding a comment to the end of a tag line can make the tag line invalid.

Each Gherkin implementations uses different definitions of whitespace.

These can be roughly categorized as using Unicode:

  • C uses a hardcoded set from Wikipedia Whitespace character Unicode table[1]
  • .Net uses Unicode category Zs, Zl and Zp and \t, \v, \f, \r and NEL[3].
  • Javascript uses the regex pattern \s which match the set used by C + BOM[4]
  • Python uses Unicode category Zs or its bidirectional class WS, B, or S[5].

And the other category:

  • CPP uses the default locale , \f, \n, \r, \t and \v[2].
  • Ruby uses the same as CPP + null[6]
  • Go only includes and \t.

Within the Unicode categorization there is significant overlap. So for Java I have chosen to match the Python definition of whitespace as it is completely defined in Unicode terms.

  1. https://en.wikipedia.org/wiki/Whitespace_character#Unicode
  2. https://en.cppreference.com/w/cpp/string/byte/isspace.html
  3. https://learn.microsoft.com/en-us/dotnet/api/system.char.iswhitespace?view=net-9.0
  4. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Character_classes
  5. https://docs.python.org/3/library/stdtypes.html#str.isspace
  6. https://ruby-doc.org/3.4.1/String.html#class-String-label-Whitespace+in+Strings

🏷️ What kind of change is this?

  • 🐛 Bug fix (non-breaking change which fixes a defect)

📋 Checklist:

  • I agree to respect and uphold the Cucumber Community Code of Conduct
  • I've changed the behaviour of the code
    • I have added/updated tests to cover my changes.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly.
  • Users should know about my change
    • I have added an entry to the "Unreleased" section of the CHANGELOG, linking to this pull request.

Gherkin lines were trimmed according regex pattern `\s` + NEL + NBSP
while comments on tags lines were assumed to be delimited by just `\s#`.
This leads to some inconsistent behaviour where adding a comment to the
end of a tag line can make the tag line invalid.

Each Gherkin implementations uses different definitions of whitespace.

These can be roughly categorized as using Unicode:

* C uses a hardcoded set from Wikipedia Whitespace character Unicode table[1]
* .Net uses Unicode category `Zs`, `Zl` and 'Zp' and `\t`, `\v`, `\f`, `\r` and NEL[3].
* Javascript uses the regex pattern `\s` which match the set used by C + BOM[4]
* Python uses Unicode category `Zs` or its bidirectional class `WS`, `B`, or `S`[5].

And the other category:
* CPP uses the default locale ` `, `\f`, `\n`, `\r`, `\t` and `\v`[2].
* Ruby uses the same as CPP + null[6]
* Go only includes ` ` and `\t`.

Within the Unicode categorization there is significant overlap. So for
Java I have chosen to match the Python definition of whitespace.

 1. https://en.wikipedia.org/wiki/Whitespace_character#Unicode
 2. https://en.cppreference.com/w/cpp/string/byte/isspace.html
 3. https://learn.microsoft.com/en-us/dotnet/api/system.char.iswhitespace?view=net-9.0
 4. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Character_classes
 5. https://docs.python.org/3/library/stdtypes.html#str.isspace
 6. https://ruby-doc.org/3.4.1/String.html#class-String-label-Whitespace+in+Strings
@mpkorstanje mpkorstanje requested a review from jkronegg July 27, 2025 14:34
@mpkorstanje mpkorstanje marked this pull request as ready for review July 27, 2025 14:35
@mpkorstanje mpkorstanje mentioned this pull request Jul 27, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant