Skip to content

add handling for CJK text around emphasis delimiters #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: gfm
Choose a base branch
from

Conversation

QuietMisdreavus
Copy link

Resolves rdar://148884965

Problem

Consider the following markdown, taken from commonmark/commonmark-spec#650:

**テスト。**テスト

**テスト**。テスト

**テスト、**テスト

**テスト**、テスト

**テスト?**テスト

**テスト**?テスト

CommonMark, as it is specified today, sees the first line of each pair as "punctuation, emphasis delimiter, text", which forces the asterisks to be considered as a left-flanking delimiter run, unable to close emphasis. This shortcoming exists in GitHub-Flavored Markdown and thus swift-cmark as well, as both of those projects are ultimately based on CommonMark.

The discussion in the above-linked issue has persisted for years as the precise specifications for what should or should not count are defined, but a contributor has created an interim proposed specification erratum as https://github.com/tats-u/markdown-cjk-friendly, specifically their updates in specification.md. While he has some patched Markdown parsers available in his repo, they are all TypeScript based, and thus are not available for swift-cmark as it is written in C.

This PR implements the proposed implementation updates in swift-cmark.

Implementation

Rather than perform all the necessary calculations of what is and is not a "CJK character" according to the proposed specification in C, this PR takes a note from the author and pre-computes the list of code points in a Swift script, which is then run to generate a cjk-ranges.inc file used from swift-cmark itself. This script uses a handful of Unicode data files, which i have also included in this PR alongside the existing Case Folding data file inherited from CommonMark.

Copying in part of the proposed specification:

A CJK character is a character (Unicode code point) that meets at least one of the following criteria:

While the use of "fully-qualified emoji" is a bit problematic for code that only processes a single code point at a time (that definition includes multi-code-point sequences like skin tone variants and regional flags), the solution in the adapted code focuses on those code points that are themselves valid emoji sequences. This is enough to filter out any confounding code points that would otherwise mess with the calculation, based on some rough smoke testing i performed.

Once the calculated "CJK character" ranges are saved, the remainder of the PR is relatively straightforward: Handle a few more ranges of code points mentioned in the proposed spec, then update the calculation for left- and right-flanking delimiters and you're done.

Acknowledgements

I would like to thank @tats-u from the bottom of my heart for having put in the work to develop this proposed specification and all the supporting code present in his repository. It feels to me like a lot of work has been put into it, and i really appreciate it.

Copy link

@patshaughnessy patshaughnessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work 👏 👏

I have various nitpicks about the Swift tool you added (which is a brilliant idea!) and also a few questions about the core algorithm itself.

return (cjkRanges, nonCjkRanges)
}

/// Returns the ranges of code points with a Script property of Hangul.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure Hangul is the only East Asian script in this file? I see various Indian scripts and also Tibetan. Also Hiragana, Katana and various other Asian scripts.

I'm sure this is correct; but maybe add a comment explaining why these other scripts are not needed here. Maybe their code points are included in EastAsianWidth.txt?

Copy link

@tats-u tats-u May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chinese and Japanese are covered by EAW in (Wide, Full, Half).
There are some Hangul characters whose EAW is Neutral.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see various Indian scripts and also Tibetan.

I've not observed actual complaints by speakers of those languages. Thai might be suffered from this CommonMark specification bug, but I've never heard complaints.

@@ -315,3 +315,20 @@ int cmark_utf8proc_is_punctuation(int32_t uc) {
uc == 92917 || (uc >= 92983 && uc <= 92987) || uc == 92996 ||
uc == 113823);
}

int cmark_utf8proc_is_cjk_character(int32_t uc) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have the Swift tool produce the function definition also; this would make it more clear where the parameter uc comes from, and where the generated code goes to.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably "Unicode Code Point".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with having the Swift code generate the function definition, since it makes it more difficult to ensure that the function signature matches the header's definition. It also allows me to change the name of the function later without having to change the generation code and rerun it.

uc was chosen to match the other UTF-8 functions in cmark; as @tats-u mentioned, the etymology is likely "Unicode Codepoint".

before_before_char = 10;
} else {
before_before_char_pos = before_char_pos - 1;
// walk back to the beginning of the previous UTF-8 sequence again:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this entirely the same as the previous walk-back code? Should we extract a function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick search, it looks like there's one other place that performs reverse UTF-8 iteration (cmark_inline_parser_scan_delimiters, also in inlines.c). It might be worth factoring out a separate reverse-iteration function, but i would hesitate to do so, just to keep the overall diff from GFM smaller.

cmark_utf8proc_is_space(before_char) ||
cmark_utf8proc_is_punctuation(before_char));
cmark_utf8proc_is_non_cjk_punctuation_character(before_char) ||
cmark_utf8proc_is_cjk_character(before_char) ||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about the core algorithm behind this: Why do we consider the delimiter to be left flanking if the preceding character is any cjk character? Should we distinguish between cjk punctuation and other cjk characters?

Put a different way: Before this change we consider the delimiter to be left flanking if:

  • After is not whitespace AND
  • After is not punctuation OR Before is whitespace OR Before is punctuation

This PR expands on the "Before is whitespace OR Before is punctuation" logic, additionally allowing the before character to be:

  • a CJK character
  • an Ideographic Variation Selector
  • a Standard Variation Selector

But this seems very generic: Any preceding CJK character would match. Should we check if the CJK character is punctuation? Similar to the original logic?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before: After is NOT punctuation OR Before is (whitespace OR punctuation)

The strategy: After is (NOT punctuation OR CJK) OR Before is (whitespace OR punctuation OR CJK)

After is NOT (punctuation AND NOT CJK) OR Before is (whitespace OR ((punctuation AND NOT CJK) OR (punctuation AND CJK)) OR CJK)

After is NOT (punctuation AND NOT CJK) OR Before is (whitespace OR (punctuation and NOT CJK) OR ((punctuation AND CJK) OR CJK))

After is NOT (punctuation AND NOT CJK) OR Before is (whitespace OR (punctuation and NOT CJK) OR CJK)

"punctuation AND CJK" vanishes by this expression transformation.

// - followed by one of the following:
// - whitespace
// - a non-CJK punctuation character
// - a CJK character

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here: Should we refine the check for any CJK character to instead look for CJK punctuation characters only?

Also: Why don't we include Ideographic Variation Selectors and Standard Variation Selector like we do for the left flanking logic?

Copy link

@tats-u tats-u May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we include Ideographic Variation Selectors and Standard Variation Selector like we do for the left flanking logic?

People other than penetration testers never input a (non-han character → ideographic variation sequence) sequence. 99.9+% of ideographic variation selectors follow a han character. This is just an optimization.
"Standard Variation Selectors" only follow non-variation-selector characters.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition: These variation selectors won't come to the first code point of a grapheme cluster. i.e. They won't follow * or _.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand a little bit: Variation Selector characters are meant to go after another code point, the way that skin-tone selectors go after a base emoji. For that not to be true, a file would need to either be corrupted or specially created.

The one special code point i would expect to see right after an asterisk is the Emoji Variation Selector, which (IMO) should render the asterisk incapable of creating emphasis. As far as i know, though, that discussion hasn't happened as part of the greater "CJK emphasis" thread.

@patshaughnessy
Copy link

I've also tested this with a large CJK markdown document, and it worked perfectly as far as I could tell (I don't speak any of the target languages, however.)

Copy link

@tats-u tats-u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have forgotten to enable Discussion in my repo. If you have additional questions about the specs, you may want to post them there:

https://github.com/tats-u/markdown-cjk-friendly/discussions

Note: the following comments are as answers for questions on the specs.

@@ -315,3 +315,20 @@ int cmark_utf8proc_is_punctuation(int32_t uc) {
uc == 92917 || (uc >= 92983 && uc <= 92987) || uc == 92996 ||
uc == 113823);
}

int cmark_utf8proc_is_cjk_character(int32_t uc) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably "Unicode Code Point".

cmark_utf8proc_is_space(before_char) ||
cmark_utf8proc_is_punctuation(before_char));
cmark_utf8proc_is_non_cjk_punctuation_character(before_char) ||
cmark_utf8proc_is_cjk_character(before_char) ||
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before: After is NOT punctuation OR Before is (whitespace OR punctuation)

The strategy: After is (NOT punctuation OR CJK) OR Before is (whitespace OR punctuation OR CJK)

After is NOT (punctuation AND NOT CJK) OR Before is (whitespace OR ((punctuation AND NOT CJK) OR (punctuation AND CJK)) OR CJK)

After is NOT (punctuation AND NOT CJK) OR Before is (whitespace OR (punctuation and NOT CJK) OR ((punctuation AND CJK) OR CJK))

After is NOT (punctuation AND NOT CJK) OR Before is (whitespace OR (punctuation and NOT CJK) OR CJK)

"punctuation AND CJK" vanishes by this expression transformation.

// - followed by one of the following:
// - whitespace
// - a non-CJK punctuation character
// - a CJK character
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition: These variation selectors won't come to the first code point of a grapheme cluster. i.e. They won't follow * or _.

@QuietMisdreavus
Copy link
Author

I pushed up some commits to respond to some of @patshaughnessy's review comments, and to update the algorithm to match the most recent release of markdown-cjk-friendly, and its use of punctuation variation sequences. Hopefully the code should be a bit easier to understand now. 😅

@tats-u
Copy link

tats-u commented Jun 5, 2025

You should remove test cases that are still correctly recognized as intended by pure CommonMark/GFM parsers.

A**BCD**E

We have only to care about whether B or D are punctuation or not, and whether A, B, D, or E are CJK or not.

Comment on lines +1648 to +1674
"テスト✌🏻**テスト?**テスト\n"
"\n"
"テスト✌🏻**テスト**?テスト\n"
"\n"
"テスト🇯🇵**テスト?**テスト\n"
"\n"
"テスト🇯🇵**テスト**?テスト\n"
"\n"
"テスト🏴󠁧󠁢󠁳󠁣󠁴󠁿**テスト?**テスト\n"
"\n"
"テスト🏴󠁧󠁢󠁳󠁣󠁴󠁿**テスト**?テスト\n"
"\n"
"テスト*️⃣**テスト?**テスト\n"
"\n"
"テスト*️⃣**テスト**?テスト\n"
"\n"
"テスト©️**テスト?**テスト\n"
"\n"
"テスト©️**テスト**?テスト\n"
"\n"
"テスト©**テスト?**テスト\n"
"\n"
"テスト©**テスト**?テスト\n"
"\n"
"テスト⌛**テスト?**テスト\n"
"\n"
"テスト⌛**テスト**?テスト\n";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These emojis and symbols have no effect. The following is better:

テスト**テスト?**test

テスト**✌🏻テスト**テスト

テスト**テスト?**テスト

test**「テスト」**テスト

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants