Skip to content

Conversation

@ksylvan
Copy link
Owner

@ksylvan ksylvan commented Jun 18, 2025

CJK Support Added for Markdown Slug Generation

Summary

This PR adds support for CJK (Chinese, Japanese, Korean) characters in the markdown file slug generation, allowing non-Latin characters to be preserved in generated filenames and anchors when using the explode command.

Related Isssues

Closes #5

Files Changed

bin/md-tree.js

Modified the sanitizeText method to preserve CJK characters instead of stripping them out. The regex now includes Unicode ranges for Chinese, Japanese (Hiragana and Katakana), and Korean characters.

package.json

Bumped the version from 1.5.1 to 1.6.0 to reflect the new feature addition.

test/test-cjk.md

Added a new test markdown file containing CJK characters in headings to verify the functionality works correctly.

test/test-cli.js

Added a comprehensive test case to verify that the explode command correctly handles CJK characters in headings, generates appropriate filenames, and creates proper links.

Code Changes

bin/md-tree.js

// Before
.replace(/[^a-z0-9\s-]/g, '')

// After
.replace(
  /[^a-z0-9\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff\uac00-\ud7af\s-]/g,
  ''
)

The regex now includes Unicode ranges:

  • \u4e00-\u9fff: Chinese characters
  • \u3040-\u309f: Japanese Hiragana
  • \u30a0-\u30ff: Japanese Katakana
  • \uac00-\ud7af: Korean Hangul

test/test-cli.js

await test('CLI explode command with CJK characters', async () => {
  // Test verifies:
  // 1. Files are created with CJK characters in filenames
  // 2. Index file contains correct links to CJK-named files
  // 3. Subsection anchors with CJK characters work correctly
});

Reason for Changes

Previously, the markdown tree parser would strip out all non-Latin characters when generating slugs for filenames and anchors. This made the tool unsuitable for documentation written in CJK languages, as headings like "章节一" would be converted to empty strings or hyphens only, resulting in non-descriptive or conflicting filenames.

Impact of Changes

  1. Internationalization: The tool now supports markdown documents written in Chinese, Japanese, and Korean, making it accessible to a much wider user base.
  2. Backward Compatibility: Existing functionality for Latin-based text remains unchanged. The change is additive and non-breaking.
  3. File Naming: Generated filenames will now preserve CJK characters, making them more meaningful and readable for users working with these languages.

Test Plan

A comprehensive test case has been added that:

  1. Creates a test markdown file with CJK headings
  2. Runs the explode command on this file
  3. Verifies that files are created with CJK characters in their names
  4. Checks that the generated index file contains correct links to these files
  5. Ensures subsection anchors with CJK characters are properly formatted

The test covers Chinese characters in main headings and Japanese characters in subsections, alongside regular English headings to ensure mixed-language documents work correctly.

Additional Notes

  • The Unicode ranges included cover the most common CJK character sets but may not include all possible characters (e.g., CJK Extension blocks). Additional ranges can be added if needed.
  • The version bump to 1.6.0 indicates this is a minor feature addition with no breaking changes.
  • This change aligns with modern web standards where URLs and filenames increasingly support Unicode characters.

**CHANGES:**
- Update sanitizeText to handle Chinese, Japanese, Korean characters.
- Extend regex to include Unicode ranges for CJK.
- Add test for CLI explode command with CJK support.
- Bump version to 1.6.0 for new feature release.
@ksylvan ksylvan requested a review from Copilot June 18, 2025 14:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for CJK characters in markdown slug generation, ensuring that filenames and anchors retain Chinese, Japanese, and Korean characters instead of stripping them.

  • Updated the regex in the slug generation logic in bin/md-tree.js to include Unicode ranges for CJK characters.
  • Added new tests in test/test-cli.js and test/test-cjk.md to verify that the explode command correctly handles CJK headings.
  • Bumped the package version to 1.6.0 to reflect the feature addition.

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.

File Description
test/test-cli.js Added a new test case to verify filename and link generation with CJK text
test/test-cjk.md Introduced a markdown file containing CJK headings for testing
package.json Updated version to 1.6.0 as part of the new feature integration
Comments suppressed due to low confidence (1)

test/test-cli.js:479

  • [nitpick] Consider adding a clarifying comment that explains the Japanese subsection 'セクション 2.1' is intentionally nested under the parent section '章二', ensuring maintainers understand the test's intent.
      indexContent.includes('[セクション 2.1](./章二.md#セクション-21)'),

@ksylvan ksylvan merged commit 3c47539 into main Jun 18, 2025
4 checks passed
@ksylvan ksylvan deleted the 0618-support-cjk-characters branch August 22, 2025 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CJK characters are stripped from markdown headings

2 participants