Skip to content

Conversation

@jquast
Copy link
Owner

@jquast jquast commented Jan 28, 2026

  • do not distribute unicode data files.
  • do not distribute universal declaration of human rights.
  • delete from git.
  • exclude from sdist.

Closes #198

@jquast jquast force-pushed the jq/do-not-distribute-data-files branch 4 times, most recently from 5771238 to 3767634 Compare January 28, 2026 07:45
@jquast jquast changed the title Do not "distribute" any data files in any forms Do not distribute any data files Jan 28, 2026
@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (17b0bad) to head (d252084).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##            master      #199   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines          885       885           
  Branches       225       225           
=========================================
  Hits           885       885           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jquast jquast force-pushed the jq/do-not-distribute-data-files branch 2 times, most recently from 8590c51 to 6472455 Compare January 28, 2026 08:12
- lists of emojis may require licenses in your jurisdiction.
- please check with local laws.
- do not distribute unicode data files.
- do not distribute universal declaration of human rights.
- delete them from git.
- delete from sdist.

Closes #198
@jquast jquast force-pushed the jq/do-not-distribute-data-files branch from 6472455 to 8fb879e Compare January 28, 2026 08:22
@codspeed-hq
Copy link

codspeed-hq bot commented Jan 28, 2026

Merging this PR will degrade performance by 99.97%

⚡ 2 improved benchmarks
❌ 5 regressed benchmarks
✅ 48 untouched benchmarks
🆕 1 new benchmark
⏩ 1 skipped benchmark1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
test_center_udhr_lines 1.6 ms 4,391.7 ms -99.96%
test_width_ascii 22.5 µs 19.5 µs +15.45%
test_wrap_udhr 577.5 ms 25,917.8 ms -97.77%
test_grapheme_boundary_before_short 76.8 µs 59.6 µs +28.8%
🆕 test_width_udhr_lines N/A 4.3 s N/A
test_ljust_udhr_lines 1.5 ms 4,348.3 ms -99.97%
test_width_udhr 115.4 ms 8,815 ms -98.69%
test_rjust_udhr_lines 1.5 ms 4,346.7 ms -99.97%

Comparing jq/do-not-distribute-data-files (d252084) with master (17b0bad)

Open in CodSpeed

Footnotes

  1. 1 benchmark was skipped, so the baseline result was used instead. If it was deleted from the codebase, click here and archive it to remove it from the performance reports.

@jquast
Copy link
Owner Author

jquast commented Jan 28, 2026

the benchmarks were modified to do the full UDHR text corpus and so are slower by design, it is the new base, no actual performance impact

@jquast jquast marked this pull request as ready for review January 28, 2026 18:50
@jquast jquast merged commit 8d27c08 into master Jan 29, 2026
42 of 43 checks passed
@jquast jquast deleted the jq/do-not-distribute-data-files branch January 29, 2026 00:04
Comment on lines -111 to -118
- name: Prepare sdist and source-dir
- name: Build wheel
shell: bash
run: |
python -Im pip install build
python -Im build
python -Im build --wheel

mkdir source-dir
tar -xzvf dist/wcwidth-*.tar.gz -C source-dir --strip-components=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively reverts #100. The purpose of #100 is to make sure downstream users can directly run tests from the sdist without the need to clone the git repo, and to avoid something like #99.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#99 excludes #198 because cOpYrIgHt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#99 excludes #198 because cOpYrIgHt

I know. IMO even if we want downstream users to download those files themselves, we should still make sure no other files are missing, so that users can simply download those files instead of cloning wcwidth's git repo.

Copy link

@evrial evrial Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well to me this is absurd you have to deal with licensing simply to run the tests

The Consortium’s software and data files are generally licensed under the OSI-approved Unicode License v3, a free, open source, highly permissive license based on the MIT License. The primary difference between the MIT License and the Unicode License is that the Unicode License expressly covers data and data files.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a PR to change CI to use sdist again if you approve, I chose bdist because after removing data files, sdist and bdist packages have no functional difference for tests, except that bdist is faster, I changed this to reduce test time

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a PR to change CI to use sdist again if you approve

That would be good.

I chose bdist because after removing data files, sdist and bdist packages have no functional difference for tests, except that bdist is faster, I changed this to reduce test time

The key idea of #100 is to prepare the sdist once; then all later steps run in the source directory unpacked from the sdist and no longer use files from git. Indeed it's a bit slower, but I think it's worth it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CI does exercise with the data files, running update-tables.py --fetch-only prior to python3.14 tests https://github.com/jquast/wcwidth/actions/runs/21465737084/job/61827437773#step:6:28

Downstream dependencies can do the same. Maybe someone who really cares about it can work with all of the many popular packaging systems to package up unicode data files, they are useful to have just like /usr/share/unicode/, maybe some of them already do.

I had some great trouble with a fetch of only a single data file last week, because of cloudflare, because my ISP uses CGNAT and the whole web is terribly hostile for my IP. I also expect CI will have trouble from time to time, so I will also add an "ok if fetch fails" into CI for fetching data files.

I don't wish the wcwidth license to change, most downstreams would trigger a re-evaluation before accepting version changes, Even if its also another MIT license, the cost is still incurred for review and categorization of metadata and so on. And then there are also quarterly SOX2 happening in private all over the world.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unicode-3.0 license text missing

3 participants