Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: unicode support #804

Merged
merged 42 commits into from
Jan 12, 2023
Merged

feat: unicode support #804

merged 42 commits into from
Jan 12, 2023

Conversation

bindreams
Copy link
Contributor

@bindreams bindreams commented Nov 19, 2022

Fixes #14 as discussed.

Some notes:

  • Since necessary facilities were already in place, I also added support for wide strings as options.
  • I had difficulties coming up with a good API for this. It would not be a good idea to conditionally throw away argc and argv in favor of GetCommandLineW, because argv passed in to parse may have been already modified. Instead, I figured out a way to retrieve "original" argv on other systems and made a parallel interface: app.parse(argc, argv) stays the same and app.parse() uses "original as launched" args.
  • I exposed the following useful functions to the user:
    • argc() and argv() on all systems for "original" args.
    • widen and narrow in case the user uses a wide-string API with our UTF-8 string.
    • to_path on all systems which creates a filesystem::path, widening the string on Windows.
  • I had to remove a test which checks existence of a file with Unicode in the name because:
    1. File needs to exist beforehand and not get created during test - Unicode can get corrupted the same way during creation and during checking, the test will erroneously pass.
    2. CI chokes trying to copy a unicode-named file to the build directory.
  • Some design decisions which I considered, let me know if you prefer these over what's in the PR:
    • app.parse(CLI::system_args) instead of app.parse(), where CLI::system_args is value of a tag type CLI::system_args_t (feel free to suggest a better name for the tag)
    • Initialize argc() and argv() as a global variable before main. This is already being done for macOS. Would improve syntax from (e.g. from CLI::argv()[1] to CLI::argv[1]). Would remove any performance costs during main but will push them on users who don't care about Unicode or convert Unicode themselves.
  • Both cpplint and clang-tidy raised a lot of warnings in preexisting code when running locally. Maybe I am doing something wrong? I had to exclude build/include_subdir from cpplint just to catch actual warnings in terminal.

Also, I made some benchmarks to measure the performance impact of supporting Unicode, since this was a concern. Measuring with google benchmark, all benchmarks receive <program-name> --option1 1234 --option2 "hello world" as commandline:

  • simple_main - create a CLI::App, options, and parse the old way. Just for reference when judging performance.
  • get_global_args - call CLI::argv() (without caching into a static variable)
  • parse_normal - app.parse(argc, argv), i.e. old speed
  • parse_global - app.parse(), i.e. new speed

Windows:

Run on (24 X 3748.57 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x12)
  L1 Instruction 32 KiB (x12)
  L2 Unified 512 KiB (x12)
  L3 Unified 32768 KiB (x2)
----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
simple_main           4557 ns         4541 ns       144516
get_global_args        566 ns          578 ns      1000000
parse_normal           568 ns          578 ns      1000000
parse_global          1262 ns         1256 ns       560000

get_global_args and parse_normal being the same time is a little sus, I suspect that maybe there are some optimizations in the benchmark loop which I failed to prevent.

Linux:

Run on (24 X 3700.04 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x12)
  L1 Instruction 32 KiB (x12)
  L2 Unified 512 KiB (x12)
  L3 Unified 32768 KiB (x1)
Load Average: 10.96, 3.19, 1.10
----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
simple_main           2666 ns         2666 ns       259722
get_global_args       2104 ns         2104 ns       340016
parse_normal           521 ns          521 ns      1024941
parse_global          2818 ns         2818 ns       248291

Overall, for a function called once in the application, I think +2μs of time for Unicode support is acceptable.

@bindreams bindreams force-pushed the unicode-support branch 18 times, most recently from 8df6899 to 0974481 Compare November 20, 2022 21:41
@codecov
Copy link

codecov bot commented Nov 21, 2022

Codecov Report

Base: 99.45% // Head: 99.45% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (9bda850) compared to base (48624ea).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #804   +/-   ##
=======================================
  Coverage   99.45%   99.45%           
=======================================
  Files          16       18    +2     
  Lines        4019     4069   +50     
=======================================
+ Hits         3997     4047   +50     
  Misses         22       22           
Impacted Files Coverage Δ
include/CLI/App.hpp 100.00% <ø> (ø)
include/CLI/Validators.hpp 99.30% <ø> (ø)
include/CLI/TypeTools.hpp 100.00% <100.00%> (ø)
include/CLI/impl/App_inl.hpp 99.31% <100.00%> (+<0.01%) ⬆️
include/CLI/impl/Argv_inl.hpp 100.00% <100.00%> (ø)
include/CLI/impl/Encoding_inl.hpp 100.00% <100.00%> (ø)
include/CLI/impl/Validators_inl.hpp 99.50% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bindreams
Copy link
Contributor Author

There are Codacy reports which I disagree with but don't have permissions to ignore.

Ready for review.

@bindreams bindreams marked this pull request as ready for review November 21, 2022 02:41
@phlptp
Copy link
Collaborator

phlptp commented Nov 21, 2022

for codacy you might be able to to put //NOLINTNEXTLINE on a few depending on the exact tool that is triggering them.

@bindreams
Copy link
Contributor Author

bindreams commented Dec 10, 2022

One of the "2 workflows awaiting approval" which someone approved on f7e8901 was a github action "cmake config check", which failed because apparently libidn.so.11 was missing on that particular system. I have no idea why that library was required in the first place (my best guess is the locale-based fallback for older compilers requires it?). Looks like cmake <3.7 is not aware that on ubuntu-latest libidn.so.12 exists instead.

Anyway, since this looks like a cmake bug, I modified github actions to run cmake versions on "appropriate" ubuntu distributions: cmake 3.4~3.10 on ubuntu 18.04, cmake 3.10~3.16 on ubuntu 20.04, and everything newer on ubuntu 22.04. Let me know if this is an ok solution.

I ran cmake config tests on my fork, so they should be good to go. Feel free to approve the remaining workflows again, I'm excited to merge this soon 😅

@phlptp
Copy link
Collaborator

phlptp commented Dec 10, 2022

This was also fixed in #813, so once that is merged and this branch updated it should be good as well.

@bindreams
Copy link
Contributor Author

Looks like coverage failed (even though it succeeded earlier with the same code).
Looking at the report, it seems like codecov decided to stop counting closing braces in unrelated functions. Certainly seems like a bug. Here is what I found: bug report for this in LCOV -> related pull request which was merged 23 days ago. So I guess this can be fixed soon.

In the meantime, I was going to suggest just letting coverage fall, but then I found out that codecov.yml already has informational: true, so I shouldn't ever fail the CI. Not sure what is going on...

@phlptp
Copy link
Collaborator

phlptp commented Dec 10, 2022

I noticed the codecov thing as well, not entirely sure what changed but I think that is resolved in #813 as well. Waiting for @henryiii to review that one, then I think that can be merged and you can rebase this one. This PR is something I want @henryiii to review and approve since it is a pretty big change.

@henryiii
Copy link
Collaborator

Merged that one. Won't have time to review this till early next week, though.

@bindreams
Copy link
Contributor Author

Re-ran checks after CI had unrelated third-party connection errors.

@@ -14,7 +14,7 @@
#include <string>
#include <utility>
#include <vector>
// [CLI11:public_includes:set]
// [CLI11:public_includes:end]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why did this work before, actually? It should search to the first "end", which I would have thought would have created a broken single header include. Testing locally, it does look like it just causes #include <cctype> to be missing; I guess this is included through something else so that's why it still works.

Copy link
Contributor Author

@bindreams bindreams Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a brief look at the single header script, it seems that the multiline regex for collecting headers needs both "set" and "end" comments. So it just didn't match and by lucky coincidence the necessary headers were included elsewhere.

@henryiii
Copy link
Collaborator

henryiii commented Jan 6, 2023

I wrote a longer reply yesterday but lost it when switching around. TL;DWA (too long; didn't write again): I think the design decisions are reasonable, I like the argument-less form (looks like Python's argparse), and keeping the cost low for projects that don't care about unicode.

@bindreams bindreams mentioned this pull request Jan 12, 2023
@bindreams
Copy link
Contributor Author

Thanks for your feedback. Does that mean this PR is ready to merge now?

@henryiii
Copy link
Collaborator

Yes, I think we can put it in, and let it sit for a while. I'm hoping this doesn't affect weird/niche systems negatively by including something they don't have, especially if they aren't using it. But maybe the best way to find out is to see if it breaks anyone in main for a bit.

@Spixmaster
Copy link

@henryiii Support for ensure_utf8() was released but is stated as a feature only in HEAD in the README.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs changelog Hasn't been added to the changelog yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode
4 participants