Skip to content

add unicode regex for paths and fix path encoding #1420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 16, 2021

Conversation

Panaetius
Copy link
Contributor

This should support filenames with unicode words and emojis in them. I adjusted the regex to use \w word characters, which matches most unicode symbols (I tested it locally with chinese, korean, arabic and cuneiform) that can appear in words. I also added the emoji range (one of our users actually had an emoji in a filename which lead to us having an issue in the first place). Also spaces are allowed in filenames, so i added that as well.

I had to change how path is stored, as the implementation url-encoded the filepath (e.g. file:///tmp/myfile) and then removed the file:// prefix, but without unencoding special characters, so you'd get a path like /tmp/%E2%98%95coffee.txt instead of /tmp/☕coffee.txt, which doesn't exist on disk.

I had to change some dependencies to get tests running, but quite a few of them still fail when running the whole suite, with errors like FileNotFoundError: [Errno 2] No such file or directory: '/home/user/DEV/cwltool/tmp/test_cid_file_dir___parallel__0/tmpz2stpzft/tmpgpo291r4', but when I tried running a sample of the failing ones individually, they passed. So there might be some concurrency/fixture issue in the tests? I'm not sure how I could run the whole suite in a way that works to see if there's any that actually fail.

Closes #110

@codecov
Copy link

codecov bot commented Mar 15, 2021

Codecov Report

Merging #1420 (f7275b3) into main (8bdc152) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1420   +/-   ##
=======================================
  Coverage   74.60%   74.60%           
=======================================
  Files          43       43           
  Lines        7828     7829    +1     
  Branches     1992     1992           
=======================================
+ Hits         5840     5841    +1     
- Misses       1416     1417    +1     
+ Partials      572      571    -1     
Impacted Files Coverage Δ
cwltool/command_line_tool.py 78.64% <100.00%> (+0.03%) ⬆️
cwltool/job.py 71.57% <0.00%> (-0.43%) ⬇️
cwltool/docker.py 68.93% <0.00%> (+0.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8bdc152...f7275b3. Read the comment docs.

Comment on lines 85 to 86
ACCEPTLIST_EN_STRICT_RE = re.compile(r"^[a-zA-Z0-9._+-]+$")
ACCEPTLIST_EN_RELAXED_RE = re.compile(r".*") # Accept anything
Copy link
Member

@mr-c mr-c Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ACCEPTLIST_EN_STRICT_RE = re.compile(r"^[a-zA-Z0-9._+-]+$")
ACCEPTLIST_EN_RELAXED_RE = re.compile(r".*") # Accept anything
ACCEPTLIST_EN_STRICT_RE = re.compile(r"^[\w0-9._+\- \u2600-\u26FF]+$") # accept unicode word characters and emojis
ACCEPTLIST_EN_RELAXED_RE = re.compile(r".*") # Accept anything

Better to fix this here, I think?

Also, use a r"" string to fix the following flake8 complaints:

cwltool/command_line_tool.py:89:31: W605 invalid escape sequence '\w'
cwltool/command_line_tool.py:89:39: W605 invalid escape sequence '\-'

https://github.com/common-workflow-language/cwltool/pull/1420/checks?check_run_id=2114176915#step:9:16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it like that at first, but thought the EN implied english and this is decidedly not english 😄

I tried tox but that exits on the first failure (Maybe because of --strict for pytest, but I tried removing that with no luck). test_udocker_should_display_memory_usage fails because psutil removed travis support in version 5.8.0 and I can't install the pinned version on my system. I pinned that version of psutil, as that might get it to work on others' systems.

Removing the TRAVIS tests and running with tox works normally, as does invoking pytest manually, but there's many failures like e.g.

_________________________________________________ test_dont_require_inputs __________________________________________________

    def test_dont_require_inputs():
        if sys.version_info[0] < 3:
            stream = BytesIO()
        else:
            stream = StringIO()

        script = None
        try:
>           script = NamedTemporaryFile(mode="w", delete=False)

../../tests/test_toolargparse.py:134:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/zenon/.pyenv/versions/3.6.12/lib/python3.6/tempfile.py:551: in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

dir = '/home/zenon/DEV/ETH/cwltool/tmp/test_cid_file_dir___parallel__0/tmpaip03ivw', pre = 'tmp', suf = '', flags = 131266
output_type = <class 'str'>

    def _mkstemp_inner(dir, pre, suf, flags, output_type):
        """Code common to mkstemp, TemporaryFile, and NamedTemporaryFile."""

        names = _get_candidate_names()
        if output_type is bytes:
            names = map(_os.fsencode, names)

        for seq in range(TMP_MAX):
            name = next(names)
            file = _os.path.join(dir, pre + name + suf)
            try:
>               fd = _os.open(file, flags, 0o600)
E               FileNotFoundError: [Errno 2] No such file or directory: '/home/zenon/DEV/ETH/cwltool/tmp/test_cid_file_dir___parallel__0/tmpaip03ivw/tmpkssk2xd5'

/home/zenon/.pyenv/versions/3.6.12/lib/python3.6/tempfile.py:262: FileNotFoundError

but then

$ pytest -k test_dont_require_inputs
==================================================== test session starts ====================================================
platform linux -- Python 3.6.12, pytest-4.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/zenon/DEV/ETH/cwltool, inifile: setup.cfg
plugins: mock-3.5.1, cov-2.11.1
collected 392 items / 391 deselected / 1 selected

tests/test_toolargparse.py .                                                                                          [100%]

========================================= 1 passed, 391 deselected in 2.62 seconds ==============================

So I'm not sure which failures are real failures.

@mr-c
Copy link
Member

mr-c commented Mar 15, 2021

Thank you @Panaetius !

For local testing, try tox

I'll investigate some of the errors from my side as well

@mr-c
Copy link
Member

mr-c commented Mar 15, 2021

There is a test that wants filenames with spaces to be rejected, but this PR "fixes" that.

This is because CWL implementations are allowed to reject filenames with spaces:

If the path contains POSIX shell metacharacters (|, &, ;, <, >, (, ), $ ,`, \, ", ', <space>, <tab>, and <newline>) or characters not allowed for Internationalized Domain Names for Applications then implementations may terminate the process with a permanentFailure.

https://www.commonwl.org/v1.0/CommandLineTool.html#File

@Panaetius
Copy link
Contributor Author

There is a test that wants filenames with spaces to be rejected, but this PR "fixes" that.

This is because CWL implementations are allowed to reject filenames with spaces:

If the path contains POSIX shell metacharacters (|, &, ;, <, >, (, ), $ ,, `, ", ', <space>, <tab>, and <newline>) or characters not allowed for Internationalized Domain Names for Applications then implementations may terminate the process with a permanentFailure.

https://www.commonwl.org/v1.0/CommandLineTool.html#File

Oh I see, I was not aware of that. Should I remove that from the regex?

@mr-c
Copy link
Member

mr-c commented Mar 15, 2021

Oh I see, I was not aware of that. Should I remove that from the regex?

Yes please. Users who still want to use filenames with spaces with cwltool can pass --relax-path-checks

@Panaetius
Copy link
Contributor Author

Ok, I'll remove that tomorrow.

Also this PR is still missing test(s) but I wasn't sure where to add them. Is there maybe an existing test I could add more cases to?

@mr-c
Copy link
Member

mr-c commented Mar 15, 2021

Ok, I'll remove that tomorrow.

Thanks!

Also this PR is still missing test(s) but I wasn't sure where to add them. Is there maybe an existing test I could add more cases to?

I think tests/test_relax_path_checks.py could be renamed and expanded

@Panaetius Panaetius force-pushed the add_unicode_support branch 3 times, most recently from bf09c24 to 1e91134 Compare March 16, 2021 10:35
@Panaetius Panaetius force-pushed the add_unicode_support branch from 1e91134 to 8485091 Compare March 16, 2021 11:02
@Panaetius
Copy link
Contributor Author

I added some tests. I also changed the regex to remove the space and I removed _ and 0-9 since that's already covered by \w

I also had to change the locale to C.UTF8 to get the tests to pass for python3.6 (they passed without it for 3.7+ but no harm in adding it)

Copy link
Member

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run make format ?

@Panaetius
Copy link
Contributor Author

done

@mr-c mr-c merged commit 0053a86 into common-workflow-language:main Mar 16, 2021
@mr-c
Copy link
Member

mr-c commented Mar 16, 2021

Woohoo! Thank you very much @Panaetius !

@Panaetius
Copy link
Contributor Author

Thank you for the help and merging it so fast 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

upgrade filename validation regex
2 participants