Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode in branch names gives unexpected error #3060

Closed
azhipaigor opened this issue Aug 21, 2017 · 9 comments
Closed

Unicode in branch names gives unexpected error #3060

azhipaigor opened this issue Aug 21, 2017 · 9 comments
Assignees
Labels
Accepted Accepted issue on our roadmap Bug A bug

Comments

@azhipaigor
Copy link

Details

Expected Result

Documentation built

Actual Result

Almost immediately after start build failing with text "An unexpected error occurred". This happens for most of latest builds.

@umeshksingla
Copy link

Looks similar to #2991

@agjohnson
Copy link
Contributor

I addressed the issue in #3073. Triggering your build, it seems the hotfix resolved the issue.

@agjohnson
Copy link
Contributor

It actually looks like there is another unicode bug with our branch handling:

Sep 15 19:52:47 build01 readthedocs/readthedocs.doc_builder.environments[20809]: ERROR (Build) [diadocsdk-1c:latest] 'ascii' codec can't decode byte 0xd0 in position 197: ordinal not in range(128) [readthedocs.doc_builder.environments:317]
Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/readthedocs/projects/tasks.py", line 145, in run_setup
    self.setup_vcs()
  File "/home/docs/checkouts/readthedocs.org/readthedocs/projects/tasks.py", line 280, in setup_vcs
    update_imported_docs(self.version.pk)
  File "/home/docs/local/lib/python2.7/site-packages/celery/app/trace.py", line 439, in __protected_call__
    return orig(self, *args, **kwargs)
  File "/home/docs/local/lib/python2.7/site-packages/newrelic/hooks/application_celery.py", line 80, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/docs/local/lib/python2.7/site-packages/celery/app/task.py", line 420, in __call__
    return self.run(*args, **kwargs)
  File "/home/docs/checkouts/readthedocs.org/readthedocs/projects/tasks.py", line 530, in update_imported_docs
    } for v in version_repo.branches
  File "/home/docs/checkouts/readthedocs.org/readthedocs/vcs_support/backends/git.py", line 137, in branches
    return self.parse_branches(stdout)
  File "/home/docs/checkouts/readthedocs.org/readthedocs/vcs_support/backends/git.py", line 155, in parse_branches
    data = str(data)
  File "/home/docs/local/lib/python2.7/site-packages/future/types/newstr.py", line 102, in __new__
    return super(newstr, cls).__new__(cls, value)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 197: ordinal not in range(128)

@agjohnson agjohnson changed the title An unexpected error occurred Unicode in branch names gives unexpected error Sep 15, 2017
@agjohnson agjohnson added the Bug A bug label Sep 15, 2017
@stsewd
Copy link
Member

stsewd commented May 17, 2018

If #4052 is implemented for branches, this should be solved.

@ericholscher
Copy link
Member

@stsewd We should check that the rest of the workflow handles unicode branch names properly. We might fix one thing, just to have it break further down the line. Have you tested #4052 with a unicode branch name, and confirmed doc serving, etc. work?

@ericholscher ericholscher self-assigned this May 22, 2018
@stsewd
Copy link
Member

stsewd commented May 22, 2018

@ericholscher #4052 is just for tags, unicode branches are still failing. Using py3 solves this also p:

@stsewd stsewd mentioned this issue Jun 27, 2018
@humitos humitos added the Accepted Accepted issue on our roadmap label Jul 26, 2018
@humitos
Copy link
Member

humitos commented Aug 2, 2018

I just tested this is in our Azure instance that it's running Python 3.6.5:

  1. Imported this project https://gitlab.com/humitos/rtd-project-a
  2. d- tag was created. All Unicode chars were removed completely. The tag name is úńìĉõdê
  3. A branch called aúbranchńwithìweirdĉcharactersõanddê was enabled in Read the Docs. The name was converted to a slug: a-branch-with-weird-characters-andd-
  4. Links to PDF download work properly
  5. Links to "Edit on GitLab" works properly (I'm linked to https://gitlab.com/humitos/rtd-project-a/blob/a%C3%BAbranch%C5%84with%C3%ACweird%C4%89characters%C3%B5andd%C3%AA/index.rst which is the correct branch name under GitLab)

So, we can say that "It works under Python 3.6" without #4433 merged.

Now, I have some questions,

  1. do we want to remove these characters?
  2. would we have a problem by allowing them in the slug? filesystem issues? URL issues?

@humitos
Copy link
Member

humitos commented Aug 2, 2018

do we want to remove these characters?

We could use what I suggested at #1410 (comment) some time ago. Instead of removing the chars, it tries to use a "similar one".

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aúbranchńwithìweirdĉcharactersõanddê').encode('ascii', 'ignore')
b'aubranchnwithiweirdccharactersoandde'

On the other hand, if there is not an "easy replacement" it just skip/ignore it:

>>> unicodedata.normalize('NFKD', u'Straße').encode('ascii', 'ignore')
b'Strae'

@stsewd
Copy link
Member

stsewd commented Aug 24, 2018

would we have a problem by allowing them in the slug? filesystem issues? URL issues?

As long as the server is correctly setup I don't think so. I guess we can live with no-ascii chars in the urls https://stackoverflow.com/questions/6625035/utf-8-characters-in-urls#6625474

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Bug A bug
Projects
None yet
Development

No branches or pull requests

6 participants