-
Notifications
You must be signed in to change notification settings - Fork 12
Decode stdout using preferred encoding not ascii, and combine multibyte as necessary #13
base: master
Are you sure you want to change the base?
Conversation
…te as necessary Original use case was using the rich.inspect within epdb session. https://pypi.org/project/rich/ is coloring output and uses UTF-8 characters for tables etc. epdb client was crashing unable to decode: (Epdb) rich.inspect(self, value=False) Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/yoh/proj/datalad/datalad-mihextras/venvs/dev3/lib/python3.9/site-packages/epdb/__init__.py", line 1091, in connect t.interact() File "/home/yoh/proj/datalad/datalad-mihextras/venvs/dev3/lib/python3.9/site-packages/epdb/epdb_client.py", line 146, in interact sys.stdout.write(text.decode('ascii')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5: ordinal not in range(128) - I think hardcoding to `ascii` is the wrong thing to do since AFAIK there is no guarantee that it would be UTF-8 - instead of leaving to default (not providing encoding) decided to go for prefered encoding - there is a code block which follows for reading/writing stdin, and I encoded also into preferable encoding which allows now to pass unicode to be encoded to utf-8 happen someone needs to enter it. Signed-off-by: Yaroslav Halchenko <debian@onerussian.com>
|
This looks okay to me, but I'd like at least one other reviewer. @mibanescu or @mtharp, what do you think? |
|
ping @mibanescu @mtharp , your feedback would be appreciated. |
| remaining_text = b"" | ||
| except UnicodeDecodeError as e: | ||
| unicode_text = all_text[:e.start].decode(encoding) | ||
| remaining_text = all_text[e.start:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, i now see i had added my comment on the wrong line. Here, remaining_text gets reinitialized with a slice from all_text. If e.start is 0 and the first char is not a valid Unicode start sequence, then remaining_text will be appended to, but will always fail to decode, which, if I am right, leads to the infinite loop.
Just re-raising the exception if e.start <= 0 should be safe i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had to use this patch again to realize that we never finished this PR.
For e.start < 0 - I accepted my own assertion above since I think it should simply never happen.
For e.start == 0 -- I think it is unlikely but legit case whenever we read only few bytes which are beginning of an incomplete unicode. Then we just store it all and append to next block -- I do not see how we could get an infinite loop here since we are not adding any looping and should exit at eof (would still do) ... the only thing would be left is at the end to assert that we have no remaining_text left -- as if incomplete unicode at the end was provided and we failed to decode it. I will add a stab for that.
|
ping on this PR -- I keep coming to the need to use patched version. Please let me know if I should improve on anything |
Original use case was using the rich.inspect within epdb session.
https://pypi.org/project/rich/ is coloring output and uses UTF-8
characters for tables etc. epdb client was crashing unable to
decode:
I think hardcoding to
asciiis the wrong thing to do since AFAIKthere is no guarantee that it would not be UTF-8
instead of leaving to default (not providing encoding) decided
to go for prefered encoding
there is a code block which follows for reading/writing stdin,
and I encoded also into preferable encoding which allows now
to pass unicode to be encoded to utf-8 happen someone needs
to enter it.
Signed-off-by: Yaroslav Halchenko debian@onerussian.com
demo of it working:
