Decode stdout using preferred encoding not ascii, and combine multibyte as necessary #13

yarikoptic · 2022-04-18T20:53:22Z

Original use case was using the rich.inspect within epdb session.
https://pypi.org/project/rich/ is coloring output and uses UTF-8
characters for tables etc. epdb client was crashing unable to
decode:

(Epdb) rich.inspect(self, value=False)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/yoh/proj/datalad/datalad-mihextras/venvs/dev3/lib/python3.9/site-packages/epdb/__init__.py", line 1091, in connect
	t.interact()
  File "/home/yoh/proj/datalad/datalad-mihextras/venvs/dev3/lib/python3.9/site-packages/epdb/epdb_client.py", line 146, in interact
	sys.stdout.write(text.decode('ascii'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5: ordinal not in range(128)

I think hardcoding to ascii is the wrong thing to do since AFAIK
there is no guarantee that it would not be UTF-8
instead of leaving to default (not providing encoding) decided
to go for prefered encoding
there is a code block which follows for reading/writing stdin,
and I encoded also into preferable encoding which allows now
to pass unicode to be encoded to utf-8 happen someone needs
to enter it.

Signed-off-by: Yaroslav Halchenko debian@onerussian.com

demo of it working:

…te as necessary Original use case was using the rich.inspect within epdb session. https://pypi.org/project/rich/ is coloring output and uses UTF-8 characters for tables etc. epdb client was crashing unable to decode: (Epdb) rich.inspect(self, value=False) Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/yoh/proj/datalad/datalad-mihextras/venvs/dev3/lib/python3.9/site-packages/epdb/__init__.py", line 1091, in connect t.interact() File "/home/yoh/proj/datalad/datalad-mihextras/venvs/dev3/lib/python3.9/site-packages/epdb/epdb_client.py", line 146, in interact sys.stdout.write(text.decode('ascii')) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5: ordinal not in range(128) - I think hardcoding to `ascii` is the wrong thing to do since AFAIK there is no guarantee that it would be UTF-8 - instead of leaving to default (not providing encoding) decided to go for prefered encoding - there is a code block which follows for reading/writing stdin, and I encoded also into preferable encoding which allows now to pass unicode to be encoded to utf-8 happen someone needs to enter it. Signed-off-by: Yaroslav Halchenko <debian@onerussian.com>

wfscheper · 2022-04-19T14:46:22Z

This looks okay to me, but I'd like at least one other reviewer. @mibanescu or @mtharp, what do you think?

yarikoptic · 2022-06-07T21:20:29Z

ping @mibanescu @mtharp , your feedback would be appreciated.

epdb/epdb_client.py

mibanescu · 2022-06-29T16:06:33Z

epdb/epdb_client.py

+                            remaining_text = b""
+                        except UnicodeDecodeError as e:
+                            unicode_text = all_text[:e.start].decode(encoding)
+                            remaining_text = all_text[e.start:]


I am sorry, i now see i had added my comment on the wrong line. Here, remaining_text gets reinitialized with a slice from all_text. If e.start is 0 and the first char is not a valid Unicode start sequence, then remaining_text will be appended to, but will always fail to decode, which, if I am right, leads to the infinite loop.

Just re-raising the exception if e.start <= 0 should be safe i think

I just had to use this patch again to realize that we never finished this PR.

For e.start < 0 - I accepted my own assertion above since I think it should simply never happen.

For e.start == 0 -- I think it is unlikely but legit case whenever we read only few bytes which are beginning of an incomplete unicode. Then we just store it all and append to next block -- I do not see how we could get an infinite loop here since we are not adding any looping and should exit at eof (would still do) ... the only thing would be left is at the end to assert that we have no remaining_text left -- as if incomplete unicode at the end was provided and we failed to decode it. I will add a stab for that.

epdb/epdb_client.py

yarikoptic · 2024-06-04T20:12:46Z

ping on this PR -- I keep coming to the need to use patched version. Please let me know if I should improve on anything

mibanescu reviewed Jun 20, 2022

View reviewed changes

epdb/epdb_client.py Show resolved Hide resolved

mibanescu reviewed Jul 25, 2022

View reviewed changes

Assert that e.start should never be negative

1524249

yarikoptic commented May 7, 2024

View reviewed changes

epdb/epdb_client.py Show resolved Hide resolved

Add handling of remaining_text

dc97bd9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decode stdout using preferred encoding not ascii, and combine multibyte as necessary #13

Decode stdout using preferred encoding not ascii, and combine multibyte as necessary #13

Uh oh!

yarikoptic commented Apr 18, 2022 •

edited

Loading

Uh oh!

wfscheper commented Apr 19, 2022

Uh oh!

yarikoptic commented Jun 7, 2022

Uh oh!

Uh oh!

mibanescu Jun 29, 2022

Uh oh!

yarikoptic May 7, 2024

Uh oh!

Uh oh!

yarikoptic commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Decode stdout using preferred encoding not ascii, and combine multibyte as necessary #13

Are you sure you want to change the base?

Decode stdout using preferred encoding not ascii, and combine multibyte as necessary #13

Uh oh!

Conversation

yarikoptic commented Apr 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wfscheper commented Apr 19, 2022

Uh oh!

yarikoptic commented Jun 7, 2022

Uh oh!

Uh oh!

mibanescu Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

yarikoptic May 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yarikoptic commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yarikoptic commented Apr 18, 2022 •

edited

Loading