Fixes for encoding issues #495

fviard · 2015-03-11T21:49:12Z

Fixes encoding/decoding issues and so added support for all exotics characters (chinese, japanese, ...) in file names, folder or path arguments (Should fix bug like #417 ) when UTF-8 is not the system default encoding.
Now, in theory, everything that gets in should be converted to unicode and, then, things should be converted back to byte strings (in user defined encoding) at the outmost parts when using the os functions.

When talking to S3, convert unicode to strings in utf-8 encoding.

Every input output of S3Uri is now supposed to be unicode.
There is still some useless unicodise/deunicodise in the path between input and output, but that should not be an issue and could be fixed gradually.

I hope to not have break anything in the exotic features that I don't use and that I haven't tested (SSE, Cloudfront, ...).
But, in theory, the changes should be transparent for people having "utf-8" as their system default encoding.
For the others, don't hesitate to create issues if there is still some things raising unicode decode errors.

Validated with run-test.py in a computer with utf-8 and one with ANSI C as default encodings.

RUN-TEST.PY

System encoding: UTF-8
Using bucket prefix: 'flo-'
1 Remove test buckets ..... OK
2 Verify no test buckets .. OK
3 Create one bucket (EU) .. OK
4 Create multiple buckets . OK
5 Invalid bucket name ..... OK
6 Buckets list ............ OK
7 Sync to S3 .............. OK
8 Sync UTF-8 to S3 ........ OK
9 List bucket content ..... OK
10 List bucket recursive ... OK
11 Clean testsuite-out/(rm) SKIP
12 Clean testsuite-out/(mk) OK
13 Put from stdin .......... OK
14 Put multipart ........... OK
15 Multipart large put from stdin OK
16 Clean testsuite-out/(rm) OK
17 Clean testsuite-out/(mk) OK
18 Sync from S3 ............ OK
19 Remove 'dir-test/' ...... OK
20 Create file-dir dir ..... OK
21 Skip over dir ........... OK
22 Clean testsuite-out/(rm) OK
23 Clean testsuite-out/(mk) OK
24 Put public, guess MIME .. OK
25 Retrieve from URL ....... OK
26 Change ACL to Private ... OK
27 Verify Private ACL ...... OK
28 Change ACL to Public .... OK
29 Verify Public ACL ....... OK
30 Sync more to S3 ......... OK
31 Change file cksum1.txt .. OK
32 Change file cksum33.txt . OK
33 Don't check MD5 ......... OK
34 Check MD5 ............... OK
35 Rename within S3 ........ OK
36 Rename (NoSuchKey) ...... OK
37 Sync more from S3 (invalid src) OK
38 Sync more from S3 ....... OK
39 Remove dst dir for get .. OK
40 Get multiple files ...... OK
41 Put unicode filenames ... OK
42 Make dst dir for get .... OK
43 Get unicode filenames ... OK
44 Get multiple files ...... OK
45 blah.txt / Blah.txt ..... OK
46 Copy between buckets .... OK
47 Recursive copy, set ACL . OK
48 Verify ACL and MIME type OK
49 Modify MIME type ........ OK
50 Verify ACL and MIME type OK
51 Modify MIME type back ... OK
52 Verify ACL and MIME type OK
53 Add cache-control header OK
54 HEAD check Cache-Control present OK
55 Remove cache-control header OK
56 HEAD check Cache-Control OK
57 sign string ............. OK
58 signurl time ............ OK
59 signurl time offset ..... OK
60 Rename within S3 ........ OK
61 Sync remote2remote ...... OK
62 Don't put symbolic links OK
63 Put symbolic links ...... OK
64 Sync symbolic links ..... OK
65 Multi-source move ....... OK
66 Verify move ............. OK
67 List all ................ OK
68 Simple delete ........... OK
69 Simple delete with rm ... OK
70 Create expiration rule with days and prefix OK
71 Create expiration rule with date and prefix OK
72 Create expiration rule with days only OK
73 Create expiration rule with date only OK
74 Get current expiration setting OK
75 Delete expiration rule .. OK
76 Recursive delete maximum exceeded OK
77 Recursive delete ........ OK
78 Recursive delete with rm OK
79 Recursive delete all .... OK
80 Remove empty bucket ..... OK
81 Remove remaining buckets OK

…ified on the command line.

…ers (chinese, japanese, ...) in file names, folder or path arguments (Should fix bug like #417i) when UTF-8 is not the system def$ Now, in theory, everything that gets in should be converted to unicoded and then things would be converted back to string (user defined encoding) at the outmost parts when dealing with the system. When talking to S3, convert unicode to strings in utf-8 encoding. Validated with run-test.py in a computer with utf-8 as default encoding and a one with ANSI C as default encoding.

… unicodise conversion + fixed the removal of most of useless or invalid unicodise/decodise in the code.

fviard · 2015-03-12T17:11:59Z

Note:
Only entries in the configuration file (config.py) could not be rightfully unicodised, because it is like a chicken and egg problem if the encoding to use is defined inside the configuration file.

mdomsch · 2015-03-12T17:18:06Z

This is excellent work, thank you! I started looking at the patch this
morning and got through about half of it. Will continue reviewing shortly.

Thanks,
Matt

On Thu, Mar 12, 2015 at 12:12 PM, Florent Viard notifications@github.com
wrote:

Note:
Only entries in the configuration file (config.py) could not be rightfully
unicodised, because it is like a chicken and egg problem if the encoding to
use is defined inside the configuration file.

—
Reply to this email directly or view it on GitHub
#495 (comment).

fviard · 2015-03-12T17:39:02Z

Thanks, by the way, take care that there is a second part 'Pass2 ...' that I added to this branch half an hour ago. Things were working ok before, but I have noticed today that some pieces were still bad. It is much cleaner now.

mdomsch · 2015-03-13T11:54:37Z

S3/S3.py

@@ -707,7 +711,7 @@ def object_modify(self, src_uri, dst_uri, extra_headers = None):
        headers = self._sanitize_headers(headers)
        acl = self.get_acl(src_uri)

-        headers['x-amz-copy-source'] = "/%s/%s" % (src_uri.bucket(), self.urlencode_string(src_uri.object()))
+        headers['x-amz-copy-source'] = deunicodise("/%s/%s" % (src_uri.bucket(), self.urlencode_string(src_uri.object())), encoding="UTF-8")


why use deunicodise() here instead of .encode('UTF-8') as you did with the same header at line 682?

You are right about that one. I just did the two in 2 different times, so was not consistent.
I just pushed an additional commit to fix that and use deunicodise in both case.
I prefer using deunicodise because there is the failsafe check for string being unicode.

I was a little lazy at that time, but I think for a future patch to add functions like:
s3_encode / s3_decode or deunicodise_s3 or unicodise_s3
and to use that instead of the "encode /deunicodise" (...=UTF-8) that are used throughout the code to prepare string to be sent to the S3 server.

…hed to "unicode" call instead of "decode" as it is supposed to be more efficient.

fviard · 2015-03-16T13:31:45Z

I have added encode_to_s3 and decode_from_s3 functions.
The only remaining thing that could be cleaned is the unicodisation inside S3Error class in Exception.py. (But I don't have the time to do that, maybe will be done in a future new pull request.)
Note: I replaced calls to "string.decode" by "unicode(" as it looks like to be more efficient (source: google search :p)

mdomsch · 2015-03-16T15:00:00Z

This looks really good. Are you comfortable with me merging it now? It
does pass the tests, and being in upstream master would get it more test
coverage.

At some point I'd like to consider putting the encode_to_s3 /
decode_from_s3() directly into the send_request() and send_file() paths
(really, everywhere we call conn.c.getresponse() and nearby) and do all the
conversion / marshalling adjacent to the socket sends/receives.
Fortunately there aren't that many of those. With signing though, we'd
need to do the conversion prior to signing (as you've done). So this may
be "good enough" and actually desired as-is.

On Mon, Mar 16, 2015 at 8:31 AM, Florent Viard notifications@github.com
wrote:

I have added encode_to_s3 and decode_from_s3 functions.
The only remaining thing that could be cleaned is the unicodisation inside
S3Error class in Exception.py.
Note: I replaced calls to "string.decode" by "unicode(" as it looks like
to be more efficient (source: google search :p)

—
Reply to this email directly or view it on GitHub
#495 (comment).

fviard · 2015-03-16T15:48:56Z

Thanks. Yes I think that it is good to merge.
I did a lot of testing and everything looks to be working as expected.
So, I also think that getting upstream is the only way to discover bugs if they are still some.
:-)

Fixes for encoding issues

fviard added 5 commits March 11, 2015 22:00

Cleanup: Add prologue to files to state that they are UTF-8 encoded.

ddbd183

Fix run-test script to take into account the config file that is spec…

5259140

…ified on the command line.

Remove useless error import

be77cf1

Pass2 of "fixes encoding/decoding issues": os.walk replaced to handle…

ca9880f

… unicodise conversion + fixed the removal of most of useless or invalid unicodise/decodise in the code.

mdomsch reviewed Mar 13, 2015
View reviewed changes

fviard added 2 commits March 13, 2015 15:19

Use deunicodise consistentely in object_copy and object_modify

5fe9c0f

fix encoding : Added encode_to_s3 and decode_from_s3 function + switc…

f732c39

…hed to "unicode" call instead of "decode" as it is supposed to be more efficient.

mdomsch added a commit that referenced this pull request Mar 16, 2015

Merge pull request #495 from fviard/batch_encoding_fixes

80c82f7

Fixes for encoding issues

mdomsch merged commit 80c82f7 into s3tools:master Mar 16, 2015

mdomsch mentioned this pull request Mar 16, 2015

Issue 417: Fix bug when trying to upload a file with non-ascii chars #425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for encoding issues #495

Fixes for encoding issues #495

fviard commented Mar 11, 2015

fviard commented Mar 12, 2015

mdomsch commented Mar 12, 2015

fviard commented Mar 12, 2015

mdomsch Mar 13, 2015

fviard Mar 13, 2015

fviard commented Mar 16, 2015

mdomsch commented Mar 16, 2015

fviard commented Mar 16, 2015

Fixes for encoding issues #495

Fixes for encoding issues #495

Conversation

fviard commented Mar 11, 2015

RUN-TEST.PY

fviard commented Mar 12, 2015

mdomsch commented Mar 12, 2015

fviard commented Mar 12, 2015

mdomsch Mar 13, 2015

Choose a reason for hiding this comment

fviard Mar 13, 2015

Choose a reason for hiding this comment

fviard commented Mar 16, 2015

mdomsch commented Mar 16, 2015

fviard commented Mar 16, 2015