-
-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for encoding issues #495
Conversation
…ified on the command line.
…ers (chinese, japanese, ...) in file names, folder or path arguments (Should fix bug like #417i) when UTF-8 is not the system def$ Now, in theory, everything that gets in should be converted to unicoded and then things would be converted back to string (user defined encoding) at the outmost parts when dealing with the system. When talking to S3, convert unicode to strings in utf-8 encoding. Validated with run-test.py in a computer with utf-8 as default encoding and a one with ANSI C as default encoding.
… unicodise conversion + fixed the removal of most of useless or invalid unicodise/decodise in the code.
Note: |
This is excellent work, thank you! I started looking at the patch this Thanks, On Thu, Mar 12, 2015 at 12:12 PM, Florent Viard notifications@github.com
|
Thanks, by the way, take care that there is a second part 'Pass2 ...' that I added to this branch half an hour ago. Things were working ok before, but I have noticed today that some pieces were still bad. It is much cleaner now. |
@@ -707,7 +711,7 @@ def object_modify(self, src_uri, dst_uri, extra_headers = None): | |||
headers = self._sanitize_headers(headers) | |||
acl = self.get_acl(src_uri) | |||
|
|||
headers['x-amz-copy-source'] = "/%s/%s" % (src_uri.bucket(), self.urlencode_string(src_uri.object())) | |||
headers['x-amz-copy-source'] = deunicodise("/%s/%s" % (src_uri.bucket(), self.urlencode_string(src_uri.object())), encoding="UTF-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use deunicodise() here instead of .encode('UTF-8') as you did with the same header at line 682?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right about that one. I just did the two in 2 different times, so was not consistent.
I just pushed an additional commit to fix that and use deunicodise in both case.
I prefer using deunicodise because there is the failsafe check for string being unicode.
I was a little lazy at that time, but I think for a future patch to add functions like:
s3_encode / s3_decode or deunicodise_s3 or unicodise_s3
and to use that instead of the "encode /deunicodise" (...=UTF-8) that are used throughout the code to prepare string to be sent to the S3 server.
…hed to "unicode" call instead of "decode" as it is supposed to be more efficient.
I have added encode_to_s3 and decode_from_s3 functions. |
This looks really good. Are you comfortable with me merging it now? It At some point I'd like to consider putting the encode_to_s3 / On Mon, Mar 16, 2015 at 8:31 AM, Florent Viard notifications@github.com
|
Thanks. Yes I think that it is good to merge. |
Fixes encoding/decoding issues and so added support for all exotics characters (chinese, japanese, ...) in file names, folder or path arguments (Should fix bug like #417 ) when UTF-8 is not the system default encoding.
Now, in theory, everything that gets in should be converted to unicode and, then, things should be converted back to byte strings (in user defined encoding) at the outmost parts when using the os functions.
When talking to S3, convert unicode to strings in utf-8 encoding.
Every input output of S3Uri is now supposed to be unicode.
There is still some useless unicodise/deunicodise in the path between input and output, but that should not be an issue and could be fixed gradually.
I hope to not have break anything in the exotic features that I don't use and that I haven't tested (SSE, Cloudfront, ...).
But, in theory, the changes should be transparent for people having "utf-8" as their system default encoding.
For the others, don't hesitate to create issues if there is still some things raising unicode decode errors.
Validated with run-test.py in a computer with utf-8 and one with ANSI C as default encodings.
RUN-TEST.PY
System encoding: UTF-8
Using bucket prefix: 'flo-'
1 Remove test buckets ..... OK
2 Verify no test buckets .. OK
3 Create one bucket (EU) .. OK
4 Create multiple buckets . OK
5 Invalid bucket name ..... OK
6 Buckets list ............ OK
7 Sync to S3 .............. OK
8 Sync UTF-8 to S3 ........ OK
9 List bucket content ..... OK
10 List bucket recursive ... OK
11 Clean testsuite-out/(rm) SKIP
12 Clean testsuite-out/(mk) OK
13 Put from stdin .......... OK
14 Put multipart ........... OK
15 Multipart large put from stdin OK
16 Clean testsuite-out/(rm) OK
17 Clean testsuite-out/(mk) OK
18 Sync from S3 ............ OK
19 Remove 'dir-test/' ...... OK
20 Create file-dir dir ..... OK
21 Skip over dir ........... OK
22 Clean testsuite-out/(rm) OK
23 Clean testsuite-out/(mk) OK
24 Put public, guess MIME .. OK
25 Retrieve from URL ....... OK
26 Change ACL to Private ... OK
27 Verify Private ACL ...... OK
28 Change ACL to Public .... OK
29 Verify Public ACL ....... OK
30 Sync more to S3 ......... OK
31 Change file cksum1.txt .. OK
32 Change file cksum33.txt . OK
33 Don't check MD5 ......... OK
34 Check MD5 ............... OK
35 Rename within S3 ........ OK
36 Rename (NoSuchKey) ...... OK
37 Sync more from S3 (invalid src) OK
38 Sync more from S3 ....... OK
39 Remove dst dir for get .. OK
40 Get multiple files ...... OK
41 Put unicode filenames ... OK
42 Make dst dir for get .... OK
43 Get unicode filenames ... OK
44 Get multiple files ...... OK
45 blah.txt / Blah.txt ..... OK
46 Copy between buckets .... OK
47 Recursive copy, set ACL . OK
48 Verify ACL and MIME type OK
49 Modify MIME type ........ OK
50 Verify ACL and MIME type OK
51 Modify MIME type back ... OK
52 Verify ACL and MIME type OK
53 Add cache-control header OK
54 HEAD check Cache-Control present OK
55 Remove cache-control header OK
56 HEAD check Cache-Control OK
57 sign string ............. OK
58 signurl time ............ OK
59 signurl time offset ..... OK
60 Rename within S3 ........ OK
61 Sync remote2remote ...... OK
62 Don't put symbolic links OK
63 Put symbolic links ...... OK
64 Sync symbolic links ..... OK
65 Multi-source move ....... OK
66 Verify move ............. OK
67 List all ................ OK
68 Simple delete ........... OK
69 Simple delete with rm ... OK
70 Create expiration rule with days and prefix OK
71 Create expiration rule with date and prefix OK
72 Create expiration rule with days only OK
73 Create expiration rule with date only OK
74 Get current expiration setting OK
75 Delete expiration rule .. OK
76 Recursive delete maximum exceeded OK
77 Recursive delete ........ OK
78 Recursive delete with rm OK
79 Recursive delete all .... OK
80 Remove empty bucket ..... OK
81 Remove remaining buckets OK