Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for encoding issues #495

Merged
merged 7 commits into from
Mar 16, 2015
Merged

Conversation

fviard
Copy link
Contributor

@fviard fviard commented Mar 11, 2015

Fixes encoding/decoding issues and so added support for all exotics characters (chinese, japanese, ...) in file names, folder or path arguments (Should fix bug like #417 ) when UTF-8 is not the system default encoding.
Now, in theory, everything that gets in should be converted to unicode and, then, things should be converted back to byte strings (in user defined encoding) at the outmost parts when using the os functions.

When talking to S3, convert unicode to strings in utf-8 encoding.

Every input output of S3Uri is now supposed to be unicode.
There is still some useless unicodise/deunicodise in the path between input and output, but that should not be an issue and could be fixed gradually.

I hope to not have break anything in the exotic features that I don't use and that I haven't tested (SSE, Cloudfront, ...).
But, in theory, the changes should be transparent for people having "utf-8" as their system default encoding.
For the others, don't hesitate to create issues if there is still some things raising unicode decode errors.

Validated with run-test.py in a computer with utf-8 and one with ANSI C as default encodings.

RUN-TEST.PY

System encoding: UTF-8
Using bucket prefix: 'flo-'
1 Remove test buckets ..... OK
2 Verify no test buckets .. OK
3 Create one bucket (EU) .. OK
4 Create multiple buckets . OK
5 Invalid bucket name ..... OK
6 Buckets list ............ OK
7 Sync to S3 .............. OK
8 Sync UTF-8 to S3 ........ OK
9 List bucket content ..... OK
10 List bucket recursive ... OK
11 Clean testsuite-out/(rm) SKIP
12 Clean testsuite-out/(mk) OK
13 Put from stdin .......... OK
14 Put multipart ........... OK
15 Multipart large put from stdin OK
16 Clean testsuite-out/(rm) OK
17 Clean testsuite-out/(mk) OK
18 Sync from S3 ............ OK
19 Remove 'dir-test/' ...... OK
20 Create file-dir dir ..... OK
21 Skip over dir ........... OK
22 Clean testsuite-out/(rm) OK
23 Clean testsuite-out/(mk) OK
24 Put public, guess MIME .. OK
25 Retrieve from URL ....... OK
26 Change ACL to Private ... OK
27 Verify Private ACL ...... OK
28 Change ACL to Public .... OK
29 Verify Public ACL ....... OK
30 Sync more to S3 ......... OK
31 Change file cksum1.txt .. OK
32 Change file cksum33.txt . OK
33 Don't check MD5 ......... OK
34 Check MD5 ............... OK
35 Rename within S3 ........ OK
36 Rename (NoSuchKey) ...... OK
37 Sync more from S3 (invalid src) OK
38 Sync more from S3 ....... OK
39 Remove dst dir for get .. OK
40 Get multiple files ...... OK
41 Put unicode filenames ... OK
42 Make dst dir for get .... OK
43 Get unicode filenames ... OK
44 Get multiple files ...... OK
45 blah.txt / Blah.txt ..... OK
46 Copy between buckets .... OK
47 Recursive copy, set ACL . OK
48 Verify ACL and MIME type OK
49 Modify MIME type ........ OK
50 Verify ACL and MIME type OK
51 Modify MIME type back ... OK
52 Verify ACL and MIME type OK
53 Add cache-control header OK
54 HEAD check Cache-Control present OK
55 Remove cache-control header OK
56 HEAD check Cache-Control OK
57 sign string ............. OK
58 signurl time ............ OK
59 signurl time offset ..... OK
60 Rename within S3 ........ OK
61 Sync remote2remote ...... OK
62 Don't put symbolic links OK
63 Put symbolic links ...... OK
64 Sync symbolic links ..... OK
65 Multi-source move ....... OK
66 Verify move ............. OK
67 List all ................ OK
68 Simple delete ........... OK
69 Simple delete with rm ... OK
70 Create expiration rule with days and prefix OK
71 Create expiration rule with date and prefix OK
72 Create expiration rule with days only OK
73 Create expiration rule with date only OK
74 Get current expiration setting OK
75 Delete expiration rule .. OK
76 Recursive delete maximum exceeded OK
77 Recursive delete ........ OK
78 Recursive delete with rm OK
79 Recursive delete all .... OK
80 Remove empty bucket ..... OK
81 Remove remaining buckets OK

fviard added 5 commits March 11, 2015 22:00
…ers (chinese, japanese, ...) in file names, folder or path arguments (Should fix bug like #417i) when UTF-8 is not the system def$

Now, in theory, everything that gets in should be converted to unicoded and then things would be converted back to string (user defined encoding) at the outmost parts when dealing with the system.
When talking to S3, convert unicode to strings in utf-8 encoding.

Validated with run-test.py in a computer with utf-8 as default encoding and a one with ANSI C as default encoding.
… unicodise conversion + fixed the removal of most of useless or invalid unicodise/decodise in the code.
@fviard
Copy link
Contributor Author

fviard commented Mar 12, 2015

Note:
Only entries in the configuration file (config.py) could not be rightfully unicodised, because it is like a chicken and egg problem if the encoding to use is defined inside the configuration file.

@mdomsch
Copy link
Contributor

mdomsch commented Mar 12, 2015

This is excellent work, thank you! I started looking at the patch this
morning and got through about half of it. Will continue reviewing shortly.

Thanks,
Matt

On Thu, Mar 12, 2015 at 12:12 PM, Florent Viard notifications@github.com
wrote:

Note:
Only entries in the configuration file (config.py) could not be rightfully
unicodised, because it is like a chicken and egg problem if the encoding to
use is defined inside the configuration file.


Reply to this email directly or view it on GitHub
#495 (comment).

@fviard
Copy link
Contributor Author

fviard commented Mar 12, 2015

Thanks, by the way, take care that there is a second part 'Pass2 ...' that I added to this branch half an hour ago. Things were working ok before, but I have noticed today that some pieces were still bad. It is much cleaner now.

@@ -707,7 +711,7 @@ def object_modify(self, src_uri, dst_uri, extra_headers = None):
headers = self._sanitize_headers(headers)
acl = self.get_acl(src_uri)

headers['x-amz-copy-source'] = "/%s/%s" % (src_uri.bucket(), self.urlencode_string(src_uri.object()))
headers['x-amz-copy-source'] = deunicodise("/%s/%s" % (src_uri.bucket(), self.urlencode_string(src_uri.object())), encoding="UTF-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use deunicodise() here instead of .encode('UTF-8') as you did with the same header at line 682?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about that one. I just did the two in 2 different times, so was not consistent.
I just pushed an additional commit to fix that and use deunicodise in both case.
I prefer using deunicodise because there is the failsafe check for string being unicode.

I was a little lazy at that time, but I think for a future patch to add functions like:
s3_encode / s3_decode or deunicodise_s3 or unicodise_s3
and to use that instead of the "encode /deunicodise" (...=UTF-8) that are used throughout the code to prepare string to be sent to the S3 server.

@fviard
Copy link
Contributor Author

fviard commented Mar 16, 2015

I have added encode_to_s3 and decode_from_s3 functions.
The only remaining thing that could be cleaned is the unicodisation inside S3Error class in Exception.py. (But I don't have the time to do that, maybe will be done in a future new pull request.)
Note: I replaced calls to "string.decode" by "unicode(" as it looks like to be more efficient (source: google search :p)

@mdomsch
Copy link
Contributor

mdomsch commented Mar 16, 2015

This looks really good. Are you comfortable with me merging it now? It
does pass the tests, and being in upstream master would get it more test
coverage.

At some point I'd like to consider putting the encode_to_s3 /
decode_from_s3() directly into the send_request() and send_file() paths
(really, everywhere we call conn.c.getresponse() and nearby) and do all the
conversion / marshalling adjacent to the socket sends/receives.
Fortunately there aren't that many of those. With signing though, we'd
need to do the conversion prior to signing (as you've done). So this may
be "good enough" and actually desired as-is.

On Mon, Mar 16, 2015 at 8:31 AM, Florent Viard notifications@github.com
wrote:

I have added encode_to_s3 and decode_from_s3 functions.
The only remaining thing that could be cleaned is the unicodisation inside
S3Error class in Exception.py.
Note: I replaced calls to "string.decode" by "unicode(" as it looks like
to be more efficient (source: google search :p)


Reply to this email directly or view it on GitHub
#495 (comment).

@fviard
Copy link
Contributor Author

fviard commented Mar 16, 2015

Thanks. Yes I think that it is good to merge.
I did a lot of testing and everything looks to be working as expected.
So, I also think that getting upstream is the only way to discover bugs if they are still some.
:-)

mdomsch added a commit that referenced this pull request Mar 16, 2015
@mdomsch mdomsch merged commit 80c82f7 into s3tools:master Mar 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants