This repository was archived by the owner on Dec 27, 2018. It is now read-only.
Force utf-8 encoding explicitly #23
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Confluence REST API returns data with a ‘utf-8’ encoding, but the Requests module defaults to an encoding behavior which corrupts many Unicode characters above 127 (for example, codepoints U+00A0 (non-breaking space) and U+00A2 (cent sign)). PythonConfluenceAPI currently relies on the Requests module’s default encoding, which will consistently mis-encode many characters (among them the two listed above).
I encountered the problem while developing an application that reads a page's body, makes modifications, and writes the updated body back to the page. An existing non-breaking space on the page became corrupted during the round trip due to the encoding error on the page read.
I created the following Gist (encoding_test.py) to reproduce the bug and validate this pull request (which corrects the encoding by explicitly setting it to utf-8):
https://gist.github.com/epmoyer/e4e9b09e9af38478e8df1c9506222626
In order to execute the encoding_test.py script you will need to locally create a config.ini file per the instructions in the comments at the top of the file. The config file should identify a test page on which you place (at least) a single “¢” character (the cent sign is sufficient to demonstrate the error).
Here are the results of executing encoding_test.py on the current PythonConfluenceAPI using a test page containing a single cent sign:
The page body contains the cent sign (U+00A2) which has a utf-8 encoding of 0xC2 0xA2. The API incorrectly returns the two codepoints U+C2 and U+2d8 instead of the correct codepoint U+00A2.
The fix is to add response.encoding = 'utf-8’ before referencing response.text
Here is the output of the same test using the pull request fix:
I’ve tested the fix with Python 2.7, 3.5, and 3.6 and it behaves properly in all cases.