Skip to content
This repository was archived by the owner on Dec 27, 2018. It is now read-only.

Force utf-8 encoding explicitly #23

Merged
merged 1 commit into from
Mar 14, 2017
Merged

Force utf-8 encoding explicitly #23

merged 1 commit into from
Mar 14, 2017

Conversation

epmoyer
Copy link
Contributor

@epmoyer epmoyer commented Mar 14, 2017

The Confluence REST API returns data with a ‘utf-8’ encoding, but the Requests module defaults to an encoding behavior which corrupts many Unicode characters above 127 (for example, codepoints U+00A0 (non-breaking space) and U+00A2 (cent sign)). PythonConfluenceAPI currently relies on the Requests module’s default encoding, which will consistently mis-encode many characters (among them the two listed above).

I encountered the problem while developing an application that reads a page's body, makes modifications, and writes the updated body back to the page. An existing non-breaking space on the page became corrupted during the round trip due to the encoding error on the page read.

I created the following Gist (encoding_test.py) to reproduce the bug and validate this pull request (which corrects the encoding by explicitly setting it to utf-8):
https://gist.github.com/epmoyer/e4e9b09e9af38478e8df1c9506222626

In order to execute the encoding_test.py script you will need to locally create a config.ini file per the instructions in the comments at the top of the file. The config file should identify a test page on which you place (at least) a single “¢” character (the cent sign is sufficient to demonstrate the error).

Here are the results of executing encoding_test.py on the current PythonConfluenceAPI using a test page containing a single cent sign:

$python encoding_test.py
Testing ConfluenceAPI...
Found page id: 84152851
--- Raw page body ----------------------------
<p>¢</p>
--- Page body codepoints ---------------------
<p>\xc2\x2d8</p>
----------------------------------------------
Testing ConfluenceFuturesAPI...
Found page id: 84152851
--- Raw page body ----------------------------
<p>¢</p>
--- Page body codepoints ---------------------
<p>\xc2\x2d8</p>
----------------------------------------------
$

The page body contains the cent sign (U+00A2) which has a utf-8 encoding of 0xC2 0xA2. The API incorrectly returns the two codepoints U+C2 and U+2d8 instead of the correct codepoint U+00A2.

The fix is to add response.encoding = 'utf-8’ before referencing response.text

Here is the output of the same test using the pull request fix:

$python encoding_test.py
Testing ConfluenceAPI...
Found page id: 84152851
--- Raw page body ----------------------------
<p>¢</p>
--- Page body codepoints ---------------------
<p>\xa2</p>
----------------------------------------------
Testing ConfluenceFuturesAPI...
Found page id: 84152851
--- Raw page body ----------------------------
<p>¢</p>
--- Page body codepoints ---------------------
<p>\xa2</p>
----------------------------------------------
$

I’ve tested the fix with Python 2.7, 3.5, and 3.6 and it behaves properly in all cases.

@rpcope1 rpcope1 merged commit b7f0ca2 into rpcope1:master Mar 14, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants