Force utf-8 encoding explicitly #23

epmoyer · 2017-03-14T19:08:39Z

The Confluence REST API returns data with a ‘utf-8’ encoding, but the Requests module defaults to an encoding behavior which corrupts many Unicode characters above 127 (for example, codepoints U+00A0 (non-breaking space) and U+00A2 (cent sign)). PythonConfluenceAPI currently relies on the Requests module’s default encoding, which will consistently mis-encode many characters (among them the two listed above).

I encountered the problem while developing an application that reads a page's body, makes modifications, and writes the updated body back to the page. An existing non-breaking space on the page became corrupted during the round trip due to the encoding error on the page read.

I created the following Gist (encoding_test.py) to reproduce the bug and validate this pull request (which corrects the encoding by explicitly setting it to utf-8):
https://gist.github.com/epmoyer/e4e9b09e9af38478e8df1c9506222626

In order to execute the encoding_test.py script you will need to locally create a config.ini file per the instructions in the comments at the top of the file. The config file should identify a test page on which you place (at least) a single “¢” character (the cent sign is sufficient to demonstrate the error).

Here are the results of executing encoding_test.py on the current PythonConfluenceAPI using a test page containing a single cent sign:

$python encoding_test.py
Testing ConfluenceAPI...
Found page id: 84152851
--- Raw page body ----------------------------
<p>Â˘</p>
--- Page body codepoints ---------------------
<p>\xc2\x2d8</p>
----------------------------------------------
Testing ConfluenceFuturesAPI...
Found page id: 84152851
--- Raw page body ----------------------------
<p>Â˘</p>
--- Page body codepoints ---------------------
<p>\xc2\x2d8</p>
----------------------------------------------
$

The page body contains the cent sign (U+00A2) which has a utf-8 encoding of 0xC2 0xA2. The API incorrectly returns the two codepoints U+C2 and U+2d8 instead of the correct codepoint U+00A2.

The fix is to add response.encoding = 'utf-8’ before referencing response.text

Here is the output of the same test using the pull request fix:

$python encoding_test.py
Testing ConfluenceAPI...
Found page id: 84152851
--- Raw page body ----------------------------
<p>¢</p>
--- Page body codepoints ---------------------
<p>\xa2</p>
----------------------------------------------
Testing ConfluenceFuturesAPI...
Found page id: 84152851
--- Raw page body ----------------------------
<p>¢</p>
--- Page body codepoints ---------------------
<p>\xa2</p>
----------------------------------------------
$

I’ve tested the fix with Python 2.7, 3.5, and 3.6 and it behaves properly in all cases.

Force utf-8 encoding explicitly

4000302

rpcope1 merged commit b7f0ca2 into rpcope1:master Mar 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Force utf-8 encoding explicitly #23

Force utf-8 encoding explicitly #23

Uh oh!

epmoyer commented Mar 14, 2017 •

edited

Loading

Uh oh!

Uh oh!

Force utf-8 encoding explicitly #23

Force utf-8 encoding explicitly #23

Uh oh!

Conversation

epmoyer commented Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

epmoyer commented Mar 14, 2017 •

edited

Loading