Skip to content

Commit

Permalink
BUG: Fix #15344 by backporting ujson usage of PEP 393 API
Browse files Browse the repository at this point in the history
Make use of the PEP 393 API to avoid expanding single byte ascii
characters into four byte unicode characters when encoding objects to
json.

closes #15344

Author: Tobias Gustafsson <tobias.l.gustafsson@gmail.com>

Closes #15360 from tobgu/backport-ujson-compact-ascii-encoding and squashes the following commits:

44de133 [Tobias Gustafsson] Fix C-code formatting to pass linting of GH15344
b7e404f [Tobias Gustafsson] Merge branch 'master' into backport-ujson-compact-ascii-encoding
4e8e2ff [Tobias Gustafsson] BUG: Fix #15344 by backporting ujson usage of PEP 393 APIs for compact ascii
  • Loading branch information
tobgu authored and jreback committed Feb 10, 2017
1 parent 3d6fcdc commit e884072
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 1 deletion.
5 changes: 4 additions & 1 deletion doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -538,6 +538,8 @@ Bug Fixes
- Bug in ``pd.pivot_table()`` where no error was raised when values argument was not in the columns (:issue:`14938`)

- Bug in ``.to_json()`` where ``lines=True`` and contents (keys or values) contain escaped characters (:issue:`15096`)
- Bug in ``.to_json()`` causing single byte ascii characters to be expanded to four byte unicode (:issue:`15344`)
- Bug in ``.read_json()`` for Python 2 where ``lines=True`` and contents contain non-ascii unicode characters (:issue:`15132`)
- Bug in ``.rolling/expanding()`` functions where ``count()`` was not counting ``np.Inf``, nor handling ``object`` dtypes (:issue:`12541`)
- Bug in ``DataFrame.resample().median()`` if duplicate column names are present (:issue:`14233`)

Expand All @@ -561,7 +563,6 @@ Bug Fixes
- Bug in ``DataFrame.fillna()`` where the argument ``downcast`` was ignored when fillna value was of type ``dict`` (:issue:`15277`)


- Bug in ``.read_json()`` for Python 2 where ``lines=True`` and contents contain non-ascii unicode characters (:issue:`15132`)

- Bug in ``pd.read_csv()`` with ``float_precision='round_trip'`` which caused a segfault when a text entry is parsed (:issue:`15140`)

Expand All @@ -574,4 +575,6 @@ Bug Fixes

- Bug in ``DataFrame.boxplot`` where ``fontsize`` was not applied to the tick labels on both axes (:issue:`15108`)
- Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`)


- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
10 changes: 10 additions & 0 deletions pandas/io/tests/json/test_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -1044,3 +1044,13 @@ def roundtrip(s, encoding='latin-1'):

for s in examples:
roundtrip(s)

def test_data_frame_size_after_to_json(self):
# GH15344
df = DataFrame({'a': [str(1)]})

size_before = df.memory_usage(index=True, deep=True).sum()
df.to_json()
size_after = df.memory_usage(index=True, deep=True).sum()

self.assertEqual(size_before, size_after)
10 changes: 10 additions & 0 deletions pandas/src/ujson/python/objToJSON.c
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,16 @@ static void *PyStringToUTF8(JSOBJ _obj, JSONTypeContext *tc, void *outValue,
static void *PyUnicodeToUTF8(JSOBJ _obj, JSONTypeContext *tc, void *outValue,
size_t *_outLen) {
PyObject *obj = (PyObject *)_obj;

#if (PY_VERSION_HEX >= 0x03030000)
if (PyUnicode_IS_COMPACT_ASCII(obj)) {
Py_ssize_t len;
char *data = PyUnicode_AsUTF8AndSize(obj, &len);
*_outLen = len;
return data;
}
#endif

PyObject *newObj = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(obj),
PyUnicode_GET_SIZE(obj), NULL);

Expand Down

0 comments on commit e884072

Please sign in to comment.