Skip to content

Conversation

methane
Copy link
Member

@methane methane commented Jul 6, 2018

Stop using key=lambda idiom. This behavior is same to
C version encoder.

https://bugs.python.org/issue23493

methane added 2 commits July 6, 2018 16:48
Stop using key=lambda idiom.  This behavior is same to
C version encoder.
@methane methane added the performance Performance or resource usage label Jul 6, 2018
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm that there is a benefit from this change.

@vstinner
Copy link
Member

vstinner commented Jul 6, 2018

Is it possible that sorted() compares two values and that the comparison raises a type error with this change? (Is it possible to have a dictionary with two keys with have the same order?)

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM since the C code calls list(dct.items()).sort() (no key argument).

items = list(dct.items())
items.sort()

and

items = sorted(dct.items())

are equivalent and sorted() is shorter, so LGTM :-)

@methane
Copy link
Member Author

methane commented Jul 6, 2018

Is it possible that sorted() compares two values and that the comparison raises a type error with this change? (Is it possible to have a dictionary with two keys with have the same order?)

It happens with C encoder.

$ python3
Python 3.7.0 (default, Jun 28 2018, 15:11:21)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> class D(dict):
...     def items(self):
...         return [('foo', 42), ('foo', 'bar')]
...
>>> d = D({'foo': None})
>>> d
{'foo': None}
>>> import json
>>> json.dumps(d)
'{"foo": 42, "foo": "bar"}'
>>> json.dumps(d, sort_keys=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/inada-n/pyenv/versions/3.7.0/lib/python3.7/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/home/inada-n/pyenv/versions/3.7.0/lib/python3.7/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/home/inada-n/pyenv/versions/3.7.0/lib/python3.7/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
TypeError: '<' not supported between instances of 'str' and 'int'

>>> json.encoder.c_make_encoder = None
>>> json.dumps(d)
'{"foo": 42, "foo": "bar"}'

@tirkarthi
Copy link
Member

Some numbers with ./configure --enable-optimizations

Before patch :

cpython git:(master)   rlwrap ./python
Python 3.8.0a0 (heads/master:c929df3, Jul  6 2018, 11:38:33)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> import json
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3216385352425277
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.38790418207645416
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3015676769427955
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3441896280273795

After patch :

➜  cpython git:(master) ✗ rlwrap ./python
Python 3.8.0a0 (heads/master-dirty:c929df3, Jul  6 2018, 11:44:57)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> import json
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3076137569732964
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3111478630453348
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3163580968976021
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3175586462020874

>>> # itemgetter
>>> import operator
>>> timeit.timeit('sorted(d.items(), key=operator.itemgetter(0))', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.37424366315826774
>>> timeit.timeit('sorted(d.items(), key=operator.itemgetter(0))', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.37700869189575315
>>> timeit.timeit('sorted(d.items(), key=operator.itemgetter(0))', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3674369710497558

Little confused that the original issue proposed operator.itemgetter which was the fastest while here it's the slowest. Any information on the difference between the proposed change vs the actual one ? Maybe things have changed from Python 3.4 when it was initially posted and key was removed in the later versions?

Thanks

@methane
Copy link
Member Author

methane commented Jul 6, 2018

https://tools.ietf.org/html/rfc7159#section-4
RFC says "The names within an object SHOULD be unique."

@methane
Copy link
Member Author

methane commented Jul 6, 2018

@tirkarthi You compared sorted(d.items()) with sorted(d.items(), key=itemgetter(0)).
Original issue compared sorted(d.items(), key=lambda x: x[0]) with sorted(d.items(), key=itemgetter(0))

@vstinner
Copy link
Member

vstinner commented Jul 6, 2018

>>> class D(dict):
...     def items(self):
...         return [('foo', 42), ('foo', 'bar')]

Honestly, if you use such weird dictionary, you are going to get worse issues than TypeError on sorting, no?

Ignore my question, your PR is fine.

@tirkarthi
Copy link
Member

Thanks @methane missed it. Makes sense now

➜  cpython git:(master) ✗ rlwrap ./python
Python 3.8.0a0 (heads/master-dirty:c929df3, Jul  6 2018, 11:44:57)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> import json
>>> timeit.timeit('sorted(d.items(), key=lambda x: x[0])', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.41919384989887476
>>> timeit.timeit('sorted(d.items(), key=lambda x: x[0])', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.4452404109761119
>>> timeit.timeit('sorted(d.items(), key=lambda x: x[0])', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.42263348307460546

@vstinner
Copy link
Member

vstinner commented Jul 6, 2018

Some numbers with ./configure --enable-optimizations (...) import timeit

Noooo, please! If you spend hours to compile Python with PGO, don't use the timeit module which is not reliable! Use perf timeit:
http://perf.readthedocs.io/en/latest/cli.html#timeit-versus-perf-timeit

Your results have a large deviation:

0.3216385352425277
0.38790418207645416
0.3015676769427955
0.3441896280273795

perf timeit should help to reduce the deviation. My doc to get reliable benchmarks:
http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproductible-benchmark-results

@tirkarthi
Copy link
Member

tirkarthi commented Jul 6, 2018

Thanks @vstinner . The after patch doesn't make sense and I misread the posted benchmark that it's related to sorted function instead of related to JSON module and was comparing sorted function. Sorry for the noise.

@methane
Copy link
Member Author

methane commented Jul 6, 2018

Even with perf, performance of json.dumps(d, sort_keys=True, indent=" ") is unstable.

$ ./python -m perf timeit -s 'import json; d = {"k%d" % i: True for i in range(100)}' -- 'json.dumps(d, indent=" ", sort_keys=True)'
.....................
Mean +- std dev: 64.7 us +- 4.5 us

$ ./python -m perf timeit -s 'import json; d = {"k%d" % i: True for i in range(100)}' -- 'json.dumps(d, indent=" ", sort_keys=True)'
.....................
WARNING: the benchmark result may be unstable
* the standard deviation (8.89 us) is 13% of the mean (68.1 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m perf system tune' command to reduce the system jitter.
Use perf stats, perf dump and perf hist to analyze results.
Use --quiet option to hide these warnings.

Mean +- std dev: 68.1 us +- 8.9 us

$ ./python -m perf timeit -s 'import json; d = {"k%d" % i: True for i in range(100)}' -- 'json.dumps(d, indent=" ", sort_keys=True)'
.....................
WARNING: the benchmark result may be unstable
* the standard deviation (8.02 us) is 12% of the mean (67.7 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m perf system tune' command to reduce the system jitter.
Use perf stats, perf dump and perf hist to analyze results.
Use --quiet option to hide these warnings.

Mean +- std dev: 67.7 us +- 8.0 us

It seems sorting is not bottleneck. We can close the issue without merging this.

@methane methane closed this Jul 6, 2018
@methane methane deleted the json-sort-keys branch July 6, 2018 12:26
@vstinner
Copy link
Member

vstinner commented Jul 6, 2018

It seems sorting is not bottleneck. We can close the issue without merging this.

I suggest to merge this change anyway, to respect the PEP 399: Python and C implementation must behave the same. Currently, the Python implementation is not the default but behaves differently :-(

@methane methane restored the json-sort-keys branch July 6, 2018 17:15
@methane methane added skip news and removed performance Performance or resource usage labels Jul 6, 2018
@methane methane reopened this Jul 6, 2018
@methane methane changed the title bpo-23493: json: Optimize sort_keys in Python encoder bpo-23493: json: Make sort_keys in Python encoder same to C Jul 6, 2018
@methane methane changed the title bpo-23493: json: Make sort_keys in Python encoder same to C bpo-23493: json: Change sort_keys in Python encoder same to C Jul 6, 2018
@methane methane merged commit e25399b into python:master Jul 6, 2018
@methane methane deleted the json-sort-keys branch July 6, 2018 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants