Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. #15265

Merged
merged 5 commits into from
Sep 12, 2019

Conversation

gnprice
Copy link
Contributor

@gnprice gnprice commented Aug 14, 2019

Now the fields have names! Much easier to keep straight as a
reader than 18 distinct indexes into a list.

Runs about 10-15% slower: from 10.8s to 12.3s, on my laptop.
Fortunately that's perfectly fine for this maintenance script.

https://bugs.python.org/issue37760

Now the fields have names!  Much easier to keep straight as a
reader than the elements of an 18-tuple.

Runs about 10-15% slower: from 10.8s to 12.3s, on my laptop.
Fortunately that's perfectly fine for this maintenance script.
Copy link
Contributor

@benjaminp benjaminp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re commit message: Wasn't each record actually a heterogeneous list?

if s[1].startswith("<CJK Ideograph"):
if s.name[-6:] == "First>":
s.name = ""
field = dataclasses.astuple(s)[:15]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sole reason for this tuple conversion line 1025? Can you use the replace method of dataclasses there? The hardcoded [:15] somewhat defeats the nice benefit of naming fields you've achieved here.

(Also, I wonder if the first two loops of this function can be merged but that's orthogonal.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sole reason for this tuple conversion line 1025? Can you use the replace method of dataclasses there? The hardcoded [:15] somewhat defeats the nice benefit of naming fields you've achieved here.

Yeah. I think I originally did this with replace and that was too slow. Let's see if that reproduces...

Oh right, another bit is that we need a distinct set for the binary_properties attribute of each record, so the code ends up not quite as clean as you might hope anyway. In this version from_row is the place that knows that fact, and we use it in both these loops.

OK, the result is it makes the whole script about 20% slower: on my (fast) desktop, it's 8.3s in this version and 10.0s with .replace. Here's the patch:

             if s:
                 if s.name[-6:] == "First>":
                     s.name = ""
-                    field = dataclasses.astuple(s)[:15]
+                    field = s
                 elif s.name[-5:] == "Last>":
                     if s.name.startswith("<CJK Ideograph"):
-                        cjk_ranges_found.append((field[0],
+                        cjk_ranges_found.append((field.codepoint,
                                                  s.codepoint))
                     s.name = ""
                     field = None
             elif field:
-                table[i] = from_row(('%X' % i,) + field[1:])
+                table[i] = dataclasses.replace(field,
+                    codepoint='%X' % (i,), binary_properties=set())

Certainly cleaner code that way, but the 20% speed hit (~2 extra seconds) is noticeably painful when doing development on this script and consequently running it all the time to try changes.

This tuple is quite local -- it doesn't escape this loop of 20 lines or so -- so I'd prefer to accept the short stretch of lower-level fiddling in exchange for the speedup when working on the rest of the script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I suppose [:15] is unlikely to break too in the future given the stability of UnicodeData.txt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed :)

import os
import sys
import zipfile

from dataclasses import dataclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just qualify with dataclass. everywhere? (seems this is already mostly happening aside from the decorator application)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, WFM.

# decimal, digit, numeric, bidi-mirrored, Unicode-1-name, (11)
# ISO-comment, uppercase, lowercase, titlecase, ea-width, (16)
# derived-props] (17)
# table: Dict[int, UcdRecord] # key is codepoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right type is actually List[Optional[UcdRecord]], right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... So it is! Good catch, thanks.

Certainly this illustrates the danger in type annotations that aren't checked. 😉 At least it was commented, so if it'd gotten committed this way the reader would have had some warning to take it skeptically.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Copy link
Contributor Author

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benjaminp for the review!

Just pushed an update with all these changes except to the astuple bit; details below.

Wasn't each record actually a heterogeneous list?

Hmm true! I guess logically I think of that as a "tuple", but it's certainly not a Python tuple. Updated the NEWS entry, and I'll edit the PR description/title too.

import os
import sys
import zipfile

from dataclasses import dataclass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, WFM.

# decimal, digit, numeric, bidi-mirrored, Unicode-1-name, (11)
# ISO-comment, uppercase, lowercase, titlecase, ea-width, (16)
# derived-props] (17)
# table: Dict[int, UcdRecord] # key is codepoint
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... So it is! Good catch, thanks.

Certainly this illustrates the danger in type annotations that aren't checked. 😉 At least it was commented, so if it'd gotten committed this way the reader would have had some warning to take it skeptically.

if s[1].startswith("<CJK Ideograph"):
if s.name[-6:] == "First>":
s.name = ""
field = dataclasses.astuple(s)[:15]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sole reason for this tuple conversion line 1025? Can you use the replace method of dataclasses there? The hardcoded [:15] somewhat defeats the nice benefit of naming fields you've achieved here.

Yeah. I think I originally did this with replace and that was too slow. Let's see if that reproduces...

Oh right, another bit is that we need a distinct set for the binary_properties attribute of each record, so the code ends up not quite as clean as you might hope anyway. In this version from_row is the place that knows that fact, and we use it in both these loops.

OK, the result is it makes the whole script about 20% slower: on my (fast) desktop, it's 8.3s in this version and 10.0s with .replace. Here's the patch:

             if s:
                 if s.name[-6:] == "First>":
                     s.name = ""
-                    field = dataclasses.astuple(s)[:15]
+                    field = s
                 elif s.name[-5:] == "Last>":
                     if s.name.startswith("<CJK Ideograph"):
-                        cjk_ranges_found.append((field[0],
+                        cjk_ranges_found.append((field.codepoint,
                                                  s.codepoint))
                     s.name = ""
                     field = None
             elif field:
-                table[i] = from_row(('%X' % i,) + field[1:])
+                table[i] = dataclasses.replace(field,
+                    codepoint='%X' % (i,), binary_properties=set())

Certainly cleaner code that way, but the 20% speed hit (~2 extra seconds) is noticeably painful when doing development on this script and consequently running it all the time to try changes.

This tuple is quite local -- it doesn't escape this loop of 20 lines or so -- so I'd prefer to accept the short stretch of lower-level fiddling in exchange for the speedup when working on the rest of the script.

@gnprice gnprice changed the title bpo-37760: Convert from giant tuples to a dataclass, in makeunicodedata. bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. Sep 12, 2019
@benjaminp benjaminp merged commit a65678c into python:master Sep 12, 2019
DinoV pushed a commit to DinoV/cpython that referenced this pull request Sep 12, 2019
…edata. (pythonGH-15265)

Now the fields have names!  Much easier to keep straight as a
reader than the elements of an 18-tuple.

Runs about 10-15% slower: from 10.8s to 12.3s, on my laptop.
Fortunately that's perfectly fine for this maintenance script.
@gnprice gnprice deleted the pr-makeud-dataclass branch September 18, 2019 05:36
anthrotype added a commit to anthrotype/unicodedata2 that referenced this pull request Mar 17, 2020
…5265)

Adapted from: python/cpython#15265

commit a65678c5c90002c5e40fa82746de07e6217df625
Author: Greg Price <gnprice@gmail.com>
Date:   Thu Sep 12 02:23:43 2019 -0700

    bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. (GH-15265)

    Now the fields have names!  Much easier to keep straight as a
    reader than the elements of an 18-tuple.

    Runs about 10-15% slower: from 10.8s to 12.3s, on my laptop.
    Fortunately that's perfectly fine for this maintenance script.

~~~

The original patch uses dataclasses but I use namedtuple here so that it
works on both python 2 and 3.
anthrotype added a commit to fonttools/unicodedata2 that referenced this pull request Mar 19, 2020
* Clean up and reduce visual clutter in the makeunicodedata scripts

python/cpython#7558

commit faa2948654d15a859bc4317e00730ff213295764
Author: Stefan Behnel <stefan_ml@behnel.de>
Date:   Sat Jun 1 21:49:03 2019 +0200

    Clean up and reduce visual clutter in the makeunicode.py script. (GH-7558)

* bpo-37760: Factor out the basic UCD parsing logic of makeunicodedata. (GH-15130)

python/cpython#15130
commit ef2af1ad44be0542a47270d5173a0b920c3a450d
Author: Greg Price <gnprice@gmail.com>
Date:   Mon Aug 12 22:20:56 2019 -0700

    bpo-37760: Factor out the basic UCD parsing logic of makeunicodedata. (GH-15130)

    There were 10 copies of this, and almost as many distinct versions of
    exactly how it was written.  They're all implementing the same
    standard.  Pull them out to the top, so the more interesting logic
    that remains becomes easier to read.

~~~

I removed the type hints from UcdFile class to apply the same patch to both python 2 and 3

* bpo-37760: Constant-fold some old options in makeunicodedata. (GH-15129)

python/cpython#15129
commit 99d208efed97e02d813e8166925b998bbd0d3993 (HEAD)
Author: Greg Price <gnprice@gmail.com>
Date:   Mon Aug 12 22:59:30 2019 -0700

    bpo-37760: Constant-fold some old options in makeunicodedata. (GH-15129)

    The `expand` option was introduced in 2000 in commit fad27aee1.
    It appears to have been always set since it was committed, and
    what it does is tell the code to do something essential.  So,
    just always do that, and cut the option.

    Also cut the `linebreakprops` option, which isn't consulted anymore.

* bpo-37760: Factor out standard range-expanding logic in makeunicodedata. (GH-15248)

python/cpython#15248
commit c03e698c344dfc557555b6b07a3ee2702e45f6ee (HEAD)
Author: Greg Price <gnprice@gmail.com>
Date:   Tue Aug 13 19:28:38 2019 -0700

    bpo-37760: Factor out standard range-expanding logic in makeunicodedata. (GH-15248)

    Much like the lower-level logic in commit ef2af1ad4, we had
    4 copies of this logic, written in a couple of different ways.
    They're all implementing the same standard, so write it just once.

* bpo-37760: Avoid cluttering work tree with downloaded Unicode files. (GH-15128)

python/cpython#15128
commit 3e4498d35c34aeaf4a9c3d57509b0d3277048ac6
Author: Greg Price <gnprice@gmail.com>
Date:   Wed Aug 14 18:18:53 2019 -0700

    bpo-37760: Avoid cluttering work tree with downloaded Unicode files. (GH-15128)

* Convert from length-18 lists to namedtuple, in makeunicodedata. (GH-15265)

Adapted from: python/cpython#15265

commit a65678c5c90002c5e40fa82746de07e6217df625
Author: Greg Price <gnprice@gmail.com>
Date:   Thu Sep 12 02:23:43 2019 -0700

    bpo-37760: Convert from length-18 lists to a dataclass, in makeunicodedata. (GH-15265)

    Now the fields have names!  Much easier to keep straight as a
    reader than the elements of an 18-tuple.

    Runs about 10-15% slower: from 10.8s to 12.3s, on my laptop.
    Fortunately that's perfectly fine for this maintenance script.

~~~

The original patch uses dataclasses but I use namedtuple here so that it
works on both python 2 and 3.

* closes bpo-39926: Update Unicode to 13.0.0. (GH-18910)

Fixes #34

Adapted from: python/cpython#18910
commit 051b9d08d1e6a8b1022a2bd9166be51c0b152698
Author: Benjamin Peterson <benjamin@python.org>
Date:   Tue Mar 10 20:41:34 2020 -0700

    closes bpo-39926: Update Unicode to 13.0.0. (GH-18910)

* Update some www.unicode.org URLs to use HTTPS. (GH-18912)

Adapted from: python/cpython#18912
commit 51796e5d2632e6ada81ca677b4153f4ccd490702
Author: Benjamin Peterson <benjamin@python.org>
Date:   Tue Mar 10 21:10:59 2020 -0700

    Update some www.unicode.org URLs to use HTTPS. (GH-18912)

* Update checksum test for Unicode 13; extend test to all of Unicode

This commit combines the following two upstream patches:

python/cpython#18913
commit c77aa2d60b420747886f4258cf159bdbb7354100
Author: Benjamin Peterson <benjamin@python.org>
Date:   Tue Mar 10 21:18:33 2020 -0700

    bpo-39926: Update unicodedata checksum tests for Unicode 13.0 update. (GH-18913)

    I forget these tests required the cpu resource.

python/cpython#15125
commit 6954be815a16fad11d1d66be576865bbbeb2b97d
Author: Greg Price <gnprice@gmail.com>
Date:   Thu Sep 12 02:25:25 2019 -0700

    closes bpo-37758: Extend unicodedata checksum tests to cover all of Unicode. (GH-15125)

    Unicode has grown since Python first gained support for it,
    when Unicode itself was still rather new.

    This pair of test cases was added in commit 6a20ee7de back in 2000,
    and they haven't needed to change much since then.  But do change
    them to look beyond the Basic Multilingual Plane (range(0x10000))
    and cover all 17 planes of Unicode's final form.

    This adds about 5 seconds to the test suite's runtime.  Mark the
    tests as CPU-using accordingly.

* test_unicodedata2: add unichr for 'narrow' python builds

* Update multibuild to latest 'devel' branch

* Build and run tests on Python 3.8

* .travis.yml: remove implicit job

or else is rejected with "Build config did not create any jobs"

travis-ci/travis-ci#8536

* test_unicodedata2: do not import test.support.requires_resource

import fails for some reason on some older 2.7 versions, see
https://travis-ci.org/github/mikekap/unicodedata2/jobs/663493029

It should not make any difference without this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants