Skip to content

Improve performance of serialize/deserialize by caching type information of fields in class #46

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 24, 2020

Conversation

jameslan
Copy link
Contributor

No description provided.

@jameslan
Copy link
Contributor Author

Here's my test result.

The performance test is executed in WSL using the following command:

$ time pytest --repeat 3000 betterproto/tests/test_inputs.py

which repeats serializing and deserializing 3000 times in the test case after the data is loaded.

Result:

env before after
py3.6 155.74s 22.660s
py3.7 72.57s 20.889s
py3.8 55.678s 16.739s

@cetanu cetanu self-assigned this May 21, 2020
@cetanu
Copy link
Collaborator

cetanu commented May 21, 2020

This is pretty awesome. I assume you did some profiling and found that _type_hint and get_type_hints take up a bit of time when creating these dataclasses?

I've read through the entire diff but I'm just going to give another look for thoroughness

if meta.group not in group_map["groups"]:
group_map["groups"][meta.group] = {"current": None, "fields": set()}
group_map["groups"][meta.group]["fields"].add(field)
group_map.setdefault(meta.group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me grok this? is this the same result before and after, or does this dict have very different structure with this change?

From what I can tell, some keys are now removed. There is no concern for these groups being overwritten? or the possibility is not there because such behavior cannot occur in the protocol buffer that this will process. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, the group_map is like:

{
    "fields": {
        "field1": "group1",
        "field2": "group1",
    },
    "groups": {
        "group1": {
            "current": Field1,
            "fields": set(Field1, Field2),
        },
    },
}

Within this dict, only current is object scoped, others are all class metadata. So that I kept current in this dict and moved others into ProtoClassMetadata, as oneof_group_by_field and oneof_field_by_group. Please refer to method __setattr__, L555 of changed __init__.py file.

In this way, building the lookup table between fields and groups is executed once per class. And also, all the object s of the class share the same lookup table.

Now in the object, _group_map indicates the current value of oneof group:

{
    "group1": Field1,
}

__setattr__ will keep this updated

setattr(cls, "_betterproto", ProtoClassMetadata(cls))


def protoclass(*args, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in my understanding, that this causes users to have to use the created protoclass instead of creating dataclasses?

However:

  1. It's auto-generated
  2. It behaves like a dataclass

So this doesn't seem to break anything.

Copy link
Collaborator

@nat-n nat-n May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something this looks like classes that inherit from betterproto.Message with a regular dataclass decorator will break if the user upgrades due to the _betterproto attribute not being set.

Breaking changes may be OK if this comes as a major version bump, but would still be better to avoid it or at least try and handle it gracefully.

I suggest accessing _betterproto via a property that checks whether it is set, and degrades gracefully for dataclasses lacking the extra annotation, maybe with a warning message. If going this route then there should be a test for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protoclass is a wrapper of dataclasses.dataclass. When this decorator is applied, the dataclass is still applied and the data class is still generated. The extra thing protoclass does, is to call make_protoclass to scan metadata of fields, and store in attribute _betterproto of the generated class.

In this way, the scan operation is done once per class, instead of per object, or per method call.

So this change requires to replace @dataclasses.dataclass with @betterproto.protoclass in generated python files. Which means users need to rerun protoc to update.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslan Thanks for clarifying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the extra decorator is removed. The metadata is populated at the first access, instead of when data class is created.

@cetanu cetanu added the enhancement New feature or request label May 21, 2020
@cetanu
Copy link
Collaborator

cetanu commented May 21, 2020

Can you go through and update the README? I think that's the main source of docs at this time

Copy link
Collaborator

@nat-n nat-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm mistaken this looks like a breaking change, see other comment for details.

@jameslan
Copy link
Contributor Author

jameslan commented May 21, 2020

Yes the result of profiling shows that get_type_hints , as well as functions in module dataclasses, are kind of pricey (especially in python 3.6).

So this PR is to make it not call them on accessing each field since they will never change.

And the next step is to remove invocations to dataclasses.fields from instance method, which may improve performance 10%~20%

@jameslan jameslan force-pushed the perf/class-cache branch from ef00e72 to d484405 Compare May 22, 2020 02:33
@jameslan
Copy link
Contributor Author

Can you go through and update the README? I think that's the main source of docs at this time

I checked README and think that it needn't any change, since now this PR is fully transparent and backward compatible.

@jameslan jameslan force-pushed the perf/class-cache branch from d484405 to 7c29a54 Compare May 23, 2020 07:57
@jameslan
Copy link
Contributor Author

Rebased to head of master

Copy link
Collaborator

@nat-n nat-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Don't forget to run black.

def test_message_can_instantiated(test_case_name: str) -> None:
plugin_module = importlib.import_module(f"{plugin_output_package}.{test_case_name}.{test_case_name}")
@pytest.fixture(scope="module", params=test_case_names)
def test_data(request):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice.

sys.path.remove(reference_module_root)


def test_message_can_be_imported(test_data: TestData) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks even more pointless now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@jameslan jameslan force-pushed the perf/class-cache branch 2 times, most recently from f603e2f to ab7687b Compare May 23, 2020 19:12
@jameslan
Copy link
Contributor Author

LGTM. Don't forget to run black.

Updated. Thanks!

@boukeversteegh
Copy link
Collaborator

boukeversteegh commented May 23, 2020

Performance is great! I've already prepared an update to #55 to be compatible with the changes done in test_inputs.py. Let's get this merged ~

@boukeversteegh
Copy link
Collaborator

Hey @nat-n ~ your last commit 5e2d9fe created merge conflicts for this Pr and mine.

Unfortunately running black on the PR branches doesn't solve the conflicts because the lines can turn out to be different.

I was able to work around it by reverting 5e2d9fe temporarily on a local copy of master, merging that copy into my branch, then rerunning black on the same files. you could do the same, @jameslan

git branch # check that you are on your PR branch
git checkout master
git pull danielgtaylor master # pull from danielgtaylor repository, use your remote name
git checkout -b master_no_blacken
git revert 5e2d9febea2a2982b7cc372ef5f4b1b7d2233e79
git checkout - # go back to PR branch
git merge master_no_blacken
black betterproto/tests/test_inputs.py # rerun black on affected files (if needed)
git commit -am 'Blacken'

jameslan added 4 commits May 23, 2020 17:36
Cached data include,
- lookup table between groups and fields of "oneof" fields
- default value creator of each field
- type hint of each field
@jameslan jameslan force-pushed the perf/class-cache branch from ab7687b to 917de09 Compare May 24, 2020 00:37
@jameslan
Copy link
Contributor Author

Hey @nat-n ~ your last commit 5e2d9fe created merge conflicts for this Pr and mine.

Unfortunately running black on the PR branches doesn't solve the conflicts because the lines can turn out to be different.

I was able to work around it by reverting 5e2d9fe temporarily on a local copy of master, merging that copy into my branch, then rerunning black on the same files. you could do the same, @jameslan

git branch # check that you are on your PR branch
git checkout master
git pull danielgtaylor master # pull from danielgtaylor repository, use your remote name
git checkout -b master_no_blacken
git revert 5e2d9febea2a2982b7cc372ef5f4b1b7d2233e79
git checkout - # go back to PR branch
git merge master_no_blacken
black betterproto/tests/test_inputs.py # rerun black on affected files (if needed)
git commit -am 'Blacken'

Thanks!

I just

$ git rebase master
$ git checkout --theirs betterproto/tests/test_inputs.py
$ git add betterproto/tests/test_inputs.py
$ git rebase --continue

To override the conflict.

@nat-n
Copy link
Collaborator

nat-n commented May 24, 2020

@jameslan Sorry about that. I did it to simplify some other merging, but should have been more cautious about touching so many lines.

Maybe there should be build time check for black correctness to keep everything in sync after the current round of PRs.

@nat-n nat-n merged commit 4a2baf3 into danielgtaylor:master May 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants