Skip to content

gh-98766: Lazier ChainMap methods #98767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

zrothberg
Copy link

@zrothberg zrothberg commented Oct 27, 2022

This was the best I was able to get performance without making any new methods on dict to prevent multiple hashes being calculated. I can take a stab at adding a new function to dict if there is desire to get back to the original performance of set unions. I imagine chainmap is one of the most heavily used types in the Collections module and iter is probably one of the most heavily called functions from it due to ABC.Mapping using it internally. The changes should be safe to backport to anywhere #23569 was merged as it makes no outward behavioral changes.

I can take a stab at making a better function in either c that can be imported or as a method on the dict class later. I have been working with a lot of custom dictionary like objects late and have run into this ordering breaking issue a bunch of times so I imagine some kind of standard library solution to it would be helpful for anyone making something along these lines.

@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@ghost
Copy link

ghost commented Oct 27, 2022

All commit authors signed the Contributor License Agreement.
CLA signed

@AlexWaygood AlexWaygood added type-feature A feature request or enhancement performance Performance or resource usage labels Oct 27, 2022
@AlexWaygood AlexWaygood linked an issue Oct 27, 2022 that may be closed by this pull request
@rhettinger rhettinger self-assigned this Oct 27, 2022
@zrothberg
Copy link
Author

@pochmann Yea I don't think I can make it lazier without some kind of unique hash method being added to itertools. You have to construct a set to track all the keys seen minimally and adding them lazily will be much slower then constructing a dict from an iterable just from the jumps back and forth. I tried like 8 different ways and this was the fastests outside of the original set union that doesn't preserve ordering.

@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

Copy link
Contributor

@rhettinger rhettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the PR can be salvaged. The ONLY way for dict.fromkeys() to reuse the hash values is to have inputs that are exact sets or dicts and only the latter will preserve order.

Also, this doesn't make "lazier chainmap methods". It simply creates an iterator and then consumes it immediately. It doesn't confer some extra magic that isn't already done by feeding in a dictionary directly. Instead, it does the opposite and loses the magic of reusing the stored hash values and taking advantage of the dict.update() fast paths.

d = {}
for mapping in reversed(self.maps):
d.update(dict.fromkeys(mapping)) # reuses stored hash values if possible
lazy_views = (key_view.keys() for key_view in reversed(self.maps))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variables here are misnamed and don't reflect an understanding of what is being done. The contents of self.maps are mappings, not views. The views aren't created until the keys() method is called. The views themselves aren't "lazy", they are a pass-through to the underlying dict.

This line should read:

views = (mapping.keys() for mapping in reversed(self.maps))

Also, note this is an API change. We've tried to make minimal assumptions about the underlying mappings which may not be real dicts. In particular, we haven't demanded that they support having views.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. I am going to split this into two different PRs one for iter and one for the copy methods.

For iter I went through and profiled multiple dict and dict subclass objects and found a method that performs the same number of hashes in total as the current fromkeys method while also reducing the number of intermediate dicts like the above does. It appears to maintain the current behavior and reduces down to the original basecase for a list of dictionaries (d.update). Do you want me to open a new PR for that right away or to just link the repo?

There is some strange (to me) inconsistencies between update and fromkeys when it comes to hashreuse. I am unsure if that is intentional or not.

I can make a gist for when set, and dict and their subclasses reuse hash or not.

for mapping in reversed(self.maps):
d.update(dict.fromkeys(mapping)) # reuses stored hash values if possible
lazy_views = (key_view.keys() for key_view in reversed(self.maps))
d = dict.fromkeys(_chain.from_iterable(lazy_views))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't do what you think it does. The chain extracts the keys from the views and then passes those through to dict.fromkeys(). That defeats the goal of retaining the precomputed hash values stored in the underlying mapping.

The dict.fromkeys() class method can use the precomputed hash values ONLY if the input is an exact dict or exact set. An iterable of keys will not suffice.

Also, there isn't any "laziness" at all in these two lines. You make a generator expression and then consume it immediately. It is not much different from converting sum([10, 20, 30]) to sum(iter([10, 20, 30]))` and calling it "lazy".

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@@ -1032,7 +1032,7 @@ def fromkeys(cls, iterable, *args):

def copy(self):
'New ChainMap or subclass with a new copy of maps[0] and refs to maps[1:]'
return self.__class__(self.maps[0].copy(), *self.maps[1:])
return self.__class__(self.maps[0].copy(), *_islice(self.maps, 1, None))
Copy link
Contributor

@rhettinger rhettinger Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't do anything useful. The time saved by not making a list is offset by the time to create the islice object.

Also, this code is stylistically bad. The old code is plain and beautiful. In contrast, the proposed change is arcane and is well outside normal code practices. It is rough on the eyes as well.

One other note. The *args form is immediately converted to a tuple by the interpreter. So, you aren't getting any "laziness" from this change. To me, it seems mostly pointless.

@@ -1050,7 +1050,7 @@ def new_child(self, m=None, **kwargs): # like Django's Context.push()
@property
def parents(self): # like Django's Context.pop()
'New ChainMap from maps[1:].'
return self.__class__(*self.maps[1:])
return self.__class__(*_islice(self.maps, 1, None))
Copy link
Contributor

@rhettinger rhettinger Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't do anything useful. The time saved by not making a list is offset by the time to create the islice object.

Also, this code is stylistically bad. The old code is plain and beautiful. In contrast, the proposed change is arcane and is well outside normal code practices. It is rough on the eyes as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think there is a better way to handle this anyway. I probably should have left it off the PR as it wasn't as heavily tested as the iter behavior.

@rhettinger rhettinger closed this Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting changes performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speedup ChainMap
5 participants