-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
gh-98766: Lazier ChainMap methods #98767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
@pochmann Yea I don't think I can make it lazier without some kind of unique hash method being added to itertools. You have to construct a set to track all the keys seen minimally and adding them lazily will be much slower then constructing a dict from an iterable just from the jumps back and forth. I tried like 8 different ways and this was the fastests outside of the original set union that doesn't preserve ordering. |
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the PR can be salvaged. The ONLY way for dict.fromkeys()
to reuse the hash values is to have inputs that are exact sets or dicts and only the latter will preserve order.
Also, this doesn't make "lazier chainmap methods". It simply creates an iterator and then consumes it immediately. It doesn't confer some extra magic that isn't already done by feeding in a dictionary directly. Instead, it does the opposite and loses the magic of reusing the stored hash values and taking advantage of the dict.update()
fast paths.
d = {} | ||
for mapping in reversed(self.maps): | ||
d.update(dict.fromkeys(mapping)) # reuses stored hash values if possible | ||
lazy_views = (key_view.keys() for key_view in reversed(self.maps)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables here are misnamed and don't reflect an understanding of what is being done. The contents of self.maps
are mappings, not views. The views aren't created until the keys()
method is called. The views themselves aren't "lazy", they are a pass-through to the underlying dict.
This line should read:
views = (mapping.keys() for mapping in reversed(self.maps))
Also, note this is an API change. We've tried to make minimal assumptions about the underlying mappings which may not be real dicts. In particular, we haven't demanded that they support having views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback. I am going to split this into two different PRs one for iter and one for the copy methods.
For iter I went through and profiled multiple dict and dict subclass objects and found a method that performs the same number of hashes in total as the current fromkeys method while also reducing the number of intermediate dicts like the above does. It appears to maintain the current behavior and reduces down to the original basecase for a list of dictionaries (d.update). Do you want me to open a new PR for that right away or to just link the repo?
There is some strange (to me) inconsistencies between update and fromkeys when it comes to hashreuse. I am unsure if that is intentional or not.
I can make a gist for when set, and dict and their subclasses reuse hash or not.
for mapping in reversed(self.maps): | ||
d.update(dict.fromkeys(mapping)) # reuses stored hash values if possible | ||
lazy_views = (key_view.keys() for key_view in reversed(self.maps)) | ||
d = dict.fromkeys(_chain.from_iterable(lazy_views)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't do what you think it does. The chain extracts the keys from the views and then passes those through to dict.fromkeys()
. That defeats the goal of retaining the precomputed hash values stored in the underlying mapping.
The dict.fromkeys()
class method can use the precomputed hash values ONLY if the input is an exact dict or exact set. An iterable of keys will not suffice.
Also, there isn't any "laziness" at all in these two lines. You make a generator expression and then consume it immediately. It is not much different from converting sum([10, 20, 30])
to sum(iter([10, 20, 30]))` and calling it "lazy".
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
@@ -1032,7 +1032,7 @@ def fromkeys(cls, iterable, *args): | |||
|
|||
def copy(self): | |||
'New ChainMap or subclass with a new copy of maps[0] and refs to maps[1:]' | |||
return self.__class__(self.maps[0].copy(), *self.maps[1:]) | |||
return self.__class__(self.maps[0].copy(), *_islice(self.maps, 1, None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change doesn't do anything useful. The time saved by not making a list is offset by the time to create the islice
object.
Also, this code is stylistically bad. The old code is plain and beautiful. In contrast, the proposed change is arcane and is well outside normal code practices. It is rough on the eyes as well.
One other note. The *args
form is immediately converted to a tuple by the interpreter. So, you aren't getting any "laziness" from this change. To me, it seems mostly pointless.
@@ -1050,7 +1050,7 @@ def new_child(self, m=None, **kwargs): # like Django's Context.push() | |||
@property | |||
def parents(self): # like Django's Context.pop() | |||
'New ChainMap from maps[1:].' | |||
return self.__class__(*self.maps[1:]) | |||
return self.__class__(*_islice(self.maps, 1, None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change doesn't do anything useful. The time saved by not making a list is offset by the time to create the islice
object.
Also, this code is stylistically bad. The old code is plain and beautiful. In contrast, the proposed change is arcane and is well outside normal code practices. It is rough on the eyes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think there is a better way to handle this anyway. I probably should have left it off the PR as it wasn't as heavily tested as the iter behavior.
This was the best I was able to get performance without making any new methods on dict to prevent multiple hashes being calculated. I can take a stab at adding a new function to dict if there is desire to get back to the original performance of set unions. I imagine chainmap is one of the most heavily used types in the Collections module and iter is probably one of the most heavily called functions from it due to ABC.Mapping using it internally. The changes should be safe to backport to anywhere #23569 was merged as it makes no outward behavioral changes.
I can take a stab at making a better function in either c that can be imported or as a method on the dict class later. I have been working with a lot of custom dictionary like objects late and have run into this ordering breaking issue a bunch of times so I imagine some kind of standard library solution to it would be helpful for anyone making something along these lines.