-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Collection with lightweight _SearchIndexer. #667
Conversation
Codecov Report
@@ Coverage Diff @@
## next #667 +/- ##
==========================================
+ Coverage 86.06% 86.07% +0.01%
==========================================
Files 51 52 +1
Lines 5038 5122 +84
Branches 1109 1131 +22
==========================================
+ Hits 4336 4409 +73
- Misses 498 515 +17
+ Partials 204 198 -6
Continue to review full report at Codecov.
|
4620806
to
0086f9c
Compare
Benchmarks below! I only compared the "find/search" benchmarks below -- all the other benchmarks are basically unchanged between master, next, and this branch. Overall I would say the time spent searching appears to have dropped by 20-30% for lean search filters over small projects, and by 7-15% for richer search filters over larger projects. Not bad. edit: I added benchmarks for a newer commit, which are even better. More like 30-40% gains for some queries.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on!
Since this is largely a copy of the full Collection
class, is the plan to remove it in a subsequent step or are we keeping both? If we keep both I think we should make an attempt at using composition to avoid the code duplication.
signac/contrib/_slim_collection.py
Outdated
_PRIMARY_KEY = "_id" | ||
|
||
|
||
class _SlimCollection(dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some ideas for a slightly better name.
Note that the return value is often called index
. I think we had a similar class name somewhere else, but all of the indexing stuff will be removed, right? The full collection could use this class internally (composition).
class _SlimCollection(dict): | |
class _Index(dict): |
This would express a bit more directly what the difference between this and the Collection
class is:
class _SlimCollection(dict): | |
class _InMemoryCollection(dict): |
The full collection class could also be based on this, right? (inheritance)
class _SlimCollection(dict): | |
class _BaseCollection(dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csadorf I appreciate these suggestions! I had a similar series of ideas -- I kept _SlimCollection
for the initial round review partly because I knew the name should change but didn't want to commit to a name until reviewing its use cases and getting feedback like what you provided. I lean towards _Index
, which should not conflict with any other internals now that we've removed other forms of indexing from the next
branch. My only hesitation is that it has a method build_index
, which returns a dict
that is organized by the given key. I wonder if that method (or the class) should be named differently. Otherwise the result of _Index.build_index(key)
is just a dict
with a particular structure, while it sounds like it should be a class factory method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I've renamed this class to _Index
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would strongly prefer a more disambiguated name. While I agree that _Index
is not currently conflicting with anything, I don't want to hobble our ability to use related names in the future. Otherwise we could run into the same issues that we now have in signac-flow where vague naming leads to ambiguity between concepts like groups, aggregates, bundles, etc. Having reviewed the code, since the critical operation is the ability to search the elements maybe it's something like a _SearchableIndex
? I would definitely like to avoid the term Collection
given a) the overloads associated with collections.abc (and now SyncedCollections) and b) we are no longer trying to specifically make this class behave so much like a MongoDB collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree a more disambiguated name would be nice, but I was unable to come up with one that felt right. I am not opposed to _SearchableIndex
but would like some consensus. I suppose something like _SearchableKeystore
could also be used to avoid the term index
as I previously mentioned the awkwardness of build_index
returning something that is not related to the _Index
class. See also: https://en.wikipedia.org/wiki/Search_data_structure
As I mentioned in our dev meeting today, my plan is to remove the |
I will get around to reviewing this soon, but in the meantime @bdice do you still think
is viable and worth pursuing? What would be the pros and cons of doing that vs the current approach? I'm all for using the most stripped down solution that we can in 2.0. The other outstanding items in the PR description's checklist can definitely be addressed in a follow-up PR, but if there's a chance that we're going to end up removing this new class again because there's a simpler alternative it would be good to figure that out now. |
@vyasr dict-only with free functions for find/build_index is viable. “Worth pursuing” is up for debate. Pro: no class = simpler? Con: pure dict inputs must have a particular structure that has no centralized home for documentation without a class. The new class inherits directly from dict, so the semantics of the class would not change significantly if we switched to using free functions. There is no internal state besides the data in the index. |
But would those indexes not be the best argument for a dedicated class to maintain that state? Building those indexes are the primary purpose of the class, no? Also, keep in mind that the |
The For simplicity, I removed key-based indices from the internal state of the signac/signac/contrib/schema.py Lines 88 to 93 in 047333f
For us to see any kind of benefit from re-using the key-based indices, we would need to heavily refactor the |
047333f
to
c485021
Compare
@csadorf @vyasr This is ready for review. I would like to keep the current PR scope and hold off on switching to a pure |
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
752cebc
to
6187c01
Compare
* Add optimized _SlimCollection. * Further minimize _SlimCollection. * Remove unused methods from _SlimCollection. * Make _SlimCollection a dict. * Add tests. * Add test for invalid filter. * Change _SlimCollection to dict initialization semantics, remove _id from document in favor of being the dict key. * Update _SlimCollection tests to use dict semantics. * Improve docs, move build_index to method. * Update error handling. * Add test coverage to Collection for invalid filter. * Remove handling of unexpected errors. This exception handling was introduced in bfb7a27 but its purpose is not clear. * Remove if statement that is impossible to reach. * Remove optimization note about searching by primary key. * Add early exit for logical expressions. * Rename _SlimCollection to _Index. * Update docs. * Update docs. * Refactoring/PR comments. * More refactoring/PR comments. * Simplify error check. * Use tighter logic in build_index. * Fix test name. * Update signac/contrib/_index.py Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Use copy and pop optimization. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Revise documentation. * Use isinstance when True is expected for micro-optimization. * Remove extraneous helper function and unnecessary data copy operation. * Apply suggestions from code review * Rename to _SearchIndexer. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
* Add optimized _SlimCollection. * Further minimize _SlimCollection. * Remove unused methods from _SlimCollection. * Make _SlimCollection a dict. * Add tests. * Add test for invalid filter. * Change _SlimCollection to dict initialization semantics, remove _id from document in favor of being the dict key. * Update _SlimCollection tests to use dict semantics. * Improve docs, move build_index to method. * Update error handling. * Add test coverage to Collection for invalid filter. * Remove handling of unexpected errors. This exception handling was introduced in bfb7a27 but its purpose is not clear. * Remove if statement that is impossible to reach. * Remove optimization note about searching by primary key. * Add early exit for logical expressions. * Rename _SlimCollection to _Index. * Update docs. * Update docs. * Refactoring/PR comments. * More refactoring/PR comments. * Simplify error check. * Use tighter logic in build_index. * Fix test name. * Update signac/contrib/_index.py Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Use copy and pop optimization. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Revise documentation. * Use isinstance when True is expected for micro-optimization. * Remove extraneous helper function and unnecessary data copy operation. * Apply suggestions from code review * Rename to _SearchIndexer. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
* Add optimized _SlimCollection. * Further minimize _SlimCollection. * Remove unused methods from _SlimCollection. * Make _SlimCollection a dict. * Add tests. * Add test for invalid filter. * Change _SlimCollection to dict initialization semantics, remove _id from document in favor of being the dict key. * Update _SlimCollection tests to use dict semantics. * Improve docs, move build_index to method. * Update error handling. * Add test coverage to Collection for invalid filter. * Remove handling of unexpected errors. This exception handling was introduced in bfb7a27 but its purpose is not clear. * Remove if statement that is impossible to reach. * Remove optimization note about searching by primary key. * Add early exit for logical expressions. * Rename _SlimCollection to _Index. * Update docs. * Update docs. * Refactoring/PR comments. * More refactoring/PR comments. * Simplify error check. * Use tighter logic in build_index. * Fix test name. * Update signac/contrib/_index.py Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Use copy and pop optimization. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Revise documentation. * Use isinstance when True is expected for micro-optimization. * Remove extraneous helper function and unnecessary data copy operation. * Apply suggestions from code review * Rename to _SearchIndexer. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
* Add optimized _SlimCollection. * Further minimize _SlimCollection. * Remove unused methods from _SlimCollection. * Make _SlimCollection a dict. * Add tests. * Add test for invalid filter. * Change _SlimCollection to dict initialization semantics, remove _id from document in favor of being the dict key. * Update _SlimCollection tests to use dict semantics. * Improve docs, move build_index to method. * Update error handling. * Add test coverage to Collection for invalid filter. * Remove handling of unexpected errors. This exception handling was introduced in bfb7a27 but its purpose is not clear. * Remove if statement that is impossible to reach. * Remove optimization note about searching by primary key. * Add early exit for logical expressions. * Rename _SlimCollection to _Index. * Update docs. * Update docs. * Refactoring/PR comments. * More refactoring/PR comments. * Simplify error check. * Use tighter logic in build_index. * Fix test name. * Update signac/contrib/_index.py Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Use copy and pop optimization. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Revise documentation. * Use isinstance when True is expected for micro-optimization. * Remove extraneous helper function and unnecessary data copy operation. * Apply suggestions from code review * Rename to _SearchIndexer. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
* Add optimized _SlimCollection. * Further minimize _SlimCollection. * Remove unused methods from _SlimCollection. * Make _SlimCollection a dict. * Add tests. * Add test for invalid filter. * Change _SlimCollection to dict initialization semantics, remove _id from document in favor of being the dict key. * Update _SlimCollection tests to use dict semantics. * Improve docs, move build_index to method. * Update error handling. * Add test coverage to Collection for invalid filter. * Remove handling of unexpected errors. This exception handling was introduced in bfb7a27 but its purpose is not clear. * Remove if statement that is impossible to reach. * Remove optimization note about searching by primary key. * Add early exit for logical expressions. * Rename _SlimCollection to _Index. * Update docs. * Update docs. * Refactoring/PR comments. * More refactoring/PR comments. * Simplify error check. * Use tighter logic in build_index. * Fix test name. * Update signac/contrib/_index.py Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Use copy and pop optimization. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Revise documentation. * Use isinstance when True is expected for micro-optimization. * Remove extraneous helper function and unnecessary data copy operation. * Apply suggestions from code review * Rename to _SearchIndexer. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
* Add optimized _SlimCollection. * Further minimize _SlimCollection. * Remove unused methods from _SlimCollection. * Make _SlimCollection a dict. * Add tests. * Add test for invalid filter. * Change _SlimCollection to dict initialization semantics, remove _id from document in favor of being the dict key. * Update _SlimCollection tests to use dict semantics. * Improve docs, move build_index to method. * Update error handling. * Add test coverage to Collection for invalid filter. * Remove handling of unexpected errors. This exception handling was introduced in bfb7a27 but its purpose is not clear. * Remove if statement that is impossible to reach. * Remove optimization note about searching by primary key. * Add early exit for logical expressions. * Rename _SlimCollection to _Index. * Update docs. * Update docs. * Refactoring/PR comments. * More refactoring/PR comments. * Simplify error check. * Use tighter logic in build_index. * Fix test name. * Update signac/contrib/_index.py Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Use copy and pop optimization. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Revise documentation. * Use isinstance when True is expected for micro-optimization. * Remove extraneous helper function and unnecessary data copy operation. * Apply suggestions from code review * Rename to _SearchIndexer. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
* Add optimized _SlimCollection. * Further minimize _SlimCollection. * Remove unused methods from _SlimCollection. * Make _SlimCollection a dict. * Add tests. * Add test for invalid filter. * Change _SlimCollection to dict initialization semantics, remove _id from document in favor of being the dict key. * Update _SlimCollection tests to use dict semantics. * Improve docs, move build_index to method. * Update error handling. * Add test coverage to Collection for invalid filter. * Remove handling of unexpected errors. This exception handling was introduced in bfb7a27 but its purpose is not clear. * Remove if statement that is impossible to reach. * Remove optimization note about searching by primary key. * Add early exit for logical expressions. * Rename _SlimCollection to _Index. * Update docs. * Update docs. * Refactoring/PR comments. * More refactoring/PR comments. * Simplify error check. * Use tighter logic in build_index. * Fix test name. * Update signac/contrib/_index.py Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Use copy and pop optimization. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Revise documentation. * Use isinstance when True is expected for micro-optimization. * Remove extraneous helper function and unnecessary data copy operation. * Apply suggestions from code review * Rename to _SearchIndexer. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
* Add optimized _SlimCollection. * Further minimize _SlimCollection. * Remove unused methods from _SlimCollection. * Make _SlimCollection a dict. * Add tests. * Add test for invalid filter. * Change _SlimCollection to dict initialization semantics, remove _id from document in favor of being the dict key. * Update _SlimCollection tests to use dict semantics. * Improve docs, move build_index to method. * Update error handling. * Add test coverage to Collection for invalid filter. * Remove handling of unexpected errors. This exception handling was introduced in bfb7a27 but its purpose is not clear. * Remove if statement that is impossible to reach. * Remove optimization note about searching by primary key. * Add early exit for logical expressions. * Rename _SlimCollection to _Index. * Update docs. * Update docs. * Refactoring/PR comments. * More refactoring/PR comments. * Simplify error check. * Use tighter logic in build_index. * Fix test name. * Update signac/contrib/_index.py Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Use copy and pop optimization. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Revise documentation. * Use isinstance when True is expected for micro-optimization. * Remove extraneous helper function and unnecessary data copy operation. * Apply suggestions from code review * Rename to _SearchIndexer. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
* Add optimized _SlimCollection. * Further minimize _SlimCollection. * Remove unused methods from _SlimCollection. * Make _SlimCollection a dict. * Add tests. * Add test for invalid filter. * Change _SlimCollection to dict initialization semantics, remove _id from document in favor of being the dict key. * Update _SlimCollection tests to use dict semantics. * Improve docs, move build_index to method. * Update error handling. * Add test coverage to Collection for invalid filter. * Remove handling of unexpected errors. This exception handling was introduced in bfb7a27 but its purpose is not clear. * Remove if statement that is impossible to reach. * Remove optimization note about searching by primary key. * Add early exit for logical expressions. * Rename _SlimCollection to _Index. * Update docs. * Update docs. * Refactoring/PR comments. * More refactoring/PR comments. * Simplify error check. * Use tighter logic in build_index. * Fix test name. * Update signac/contrib/_index.py Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Use copy and pop optimization. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com> * Revise documentation. * Use isinstance when True is expected for micro-optimization. * Remove extraneous helper function and unnecessary data copy operation. * Apply suggestions from code review * Rename to _SearchIndexer. Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
Description
This PR (draft) minimizes the Collection class for the two use cases it has in signac's internals: detecting schema and finding job ids.
The
Collection
class is 1174 lines and the_SearchIndexer
class is 273 lines (including a helper method that I inlined, which makes it appear larger).To-do:
_id
key like[{"a": 0, "_id": "abcdef"}]
to a dict like{"abcdef": {"a": 0}}
. This simplifies several areas of the code and makes_SearchIndexer
's semantics dict-like._SearchIndexer
class to bare minimum amount of code_index.py
_SearchIndexer
and use a plaindict
with helper functions?find
andbuild_index
, and instead defining them as free functions that act on adict
with certain assumptions about structure. Do we like this idea?Collection
from public APICollection
.Motivation and Context
Resolves #664.
Types of Changes
1The change breaks (or has the potential to break) existing functionality.
Checklist:
If necessary: