-
Notifications
You must be signed in to change notification settings - Fork 80
Search stopgap #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search stopgap #1014
Conversation
|
||
fullcount = {} | ||
# rearrange all samples so that each metadata column found is its own list | ||
meta_vals = zip(*double_comprehension(results)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might as well use izip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
izip breaks the Counter call below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. There really is an incredible amount of work being done with complex nested data structures in this method. It smells like there is quite a bit of redundant work. Given the inputs, returns, and the docstring, wouldn't the following suffice? It should require a lot less iteration and allocation
n_cols = len(meta_cols)
studycount = {}
fullcount = {mc: defaultdict(int) for mc in meta_cols}
for study_id, meta_vals in viewitems(results):
metadata_counts = [defaultdict(int) for i in range(n_cols)]
for cat_idx, meta_col, meta_val in izip(range(n_cols), meta_cols, meta_vals):
metadata_counts[cat_idx][meta_val] += 1
fullcount[meta_col][meta_val] += 1
studycount[study_id] = {mc: mcounts for mc, mcounts in izip(meta_cols, metadata_counts)}
The above is still placing extreme faith that meta_cols
is in order with meta_vals
, performing redundant work, and redundant representation. The latter two points are because fullcount
really should be expressed as reduce(add, studycount.values())
, however dict
doesn't support that. What I'm wondering is actually if these data should be coming in as either np.array
or sp.sparse.spmatrix
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this is so expensive is because it's something tacked onto the search object and was not originally a function expected of it. As stated, this is kind of a stopgap so the results will be formatted how we expect the new search engine to work, without the actual re factoring that is going to happen once all the massive database changes stop happening. During that refactor we may move to numpy or some other holder, but for now we have this.
|
||
Parameters | ||
---------- | ||
results : dict of lists of list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is misleading to name an input parameter as results
completely refactored, should be good for merge |
👍 |
Can someone do a final review/merge? |
This takes the current search engine output and formats it into what we expect the refactored search to spit out. This is needed because the processed data samples available may be a subset of the total samples available, and this needs to be reflected by the found samples in the search engine. This is also a stopgap because the search engine needs refactoring but there are database and logic changes blocking that, so this will sit on top for now to give us the output required until we can refactor the search engine.