Skip to content

Search stopgap #1014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 1, 2015
Merged

Search stopgap #1014

merged 8 commits into from
Apr 1, 2015

Conversation

squirrelo
Copy link
Contributor

This takes the current search engine output and formats it into what we expect the refactored search to spit out. This is needed because the processed data samples available may be a subset of the total samples available, and this needs to be reflected by the found samples in the search engine. This is also a stopgap because the search engine needs refactoring but there are database and logic changes blocking that, so this will sit on top for now to give us the output required until we can refactor the search engine.


fullcount = {}
# rearrange all samples so that each metadata column found is its own list
meta_vals = zip(*double_comprehension(results))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well use izip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

izip breaks the Counter call below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. There really is an incredible amount of work being done with complex nested data structures in this method. It smells like there is quite a bit of redundant work. Given the inputs, returns, and the docstring, wouldn't the following suffice? It should require a lot less iteration and allocation

n_cols = len(meta_cols)
studycount = {}
fullcount = {mc: defaultdict(int) for mc in meta_cols}
for study_id, meta_vals in viewitems(results):
    metadata_counts = [defaultdict(int) for i in range(n_cols)]
    for cat_idx, meta_col, meta_val in izip(range(n_cols), meta_cols, meta_vals):
        metadata_counts[cat_idx][meta_val] += 1
        fullcount[meta_col][meta_val] += 1
    studycount[study_id] = {mc: mcounts for mc, mcounts in izip(meta_cols, metadata_counts)}

The above is still placing extreme faith that meta_cols is in order with meta_vals, performing redundant work, and redundant representation. The latter two points are because fullcount really should be expressed as reduce(add, studycount.values()), however dict doesn't support that. What I'm wondering is actually if these data should be coming in as either np.array or sp.sparse.spmatrix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is so expensive is because it's something tacked onto the search object and was not originally a function expected of it. As stated, this is kind of a stopgap so the results will be formatted how we expect the new search engine to work, without the actual re factoring that is going to happen once all the massive database changes stop happening. During that refactor we may move to numpy or some other holder, but for now we have this.


Parameters
----------
results : dict of lists of list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is misleading to name an input parameter as results

@coveralls
Copy link

Coverage Status

Coverage increased (+0.08%) to 79.17% when pulling 8edfa56 on squirrelo:search-stopgap into 54f63e9 on biocore:master.

@squirrelo
Copy link
Contributor Author

completely refactored, should be good for merge

@wasade
Copy link
Contributor

wasade commented Mar 30, 2015

👍

@squirrelo
Copy link
Contributor Author

Can someone do a final review/merge?

antgonza added a commit that referenced this pull request Apr 1, 2015
@antgonza antgonza merged commit e122e49 into qiita-spots:master Apr 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants