-
Notifications
You must be signed in to change notification settings - Fork 80
Search stopgap #1014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search stopgap #1014
Changes from 5 commits
0ef232d
695fa2e
99ea8ee
b673505
7a30ab6
632127c
5a4af3b
8edfa56
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
from collections import Counter, defaultdict | ||
|
||
from future.utils import viewvalues, viewitems | ||
|
||
from qiita_db.study import Study | ||
from qiita_db.data import ProcessedData | ||
|
||
|
||
def count_metadata(results, meta_cols): | ||
"""Counts the metadata found in a search, and returns these counts | ||
|
||
Parameters | ||
---------- | ||
results : dict of lists of list | ||
results in the format returned by the qiita_db search obj | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this doesn't indicate what expectations there are about the keys and values, just types. if this method should only ever under any circumstance be called from the search object, then it should be a private member method of that object. otherwise, you have to assume that you do not know who the caller is. this is important both for the documentation here, and wide possibility for a bug on line 36 |
||
meta_cols : list | ||
metadata column names searched for, as returned by qiita_db search obj | ||
|
||
Returns | ||
------- | ||
fullcount : dict of dicts | ||
counts for each found metadata value over all studies, in the format | ||
{meta_col1: {value1: count, value2: count, ...}, ...} | ||
studycount : dict of dict of dicts | ||
counts for each found metadata value for each study, in the format | ||
{study_id: {meta_col1: {value1: count, value2: count, ...}, ...}, ...} | ||
""" | ||
def double_comprehension(results): | ||
for samples in viewvalues(results): | ||
for sample in samples: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this actually an iterable of sample IDs or is it an iterable of metadata values? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both. First position is sample_id, rest are metadata values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it would be more appropriate to describe this as metadata values then since a sample id is a metadata value for the SampleID category |
||
yield sample | ||
|
||
fullcount = {} | ||
# rearrange all samples so that each metadata column found is its own list | ||
meta_vals = zip(*double_comprehension(results)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. might as well use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. izip breaks the Counter call below There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, I see. There really is an incredible amount of work being done with complex nested data structures in this method. It smells like there is quite a bit of redundant work. Given the inputs, returns, and the docstring, wouldn't the following suffice? It should require a lot less iteration and allocation n_cols = len(meta_cols)
studycount = {}
fullcount = {mc: defaultdict(int) for mc in meta_cols}
for study_id, meta_vals in viewitems(results):
metadata_counts = [defaultdict(int) for i in range(n_cols)]
for cat_idx, meta_col, meta_val in izip(range(n_cols), meta_cols, meta_vals):
metadata_counts[cat_idx][meta_val] += 1
fullcount[meta_col][meta_val] += 1
studycount[study_id] = {mc: mcounts for mc, mcounts in izip(meta_cols, metadata_counts)} The above is still placing extreme faith that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason this is so expensive is because it's something tacked onto the search object and was not originally a function expected of it. As stated, this is kind of a stopgap so the results will be formatted how we expect the new search engine to work, without the actual re factoring that is going to happen once all the massive database changes stop happening. During that refactor we may move to numpy or some other holder, but for now we have this. |
||
for pos, cat in enumerate(meta_cols): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it isn't assured that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is by the search object, not this function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method is putting a lot of faith in its caller. You can see why this is a giant risk for nasty bugs going forward, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but this is a stopgap since we need it to go forward, and hopefully the refactor will come soon after. It's a standard assumption we've always made with the search though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While the search method may state the values are in order on return, this method has no assurances about who is calling it and does not perform any validation on its inputs. What you're indicating is that this method ought to be a private static member method of the search object. That reduces the exposure, and potential for a bug in this situation, but does not remove it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In addition, since There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nevermind, I see what's going on here now. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, I can move it into the search object. |
||
# use Counter object to count all metadata values for a column | ||
# pos+1 so we skip the sample names list | ||
fullcount[cat] = Counter(meta_vals[pos + 1]) | ||
|
||
# Now get metadata counts for each study, removing sample ids as before | ||
studycount = {} | ||
for study_id in results: | ||
hold = {} | ||
# zip all samples for a given study so that each metadata column found | ||
# is its own list | ||
meta_vals = zip(*(sample for sample in results[study_id])) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is the second time this type of operation is being performed. It's going to likely be pretty expensive, and suggests that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the issue is that the info comes out as metadata value by sample_id, and it needs to get rotated 90 degrees to count it by metadata column. This is the fastest way I could think of to do that. |
||
for pos, cat in enumerate(meta_cols): | ||
# use Counter object to count all metadata values for a column | ||
# pos+1 so we skip the sample names list | ||
hold[cat] = Counter(meta_vals[pos + 1]) | ||
studycount[study_id] = hold | ||
|
||
return fullcount, studycount | ||
|
||
|
||
def filter_by_processed_data(results, datatypes=None): | ||
"""Filters results to what is available in each processed data | ||
|
||
Parameters | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. space please |
||
---------- | ||
results : dict of lists of list | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment here, and on its description |
||
results in the format returned by the qiita_db search obj | ||
datatypes : list of str, optional | ||
Datatypes to selectively return. Default all datatypes available | ||
|
||
Returns | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. space please |
||
------- | ||
study_proc_ids : dict of dicts of lists | ||
Processed data ids with samples for each study, in the format | ||
{study_id: {datatype: [proc_id, proc_id, ...], ...}, ...} | ||
proc_data_samples : dict of lists | ||
Samples available in each processed data id, in the format | ||
{proc_data_id: [samp_id1, samp_id2, ...], ...} | ||
samples_meta : dict of lists | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. from the code below, the sample IDs contained in here will be a superset of the sample IDs described in |
||
sample metadata in same order as the metadata given by search | ||
Format {samp_id: [meta1, meta2, ...], ...} | ||
""" | ||
study_proc_ids = {} | ||
proc_data_samples = {} | ||
samples_meta = {} | ||
for study_id, study_samples in viewitems(results): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is |
||
study = Study(study_id) | ||
samples_meta.update({s[0]: s[1:] for s in study_samples}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this structure holds |
||
study_proc_ids[study_id] = defaultdict(list) | ||
for proc_data_id in study.processed_data(): | ||
proc_data = ProcessedData(proc_data_id) | ||
datatype = proc_data.data_type() | ||
# skip processed data if it doesn't fit the given datatypes | ||
if datatypes is not None and datatype not in datatypes: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this conditional can be simplified if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, that would not work based on the description in the docstring. You would instead want to set |
||
continue | ||
samps_available = proc_data.samples | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this variable lookup friendly? E.g., There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, it's a set. |
||
hold = [s[0] for s in study_samples if s[0] in samps_available] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so the goal here is to see if the intersection of the sample IDs in the study, and the processed/available samples, is not the empty set? What about: study_sample_ids = {s[0] for s in study_samples}
if samps_available.intersection(study_sample_ids):
proc_data_samples[proc_data_id] = study_sample_ids
study_proc_ids[study_id][datatype].append(proc_data_id) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had it that way originally but thought a list comprehension was faster. Testing now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general it is better to focus on the expressiveness of the code rather than performance. If performance was the concern for this method, then there are quite a few other things that should be addressed first. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fair point. |
||
if len(hold) == 0: | ||
# all samples filtered so remove it as a result | ||
del(proc_data_samples[proc_data_id]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no assurance that |
||
else: | ||
proc_data_samples[proc_data_id] = hold | ||
# add the processed data to the list for the study | ||
study_proc_ids[study_id][datatype].append(proc_data_id) | ||
return study_proc_ids, proc_data_samples, samples_meta |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
from unittest import TestCase, main | ||
|
||
from qiita_core.util import qiita_test_checker | ||
from qiita_db.search import QiitaStudySearch | ||
from qiita_db.user import User | ||
from qiita_ware.search import count_metadata, filter_by_processed_data | ||
|
||
|
||
@qiita_test_checker() | ||
class SearchTest(TestCase): | ||
"""Tests that the search helpers work as expected""" | ||
def test_count_metadata(self): | ||
search = QiitaStudySearch() | ||
results, meta_cols = search('study_id = 1 AND ph > 0', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be much more useful if the inputs to |
||
User('test@foo.bar')) | ||
fullcount, studycount = count_metadata(results, meta_cols) | ||
expfull = {'study_id': {1: 27}, 'ph': {6.82: 10, 6.8: 9, 6.94: 8}} | ||
expstudy = {1: {'study_id': {1: 27}, | ||
'ph': {6.82: 10, 6.8: 9, 6.94: 8}}} | ||
self.assertEqual(fullcount, expfull) | ||
self.assertEqual(studycount, expstudy) | ||
|
||
def test_filter_by_processed_data(self): | ||
search = QiitaStudySearch() | ||
results, meta_cols = search('study_id = 1', User('test@foo.bar')) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment here |
||
study_proc_ids, proc_data_samples, meta = filter_by_processed_data( | ||
results) | ||
exp_spid = {1: {'18S': [1]}} | ||
exp_pds = {1: [ | ||
'1.SKM7.640188', '1.SKD9.640182', '1.SKM8.640201', '1.SKB8.640193', | ||
'1.SKD2.640178', '1.SKM3.640197', '1.SKM4.640180', '1.SKB9.640200', | ||
'1.SKB4.640189', '1.SKB5.640181', '1.SKB6.640176', '1.SKM2.640199', | ||
'1.SKM5.640177', '1.SKB1.640202', '1.SKD8.640184', '1.SKD4.640185', | ||
'1.SKB3.640195', '1.SKM1.640183', '1.SKB7.640196', '1.SKD3.640198', | ||
'1.SKD7.640191', '1.SKD6.640190', '1.SKB2.640194', '1.SKM9.640192', | ||
'1.SKM6.640187', '1.SKD5.640186', '1.SKD1.640179']} | ||
exp_meta = { | ||
'1.SKM7.640188': [1], '1.SKD9.640182': [1], '1.SKM8.640201': [1], | ||
'1.SKB8.640193': [1], '1.SKD2.640178': [1], '1.SKM3.640197': [1], | ||
'1.SKM4.640180': [1], '1.SKB9.640200': [1], '1.SKB4.640189': [1], | ||
'1.SKB5.640181': [1], '1.SKB6.640176': [1], '1.SKM2.640199': [1], | ||
'1.SKM5.640177': [1], '1.SKB1.640202': [1], '1.SKD8.640184': [1], | ||
'1.SKD4.640185': [1], '1.SKB3.640195': [1], '1.SKM1.640183': [1], | ||
'1.SKB7.640196': [1], '1.SKD3.640198': [1], '1.SKD7.640191': [1], | ||
'1.SKD6.640190': [1], '1.SKB2.640194': [1], '1.SKM9.640192': [1], | ||
'1.SKM6.640187': [1], '1.SKD5.640186': [1], '1.SKD1.640179': [1]} | ||
self.assertEqual(study_proc_ids, exp_spid) | ||
self.assertEqual(proc_data_samples, exp_pds) | ||
self.assertEqual(meta, exp_meta) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is misleading to name an input parameter as
results