Skip to content

Search stopgap #1014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 1, 2015
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions qiita_db/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -1173,3 +1173,24 @@ def processed_date(self):
return conn_handler.execute_fetchone(
"SELECT processed_date FROM qiita.{0} WHERE "
"processed_data_id=%s".format(self._table), (self.id,))[0]

@property
def samples(self):
"""Return the samples available according to prep template

Returns
-------
set
all sample_ids available for the processed data
"""
conn_handler = SQLConnectionHandler()
# Get the prep template id for teh dynamic table lookup
sql = """SELECT ptp.prep_template_id FROM
qiita.prep_template_preprocessed_data ptp JOIN
qiita.preprocessed_processed_data ppd USING (preprocessed_data_id)
WHERE ppd.processed_data_id = %s"""
prep_id = conn_handler.execute_fetchone(sql, [self._id])[0]

# Get samples from dynamic table
sql = "SELECT sample_id FROM qiita.prep_%d" % prep_id
return set(s[0] for s in conn_handler.execute_fetchall(sql))
12 changes: 12 additions & 0 deletions qiita_db/test/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -868,6 +868,18 @@ def test_processed_date(self):
pd = ProcessedData(1)
self.assertEqual(pd.processed_date, datetime(2012, 10, 1, 9, 30, 27))

def test_samples(self):
pd = ProcessedData(1)
obs = pd.samples
exp = {
'1.SKB2.640194', '1.SKM4.640180', '1.SKB3.640195', '1.SKB6.640176',
'1.SKD6.640190', '1.SKM6.640187', '1.SKD9.640182', '1.SKM8.640201',
'1.SKM2.640199', '1.SKD2.640178', '1.SKB7.640196', '1.SKD4.640185',
'1.SKB8.640193', '1.SKM3.640197', '1.SKD5.640186', '1.SKB1.640202',
'1.SKM1.640183', '1.SKD1.640179', '1.SKD3.640198', '1.SKB5.640181',
'1.SKB4.640189', '1.SKB9.640200', '1.SKM9.640192', '1.SKD8.640184',
'1.SKM5.640177', '1.SKM7.640188', '1.SKD7.640191'}
self.assertEqual(obs, exp)

if __name__ == '__main__':
main()
101 changes: 101 additions & 0 deletions qiita_ware/search.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
from collections import Counter, defaultdict

from future.utils import viewvalues, viewitems

from qiita_db.study import Study
from qiita_db.data import ProcessedData


def count_metadata(results, meta_cols):
"""Counts the metadata found in a search, and returns these counts

Parameters
----------
results : dict of lists of list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is misleading to name an input parameter as results

results in the format returned by the qiita_db search obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't indicate what expectations there are about the keys and values, just types. if this method should only ever under any circumstance be called from the search object, then it should be a private member method of that object. otherwise, you have to assume that you do not know who the caller is. this is important both for the documentation here, and wide possibility for a bug on line 36

meta_cols : list
metadata column names searched for, as returned by qiita_db search obj

Returns
-------
fullcount : dict of dicts
counts for each found metadata value over all studies, in the format
{meta_col1: {value1: count, value2: count, ...}, ...}
studycount : dict of dict of dicts
counts for each found metadata value for each study, in the format
{study_id: {meta_col1: {value1: count, value2: count, ...}, ...}, ...}
"""
def double_comprehension(results):
for samples in viewvalues(results):
for sample in samples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually an iterable of sample IDs or is it an iterable of metadata values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both. First position is sample_id, rest are metadata values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be more appropriate to describe this as metadata values then since a sample id is a metadata value for the SampleID category

yield sample

fullcount = {}
# rearrange all samples so that each metadata column found is its own list
meta_vals = zip(*double_comprehension(results))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well use izip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

izip breaks the Counter call below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. There really is an incredible amount of work being done with complex nested data structures in this method. It smells like there is quite a bit of redundant work. Given the inputs, returns, and the docstring, wouldn't the following suffice? It should require a lot less iteration and allocation

n_cols = len(meta_cols)
studycount = {}
fullcount = {mc: defaultdict(int) for mc in meta_cols}
for study_id, meta_vals in viewitems(results):
    metadata_counts = [defaultdict(int) for i in range(n_cols)]
    for cat_idx, meta_col, meta_val in izip(range(n_cols), meta_cols, meta_vals):
        metadata_counts[cat_idx][meta_val] += 1
        fullcount[meta_col][meta_val] += 1
    studycount[study_id] = {mc: mcounts for mc, mcounts in izip(meta_cols, metadata_counts)}

The above is still placing extreme faith that meta_cols is in order with meta_vals, performing redundant work, and redundant representation. The latter two points are because fullcount really should be expressed as reduce(add, studycount.values()), however dict doesn't support that. What I'm wondering is actually if these data should be coming in as either np.array or sp.sparse.spmatrix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is so expensive is because it's something tacked onto the search object and was not originally a function expected of it. As stated, this is kind of a stopgap so the results will be formatted how we expect the new search engine to work, without the actual re factoring that is going to happen once all the massive database changes stop happening. During that refactor we may move to numpy or some other holder, but for now we have this.

for pos, cat in enumerate(meta_cols):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it isn't assured that meta_cols is not in the same order as meta_vals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is by the search object, not this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is putting a lot of faith in its caller. You can see why this is a giant risk for nasty bugs going forward, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this is a stopgap since we need it to go forward, and hopefully the refactor will come soon after. It's a standard assumption we've always made with the search though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the search method may state the values are in order on return, this method has no assurances about who is calling it and does not perform any validation on its inputs. What you're indicating is that this method ought to be a private static member method of the search object. That reduces the exposure, and potential for a bug in this situation, but does not remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, since results is a dict, no order is assured and it is mutable. What that means is that IF the object is unmolested, then the order will remain unchanged. If ANY modification happens, the order is not assured to remain the same. Since this is mutable, you cannot assume it will never be modified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see what's going on here now. The list of list contained in results is assured to remain in the same order. This is also a dense representation of a matrix as it isn't jagged. Numpy ops here probably would be much more natural.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I can move it into the search object.

# use Counter object to count all metadata values for a column
# pos+1 so we skip the sample names list
fullcount[cat] = Counter(meta_vals[pos + 1])

# Now get metadata counts for each study, removing sample ids as before
studycount = {}
for study_id in results:
hold = {}
# zip all samples for a given study so that each metadata column found
# is its own list
meta_vals = zip(*(sample for sample in results[study_id]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the second time this type of operation is being performed. It's going to likely be pretty expensive, and suggests that results is not formatted in as usable of a data structure as it should be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue is that the info comes out as metadata value by sample_id, and it needs to get rotated 90 degrees to count it by metadata column. This is the fastest way I could think of to do that.

for pos, cat in enumerate(meta_cols):
# use Counter object to count all metadata values for a column
# pos+1 so we skip the sample names list
hold[cat] = Counter(meta_vals[pos + 1])
studycount[study_id] = hold

return fullcount, studycount


def filter_by_processed_data(results, datatypes=None):
"""Filters results to what is available in each processed data

Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space please

----------
results : dict of lists of list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here, and on its description

results in the format returned by the qiita_db search obj
datatypes : list of str, optional
Datatypes to selectively return. Default all datatypes available

Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space please

-------
study_proc_ids : dict of dicts of lists
Processed data ids with samples for each study, in the format
{study_id: {datatype: [proc_id, proc_id, ...], ...}, ...}
proc_data_samples : dict of lists
Samples available in each processed data id, in the format
{proc_data_id: [samp_id1, samp_id2, ...], ...}
samples_meta : dict of lists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the code below, the sample IDs contained in here will be a superset of the sample IDs described in proc_data_samples. Are those objects expected to be 1-1? If not, it would be useful to include a mention of this in the docstring

sample metadata in same order as the metadata given by search
Format {samp_id: [meta1, meta2, ...], ...}
"""
study_proc_ids = {}
proc_data_samples = {}
samples_meta = {}
for study_id, study_samples in viewitems(results):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is study_samples an iterable of the sample ids? it looks like it is the metadata values from its use

study = Study(study_id)
samples_meta.update({s[0]: s[1:] for s in study_samples})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this structure holds {sample_id: [metadata_values]} without any correspondence to the metadata category the value is associated with?

study_proc_ids[study_id] = defaultdict(list)
for proc_data_id in study.processed_data():
proc_data = ProcessedData(proc_data_id)
datatype = proc_data.data_type()
# skip processed data if it doesn't fit the given datatypes
if datatypes is not None and datatype not in datatypes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this conditional can be simplified if datatypes is made the empty set at the beginning of the method in the case where datatypes is None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, that would not work based on the description in the docstring. You would instead want to set datatypes = set(all_datatypes)

continue
samps_available = proc_data.samples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this variable lookup friendly? E.g., set or dict or something comparable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's a set.

hold = [s[0] for s in study_samples if s[0] in samps_available]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the goal here is to see if the intersection of the sample IDs in the study, and the processed/available samples, is not the empty set? What about:

study_sample_ids = {s[0] for s in study_samples}
if samps_available.intersection(study_sample_ids):
    proc_data_samples[proc_data_id] = study_sample_ids
    study_proc_ids[study_id][datatype].append(proc_data_id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it that way originally but thought a list comprehension was faster. Testing now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it is better to focus on the expressiveness of the code rather than performance. If performance was the concern for this method, then there are quite a few other things that should be addressed first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point.

if len(hold) == 0:
# all samples filtered so remove it as a result
del(proc_data_samples[proc_data_id])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no assurance that proc_data_id has been set in proc_data_samples from what I can see which would lead to a KeyError

else:
proc_data_samples[proc_data_id] = hold
# add the processed data to the list for the study
study_proc_ids[study_id][datatype].append(proc_data_id)
return study_proc_ids, proc_data_samples, samples_meta
53 changes: 53 additions & 0 deletions qiita_ware/test/test_search.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
from unittest import TestCase, main

from qiita_core.util import qiita_test_checker
from qiita_db.search import QiitaStudySearch
from qiita_db.user import User
from qiita_ware.search import count_metadata, filter_by_processed_data


@qiita_test_checker()
class SearchTest(TestCase):
"""Tests that the search helpers work as expected"""
def test_count_metadata(self):
search = QiitaStudySearch()
results, meta_cols = search('study_id = 1 AND ph > 0',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be much more useful if the inputs to count_metadata where explicitly described instead of fetching from the database. It is difficult to evaluate this test without knowing the inputs.

User('test@foo.bar'))
fullcount, studycount = count_metadata(results, meta_cols)
expfull = {'study_id': {1: 27}, 'ph': {6.82: 10, 6.8: 9, 6.94: 8}}
expstudy = {1: {'study_id': {1: 27},
'ph': {6.82: 10, 6.8: 9, 6.94: 8}}}
self.assertEqual(fullcount, expfull)
self.assertEqual(studycount, expstudy)

def test_filter_by_processed_data(self):
search = QiitaStudySearch()
results, meta_cols = search('study_id = 1', User('test@foo.bar'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here

study_proc_ids, proc_data_samples, meta = filter_by_processed_data(
results)
exp_spid = {1: {'18S': [1]}}
exp_pds = {1: [
'1.SKM7.640188', '1.SKD9.640182', '1.SKM8.640201', '1.SKB8.640193',
'1.SKD2.640178', '1.SKM3.640197', '1.SKM4.640180', '1.SKB9.640200',
'1.SKB4.640189', '1.SKB5.640181', '1.SKB6.640176', '1.SKM2.640199',
'1.SKM5.640177', '1.SKB1.640202', '1.SKD8.640184', '1.SKD4.640185',
'1.SKB3.640195', '1.SKM1.640183', '1.SKB7.640196', '1.SKD3.640198',
'1.SKD7.640191', '1.SKD6.640190', '1.SKB2.640194', '1.SKM9.640192',
'1.SKM6.640187', '1.SKD5.640186', '1.SKD1.640179']}
exp_meta = {
'1.SKM7.640188': [1], '1.SKD9.640182': [1], '1.SKM8.640201': [1],
'1.SKB8.640193': [1], '1.SKD2.640178': [1], '1.SKM3.640197': [1],
'1.SKM4.640180': [1], '1.SKB9.640200': [1], '1.SKB4.640189': [1],
'1.SKB5.640181': [1], '1.SKB6.640176': [1], '1.SKM2.640199': [1],
'1.SKM5.640177': [1], '1.SKB1.640202': [1], '1.SKD8.640184': [1],
'1.SKD4.640185': [1], '1.SKB3.640195': [1], '1.SKM1.640183': [1],
'1.SKB7.640196': [1], '1.SKD3.640198': [1], '1.SKD7.640191': [1],
'1.SKD6.640190': [1], '1.SKB2.640194': [1], '1.SKM9.640192': [1],
'1.SKM6.640187': [1], '1.SKD5.640186': [1], '1.SKD1.640179': [1]}
self.assertEqual(study_proc_ids, exp_spid)
self.assertEqual(proc_data_samples, exp_pds)
self.assertEqual(meta, exp_meta)


if __name__ == '__main__':
main()