fix #301 #328

antgonza · 2018-08-17T14:11:14Z

No description provided.

AmandaBirmingham · 2018-08-23T23:41:47Z

Hi @antgonza : the timestamp change has just been merged into master :) so could you please integrate that into this PR and address any check failures that remain afterwards? Would love to review and merge this by the end of the week!

antgonza · 2018-08-24T14:31:16Z

Wait on review, need to do another pass ...

antgonza · 2018-08-24T15:29:29Z

Just restarted the build cause locally the error is random, not sure if related to other issues but it seems like the plate number changes from 2-4 in some occasions, @AmandaBirmingham any ideas?

AssertionError: 'samp[661 chars]late 2\tCannabis Soils\t1.SKB1.640202\tTest pl[40336 chars]\t\n' != 'samp[661 chars]late 4\tCannabis Soils\t1.SKB1.640202\tTest pl[40336 chars]\t\n'

AmandaBirmingham · 2018-08-24T16:59:53Z

Yep, I strongly suspect I know the answer to this one. Having looked through the code, I believe that the ORDER BY study_id, sample_id, row_num, col_num clause in the big sql statement in generate_sample_sheet is not adequately constrained to ensure a deterministic output. Within the same study, the exact same sample_id (with the exact same row_num and col_num) can appear on multiple different plates, and without more constraint, which one is returned first will be random. There are apparently a lot of situations like this in the codebase, which were masked previous to the merge of PR 331 (Corrections of and expansions to unit test database) because all the tests were written using only a single test plate :(

However, that is not the only issue with this sql statement :), because I see that the known-good prep file does not order samples by sample id at all but rather by plate and then by position (first by column, then by row--the opposite of what the above sql does):

Since the goal is to reproduce the known-good file, generate_sample_sheet should follow this order.

antgonza · 2018-08-24T17:01:52Z

Thank and sounds good, any suggestions on how to adequately constrain the SQL?

AmandaBirmingham · 2018-08-24T17:03:23Z

That said, if you decide you want to open a new issue for the plate/column/row sorting instead of handling it in this PR, I'm ok with that. In that case, it would still be necessary to add an additional constraint to the sql ORDER BY statement, or else to explicitly sort the results in python afterwards, so that the output order becomes deterministic and is thus stable every time we run the tests :)

AmandaBirmingham · 2018-08-24T17:03:50Z

Ooops, sorry, our messages crossed ...

AmandaBirmingham · 2018-08-24T17:05:53Z

My initial guess would be that ordering by plate, then col_num, then row_num would do it. I don't know if there is actually supposed to be a study ordering ... the known-good file I have all comes from a single study, so I'd have to check that with the wet lab.

wasade · 2018-08-24T17:25:52Z

Adding in the plate to ordering should do it, assuming the plate number is globally unique (to labman).

Organizing the output by study could be nice for subsequent debugging and human eyes. But, I can also see the value in having it ordered by physical location too.

AmandaBirmingham · 2018-08-24T17:32:42Z

I think we will want to use the plate external_id (the user-assigned plate id), which I think is what @antgonza already has in plate and is constrained to be unique within labman during the table set-up in db_patch.sql:

CONSTRAINT idx_plate UNIQUE ( external_id )

I've reached out to Gail and the wet lab about whether they want additional ordering by study, but until/unless they say they do, I say we stick with straight-up positional ordering for now.

antgonza · 2018-08-24T17:37:47Z

For consistency, should we do sample_plate, primer_plate, col_num, row_num or do you have any other suggestions?

antgonza · 2018-08-24T19:01:38Z

Note that the sorting in SQL is ignored once everything moves to a dataframe, so decided to remove that sorting and use the pandas sorting ...

AmandaBirmingham · 2018-08-24T21:34:33Z

Ok, so the latest test failure says:

I began debugging by simply pulling your test's TARGET_EXAMPLE "known-good" string into excel so I could read it more easily, and it seems like there are some quite weird things going on in it:

For example, it looks like:

It is not sorted by column and then by row within plate, but rather by well id--which leads to G10 coming after G1, etc.
There are only 24 rows for the first plate, which should have 95 (not 96 because H12 of each 96-well plate in the test db is empty).

Given this, I think the best place to start fixing this test failure is probably returning to the TARGET_EXAMPLE string and ensuring that it matches what you really think the code should be producing given the specs and discussion so far; for the info in TARGET_EXAMPLE currently, I'd be worried if this test WAS passing :)

AmandaBirmingham · 2018-08-27T19:23:03Z

labman/db/process.py

-                    data[study][content] = result
+                    # if we want the sample_name.well_id, just replace sid
+                    # for content
+                    data[study][sid] = result


@antgonza I may have found your bug. Because the same sample can appear multiple times (even on a single plate, let alone on multiple plates throughout a sequencing run), if data is a dictionary of dictionaries (looks like it?) then the added line here,

data[study][sid] = result

... will overwrite info about those multiple samples each time the same sample id is encountered, leaving you with just one record for each sample (which, given that the SQL results aren't ordered, could be any one of them randomly--just whichever happens to be returned last).

The original code,

data[study][content] = result

would not be anticipated to have this problem because sid(sample_id) and content are not equivalent: content always (per code referenced below) has plate.id and well.well_id appended to it. Thus, in the case where a sample is plated multiple times, a single sample_id, say "1.SKB3.640195", ends up associated with multiple distinct contents such as "1.SKB3.640195.21.C9", "1.SKB3.640195.21.C12", etc, etc.

https://github.com/jdereus/labman/blob/19dd95f4e5578c1d9e2c5642fbf7479ccf573730/labman/db/composition.py#L580-L621

@AmandaBirmingham and I discussed this offline and agreed on data[study][content] = result even so that this will generated duplicated sample names, which are not supported downstream.

I also checked this one with Gail. She confirms that:

The sample_name column must exactly match what is in Qiita (so we can't use sample content there, even if there are duplicates) and

Every sample that is in the run must be represented in the prep info file (so, even if there are duplicates, we can't leave those out of the file)
* Note that while duplicate samples on the same plate or even in the same run is a very undesirable case in the real world, it is the norm in the test data. If you want to change the test db population and attendant unit tests so that they represent a more realistic case (no duplicate samples), that would be great!

In the case of duplicate samples in the same run, Gail manually goes into Qiita and adds a new sample (to represent the duplicate) and then manually modifies the name of the duplicate in the prep file to match the name she created for the new sample in Qiita. She has asked for Labman to show a warning when users download a prep info file that has duplicate samples in it, telling them that they will need to go through this process. I have added this as #343 Warn user if prep info file contains duplicate samples.

Kewl, just note that in the current state we will have a difference in the blanks and spikes in the prep and the sample information file, which we will need to add to the sample info in Qiita somehow.

@antgonza You are absolutely right that the foregoing applies only to experimental samples--thanks for pointing that out! The known-good example we have for the prep info file shows that blanks are qualified by their well id while samples are not:

Furthermore, the labman.sample_composition table doesn't even store a sample_id for blanks and other controls:

Therefore, it looks like the sample_name column in the prep info file will need to contain the sample_id for experimental samples (i.e., those with sample_composition_type_id=1) and the content for everything else. Good catch!

Related to @antgonza 's point above about blanks, etc: it looks like handling them requires a modification to our way of generating the Well_Description, too. The known-good file we got from the lab shows well descriptions like this:

However, since as noted above the sample_composition table doesn't even store a sample_id for blanks and controls, following our approach of building the well description from the sample_id in these cases gives well descriptions with None in them, like this:

I'm coming to the conclusion that maybe everywhere in the prep info file generation that we incorporate sample_id for experimental samples (e.g., in sample name, well description, original name, more?), we need to instead incorporate content for the non-experimental samples to match the known-good. @antgonza , what do you think?

I think this should be a case by case scenario:

orig_name = name from qiita or empty for blanks/spikes

sample_name = original name for unique samples, if samples are duplicated, use content; for blanks/spikes use content

rest, like well description: use the previous sample name

This is an easy change in the current code; however, not easy to test as we will need to create a test with only unique sample ids and I have no idea of how to do that ...

From the known-good we got from the lab, it looks like they currently expect orig_name to match sample_name even for blanks, etc (see below) so let's follow their convention on that. I agree with the suggestions for all the rest!

Re testing: Yeah, testing is going to be sticky :( The labman test db only knows about ONE Qiita study ("Identification of the Microbiomes for Cannabis Soils") and says that study only has 27 samples (not enough to make a full plate without either repetition or lots of blanks). It looks like this info comes from qiita/qiita_db/support_files/populate_test_db.sql , rather than from anything within labman, so I'm hesitant to try to modify its content in the labman db population scripts. I don't want to change the existing test plates in the db to not have any duplicate samples because we need to be able to test the behavior of full plates. Alternatively, adding new test plates that don't have duplicate samples on them would be a big effort in expanding the db population script; that effort should really be put in at some point, but it is going to be enough of a pain that I don't want to do it until such time as we have access to a larger test study in the qiita tables to "plate".

Given all this, I think the best of a host of bad options may be to change the db population script so that there are a small number of unique samples on the existing plates. This will be a pain and kind of hacky (and will require some changes to the existing unit tests--ugh!) but will mean that we will be able to test all the use cases (unique experimental samples, duplicated experimental samples, and blanks/etc) at least once. I can probably get a PR ready for that today.

Note that it's pretty easy to create new studies in qiita via the qiita-api or even adding new samples to an existing study without having to modify qiita itself but this will need to be done via the qiita_client (which is part of of the db reset for tests).

A PR that puts one unique sample on each of the sample plates in the test db (replacing one of the samples that was previously a repeat) is now available: #346 Test database extension: some unique samples

wasade

Some questions, some of which may be out-of-scope for the PR

wasade · 2018-08-27T19:40:31Z

labman/db/process.py

+                    -- orig_name
+                    sample_id AS orig_name,
+                    -- experiment_design_description
+                    '' AS experiment_design_description,


The design description is always empty?

...same question for some of those below

Actually, this one is a mistake on my part--thanks for noticing it! I classed this in the "leave empty" category (see below) because it is empty in the known-good example I have, but going back to my notes, I see that Gail suggested that we should pull it from the Qiita study description.

Gail instructed us to always leave the following fields empty:
RUN_DATE
RUN_PREFIX
center_project_name
INSTRUMENT_MODEL
RUNID

I assume that someday in the future, labman or qiita or something will be expanded to close the loop by filling in these values after the sequencing is done ...

Okay, thanks! Some of this is out-of-scope for labman too (e.g., the actual sequencing date)

K, using qiita.study.study_description

wasade · 2018-08-27T19:40:41Z

labman/db/process.py

+                    -- TM1000_8_tool
+                    epmotion_tool_id,
+                    -- mAStermix_lot
+                    mASter_mix_id,


should the AS be capitalized?

nop, changing ...

wasade · 2018-08-27T19:42:11Z

labman/db/process.py

+                    -- experiment_design_description
+                    '' AS experiment_design_description,
+                    -- run_center
+                    'UCSDMI' AS run_center,


Is the center the same if we run on the miseq that sits in our lab space vs. the instruments that are part of the sequencing core? Or those at rady?

When I asked "Is this always UCSDMI ?", the answer I got from Gail was "For our lab YES". I interpreted that to mean that it was true for all prep info sheets created in the lab; should I be asking her a more nuanced question here, and if so, what?

Okay. I think it would be useful to ask Gail if UCSDMI encompasses any sequencing instrument associated with the UCSD campus. If yes, than we're good. If no, than more granularity may be warranted

Ok, I've put a question out to Gail. If I learn that we need to make the behavior for this field more nuanced, I'll share that here.

will go with UCSDMI, we can add an issue or replace once we have an answer if before merging ...

Got an answer from Gail, which fortunately doesn't require a code change :)

Q: Is the center the same if we run on the miseq that sits in our lab space vs. the instruments that are part of the sequencing core? Or those at rady?
A: UCSDMI refers to all of those machines - in my opinion. Each machine has an identifier that is in the ruined allowing us to determine which particular site at UCSDMI was used (UCSD microbiome initiative).

labman/db/process.py

AmandaBirmingham · 2018-08-27T23:49:50Z

BTW, any @wasade comments I didn't reply to I'm leaving to @antgonza because I don't have their answers :)

antgonza · 2018-08-31T15:55:16Z

@AmandaBirmingham, just wondering if there is anything missing from this PR? Thanks!

AmandaBirmingham · 2018-08-31T17:31:44Z

@antgonza It looks to me like we are still getting Nones incorporated in the Well_description for non-experimental samples like blanks and controls (see example in #328 (comment) ); as far as I know, that's the last remaining item!

antgonza · 2018-08-31T17:32:35Z

Good catch, fixing.

fix biocore#301

c61810a

antgonza requested a review from AmandaBirmingham August 17, 2018 14:11

antgonza added 11 commits August 17, 2018 10:16

try to fix error

f762561

split tests

28c4a1b

rm timezone timestamps

6d1c3dc

fix errors

921da0d

show integration_tests error

258752f

fix biocore#326

42add0f

fix tests

b707d10

rm extra chars

940e424

Merge branch 'master' of https://github.com/jdereus/labman into rm-tz

e7e63de

upstream rm-tz

63e515c

fix biocore#301

27366ec

antgonza added 2 commits August 24, 2018 07:44

merge master

e0c9061

fixing some errors

3582f85

antgonza added 4 commits August 24, 2018 08:57

get_date_format without seconds

28c9aa3

fix integration_tests

7e52181

rm duplicated line

6472f10

row[5] -> row[sequencing_process_id]

7657765

flake8

c989454

fix tests

9379533

fix error, try

66c5b40

AmandaBirmingham reviewed Aug 27, 2018

View reviewed changes

wasade reviewed Aug 27, 2018

View reviewed changes

AmandaBirmingham mentioned this pull request Aug 27, 2018

Support generating prep info files for non-Illumina sequencing runs #342

Open

antgonza added 2 commits August 28, 2018 06:36

upstream master

457cdb8

addressing comments from @wasade and @AmandaBirmingham

fb45317

antgonza force-pushed the fix-301 branch 3 times, most recently from fd183a3 to fb45317 Compare August 28, 2018 13:32

antgonza mentioned this pull request Aug 29, 2018

Test database extension: some unique samples #346

Merged

new sample naming

b3adc75

fixing well-description

8621e64

AmandaBirmingham merged commit bb3e574 into biocore:master Aug 31, 2018

Uh oh!

fix #301 #328

fix #301 #328

Uh oh!

Conversation

antgonza commented Aug 17, 2018

Uh oh!

AmandaBirmingham commented Aug 23, 2018

Uh oh!

antgonza commented Aug 24, 2018

Uh oh!

antgonza commented Aug 24, 2018

Uh oh!

AmandaBirmingham commented Aug 24, 2018

Uh oh!

antgonza commented Aug 24, 2018

Uh oh!

AmandaBirmingham commented Aug 24, 2018

Uh oh!

AmandaBirmingham commented Aug 24, 2018

Uh oh!

AmandaBirmingham commented Aug 24, 2018

Uh oh!

wasade commented Aug 24, 2018

Uh oh!

AmandaBirmingham commented Aug 24, 2018

Uh oh!

antgonza commented Aug 24, 2018

Uh oh!

antgonza commented Aug 24, 2018

Uh oh!

AmandaBirmingham commented Aug 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wasade left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmandaBirmingham commented Aug 31, 2018 •

edited

Loading