Skip to content

Conversation

@antgonza
Copy link
Collaborator

No description provided.

@AmandaBirmingham
Copy link
Collaborator

Hi @antgonza : the timestamp change has just been merged into master :) so could you please integrate that into this PR and address any check failures that remain afterwards? Would love to review and merge this by the end of the week!

@antgonza
Copy link
Collaborator Author

Wait on review, need to do another pass ...

@antgonza
Copy link
Collaborator Author

Just restarted the build cause locally the error is random, not sure if related to other issues but it seems like the plate number changes from 2-4 in some occasions, @AmandaBirmingham any ideas?

AssertionError: 'samp[661 chars]late 2\tCannabis Soils\t1.SKB1.640202\tTest pl[40336 chars]\t\n' != 'samp[661 chars]late 4\tCannabis Soils\t1.SKB1.640202\tTest pl[40336 chars]\t\n'

@AmandaBirmingham
Copy link
Collaborator

Yep, I strongly suspect I know the answer to this one. Having looked through the code, I believe that the ORDER BY study_id, sample_id, row_num, col_num clause in the big sql statement in generate_sample_sheet is not adequately constrained to ensure a deterministic output. Within the same study, the exact same sample_id (with the exact same row_num and col_num) can appear on multiple different plates, and without more constraint, which one is returned first will be random. There are apparently a lot of situations like this in the codebase, which were masked previous to the merge of PR 331 (Corrections of and expansions to unit test database) because all the tests were written using only a single test plate :(

However, that is not the only issue with this sql statement :), because I see that the known-good prep file does not order samples by sample id at all but rather by plate and then by position (first by column, then by row--the opposite of what the above sql does):

screen shot 2018-08-24 at 9 53 25 am

Since the goal is to reproduce the known-good file, generate_sample_sheet should follow this order.

@antgonza
Copy link
Collaborator Author

Thank and sounds good, any suggestions on how to adequately constrain the SQL?

@AmandaBirmingham
Copy link
Collaborator

That said, if you decide you want to open a new issue for the plate/column/row sorting instead of handling it in this PR, I'm ok with that. In that case, it would still be necessary to add an additional constraint to the sql ORDER BY statement, or else to explicitly sort the results in python afterwards, so that the output order becomes deterministic and is thus stable every time we run the tests :)

@AmandaBirmingham
Copy link
Collaborator

Ooops, sorry, our messages crossed ...

@AmandaBirmingham
Copy link
Collaborator

My initial guess would be that ordering by plate, then col_num, then row_num would do it. I don't know if there is actually supposed to be a study ordering ... the known-good file I have all comes from a single study, so I'd have to check that with the wet lab.

@wasade
Copy link
Member

wasade commented Aug 24, 2018

Adding in the plate to ordering should do it, assuming the plate number is globally unique (to labman).

Organizing the output by study could be nice for subsequent debugging and human eyes. But, I can also see the value in having it ordered by physical location too.

@AmandaBirmingham
Copy link
Collaborator

I think we will want to use the plate external_id (the user-assigned plate id), which I think is what @antgonza already has in plate and is constrained to be unique within labman during the table set-up in db_patch.sql:

CONSTRAINT idx_plate UNIQUE ( external_id )

I've reached out to Gail and the wet lab about whether they want additional ordering by study, but until/unless they say they do, I say we stick with straight-up positional ordering for now.

@antgonza
Copy link
Collaborator Author

For consistency, should we do sample_plate, primer_plate, col_num, row_num or do you have any other suggestions?

@antgonza
Copy link
Collaborator Author

Note that the sorting in SQL is ignored once everything moves to a dataframe, so decided to remove that sorting and use the pandas sorting ...

@AmandaBirmingham
Copy link
Collaborator

Ok, so the latest test failure says:

screen shot 2018-08-24 at 2 06 25 pm

I began debugging by simply pulling your test's TARGET_EXAMPLE "known-good" string into excel so I could read it more easily, and it seems like there are some quite weird things going on in it:

screen shot 2018-08-24 at 2 11 51 pm

For example, it looks like:

  1. It is not sorted by column and then by row within plate, but rather by well id--which leads to G10 coming after G1, etc.
  2. There are only 24 rows for the first plate, which should have 95 (not 96 because H12 of each 96-well plate in the test db is empty).

Given this, I think the best place to start fixing this test failure is probably returning to the TARGET_EXAMPLE string and ensuring that it matches what you really think the code should be producing given the specs and discussion so far; for the info in TARGET_EXAMPLE currently, I'd be worried if this test WAS passing :)

data[study][content] = result
# if we want the sample_name.well_id, just replace sid
# for content
data[study][sid] = result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antgonza I may have found your bug. Because the same sample can appear multiple times (even on a single plate, let alone on multiple plates throughout a sequencing run), if data is a dictionary of dictionaries (looks like it?) then the added line here,

data[study][sid] = result

... will overwrite info about those multiple samples each time the same sample id is encountered, leaving you with just one record for each sample (which, given that the SQL results aren't ordered, could be any one of them randomly--just whichever happens to be returned last).

The original code,

data[study][content] = result

would not be anticipated to have this problem because sid(sample_id) and content are not equivalent: content always (per code referenced below) has plate.id and well.well_id appended to it. Thus, in the case where a sample is plated multiple times, a single sample_id, say "1.SKB3.640195", ends up associated with multiple distinct contents such as "1.SKB3.640195.21.C9", "1.SKB3.640195.21.C12", etc, etc.

https://github.com/jdereus/labman/blob/19dd95f4e5578c1d9e2c5642fbf7479ccf573730/labman/db/composition.py#L580-L621

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmandaBirmingham and I discussed this offline and agreed on data[study][content] = result even so that this will generated duplicated sample names, which are not supported downstream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also checked this one with Gail. She confirms that:

  1. The sample_name column must exactly match what is in Qiita (so we can't use sample content there, even if there are duplicates) and
  2. Every sample that is in the run must be represented in the prep info file (so, even if there are duplicates, we can't leave those out of the file)
    * Note that while duplicate samples on the same plate or even in the same run is a very undesirable case in the real world, it is the norm in the test data. If you want to change the test db population and attendant unit tests so that they represent a more realistic case (no duplicate samples), that would be great!

In the case of duplicate samples in the same run, Gail manually goes into Qiita and adds a new sample (to represent the duplicate) and then manually modifies the name of the duplicate in the prep file to match the name she created for the new sample in Qiita. She has asked for Labman to show a warning when users download a prep info file that has duplicate samples in it, telling them that they will need to go through this process. I have added this as #343 Warn user if prep info file contains duplicate samples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kewl, just note that in the current state we will have a difference in the blanks and spikes in the prep and the sample information file, which we will need to add to the sample info in Qiita somehow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antgonza You are absolutely right that the foregoing applies only to experimental samples--thanks for pointing that out! The known-good example we have for the prep info file shows that blanks are qualified by their well id while samples are not:

screen shot 2018-08-28 at 10 49 49 am

Furthermore, the labman.sample_composition table doesn't even store a sample_id for blanks and other controls:

screen shot 2018-08-28 at 10 52 03 am

Therefore, it looks like the sample_name column in the prep info file will need to contain the sample_id for experimental samples (i.e., those with sample_composition_type_id=1) and the content for everything else. Good catch!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to @antgonza 's point above about blanks, etc: it looks like handling them requires a modification to our way of generating the Well_Description, too. The known-good file we got from the lab shows well descriptions like this:

screen shot 2018-08-28 at 11 04 35 am

However, since as noted above the sample_composition table doesn't even store a sample_id for blanks and controls, following our approach of building the well description from the sample_id in these cases gives well descriptions with None in them, like this:

screen shot 2018-08-28 at 11 05 05 am

I'm coming to the conclusion that maybe everywhere in the prep info file generation that we incorporate sample_id for experimental samples (e.g., in sample name, well description, original name, more?), we need to instead incorporate content for the non-experimental samples to match the known-good. @antgonza , what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a case by case scenario:

  • orig_name = name from qiita or empty for blanks/spikes
  • sample_name = original name for unique samples, if samples are duplicated, use content; for blanks/spikes use content
  • rest, like well description: use the previous sample name

This is an easy change in the current code; however, not easy to test as we will need to create a test with only unique sample ids and I have no idea of how to do that ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the known-good we got from the lab, it looks like they currently expect orig_name to match sample_name even for blanks, etc (see below) so let's follow their convention on that. I agree with the suggestions for all the rest!

screen shot 2018-08-28 at 11 28 29 am

Re testing: Yeah, testing is going to be sticky :( The labman test db only knows about ONE Qiita study ("Identification of the Microbiomes for Cannabis Soils") and says that study only has 27 samples (not enough to make a full plate without either repetition or lots of blanks). It looks like this info comes from qiita/qiita_db/support_files/populate_test_db.sql , rather than from anything within labman, so I'm hesitant to try to modify its content in the labman db population scripts. I don't want to change the existing test plates in the db to not have any duplicate samples because we need to be able to test the behavior of full plates. Alternatively, adding new test plates that don't have duplicate samples on them would be a big effort in expanding the db population script; that effort should really be put in at some point, but it is going to be enough of a pain that I don't want to do it until such time as we have access to a larger test study in the qiita tables to "plate".

Given all this, I think the best of a host of bad options may be to change the db population script so that there are a small number of unique samples on the existing plates. This will be a pain and kind of hacky (and will require some changes to the existing unit tests--ugh!) but will mean that we will be able to test all the use cases (unique experimental samples, duplicated experimental samples, and blanks/etc) at least once. I can probably get a PR ready for that today.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it's pretty easy to create new studies in qiita via the qiita-api or even adding new samples to an existing study without having to modify qiita itself but this will need to be done via the qiita_client (which is part of of the db reset for tests).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PR that puts one unique sample on each of the sample plates in the test db (replacing one of the samples that was previously a repeat) is now available: #346 Test database extension: some unique samples

Copy link
Member

@wasade wasade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions, some of which may be out-of-scope for the PR

-- orig_name
sample_id AS orig_name,
-- experiment_design_description
'' AS experiment_design_description,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design description is always empty?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...same question for some of those below

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this one is a mistake on my part--thanks for noticing it! I classed this in the "leave empty" category (see below) because it is empty in the known-good example I have, but going back to my notes, I see that Gail suggested that we should pull it from the Qiita study description.

Gail instructed us to always leave the following fields empty:
RUN_DATE
RUN_PREFIX
center_project_name
INSTRUMENT_MODEL
RUNID

I assume that someday in the future, labman or qiita or something will be expanded to close the loop by filling in these values after the sequencing is done ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks! Some of this is out-of-scope for labman too (e.g., the actual sequencing date)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K, using qiita.study.study_description

-- TM1000_8_tool
epmotion_tool_id,
-- mAStermix_lot
mASter_mix_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the AS be capitalized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nop, changing ...

-- experiment_design_description
'' AS experiment_design_description,
-- run_center
'UCSDMI' AS run_center,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the center the same if we run on the miseq that sits in our lab space vs. the instruments that are part of the sequencing core? Or those at rady?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I asked "Is this always UCSDMI ?", the answer I got from Gail was "For our lab YES". I interpreted that to mean that it was true for all prep info sheets created in the lab; should I be asking her a more nuanced question here, and if so, what?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I think it would be useful to ask Gail if UCSDMI encompasses any sequencing instrument associated with the UCSD campus. If yes, than we're good. If no, than more granularity may be warranted

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've put a question out to Gail. If I learn that we need to make the behavior for this field more nuanced, I'll share that here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will go with UCSDMI, we can add an issue or replace once we have an answer if before merging ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got an answer from Gail, which fortunately doesn't require a code change :)

Q: Is the center the same if we run on the miseq that sits in our lab space vs. the instruments that are part of the sequencing core? Or those at rady?
A: UCSDMI refers to all of those machines - in my opinion. Each machine has an identifier that is in the ruined allowing us to determine which particular site at UCSDMI was used (UCSD microbiome initiative).

@AmandaBirmingham
Copy link
Collaborator

BTW, any @wasade comments I didn't reply to I'm leaving to @antgonza because I don't have their answers :)

@antgonza
Copy link
Collaborator Author

@AmandaBirmingham, just wondering if there is anything missing from this PR? Thanks!

@AmandaBirmingham
Copy link
Collaborator

AmandaBirmingham commented Aug 31, 2018

@antgonza It looks to me like we are still getting Nones incorporated in the Well_description for non-experimental samples like blanks and controls (see example in #328 (comment) ); as far as I know, that's the last remaining item!

@antgonza
Copy link
Collaborator Author

Good catch, fixing.

@AmandaBirmingham AmandaBirmingham merged commit bb3e574 into biocore:master Aug 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants