WIP Gen shotgun prep files #502

charles-cowart · 2019-05-08T16:01:53Z

Initial query and code for shotgun prep info file generation. My assumption is that we will be tweaking the columns, format of the output, etc. and getting approval from Gail, etc. before merging.

charles-cowart · 2019-05-09T06:11:30Z

Partial implementation of final output using DataFrames instead of CSV for munging. Will pick it up in the morning.

Work in Progress

Partial implementation of munging using DataFrame.

All columns are present and filled out. There are some partial null columns where rows represent control data; this will be resolved in a later push.

wasade · 2019-05-09T22:56:43Z

@charles-cowart, note I think this needs to be rebased

charles-cowart · 2019-05-09T23:04:15Z

Sample output from the generate_metagenomics_prep_info_file() method. This is currently being used as the expected result for testing.

shotgun_prep_info_sample_output.txt

labcontrol/db/process.py

labcontrol/db/tests/test_process.py

labcontrol/db/process.py

AmandaBirmingham · 2019-05-10T00:22:30Z

labcontrol/db/process.py

+
+            for d in results:
+                # assert that study_id and orig_name are only None when both
+                # are None; making this a control entry and not a sample entry


So, the problem is that this isn't necessarily true--at least not currently. We definitely DO have real, experimental samples coming through and getting plated that happen NOT to be in qiita (which is all that these two pieces of being None indicates). What is this field being used for? Should we name it more correct, like "is_not_in_qiita" or something? I wouldn't want to use it for anything that required that its contents really, truly be controls ...

It looks like the reason this line works, at least for the test data, is that the raw results from the SQL query can be organized into two groups. The first group is 288 in number, looks like sample data, and has no missing values.

The second group number 92 and uniformly have NULL values for the following five columns:
['orig_name', 'study_id', 'sample_id', 'project_name', 'experiment_design_description']

Hence, rows where both study_id and 'orig_name' are NULL matches all of the second set and none of the first.

It appears I arrived at this by negating the condition we use to test for a 'sample row' in Amplicon prep info file generation:
if result['orig_name2'] is not None and study_id is not None:
Note that orig_name2 is the same as orig_name at this point in the code, and both orig_name and study_id have not been modified since coming off the DB. However, I got it wrong; the negation should be if orig_name is None OR study_id is None.

With respect to the test data, it's amusing that both statements will work equally well in this particular case. IMHO, we could have an is_control column if determining which row represents a control is potentially fuzzy. I could also determine what is a control (or at least not a sample) by parsing the content column, and targeting rows with content starting with 'vibrio' or 'blank' or another word in the table.

I'll leave this one alone until @AmandaBirmingham comments, and then I can make the change easily enough.

Thanks, @charles-cowart

Please
(a) take the creation of the is_control field out. It is not clearly named and thus has the potential to lead to gnarly bugs down the road when someone inaccurately assumes it is what it says it is (especially since our existing test conditions don't exercise the ways in which it secretly isn't what it says it is). Since it is currently used in only one place, we have the option to just drop it until we come up with a better naming/handling of this weird "maybe controls but maybe just unloved experimental samples" situation.
(b) replace the current use of is_control to decide when to modify orig_name2 with the check we use for that in generate_amplicon_prep_info:

if result['orig_name2'] is not None and study_id is not None:

As I recall, it took you and me about a week of back-and-forth to get this condition right, and so I don't want to mess with it again (at least not till we do a fuller refactor someday).

Too true! 😆

Fair enough, although it seems like we could really benefit from easily identifying states. It is something I hope we can get back to at some point.

labcontrol/db/process.py

labcontrol/db/support_files/patches/1.sql

AmandaBirmingham

This PR is looking really good! I think the last noteworthy thing to put in is the small front-end change. Right now, users only see the option to download prep sheets for amplicon sequencing runs:

This limitation is implemented here (look, we even remembered to reference this issue :D )

https://github.com/jdereus/labman/blob/7bb9106555143bb4877c49516e92c792ea9349a5/labcontrol/gui/templates/sequence_run_list.html#L27-L35

I suggest that we remove the if, replacing the entire block referenced above with

        var preparationSheets = "<a href='/process/sequencing/" +
            row[sequencing_process_id] +
            "/preparation_sheets' class='btn btn-success'>" +
            "<span class='glyphicon glyphicon-download'></span> " +
            "Download Preparation Sheets</a>";

@charles-cowart , can you make this change in your PR?

Note: this change this means that IF in the future we add a new kind of assay (like the BS new type of assay shown in the screenshot, which is of the test database :) AND we fail to implement support for it in SequencingProcess.generate_prep_information, THEN users will see a "Download Preparation Sheets" button for sequencing runs of that new assay type, but if they click the button they will get an error message (because SequencingProcess.generate_prep_information throws a ValueError for any unsupported assay type). However, there would be no possibility of generating garbage output (just an ugly error) and we really should be supporting prep sheets going forward for every new assay type, so I'm ok with this minimal risk.

charles-cowart · 2019-05-10T07:04:13Z

@AmandaBirmingham Thanks! Will do!

Added additional requested changes, as well as a few statements to write both raw and final output to disk for debugging purposes. Will remove from final version. Note that this comment in this issue needs to be resolved to everyone's satisfaction: biocore#502 (comment)

wasade · 2019-05-10T15:40:51Z

I'm 👍 pending the reply / edit to this comment

charles-cowart · 2019-05-10T17:39:33Z

@AmandaBirmingham @wasade Thanks for all your comments! I put the most recent request changes in, and it's currently going through Travis...

AmandaBirmingham · 2019-05-10T17:45:29Z

This looks good to me! @charles-cowart can I just confirm that you've run LabControl locally and verified that you can now see and successfully use the "Download Preparation Sheets" button for the an existing shotgun sequencing run?

charles-cowart · 2019-05-10T17:53:56Z

Thanks @AmandaBirmingham ! Yup, downloads as expected. :)

charles-cowart mentioned this pull request May 8, 2019

Generate correct Preparation Sheet for shotgun runs #327

Closed

Charles Cowart added 9 commits May 9, 2019 13:17

Shotgun prep file generation

934c2ff

Work in Progress

Documentation/flake8 pass

a612070

An initial test file for prep file output testing

c16fd02

Addressing change requests made in meeting

fdc8aaa

Migration from CSV module to DataFrame

48f9c83

Partial implementation of munging using DataFrame.

All columns are present and filled out.

90823be

All columns are present and filled out. There are some partial null columns where rows represent control data; this will be resolved in a later push.

All columns completed

a8f1f03

Remove debugging code

853daee

Fixed testing

677f696

charles-cowart force-pushed the gen_shotgun_prep_files branch from b4c94c3 to 677f696 Compare May 9, 2019 22:51

charles-cowart requested review from AmandaBirmingham and wasade May 9, 2019 22:51

Merge branch 'master' into gen_shotgun_prep_files

b5c87e5

wasade requested changes May 9, 2019

View reviewed changes