-
Notifications
You must be signed in to change notification settings - Fork 15
WIP Gen shotgun prep files #502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Gen shotgun prep files #502
Conversation
|
Partial implementation of final output using DataFrames instead of CSV for munging. Will pick it up in the morning. |
Work in Progress
Partial implementation of munging using DataFrame.
All columns are present and filled out. There are some partial null columns where rows represent control data; this will be resolved in a later push.
b4c94c3 to
677f696
Compare
|
@charles-cowart, note I think this needs to be rebased |
|
Sample output from the generate_metagenomics_prep_info_file() method. This is currently being used as the expected result for testing. |
labcontrol/db/process.py
Outdated
|
|
||
| for d in results: | ||
| # assert that study_id and orig_name are only None when both | ||
| # are None; making this a control entry and not a sample entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the problem is that this isn't necessarily true--at least not currently. We definitely DO have real, experimental samples coming through and getting plated that happen NOT to be in qiita (which is all that these two pieces of being None indicates). What is this field being used for? Should we name it more correct, like "is_not_in_qiita" or something? I wouldn't want to use it for anything that required that its contents really, truly be controls ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the reason this line works, at least for the test data, is that the raw results from the SQL query can be organized into two groups. The first group is 288 in number, looks like sample data, and has no missing values.
The second group number 92 and uniformly have NULL values for the following five columns:
['orig_name', 'study_id', 'sample_id', 'project_name', 'experiment_design_description']
Hence, rows where both study_id and 'orig_name' are NULL matches all of the second set and none of the first.
It appears I arrived at this by negating the condition we use to test for a 'sample row' in Amplicon prep info file generation:
if result['orig_name2'] is not None and study_id is not None:
Note that orig_name2 is the same as orig_name at this point in the code, and both orig_name and study_id have not been modified since coming off the DB. However, I got it wrong; the negation should be if orig_name is None OR study_id is None.
With respect to the test data, it's amusing that both statements will work equally well in this particular case. IMHO, we could have an is_control column if determining which row represents a control is potentially fuzzy. I could also determine what is a control (or at least not a sample) by parsing the content column, and targeting rows with content starting with 'vibrio' or 'blank' or another word in the table.
I'll leave this one alone until @AmandaBirmingham comments, and then I can make the change easily enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @charles-cowart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please
(a) take the creation of the is_control field out. It is not clearly named and thus has the potential to lead to gnarly bugs down the road when someone inaccurately assumes it is what it says it is (especially since our existing test conditions don't exercise the ways in which it secretly isn't what it says it is). Since it is currently used in only one place, we have the option to just drop it until we come up with a better naming/handling of this weird "maybe controls but maybe just unloved experimental samples" situation.
(b) replace the current use of is_control to decide when to modify orig_name2 with the check we use for that in generate_amplicon_prep_info:
if result['orig_name2'] is not None and study_id is not None:
As I recall, it took you and me about a week of back-and-forth to get this condition right, and so I don't want to mess with it again (at least not till we do a fuller refactor someday).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too true! 😆
Fair enough, although it seems like we could really benefit from easily identifying states. It is something I hope we can get back to at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is looking really good! I think the last noteworthy thing to put in is the small front-end change. Right now, users only see the option to download prep sheets for amplicon sequencing runs:
This limitation is implemented here (look, we even remembered to reference this issue :D )
I suggest that we remove the if, replacing the entire block referenced above with
var preparationSheets = "<a href='/process/sequencing/" +
row[sequencing_process_id] +
"/preparation_sheets' class='btn btn-success'>" +
"<span class='glyphicon glyphicon-download'></span> " +
"Download Preparation Sheets</a>";
@charles-cowart , can you make this change in your PR?
Note: this change this means that IF in the future we add a new kind of assay (like the BS new type of assay shown in the screenshot, which is of the test database :) AND we fail to implement support for it in SequencingProcess.generate_prep_information, THEN users will see a "Download Preparation Sheets" button for sequencing runs of that new assay type, but if they click the button they will get an error message (because SequencingProcess.generate_prep_information throws a ValueError for any unsupported assay type). However, there would be no possibility of generating garbage output (just an ugly error) and we really should be supporting prep sheets going forward for every new assay type, so I'm ok with this minimal risk.
|
@AmandaBirmingham Thanks! Will do! |
Added additional requested changes, as well as a few statements to write both raw and final output to disk for debugging purposes. Will remove from final version. Note that this comment in this issue needs to be resolved to everyone's satisfaction: biocore#502 (comment)
|
I'm 👍 pending the reply / edit to this comment |
|
@AmandaBirmingham @wasade Thanks for all your comments! I put the most recent request changes in, and it's currently going through Travis... |
|
This looks good to me! @charles-cowart can I just confirm that you've run LabControl locally and verified that you can now see and successfully use the "Download Preparation Sheets" button for the an existing shotgun sequencing run? |
|
Thanks @AmandaBirmingham ! Yup, downloads as expected. :) |

Initial query and code for shotgun prep info file generation. My assumption is that we will be tweaking the columns, format of the output, etc. and getting approval from Gail, etc. before merging.