-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INF-435/mz/extend germline to produce desired output structure #6
INF-435/mz/extend germline to produce desired output structure #6
Conversation
…embers, update the Sample loader to populate
…arisons for the time being
…optionally generate this output. If not using --new-samples or --samples-to-compare-to, the germline code will generate the usual single .match file
@@ -108,6 +117,8 @@ class Individual | |||
streamoff offset; | |||
|
|||
Match ** all_matches; | |||
ofstream* individualMatchFile; | |||
ofstream* individualHomozFile; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice... this is definitely cleaner than map<string, ofstream*>
fout << endl; | ||
if ( ALL_SAMPLES.useEmbarkRFGermlineOutput ) { | ||
int key1 = stoi(node[0]->single_id); | ||
int key2 = stoi(node[1]->single_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth handling case where this can't convert? I don't think so, because that means something is truly wrong? i.e., something was passed in that wasn't really a proxy key?
string end_pos = to_string(ALL_SNPS.getSNP(snp_end).getPhysPos()); | ||
|
||
// homoz | ||
if ( key1 == key2 && node[0]->is_new ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could DRY out these if statements significantly, which I think would improve readability. The only thing that changes between the branches is ofs
, and whether we get the 0th or 1st element of node. You could just set these in an if statement, then all the oline pushes could be shared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new dog new dog has some extra logic though which you'll need to keep in the if statements (maybe this is why you didn't DRY this out?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do this mostly because I wanted the logic to be very visible and easily comparable to the logic in the rust code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done! I'm working on the test harness + install into the container now, will give this a try later this morning
*ofs << joined_oline << endl; | ||
} | ||
// newdog : newdog comparison, write out same record twice | ||
else if ( node[0]->is_new && node[1]->is_new && key1 != key2 ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Is this block reachable? I'm a little hazy on what node[*]
is. For instance, in the line above:
key1 > key2 && node[0]->is_new
... Does this necessarily imply that only one of the keys is new
?
I'm naively assuming node[0]->is_new
asserts that the first key is new
, but doesn't say anything about the second...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, so this specific case is the newdog
newdog
comparison, where both keys are new and not the same. This would require us to output the match twice, one for each new dog. This block is definitely reachable only when we run the compset 9999
.
key1 > key2 && node[0]->is_new
... Does this necessarily imply that only one of the keys is `new`? I'm naively assuming `node[0]->is_new` asserts that the first key is `new`, but doesn't say anything about the second...
Correct, this is checking only if the first key is new. The per-dog matches file require(?) that the new dog be first, but germline doesn't care, so each line needs to be checked to see which key is the new dog and put that one first. This logic is lifted from the rust script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this only matters when we are comparing a set of old dogs against new dogs, and using the -samples-to-compare
and -new-samples
flags. Otherwise, germline will do all dogs against all dogs in the ped file by default. This doesn't entirely rely on the fact that new keys will be greater than old keys as I wanted to be more explicit about checking if a key is new dog or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! One small question about the logic for splitting the lines.
…ual results to, adhere to existing pattern for easier integration into python
Thanks Dennis, addressed your comment. Merging now |
Objective
This PR extends the Germline tool to produce the same set of outputs as the
split_germline_output_for_new_dog
rust binary. During profiling, we saw about a 4x speedup when using C++ over rust, the toy script can be found in this PR.The goal is to lift all the logic for the rust code directly into germline, so we can output per-dog matches directly from memory, without the need to split the single
.match
file which contains all matchesChanges
-chromosome
and-individual_output
parameters now set a boolean flaguseEmbarkRFGermlineOutputs
and dictate output structure/namingnew dog
is assigned a file handle for a match file and homoz filenew dog
:new dog
matches are handled such that the same record is output twice, once for each new dognew_key, haplotype, old_key, haplotype, chrom, start, end
Testing
A minimal test set was generated with 5 old dogs and 5 new dogs with 2 tests being executed.
ToDo
minimal_test
, maybe even bake into theMakefile
relative_finder_germline.py
codeWill integrate into
pybark
in separate PRParity
old results + rust code:
new results: