Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INF-435/mz/extend germline to produce desired output structure #6

Merged

Conversation

myz540
Copy link

@myz540 myz540 commented Dec 17, 2021

Objective

This PR extends the Germline tool to produce the same set of outputs as the split_germline_output_for_new_dog rust binary. During profiling, we saw about a 4x speedup when using C++ over rust, the toy script can be found in this PR.

The goal is to lift all the logic for the rust code directly into germline, so we can output per-dog matches directly from memory, without the need to split the single .match file which contains all matches

Changes

  1. the -chromosome and -individual_output parameters now set a boolean flag useEmbarkRFGermlineOutputs and dictate output structure/naming
  2. Each new dog is assigned a file handle for a match file and homoz file
  3. new dog : new dog matches are handled such that the same record is output twice, once for each new dog
  4. Homoz tracts are handled and output to a separate file
  5. Ordering is handled such that the columns are always new_key, haplotype, old_key, haplotype, chrom, start, end

Testing

A minimal test set was generated with 5 old dogs and 5 new dogs with 2 tests being executed.

  1. Compare 5 old dogs against 5 new dogs: results show parity against the original germline code, though parity against the rust outputs are TBD.
  2. Compare 5 new dogs against the same 5 new dogs: results show parity against the original germline code with homoz tracts and match files being output to different files. Parity against rust outputs are TBD.

ToDo

  • Minor cleanup and ?refactor?
  • Add README to the minimal_test, maybe even bake into the Makefile
  • Properly integrate into the relative_finder_germline.py code
  • Properly set output file handle names and paths to mimic current rust code

Will integrate into pybark in separate PR

Parity

old results + rust code:
image

new results:
image

Mike Zhong added 2 commits December 16, 2021 21:27
…optionally generate this output. If not using --new-samples or --samples-to-compare-to, the germline code will generate the usual single .match file
GERMLINE_0001.cpp Outdated Show resolved Hide resolved
Individual.cpp Outdated Show resolved Hide resolved
Individual.cpp Outdated Show resolved Hide resolved
Individual.cpp Outdated Show resolved Hide resolved
@@ -108,6 +117,8 @@ class Individual
streamoff offset;

Match ** all_matches;
ofstream* individualMatchFile;
ofstream* individualHomozFile;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice... this is definitely cleaner than map<string, ofstream*>

fout << endl;
if ( ALL_SAMPLES.useEmbarkRFGermlineOutput ) {
int key1 = stoi(node[0]->single_id);
int key2 = stoi(node[1]->single_id);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth handling case where this can't convert? I don't think so, because that means something is truly wrong? i.e., something was passed in that wasn't really a proxy key?

string end_pos = to_string(ALL_SNPS.getSNP(snp_end).getPhysPos());

// homoz
if ( key1 == key2 && node[0]->is_new ) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could DRY out these if statements significantly, which I think would improve readability. The only thing that changes between the branches is ofs, and whether we get the 0th or 1st element of node. You could just set these in an if statement, then all the oline pushes could be shared

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new dog new dog has some extra logic though which you'll need to keep in the if statements (maybe this is why you didn't DRY this out?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do this mostly because I wanted the logic to be very visible and easily comparable to the logic in the rust code

Copy link

@CalvinLeather CalvinLeather left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done! I'm working on the test harness + install into the container now, will give this a try later this morning

*ofs << joined_oline << endl;
}
// newdog : newdog comparison, write out same record twice
else if ( node[0]->is_new && node[1]->is_new && key1 != key2 ) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Is this block reachable? I'm a little hazy on what node[*] is. For instance, in the line above:

key1 > key2 && node[0]->is_new

... Does this necessarily imply that only one of the keys is new?

I'm naively assuming node[0]->is_new asserts that the first key is new, but doesn't say anything about the second...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, so this specific case is the newdog newdog comparison, where both keys are new and not the same. This would require us to output the match twice, one for each new dog. This block is definitely reachable only when we run the compset 9999.

key1 > key2 && node[0]->is_new


... Does this necessarily imply that only one of the keys is `new`?

I'm naively assuming `node[0]->is_new` asserts that the first key is `new`, but doesn't say anything about the second...

Correct, this is checking only if the first key is new. The per-dog matches file require(?) that the new dog be first, but germline doesn't care, so each line needs to be checked to see which key is the new dog and put that one first. This logic is lifted from the rust script

Copy link
Author

@myz540 myz540 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this only matters when we are comparing a set of old dogs against new dogs, and using the -samples-to-compare and -new-samples flags. Otherwise, germline will do all dogs against all dogs in the ped file by default. This doesn't entirely rely on the fact that new keys will be greater than old keys as I wanted to be more explicit about checking if a key is new dog or not

Copy link

@ddunne-embarkvet ddunne-embarkvet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! One small question about the logic for splitting the lines.

@myz540
Copy link
Author

myz540 commented Dec 22, 2021

Looks good to me! One small question about the logic for splitting the lines.

Thanks Dennis, addressed your comment. Merging now

@myz540 myz540 merged commit cff6dbc into master Dec 22, 2021
@myz540 myz540 deleted the INF-435/mz/extend-germline-to-produce-desired-output-structure branch December 22, 2021 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants