INF-435/mz/extend germline to produce desired output structure #6

myz540 · 2021-12-17T05:17:16Z

Objective

This PR extends the Germline tool to produce the same set of outputs as the split_germline_output_for_new_dog rust binary. During profiling, we saw about a 4x speedup when using C++ over rust, the toy script can be found in this PR.

The goal is to lift all the logic for the rust code directly into germline, so we can output per-dog matches directly from memory, without the need to split the single .match file which contains all matches

Changes

the -chromosome and -individual_output parameters now set a boolean flag useEmbarkRFGermlineOutputs and dictate output structure/naming
Each new dog is assigned a file handle for a match file and homoz file
new dog : new dog matches are handled such that the same record is output twice, once for each new dog
Homoz tracts are handled and output to a separate file
Ordering is handled such that the columns are always new_key, haplotype, old_key, haplotype, chrom, start, end

Testing

A minimal test set was generated with 5 old dogs and 5 new dogs with 2 tests being executed.

Compare 5 old dogs against 5 new dogs: results show parity against the original germline code, though parity against the rust outputs are TBD.
Compare 5 new dogs against the same 5 new dogs: results show parity against the original germline code with homoz tracts and match files being output to different files. Parity against rust outputs are TBD.

ToDo

Minor cleanup and ?refactor?
Add README to the minimal_test, maybe even bake into the Makefile
Properly integrate into the relative_finder_germline.py code
Properly set output file handle names and paths to mimic current rust code

Will integrate into pybark in separate PR

Parity

old results + rust code:

new results:

…embers, update the Sample loader to populate

…bject

…arisons for the time being

…optionally generate this output. If not using --new-samples or --samples-to-compare-to, the germline code will generate the usual single .match file

GERMLINE_0001.cpp

Individual.cpp

CalvinLeather · 2021-12-17T12:23:03Z

Individual.h

@@ -108,6 +117,8 @@ class Individual
 	streamoff offset;

 	Match ** all_matches;
+	ofstream* individualMatchFile;
+	ofstream* individualHomozFile;


Nice... this is definitely cleaner than map<string, ofstream*>

Individual.h

CalvinLeather · 2021-12-17T12:27:19Z

Match.cpp

-		fout << endl;
+		if ( ALL_SAMPLES.useEmbarkRFGermlineOutput ) {
+			int key1 = stoi(node[0]->single_id);
+			int key2 = stoi(node[1]->single_id);


Is it worth handling case where this can't convert? I don't think so, because that means something is truly wrong? i.e., something was passed in that wasn't really a proxy key?

Match.cpp

CalvinLeather · 2021-12-17T12:32:54Z

Match.cpp

+			string end_pos = to_string(ALL_SNPS.getSNP(snp_end).getPhysPos());
+
+			// homoz
+			if ( key1 == key2 && node[0]->is_new ) {


You could DRY out these if statements significantly, which I think would improve readability. The only thing that changes between the branches is ofs, and whether we get the 0th or 1st element of node. You could just set these in an if statement, then all the oline pushes could be shared

new dog new dog has some extra logic though which you'll need to keep in the if statements (maybe this is why you didn't DRY this out?)

I didn't do this mostly because I wanted the logic to be very visible and easily comparable to the logic in the rust code

minimal_test/run_new.sh

CalvinLeather

Nicely done! I'm working on the test harness + install into the container now, will give this a try later this morning

…onally create the individual file handles

…ns per existing spec

… append mode

… on ubuntu 20.04

ddunne-embarkvet · 2021-12-21T20:34:11Z

Match.cpp

+				*ofs << joined_oline << endl;
+			}
+			// newdog : newdog comparison, write out same record twice
+			else if ( node[0]->is_new  &&  node[1]->is_new  && key1 != key2 ) {


QQ: Is this block reachable? I'm a little hazy on what node[*] is. For instance, in the line above:

key1 > key2 && node[0]->is_new

... Does this necessarily imply that only one of the keys is new?

I'm naively assuming node[0]->is_new asserts that the first key is new, but doesn't say anything about the second...

Correct, so this specific case is the newdog newdog comparison, where both keys are new and not the same. This would require us to output the match twice, one for each new dog. This block is definitely reachable only when we run the compset 9999.

key1 > key2 && node[0]->is_new

... Does this necessarily imply that only one of the keys is `new`? I'm naively assuming `node[0]->is_new` asserts that the first key is `new`, but doesn't say anything about the second...

Correct, this is checking only if the first key is new. The per-dog matches file require(?) that the new dog be first, but germline doesn't care, so each line needs to be checked to see which key is the new dog and put that one first. This logic is lifted from the rust script

Again, this only matters when we are comparing a set of old dogs against new dogs, and using the -samples-to-compare and -new-samples flags. Otherwise, germline will do all dogs against all dogs in the ped file by default. This doesn't entirely rely on the fact that new keys will be greater than old keys as I wanted to be more explicit about checking if a key is new dog or not

ddunne-embarkvet

Looks good to me! One small question about the logic for splitting the lines.

…ual results to, adhere to existing pattern for easier integration into python

myz540 · 2021-12-22T17:09:34Z

Looks good to me! One small question about the logic for splitting the lines.

Thanks Dennis, addressed your comment. Merging now

Mike Zhong added 10 commits December 16, 2021 18:26

Add individual match and homoz files, with setters and getters

a7c30cc

fix matches builder header guard

836f075

Add single_id, haplotype, and is_new flag to Individual.h as public m…

2b951d8

…embers, update the Sample loader to populate

Have Sample loader set is_new

08fb9ef

Add prints for old vs new sample loading, testing

4bc5267

Ok, so ALL_SAMPLES is used as a ref and individualsP is the working o…

680060d

…bject

Add file handle initialization during sample load

47dbf7e

Write out records for new keys only, print for homoz and new-new comp…

6db6010

…arisons for the time being

fix file handle output name

4d6dea0

completed implementation for new-new and homoz files

377af45

myz540 added the wip label Dec 17, 2021

myz540 requested a review from CalvinLeather December 17, 2021 05:17

Mike Zhong added 2 commits December 16, 2021 21:27

Added minimal test set, will add instructions later

537032f

properly parameterized the useEmbarkRFGermlineOutput boolean flag to …

b899ea8

…optionally generate this output. If not using --new-samples or --samples-to-compare-to, the germline code will generate the usual single .match file

myz540 requested a review from mattbarton December 17, 2021 05:48