Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy Function #317

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

Discrepancy Function #317

wants to merge 27 commits into from

Conversation

hfr1tz3
Copy link

@hfr1tz3 hfr1tz3 commented Sep 22, 2023

This discrepancy functions gives us a metric for comparing two tree sequences.
The tree_discrepancy method returns for the discrepancy between tsa and tsb, the tuple of
a. the total shared span divided by total node span in tsa (which we need to compute); so this is proprotion of the span in tsa that is accurately represented in tsb; and
b. the root-mean-squared discrepancy in time, with the average weighted by span in tsa.

@nspope
Copy link
Contributor

nspope commented Sep 25, 2023

Awesome, let me know when you want feedback.

@hfr1tz3
Copy link
Author

hfr1tz3 commented Sep 25, 2023

So we need to pass over edges and pass over roots. To do so we should write a separate function.
Proposal: Need to get per node total span in ts ( ie. ) w_i in sum (t_x-t_y)**2 w_i/ sum (w_i)
count up in each node the total span from where it is a child which we can get from the edge table np.bincount...
doesn't account for nodes which are roots so we need to include contribution of nodes who are roots.

@petrelharp
Copy link

petrelharp commented Oct 2, 2023


def node_spans(ts):
    """
    Returns the array of "node spans", i.e., the `j`th entry gives
    the total span over which node `j` is in the tree (i.e., does
    not have 'missing data' there).
    """
    child_spans = np.bincount(
            ts.edges_child,
            weights=ts.edges_right - ts.edges_left,
            minlength=ts.num_nodes,
    )

    for t in ts.trees():
        span = t.interval[1] - t.interval[0]
        for r in t.roots:
            # do this check to exempt 'missing data'
            if t.num_children[r] > 0:
                child_spans[r] += span

@hfr1tz3
Copy link
Author

hfr1tz3 commented Oct 2, 2023

I think that is everything we need for tree_discrepancy. I think I will be ready for feedback now @nspope .

@nspope
Copy link
Contributor

nspope commented Oct 9, 2023

Oh sorry I just saw this! Will look ASAP.

@hfr1tz3
Copy link
Author

hfr1tz3 commented Oct 9, 2023

I still haven't checked if the tests from test_evaluation.py have passed yet, but I am working on trying to do that.

@petrelharp
Copy link

Hold off @nspope - we looked at this today and @hfr1tz3 has homework yet.

@hfr1tz3
Copy link
Author

hfr1tz3 commented Oct 25, 2023

All tests in test_evaluation.py for tree_discrepancy now pass!
@petrelharp does this mean we are ready for review?

dis, err = evaluation.tree_discrepancy(ts, other)
true_error = np.sqrt((2 * 6 * 300**2 + 2 * 2 * 150**2) / 46)
assert dis == 0.0
assert np.isclose(err, true_error)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add another test that tests with both time and span True?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention it, that would make sense to do.

tsdate/evaluation.py Outdated Show resolved Hide resolved
tsdate/evaluation.py Outdated Show resolved Hide resolved
tsdate/evaluation.py Outdated Show resolved Hide resolved
tsdate/evaluation.py Outdated Show resolved Hide resolved
@petrelharp
Copy link

I have some minor suggestions, then yes - ready for review!

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
hfr1tz3 and others added 2 commits October 26, 2023 10:34
Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
@petrelharp
Copy link

Have a look, @nspope ?

Copy link
Contributor

@nspope nspope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! Just have a couple questions about the "one step" procedure to find the best matching node.

if shared_spans[i, j] == max_span[i]:
match[i, j] = shared_spans[i, j]
time_array[i, j] = np.abs(ts.nodes_time[i] - other.nodes_time[j])
discrepancy[i, j] = 1 / (1 + match[i, j] * time_array[i, j])
Copy link
Contributor

@nspope nspope Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me of the rationale here?

The best match should be the node that has the greatest shared span, if this is unambiguous. If it is ambiguous (nodes are tied in max shared span), then we could choose the best match to be the tied node with with the closest age to the focal node.

But that two-step procedure isn't happening here, right? So we could conceivably get a "best-matching" node with a relatively small shared span but a very similar age? (and the chance of this happening would depend on the choice of time scaling?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see-- you are doing the two step procedure, sorry! What threw me off was discrepancy[i, j] = 1 / (1 + match[i, j] * time_array[i, j]) ... does this need to include match?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose not, I could just clean it up and use shared_spans[i,j].

time_difference = np.absolute(np.asarray(ts_times - other_times))
best_match_matrix = scipy.sparse.coo_matrix(
(
1 / (1 + (match_matrix.data * time_difference)),
Copy link
Contributor

@nspope nspope Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as earlier comment -- couldn't we end up with best matches that don't satisfy the "max shared span" criterion, but are very similar in terms of age?

And, the choice of time scaling would impact the chance of this happening, right? In that, if the time scaling is such that the absolute difference between ages is large relative to typical node spans, then this one-step criterion would be dominated by the difference in ages?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see how this works-- very nice.

match = shared_spans.data == max_span[row_ind]
# Construct a matrix of potiential matches and
match_matrix = scipy.sparse.coo_matrix(
(shared_spans.data[match], (row_ind[match], col_ind[match])),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work as intended if shared_spans.data[match] was np.ones(len(match)), correct?

shared_spans = naive_shared_node_spans(ts, other).toarray()
max_span = np.max(shared_spans, axis=1)
assert len(max_span) == ts.num_nodes
match = np.zeros((ts.num_nodes, other.num_nodes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is match needed? Seems like this would work fine with

if shared_spans[i, j] == max_span[i]:
  time_array[i, j] = np.abs(...)
  discrepancy[i, j] = 1 / (1 + time_array[i, j])

(also, maybe add a comment about the necessity of 1 / (1 + x) for the sparse matrix format)

match[i, j] = shared_spans[i, j]
time_array[i, j] = np.abs(ts.nodes_time[i] - other.nodes_time[j])
discrepancy[i, j] = 1 / (1 + match[i, j] * time_array[i, j])
best_match = np.argmax(discrepancy, axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nitpicking here, but I think a more naive (e.g. clearer to read) implementation would not use sparse matrix operations -- instead, go over shared_spans row by row, find the max, find ties, calculate time differences for those ties, and append the best match to to a list.

@nspope
Copy link
Contributor

nspope commented Oct 27, 2023

Nevermind, I misunderstood on my initial read-through. Looks great!

The only real suggestion I have is to rewrite the naive test suite implementation to be more naive (e.g. clearer on a quick glance). That is, don't use sparse matrix operations, but instead loop over rows of shared_spans, find the set of nodes matching the row max with np.where, get the age difference for these, and append the node with the min age difference to a list.

@petrelharp
Copy link

Nevermind, I misunderstood on my initial read-through. Looks great!

Perhaps a short explanatory comment could clarify how it works - I thought it was wrong last time I looked at it also. =)

@petrelharp
Copy link

We've got line ending problems; for the record here's what I did to fix them:

sed -i 's/^M$//' tsdate/evaluation.py
sed -i 's/^M$//' tests/test_evaluation.py 

@hyanwong
Copy link
Member

hyanwong commented Nov 8, 2023

Is this ready for merging? I'm happy to do so if @petrelharp and/or @nspope give the OK (assume it's fine for the moment in the tsdate repo). If we merge it this week, it should make it into the next release (although I don't think it's a documented thing yet anyway, right?)

I have a feeling it might require a tskit release too, though?

@nspope
Copy link
Contributor

nspope commented Nov 8, 2023

I think this PR is waiting on extend_edges for the tests to pass? Which might now be merged into the github head for tskit. But maybe we should wait until this is in a release -- I don't think there's an immediate rush.

@petrelharp
Copy link

It is merged to tskit, but we need a release.

@nspope
Copy link
Contributor

nspope commented Dec 11, 2023

Note: we need to ensure that the discrepancy function and time RMSE are calculated correctly when nodes have no matches. Currently, the tie would be broken by comparing to all nodes in the tree sequence (we think?). @hfr1tz3 will fix

Added unit test.
Edited both naive_discrepancy and tree_discrepancy to exclude non-matched nodes.
Had to remove -n argument in pytest.ini file. (I have no idea why its there)
@petrelharp
Copy link

Note that when we add stuff to a tree sequence we add both right and wrong stuff; if the proprotion of wrong stuff we add is greater than the proportion of wrong stuff already there, we increase discrepancy, even if this proportion is very low. (Example: extending a true but simplified tree sequence.)

So, proposal is that tree_discrepancy(a, b) also returns the proportion of the total span of b that is matched in a; this would be

(1 - discrepancy) * total_span(a) / total_span(b)

@petrelharp
Copy link

Also, a evaluation.total_span(ts) function that just does np.sum(evaluation.node_spans(ts)).

tsdate/evaluation.py Outdated Show resolved Hide resolved
tsdate/evaluation.py Outdated Show resolved Hide resolved
@petrelharp
Copy link

I think the code does what we want but the docs had it the other way around? I could be mixed up, though - please check?

@petrelharp
Copy link

Okay - if you agree with my suggested changes, then please hit the "commit change" button on them.

hfr1tz3 and others added 2 commits February 23, 2024 09:20
Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
@petrelharp
Copy link

TODO:

  • squash commits
  • rebase to main
  • double check things are ready to merge

However, this currently uses extend_edges, which we are renaming extend_haplotypes; so let's (a) make the change to extend_haplotypes; (b) check the tests run locally; (c) mark skip them so this passes; (d) make an issue to un-skip these when a new tskit with extend_haplotypes is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants