Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs discussing how to get a matrix of pairwise comparisons #2776

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 21 additions & 8 deletions docs/stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -280,12 +280,10 @@ and also inefficient, because the allele frequencies for one population
gets used in computing many of those values.
So, statistics that take a `sample_sets` argument also take an `indexes` argument,
which for a statistic that operates on `k` sample sets will be a list of `k`-tuples.
If `indexes` is a length `n` list of `k`-tuples,
then the output will have `n` columns,
and if `indexes[j]` is a tuple `(i0, ..., ik)`,
then the `j`-th column will contain values of the statistic computed on
`(sample_sets[i0], sample_sets[i1], ..., sample_sets[ik])`.

If `indexes` is a length `n` list of `k`-tuples, then the output will have `n`
columns, so if `indexes[j]` is the tuple `(i0, ..., ik)`
the `j`-th column will contain values of the statistic computed on
`(sample_sets[i0], sample_sets[i1], ..., sample_sets[ik])`.

How multiple statistics are handled differs slightly between statistics
that operate on single sample sets and multiple sample sets.
Expand Down Expand Up @@ -338,14 +336,29 @@ The `indexes` parameter is interpreted in the following way:
statistic selecting the specified sample sets and we remove the last dimension
in the result array as described in the {ref}`sec_stats_output_dimensions` section.

- If if is `None` and `sample_sets` contains exactly `k` sample sets,
- If it is `None` and `sample_sets` contains exactly `k` sample sets,
this is equivalent to `indexes=range(k)`. **Note
that we also drop the outer dimension of the result array in this case**.

- If is is a list of `k`-tuples (each consisting of integers
- If it is a list of `k`-tuples (each consisting of integers
of integers between `0` and `len(sample_sets) - 1`) of length `n` we
compute `n` statistics based on these selections of sample sets.

For instance, a common requirement in the case of a pairwise (i.e. `k=2`) symmetrical
statistic such as `divergence` is to obtain the statistic between all pairs of
sample sets. This can be done by indexing all possible pairs as follows:

```{code-cell} ipython3
# Compute all pairwise divergences between diploid individuals. Or e.g. for pairwise
# divergences between each sampled genome, use `sample_sets=[[s] for s in ts.samples()]`
sample_sets=[i.nodes for i in ts.individuals()]
index_all_pairs = np.tril_indices(len(sample_sets)) # ((0,0), (1,0), (1,1), (2,0), ...)
d = ts.divergence(sample_sets, indexes=tuple(zip(*index_all_pairs)))
pairwise_matrix = np.zeros((len(sample_sets), len(sample_sets)))
pairwise_matrix[index_all_pairs] = d

print(pairwise_matrix) # NB: only lower triangular values filled
```

(sec_stats_windows)=

Expand Down