Skip to content

Allow for single-state alignments and remove misleading warnings #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

StefanFlaumberg
Copy link
Contributor

The current version does not allow an alignment or a partition to have only one state by invoking a hardcoded error. The motivation for this restriction is probably that states not present on tree leaves are assumed to be excluded from the substitution process, hence for a single-state alignment there would be only one state left for modelling substitution process, which would make the observed alignment a certain event given any tree topology and branch lengths (LL = 0).

Here I want to bring your attention to the fallacy of assuming that unrepresented states get excluded from the substitution process. Actually, they do not! Below I provide a simple example in proof:

Let's assume a dummy alignment dummy.phy:

5 1
1    A
2    A
3    A
4    A
5    A

And use it for tree reconstruction after disabling the error raising expression in the Alignment::checkAbsentStates function:
iqtree3 -seed 123 -nt 1 -s dummy.phy --seqtype AA --keep-ident -m "LG" -pre test_dummy_allstates
iqtree3 -seed 123 -nt 1 -s dummy.phy --seqtype AA --keep-ident -m "LG+F" -pre test_dummy_onestate
Both runs finish successfully. The run using the LG model results in LL = -2.537, while the run using the LG+F model results in LL = 0.0. All the inferred tree branches have the minimum allowed length (1e-6) in both runs.

The explanation is obvious. The inbuilt state frequencies of the LG matrix used in the first run make it possible for all states to occur in the substitution process, implying that different alignments, not only the given one, can evolve on any tree to be estimated, hence the non-zero LL of the tree estimated for the given alignment. On the contrary, the observed +F state frequencies freq(A, other) = (1.0, 0.0) used in the second run allow only for the A state in the substitution process, making the evolution of the given alignment inevitable on any tree.
Were unrepresented state frequencies really excluded from the substitution process, we would observe the second run situation for the both runs.

The conclusion is:

  • States unrepresented in alignment are not excluded from the substitution process and never have been. Apparently, it is ok and usually does not lead to numerical issues.
  • A single-state alignment/partition is suitable for an analysis including branch length optimization, both if only one state or all states are included in the substitution process. It likewise seems not to lead to any issues, but the estimated branch lengths are surely not reliable (due to the scarcity of information).

This pull request allows for single-state alignments/partitions and modifies the warnings in accordance with the conclusion.
The Alignment::checkAbsentStates and SuperAlignment::checkAbsentStates functions are made to be of the void-returning type because 1) all the relevant information is already printed inside Alignment::checkAbsentStates and printing the sum of numbers of unobserved states in SuperAlignment::checkAbsentStates (which can easily exceed the num_states) is somewhat misleading, and 2) the number of unobserved states is not used anywhere else.

@StefanFlaumberg
Copy link
Contributor Author

StefanFlaumberg commented Jun 27, 2025

Why this is important:

I am interested in working with structure-based partitioned alignments. In such alignments a partition for a specific structure type can comprise a single constant site. The phylogenetic impact of this partition is negligible and the partition is not too easy to filter out in advance. Thus, if it doesn't lead to an error, it'd better be included in the analysis.

Other users have also enquired about the "States ... not present in ... and thus removed from Markov process" warning (see iqtree/iqtree2#454 and #33). So I think it would be useful to remove this misleading warning, if my conclusions are right, or to discuss what really happens with the unrepresented states in light of the example I presented above, otherwise.

@StefanFlaumberg
Copy link
Contributor Author

A possible problem:

I've tested the modified IQ-Tree version against RAxML-NG (single-state partitions are allowed there by default!) on the dummy alignment and on several partitioned real alignments with different options for branch length optimization. The results were similar except for the case of length-proportional partition models: here IQ-Tree fails to properly optimize the tree scaling factor of the single-state partition, estimating the factor to be ca. 1.0 when it should be ca. 0.0.

The problem stems from the optimization constraints for the tree scaling factor in the PartitionModelPlen::optimizeGeneRate and PhyloTree::optimizeTreeLengthScaling functions.
The first function puts the following constraints: min_scaling = 1.0/tree->at(i)->getAlnNSite() and max_scaling = nsites / tree->at(i)->getAlnNSite(). To me, these are very odd heuristics. For one thing, they imply that a single-site partition cannot be much slower than the average -- but why?
The second function constraints are devised to prevent exceeding the min_branch_length and max_branch_length boundaries during tree scaling. These constraints become a problem in the case of a tree having both near-zero branch lengths and normal branch lengths, as for such a tree the min_scaling will be constrained to be no less than 1.0.

My solution would be to use no constraints in the first function and to rewrite the rescaling procedure to always perform rescaling with length saturation (i.e. any rescaling factor is allowed, but the result never exceeds the preset min and max values). This would make the rescaling procedure more biologically sensible. Changing the rescaling procedure would also require several changes in scale factor optimization.
I think I could implement it myself, but only in a separate pull request once/if this one gets accepted.

Finally, there are two important things I'd like to draw attention to here:

  1. The problem of single-site partition scaling constraints is likely to have little effect in real cases, given such partitions usually comprise only a small share of the alignment sites. Thus, the current pull request is still relevant and complete.
  2. The current implementation of tree scaling is suboptimal regardless of the current pull request topic. The example of a tree with both near-zero and normal branch lengths is quite realistic and shows that rescaling with length saturation might be a better option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant