Skip to content

Conversation

@gnikolenyi
Copy link
Collaborator

@gnikolenyi gnikolenyi commented Dec 28, 2025

Fixes #72

Copy link
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting a fix for this issue @gnikolenyi ! I understand that this fix might be a smaller part of a set of larger fixes planned with the later release. Thank you for adding this fix separately first.

To ensure correctness of the check_sequence logic, it would be extremely helpful to have a few test cases. Simple examples like the one provided by @ECalfeeAdaptive in #72 are perfect for testing this function as it is easy to see the contents of the examples, and easy to check what the expected output should be.

Please let me know if I can provide any assistance with adding the the tests / documentation.

Minimum required template hit length. Defaults to 10 residues.
Returns:
bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update the return description to reflect the 3 values that are now being returned?

Also, it doesn't appear that we use the other return values of query_aln and hit_aln. Perhaps it would be easier to add these return values later if/when they are needed?

# Template cache construction
def check_sequence(
query_seq: str,
query: TemplateHit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related to this PR, but could we update docstring for TemplateHit? Some of the fields seem out of date:

class TemplateHit(NamedTuple):
"""Tuple containing template hit information.
Attributes:
index (str):
Row index of the hit in the alignment.
name (str):
PDB-chain ID of the hit.
aligned_cols (int):
Number of
hit_sequence (str):
The PDB ID of the hit.
indices_hit (str):
The PDB ID of the hit.
e_value (str):
The PDB ID of the hit.
"""

min_align: float = 0.1,
min_len: int = 10,
) -> bool:
"""Applies sequence filters to template hits following AF3 SI Section 2.4.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a quick description of the template filters from this section. AFAICT from the code, these filters are:

  • Fails if coverage < min_align threshold.
  • Fails if coverage >= max_subseq AND covered == identical.

Digging into the second statement, the anticipated outputs are:

  • covered == identical -- this suggests that the template hit is identical to the query hit, because the non-gaps are located in the same places in the query / template hit. This hit would fail the filter
  • covered != identical -- this suggests that some of the gap tokens in the matching sequence are not in the same locations, and thus the sequence is not a perfect match. This hit would pass the filter.

If the above understanding is correct, I am not sure if this would resolve the issue raised in the test example given in #72 . In that case, we have a hit which has 100% coverage, but has a different sequence value. In that test case, I believe the function would still fail the checks in this function.

Copy link

@ECalfeeAdaptive ECalfeeAdaptive Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequence positions that are aligned but not identical (AA substitutions) don't seem to be covered by this PR. I think the template 'duplicate' logic here from openfold could be re-used for openfold3 and would resolve the issue I raised in #72



# Template cache construction
def check_sequence(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we consider making this function name more descriptive. Perhaps something like "check_seqence_similarity_within_range"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Top templates incorrectly filtered out by training preprocessing script

3 participants