Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion openfold3/core/data/pipelines/preprocessing/template.py
Original file line number Diff line number Diff line change
Expand Up @@ -454,7 +454,8 @@ def create_template_cache_for_query(
continue

# 1. Apply sequence filters: AF3 SI Section 2.4
if check_sequence(query_seq=query.hit_sequence.replace("-", ""), hit=hit):
fails_sequence_filters, _, _ = check_sequence(query=query, hit=hit)
if fails_sequence_filters:
template_process_logger.info(
f"Template {hit_pdb_id} sequence does not pass sequence"
" filters. Skipping this template."
Expand Down
33 changes: 26 additions & 7 deletions openfold3/core/data/primitives/sequence/template.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ def parse_representatives(

# Template cache construction
def check_sequence(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we consider making this function name more descriptive. Perhaps something like "check_seqence_similarity_within_range"?

query_seq: str,
query: TemplateHit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related to this PR, but could we update docstring for TemplateHit? Some of the fields seem out of date:

class TemplateHit(NamedTuple):
"""Tuple containing template hit information.
Attributes:
index (str):
Row index of the hit in the alignment.
name (str):
PDB-chain ID of the hit.
aligned_cols (int):
Number of
hit_sequence (str):
The PDB ID of the hit.
indices_hit (str):
The PDB ID of the hit.
e_value (str):
The PDB ID of the hit.
"""

hit: TemplateHit,
max_subseq: float = 0.95,
min_align: float = 0.1,
Expand All @@ -152,8 +152,8 @@ def check_sequence(
"""Applies sequence filters to template hits following AF3 SI Section 2.4.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a quick description of the template filters from this section. AFAICT from the code, these filters are:

  • Fails if coverage < min_align threshold.
  • Fails if coverage >= max_subseq AND covered == identical.

Digging into the second statement, the anticipated outputs are:

  • covered == identical -- this suggests that the template hit is identical to the query hit, because the non-gaps are located in the same places in the query / template hit. This hit would fail the filter
  • covered != identical -- this suggests that some of the gap tokens in the matching sequence are not in the same locations, and thus the sequence is not a perfect match. This hit would pass the filter.

If the above understanding is correct, I am not sure if this would resolve the issue raised in the test example given in #72 . In that case, we have a hit which has 100% coverage, but has a different sequence value. In that test case, I believe the function would still fail the checks in this function.

Copy link

@ECalfeeAdaptive ECalfeeAdaptive Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequence positions that are aligned but not identical (AA substitutions) don't seem to be covered by this PR. I think the template 'duplicate' logic here from openfold could be re-used for openfold3 and would resolve the issue I raised in #72


Args:
query_seq (str):
The query sequence.
query (TemplateHit):
The query template_hit.
hit (TemplateHit):
Candidate template hit.
max_subseq (float, optional):
Expand All @@ -167,13 +167,32 @@ def check_sequence(
bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update the return description to reflect the 3 values that are now being returned?

Also, it doesn't appear that we use the other return values of query_aln and hit_aln. Perhaps it would be easier to add these return values later if/when they are needed?

Whether the hit passes the sequence filters.
"""
query_seq = query.hit_sequence.replace("-", "")
hit_seq = hit.hit_sequence.replace("-", "")
return (
((len(hit_seq) / len(query_seq)) > max_subseq)
| ((hit.aligned_cols / len(query_seq)) < min_align)
| (len(hit_seq) < min_len)
if len(hit_seq) < min_len:
return True, None, None
query_aln = np.frombuffer(
query.hit_sequence.replace(".", "-").encode("ascii"), dtype="S1"
)
hit_aln = np.frombuffer(
hit.hit_sequence.replace(".", "-").encode("ascii"), dtype="S1"
)

query_not_gap = query_aln != b"-"
hit_not_gap = hit_aln != b"-"

columns_to_keep = query_not_gap & hit_not_gap
covered = columns_to_keep.sum()

coverage = covered / (len(query_seq) or 1)

if coverage < min_align:
return True, None, None

identical = (columns_to_keep & (query_not_gap == hit_not_gap)).sum()

return coverage >= max_subseq and identical == covered, query_aln, hit_aln


def parse_release_date(cif_file: CIFFile) -> datetime:
"""Parses the release date of a structure from its CIF file.
Expand Down
Loading