-
Notifications
You must be signed in to change notification settings - Fork 55
Fix check_sequence filter #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
jnwei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for submitting a fix for this issue @gnikolenyi ! I understand that this fix might be a smaller part of a set of larger fixes planned with the later release. Thank you for adding this fix separately first.
To ensure correctness of the check_sequence logic, it would be extremely helpful to have a few test cases. Simple examples like the one provided by @ECalfeeAdaptive in #72 are perfect for testing this function as it is easy to see the contents of the examples, and easy to check what the expected output should be.
Please let me know if I can provide any assistance with adding the the tests / documentation.
| Minimum required template hit length. Defaults to 10 residues. | ||
| Returns: | ||
| bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we update the return description to reflect the 3 values that are now being returned?
Also, it doesn't appear that we use the other return values of query_aln and hit_aln. Perhaps it would be easier to add these return values later if/when they are needed?
| # Template cache construction | ||
| def check_sequence( | ||
| query_seq: str, | ||
| query: TemplateHit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly related to this PR, but could we update docstring for TemplateHit? Some of the fields seem out of date:
openfold-3/openfold3/core/data/io/sequence/template.py
Lines 69 to 85 in 0b9df93
| class TemplateHit(NamedTuple): | |
| """Tuple containing template hit information. | |
| Attributes: | |
| index (str): | |
| Row index of the hit in the alignment. | |
| name (str): | |
| PDB-chain ID of the hit. | |
| aligned_cols (int): | |
| Number of | |
| hit_sequence (str): | |
| The PDB ID of the hit. | |
| indices_hit (str): | |
| The PDB ID of the hit. | |
| e_value (str): | |
| The PDB ID of the hit. | |
| """ |
| min_align: float = 0.1, | ||
| min_len: int = 10, | ||
| ) -> bool: | ||
| """Applies sequence filters to template hits following AF3 SI Section 2.4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a quick description of the template filters from this section. AFAICT from the code, these filters are:
- Fails if
coverage < min_alignthreshold. - Fails if
coverage >= max_subseq AND covered == identical.
Digging into the second statement, the anticipated outputs are:
- covered == identical -- this suggests that the template hit is identical to the query hit, because the non-gaps are located in the same places in the query / template hit. This hit would fail the filter
- covered != identical -- this suggests that some of the gap tokens in the matching sequence are not in the same locations, and thus the sequence is not a perfect match. This hit would pass the filter.
If the above understanding is correct, I am not sure if this would resolve the issue raised in the test example given in #72 . In that case, we have a hit which has 100% coverage, but has a different sequence value. In that test case, I believe the function would still fail the checks in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
|
|
||
| # Template cache construction | ||
| def check_sequence( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we consider making this function name more descriptive. Perhaps something like "check_seqence_similarity_within_range"?
Fixes #72