feat(inference): Add support for custom residue numbering (resolves #58) #69

chodec · 2025-12-06T21:25:40Z

This Pull Request implements full support for custom residue numbering in the inference output. This feature allows users to define specific residue numbering in the input JSON, resolving Issue #58.

Summary of Changes

The primary goal was to allow users to define specific residue numbering in the input JSON, rather than relying on the default numbering starting from 1. This includes support for non-sequential numbers and PDB-style insertion codes (e.g., '103A').

The implementation required coordinated changes across three key modules:

Input Definition (inference_query_format.py):
- Added two optional string fields to the Chain class: starting_residue_number (for simple offset) and residue_ids (for explicit lists).
Data Processing (inference.py):
- Implemented logic that ensures residue_ids takes precedence over starting_residue_number. If a valid explicit list is provided, it is used; otherwise, a sequential list is generated based on the start number. The final list is stored in the data batch.
Output Writing (Post-processing in writer.py):
- Implemented the static method OF3OutputWriter._renumber_atom_array. This method executes after model inference but before writing the PDB/mmCIF file.
- It uses regular expressions (re) to safely parse string IDs (e.g., separating '103A' into the integer ID 103 and the insertion code 'A').
- The new IDs are applied directly to the Biotite AtomArray's res_id and ins_code annotations. This ensures the output structure reflects the desired numbering without affecting core model calculations.

Related Issues

Resolves: #58

Testing and Validation

Note on Testing: Due to local environment configuration issues (missing model checkpoints), an end-to-end test run was not possible to perform.

However, the logic has been manually validated to ensure:

Priority and Consistency: The implementation correctly prioritizes residue_ids and handles sequence length mismatch by defaulting to standard numbering (1, 2, 3...).
Parsing Robustness: The regex parsing logic in writer.py correctly extracts insertion codes, which is critical for PDB compliance.

…qlaboratory#58)

jnwei

Hi @chodec,

Thank you for working on this PR. The custom numbering feature is a tricky feature to get right, but as #58 indicates, it would be very useful for researchers.

I took a first pass over this PR today, and I have a few suggestions / questions:

First, a check of my understanding: In the current implementation, a new method for creating a custom residue ID list (get_custom_residue_ids) in the InferenceDataset class. It looks like the intention is for the new residue IDs to be read by the output writer, but I don't see where the new residue ids are added to the batch?
I would recommend that the custom residue_id list be created upon construction of the Chain class in inference_query_format.py rather than being generated in the InferenceDataset. This way, the logic around parsing the residue ids can be kept in one place, rather than adding extra logic to the InferenceDataset.
- For an example of how to use pydantic validators to create the residue_id list given an input that is either a full list or an int, you might be able to borrow the logic used in InferenceExperimentSettings to generate random seeds from a list or a initial integer seed here
- The InferenceDataset can then be used to create batch features of the custom residue list if it is provided in a chain.
Could you please add unit tests to test the creation of the custom residue numbering? I think it could be helpful to have two tests:
- One test for generating the optional residue_id list in the Chain class, perhaps added here
- One test for writing the outputs, which could be added here).
I would guess that some of the examples you used for manual validation of the numbering might be suitable test cases.

Also assigning @ljarosch to review, as he has more experience working with biotite and renumbering chains and may have additional suggestions regarding the organization.

Please let us know if you have any questions, and thank you again for your work on this issue!

jnwei · 2025-12-08T10:43:56Z

openfold3/core/runners/writer.py

        elif out_fmt == "npz":
            np.savez_compressed(out_file_full, **full_confidence_scores)

+# openfold3/core/runners/writer.py


Please remove this stray comment

feat(inference): Add support for custom residue numbering (resolves a…

395f347

…qlaboratory#58)

jnwei requested changes Dec 8, 2025

View reviewed changes

jnwei requested a review from ljarosch December 8, 2025 11:21

Fix: Compatibility and Core Logic for Python 3.13 / Pydantic V4

eabfc13

chodec marked this pull request as draft December 12, 2025 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(inference): Add support for custom residue numbering (resolves #58) #69

feat(inference): Add support for custom residue numbering (resolves #58) #69

Uh oh!

chodec commented Dec 6, 2025

Uh oh!

jnwei left a comment

Uh oh!

jnwei Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(inference): Add support for custom residue numbering (resolves #58) #69

Are you sure you want to change the base?

feat(inference): Add support for custom residue numbering (resolves #58) #69

Uh oh!

Conversation

chodec commented Dec 6, 2025

Summary of Changes

Related Issues

Testing and Validation

Uh oh!

jnwei left a comment

Choose a reason for hiding this comment

Uh oh!

jnwei Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants