Skip to content

Conversation

@chodec
Copy link
Contributor

@chodec chodec commented Dec 6, 2025

This Pull Request implements full support for custom residue numbering in the inference output. This feature allows users to define specific residue numbering in the input JSON, resolving Issue #58.

Summary of Changes

The primary goal was to allow users to define specific residue numbering in the input JSON, rather than relying on the default numbering starting from 1. This includes support for non-sequential numbers and PDB-style insertion codes (e.g., '103A').

The implementation required coordinated changes across three key modules:

  1. Input Definition (inference_query_format.py):
    • Added two optional string fields to the Chain class: starting_residue_number (for simple offset) and residue_ids (for explicit lists).
  2. Data Processing (inference.py):
    • Implemented logic that ensures residue_ids takes precedence over starting_residue_number. If a valid explicit list is provided, it is used; otherwise, a sequential list is generated based on the start number. The final list is stored in the data batch.
  3. Output Writing (Post-processing in writer.py):
    • Implemented the static method OF3OutputWriter._renumber_atom_array. This method executes after model inference but before writing the PDB/mmCIF file.
    • It uses regular expressions (re) to safely parse string IDs (e.g., separating '103A' into the integer ID 103 and the insertion code 'A').
    • The new IDs are applied directly to the Biotite AtomArray's res_id and ins_code annotations. This ensures the output structure reflects the desired numbering without affecting core model calculations.

Related Issues

Resolves: #58


Testing and Validation

Note on Testing: Due to local environment configuration issues (missing model checkpoints), an end-to-end test run was not possible to perform.

However, the logic has been manually validated to ensure:

  • Priority and Consistency: The implementation correctly prioritizes residue_ids and handles sequence length mismatch by defaulting to standard numbering (1, 2, 3...).
  • Parsing Robustness: The regex parsing logic in writer.py correctly extracts insertion codes, which is critical for PDB compliance.

Copy link
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chodec,

Thank you for working on this PR. The custom numbering feature is a tricky feature to get right, but as #58 indicates, it would be very useful for researchers.

I took a first pass over this PR today, and I have a few suggestions / questions:

  • First, a check of my understanding: In the current implementation, a new method for creating a custom residue ID list (get_custom_residue_ids) in the InferenceDataset class. It looks like the intention is for the new residue IDs to be read by the output writer, but I don't see where the new residue ids are added to the batch?

  • I would recommend that the custom residue_id list be created upon construction of the Chain class in inference_query_format.py rather than being generated in the InferenceDataset. This way, the logic around parsing the residue ids can be kept in one place, rather than adding extra logic to the InferenceDataset.

    • For an example of how to use pydantic validators to create the residue_id list given an input that is either a full list or an int, you might be able to borrow the logic used in InferenceExperimentSettings to generate random seeds from a list or a initial integer seed here
    • The InferenceDataset can then be used to create batch features of the custom residue list if it is provided in a chain.
  • Could you please add unit tests to test the creation of the custom residue numbering? I think it could be helpful to have two tests:
    - One test for generating the optional residue_id list in the Chain class, perhaps added here
    - One test for writing the outputs, which could be added here).

  • I would guess that some of the examples you used for manual validation of the numbering might be suitable test cases.

Also assigning @ljarosch to review, as he has more experience working with biotite and renumbering chains and may have additional suggestions regarding the organization.

Please let us know if you have any questions, and thank you again for your work on this issue!

elif out_fmt == "npz":
np.savez_compressed(out_file_full, **full_confidence_scores)

# openfold3/core/runners/writer.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this stray comment

@jnwei jnwei requested a review from ljarosch December 8, 2025 11:21
@chodec chodec marked this pull request as draft December 12, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants