[Train] Richer Train Run Metadata by JasonLi1909 · Pull Request #59186 · ray-project/ray

JasonLi1909 · 2025-12-04T22:54:35Z

Overview

This PR adds richer metadata to Ray Train runs, enabling improved train run observability and ease of reproducibility between train runs. The new metadata consists of the following:

Datasets
DataConfig
RunConfig (FailureConfig, CheckpointConfig, name, worker runtime environment, storage_filesystem, storage_path)
BackendConfig
ScalingConfig
Framework versions (ray and framework related module versions)
User defined train_loop_config

Implementation

Diagram: New metadata collection and export flow

This PR:

Implements logic to collect new metadata from the StateManager callback
Updates Train Run pydantic models and protobuf schemas accommodate the new fields
Implements logic used in TrainStateActor's export API for converting pydantic TrainRuns into Protobuf, while also sanitizing fields to be human-readable
Refactors BackendConfig to be an abstract base class to allow for a new to_dict abstract method and implements a DefaultBackendConfig class
Adds tests for new metadata collection and state export
Increments export API TRAIN_SCHEMA_VERSION from 2 to 3

Pydantic `TrainRun` to Protobuf `ExportTrainRunEventData` Conversion

The below table details the type mapping from the pydantic TrainRun model to the protobuf ExportTrainRunEventData schema. The protobuf types were chosen to allow for non-lossy conversion while maintaining the original pydantic model's clear type contract.

Pydantic TrainRun to Protobuf ExportTrainRunEventData Type Mappings

Pydantic TrainRun Field	Protobuf ExportTrainRunEventData Field
`framework_versions: Dict[str, str]`	`map<string, string> framework_versions`
`Run_settings: RunSettings`	`RunSettings run_settings`
`train_loop_config: Optional[Dict]`	`optional google.protobuf.Struct train_loop_config = 1;`
`backend_config: BackendConfig`	`BackendConfig backend_config`
`framework: Optional[TrainingFramework]`	`(enum) TrainingFramework framework`
`config: Dict[str, Any]`	`google.protobuf.Struct config`
`scaling_config: ScalingConfig`	`ScalingConfig scaling_config`
`num_workers: Union[int, Tuple[int, int]]`	`message IntRange {` `int32 min = 1;` `int32 max = 2;` `}` `// The number of workers for the Train run, can be a range with elastic training enabled` `oneof num_workers {` `int32 num_workers_fixed = 1;` `IntRange num_workers_range = 2;` `}`
`use_gpu: bool`	`bool use_gpu`
`resources_per_worker: Optional[Dict[str, float]]`	`message StringFloatMap {` `map<string, double> values = 1;` `}` `StringFloatMap resources_per_worker = 3;`
`placement_strategy: str`	`string placement_strategy`
`accelerator_type: Optional[str]`	`optional string accelerator_type`
`use_tpu: bool`	`bool use_tpu`
`topology: Optional[str]`	`optional string topology`
`bundle_label_selector: Optional[Union[Dict[str, str], List[Dict[str, str]]]]`	`message StringMap {` `map<string, string> values = 1;` `}` `repeated StringMap bundle_label_selector;`
`datasets: List[str]`	`repeated string datasets`
`data_config: DataConfig`	`DataConfig data_config`
`datasets_to_split: Union[Literal["all"], List[str]]`	`message All {}` `message StringList {` `repeated string values = 1;` `}` `oneof datasets_to_split {` `All all = 1;` `StringList datasets = 2;` `}`
`execution_options: Optional[Dict]`	`google.protobuf.Struct execution_options`
`enable_shard_locality: bool`	`bool enable_shard_locality`
`run_config: RunConfig`	`RunConfig run_config`
`name: str`	`string name`
`failure_config: FailureConfig`	`FailureConfig failure_config`
`worker_runtime_env: Dict[str, Any]`	`google.protobuf.Struct worker_runtime_env`
`checkpoint_config: CheckpointConfig`	`CheckpointConfig checkpoint_config`
`num_to_keep: int`	`optional uint64 num_to_keep`
`checkpoint_score_attribute: Optional[str]`	`optional string checkpoint_score_attribute`
`checkpoint_score_order: Literal["max", "min"]`	`(enum) CheckpointScoreOrder checkpoint_score_order`
`storage_path: str`	`string storage_path`
`storage_filesystem: Optional[str]`	`optional string storage_filesystem`
`max_failures: int`	`int max_failures`
`controller_failure_limit: int`	`int controller_failure_limit`

The below table outlines the underlying type usage patterns that guided the mappings above.

Protobuf Type Usage Patterns and Reasoning Table

Protobuf Type	Description / Usage
`google.protobuf.Struct`	Represents dictionaries with values of arbitrary types that are only known at runtime. Keys must always be strings.
`string`	A standard string field.
`oneof`	Used to model `Union[...]` types in pydantic where the field can be one of several mutually exclusive types that cannot be mapped to a single protobuf type.
`optional`	Mirrors optional pydantic fields (`Optional[T]` / nullable fields). The proto field can be unset and will not implicitly default. Provides a 1-1 mapping from pydantic to proto so the UI can differentiate values that have and have not been set.
`repeated`	Used for lists.
`map`	Used for dictionaries with enforced key and value types.
Scalar types (int, bool, etc.)	Map directly and behave as expected from their pydantic counterparts.
`Enum`	Used for pydantic enums or literals.

Note: These patterns should be maintained in any future schema changes

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

gemini-code-assist

Code Review

This pull request introduces new Pydantic models to enrich the metadata for Ray Train runs, covering dataset details, runtime configuration, and execution configuration. These changes are confined to the schema definition in python/ray/train/v2/_internal/state/schema.py. My review focuses on improving the maintainability of the TrainRun model by suggesting more concise descriptions for the newly added fields. The current descriptions are redundant as they repeat details from the nested models' docstrings, which could become a maintenance issue. Overall, the changes are a good step towards better observability.

python/ray/train/v2/_internal/state/schema.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/v2/_internal/state/state_manager.py

python/ray/data/_internal/execution/interfaces/execution_options.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/v2/_internal/state/state_manager.py

python/ray/train/v2/_internal/state/schema.py

matthewdeng · 2025-12-18T02:39:10Z

python/ray/train/v2/_internal/state/schema.py

+class TrainingExecutionConfiguration(BaseModel):
+    """Configuration parameters for executing the training loop,
+    including details about the training configs, scaling configs, and backend settings."""


What goes in here vs. as a direct attribute of the TrainRun?

Yeah I agree that the naming here is vague due to the loose relationship between the training loop config, scaling config, and backend config. It was somewhat of a forced grouping in an attempt to categorize the remaining fields. To resolve this, created a new "Run Configuration" schema that captures the new metadata and places these three (train loop, scaling, backend config) at the top level along with Dataset Details and Runtime Configuration. Notice this matches the nesting of the TorchTrainer args where these fields are defined. We can have a further discussion about this if it is the best approach.

python/ray/train/v2/_internal/callbacks/state_manager.py

python/ray/train/lightgbm/config.py

python/ray/train/v2/tests/util.py

python/ray/train/backend.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

…for default behavior Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/backend.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

…ults to schema Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/v2/_internal/state/state_manager.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/v2/_internal/state/schema.py

python/ray/data/_internal/execution/interfaces/execution_options.py

python/ray/train/v2/_internal/callbacks/state_manager.py

…s in schemas Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/v2/_internal/state/state_manager.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/data/_internal/execution/interfaces/execution_options.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

… framework Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

python/ray/train/v2/_internal/state/util.py

python/ray/train/v2/_internal/state/export.py

…em and name fields, refacted _to_human_readable_json to _to_human_readable_struct and updated tests accordingly Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

python/ray/train/v2/_internal/state/state_manager.py

python/ray/train/v2/_internal/state/export.py

…ests, nits Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/train/v2/tests/test_state.py

Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

src/ray/protobuf/export_train_state.proto

python/ray/train/v2/_internal/state/state_manager.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

src/ray/protobuf/export_train_state.proto

python/ray/train/v2/_internal/state/state_manager.py

matthewdeng

This is awesome, thanks for the multiple iterations on this!

matthewdeng · 2026-03-10T04:34:07Z

python/ray/train/v2/_internal/state/export.py

+        # Fallback: string representation
+        return str(value)


This is probably okay for now but be aware that if there is no __str__/__repr__ defined it might end up looking something like <__main__.MyClass object at 0x7f3c2a1b4d30>. We can follow up here in the future if we want to clean it up.

python/ray/train/v2/tests/test_state_export.py

matthewdeng · 2026-03-10T04:46:52Z

python/ray/train/v2/_internal/state/export.py

+            if depth - 1 <= 0:
+                # Collapse the list/tuple/set to "..."
+                return ["..."]


Why special casing for this? I was looking at the unit test and typically I would expect the items in the list to have the same behavior as the keys of a dict.

# max_depth=2 assert json.loads( MessageToJson(_dict_to_human_readable_struct(obj, max_depth=2)) ) == { "native": 42, "nested": {"inner": "..."}, "obj": "CustomObj", "sequence": ["..."], }

Just to avoid long lists of ellipses, ["..."] looks better than ["...", "...", "..."].
The latter also just seems a bit redundant. And in the front-end, this will be easier to parse from ["..."] to, say, [...]

Updated this behavior to only account for dict depth

python/ray/train/v2/_internal/state/util.py

edoakes

stamp proto changes

python/ray/train/v2/tests/test_state_export.py

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

python/ray/train/v2/_internal/state/state_manager.py

src/ray/protobuf/export_train_state.proto

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

cursor · 2026-03-16T21:28:04Z

src/ray/protobuf/export_train_state.proto

+    // Either “max” or “min”. If “max”/”min”, then checkpoints with highest/lowest values
+    // of the checkpoint_score_attribute will be kept.
+    CheckpointScoreOrder checkpoint_score_order = 3;
+  }


PR modifies .proto files — review RPC standards

Low Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

^{Triggered by project rule: Bugbot Rules}

python/ray/train/v2/_internal/state/export.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

src/ray/protobuf/export_train_state.proto

python/ray/train/v2/_internal/state/export.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

cursor · 2026-03-17T00:41:10Z

src/ray/protobuf/export_train_state.proto

 package ray.rpc;

-// Metadata for a Ray Train run, including its details and status
+import "google/protobuf/struct.proto";


Proto file modified — RPC fault-tolerance review required

Low Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

^{Triggered by project rule: Bugbot Rules}

cursor · 2026-03-17T00:41:10Z

python/ray/train/v2/tests/test_state_export.py

+        "nested": {"inner": {"deep": 99}},
+        "obj": "CustomObj",
+        "sequence": [1, "CustomObj"],
+    }


Test missing inf_float key in expected output

Medium Severity

The test_dict_to_human_readable_struct_max_depth test includes "inf_float": float("inf") in the input dict, but the expected outputs for both max_depth=2 and max_depth=3 are missing the corresponding "inf_float": "inf" entry. The _dict_to_human_readable_struct function correctly converts non-finite floats to their string representation, so the Struct will contain this key. The == assertion will fail because the actual output includes a key the expected dict does not.

updated schema

0a5403d

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner December 4, 2025 22:54

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

python/ray/train/v2/_internal/state/schema.py Outdated Show resolved Hide resolved

added new train run metadata to state creation/update logic

8a20ba4

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner December 4, 2025 22:55

cursor bot reviewed Dec 4, 2025

View reviewed changes

python/ray/train/v2/_internal/state/state_manager.py Show resolved Hide resolved

python/ray/data/_internal/execution/interfaces/execution_options.py Outdated Show resolved Hide resolved

ray-gardener bot added train Ray Train Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Dec 5, 2025

JasonLi1909 added 2 commits December 15, 2025 12:02

fixed dict construction for train_loop_configs and ExecutionOptions

98abfe0

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

tests

de257e1

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Dec 17, 2025

View reviewed changes

python/ray/train/v2/_internal/state/state_manager.py Outdated Show resolved Hide resolved

matthewdeng reviewed Dec 18, 2025

View reviewed changes

JasonLi1909 added 2 commits December 31, 2025 16:34

refactored schemas and to_dict() logic

20e1c7d

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

implemented abstract class for BackendConfig and DefaultBackendConfig…

16b56b2

…for default behavior Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Jan 3, 2026

View reviewed changes

python/ray/train/backend.py Outdated Show resolved Hide resolved

JasonLi1909 added 2 commits January 4, 2026 02:55

updated BackendConfig schema

f9e8278

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

added BackendConfig framework property, updated tests, and added defa…

a95f14e

…ults to schema Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Jan 5, 2026

View reviewed changes

python/ray/train/v2/_internal/state/state_manager.py Show resolved Hide resolved

JasonLi1909 added 2 commits January 5, 2026 13:34

fix

85c04b1

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

handle non json serializeable train_loop_config

ea1e3cd

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

matthewdeng reviewed Jan 6, 2026

View reviewed changes

renamed RunConfiguration schema to RunContext, removed defeault field…

eeb44a6

…s in schemas Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Jan 8, 2026

View reviewed changes

python/ray/train/v2/_internal/state/state_manager.py Show resolved Hide resolved

JasonLi1909 added 5 commits January 7, 2026 17:30

schema fix

182b88c

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

added proto schema

3bae151

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

updated export api to accomodate new protos

8671052

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

train_loop_config optional

727a252

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

added test for export of new fields

a1a9e93

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner January 22, 2026 23:47

cursor bot reviewed Jan 22, 2026

View reviewed changes

python/ray/data/_internal/execution/interfaces/execution_options.py Outdated Show resolved Hide resolved

JasonLi1909 added 3 commits March 3, 2026 15:18

added tests for construct_data_config

124c71e

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

updated _get_framework_version to get version of relevant modules per…

876e31b

… framework Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

updated _get_framework_version tests and renamed vars in export.py

e26188c

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Mar 4, 2026

View reviewed changes

python/ray/train/v2/_internal/state/util.py Show resolved Hide resolved

python/ray/train/v2/_internal/state/export.py Outdated Show resolved Hide resolved

changed train_loop_config to Struct, added RunConfig storage_filesyst…

b599aa1

…em and name fields, refacted _to_human_readable_json to _to_human_readable_struct and updated tests accordingly Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Mar 5, 2026

View reviewed changes

python/ray/train/v2/_internal/state/state_manager.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/state/export.py Show resolved Hide resolved

JasonLi1909 added 2 commits March 5, 2026 17:04

changed _to_human_readable_strct to _dict_to_human_readable_struct, t…

0b8c896

…ests, nits Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

added test

70e8a8e

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Mar 6, 2026

View reviewed changes

python/ray/train/v2/tests/test_state.py Outdated Show resolved Hide resolved

JasonLi1909 and others added 2 commits March 5, 2026 17:16

Merge branch 'master' into add-new-train-metadata

33df588

Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

fixed test

2934a26

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Mar 6, 2026

View reviewed changes

src/ray/protobuf/export_train_state.proto Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/state/state_manager.py Show resolved Hide resolved

added support for elastic training and a unit test

94f6d9f

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Mar 6, 2026

View reviewed changes

src/ray/protobuf/export_train_state.proto Show resolved Hide resolved

python/ray/train/v2/_internal/state/state_manager.py Outdated Show resolved Hide resolved

matthewdeng approved these changes Mar 10, 2026

View reviewed changes

edoakes approved these changes Mar 12, 2026

View reviewed changes

iamjustinhsu approved these changes Mar 12, 2026

View reviewed changes

goutamvenkat-anyscale reviewed Mar 13, 2026

View reviewed changes

python/ray/train/v2/tests/test_state_export.py Outdated Show resolved Hide resolved

Merge branch 'master' into add-new-train-metadata

882555c

cursor bot reviewed Mar 16, 2026

View reviewed changes

python/ray/train/v2/_internal/state/state_manager.py Outdated Show resolved Hide resolved

src/ray/protobuf/export_train_state.proto Outdated Show resolved Hide resolved

JasonLi1909 added 4 commits March 16, 2026 14:20

updated train_loop_config type hints to accommodate integer keys

f45c609

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

remove locality_with_output from execution options dict

3e66a2f

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

remove unecessary test

8dc1f52

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

fix clang-format errors

466d3c5

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Mar 16, 2026

View reviewed changes

JasonLi1909 added 2 commits March 16, 2026 15:39

update _dict_to_huamn_readable_struct to only account for dict depth

2a1610c

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

fix error handling

49de6a3

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Mar 16, 2026

View reviewed changes

src/ray/protobuf/export_train_state.proto Show resolved Hide resolved

python/ray/train/v2/_internal/state/export.py Show resolved Hide resolved

fix edge case with inf float values

d29ae2e

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Mar 17, 2026

View reviewed changes

Conversation

JasonLi1909 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Implementation

Pydantic TrainRun to Protobuf ExportTrainRunEventData Conversion

Pydantic TrainRun to Protobuf ExportTrainRunEventData Type Mappings

Protobuf Type Usage Patterns and Reasoning Table

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewdeng Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matthewdeng left a comment

Choose a reason for hiding this comment

Uh oh!

matthewdeng Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matthewdeng Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JasonLi1909 commented Dec 4, 2025 •

edited

Loading

Pydantic `TrainRun` to Protobuf `ExportTrainRunEventData` Conversion

JasonLi1909 Dec 30, 2025 •

edited

Loading

JasonLi1909 Mar 14, 2026 •

edited

Loading

PR modifies `.proto` files — review RPC standards

Test missing `inf_float` key in expected output