Skip to content

[Train] Richer Train Run Metadata#59186

Open
JasonLi1909 wants to merge 43 commits intoray-project:masterfrom
JasonLi1909:add-new-train-metadata
Open

[Train] Richer Train Run Metadata#59186
JasonLi1909 wants to merge 43 commits intoray-project:masterfrom
JasonLi1909:add-new-train-metadata

Conversation

@JasonLi1909
Copy link
Contributor

@JasonLi1909 JasonLi1909 commented Dec 4, 2025

Overview

This PR adds richer metadata to Ray Train runs, enabling improved train run observability and ease of reproducibility between train runs. The new metadata consists of the following:

  • Datasets
  • DataConfig
  • RunConfig (FailureConfig, CheckpointConfig, name, worker runtime environment, storage_filesystem, storage_path)
  • BackendConfig
  • ScalingConfig
  • Framework versions (ray and framework related module versions)
  • User defined train_loop_config

Implementation

Screenshot 2026-01-27 at 2 04 15 PM

Diagram: New metadata collection and export flow

This PR:

  • Implements logic to collect new metadata from the StateManager callback
  • Updates Train Run pydantic models and protobuf schemas accommodate the new fields
  • Implements logic used in TrainStateActor's export API for converting pydantic TrainRuns into Protobuf, while also sanitizing fields to be human-readable
  • Refactors BackendConfig to be an abstract base class to allow for a new to_dict abstract method and implements a DefaultBackendConfig class
  • Adds tests for new metadata collection and state export
  • Increments export API TRAIN_SCHEMA_VERSION from 2 to 3

Pydantic TrainRun to Protobuf ExportTrainRunEventData Conversion

The below table details the type mapping from the pydantic TrainRun model to the protobuf ExportTrainRunEventData schema. The protobuf types were chosen to allow for non-lossy conversion while maintaining the original pydantic model's clear type contract.

Pydantic TrainRun to Protobuf ExportTrainRunEventData Type Mappings

Pydantic TrainRun Field Protobuf ExportTrainRunEventData Field
framework_versions: Dict[str, str] map<string, string> framework_versions
Run_settings: RunSettings RunSettings run_settings
train_loop_config: Optional[Dict] optional google.protobuf.Struct train_loop_config = 1;
backend_config: BackendConfig BackendConfig backend_config
framework: Optional[TrainingFramework] (enum) TrainingFramework framework
config: Dict[str, Any] google.protobuf.Struct config
scaling_config: ScalingConfig ScalingConfig scaling_config
num_workers: Union[int, Tuple[int, int]] message IntRange {
int32 min = 1;
int32 max = 2;
}

// The number of workers for the Train run, can be a range with elastic training enabled
oneof num_workers {
int32 num_workers_fixed = 1;
IntRange num_workers_range = 2;
}
use_gpu: bool bool use_gpu
resources_per_worker: Optional[Dict[str, float]] message StringFloatMap {
map<string, double> values = 1;
}
StringFloatMap resources_per_worker = 3;
placement_strategy: str string placement_strategy
accelerator_type: Optional[str] optional string accelerator_type
use_tpu: bool bool use_tpu
topology: Optional[str] optional string topology
bundle_label_selector: Optional[Union[Dict[str, str], List[Dict[str, str]]]] message StringMap {
map<string, string> values = 1;
}
repeated StringMap bundle_label_selector;
datasets: List[str] repeated string datasets
data_config: DataConfig DataConfig data_config
datasets_to_split: Union[Literal["all"], List[str]] message All {}
message StringList {
repeated string values = 1;
}
oneof datasets_to_split {
All all = 1;
StringList datasets = 2;
}
execution_options: Optional[Dict] google.protobuf.Struct execution_options
enable_shard_locality: bool bool enable_shard_locality
run_config: RunConfig RunConfig run_config
name: str string name
failure_config: FailureConfig FailureConfig failure_config
worker_runtime_env: Dict[str, Any] google.protobuf.Struct worker_runtime_env
checkpoint_config: CheckpointConfig CheckpointConfig checkpoint_config
num_to_keep: int optional uint64 num_to_keep
checkpoint_score_attribute: Optional[str] optional string checkpoint_score_attribute
checkpoint_score_order: Literal["max", "min"] (enum) CheckpointScoreOrder checkpoint_score_order
storage_path: str string storage_path
storage_filesystem: Optional[str] optional string storage_filesystem
max_failures: int int max_failures
controller_failure_limit: int int controller_failure_limit

The below table outlines the underlying type usage patterns that guided the mappings above.

Protobuf Type Usage Patterns and Reasoning Table

Protobuf Type Description / Usage
google.protobuf.Struct Represents dictionaries with values of arbitrary types that are only known at runtime. Keys must always be strings.
string A standard string field.
oneof Used to model Union[...] types in pydantic where the field can be one of several mutually exclusive types that cannot be mapped to a single protobuf type.
optional Mirrors optional pydantic fields (Optional[T] / nullable fields). The proto field can be unset and will not implicitly default. Provides a 1-1 mapping from pydantic to proto so the UI can differentiate values that have and have not been set.
repeated Used for lists.
map Used for dictionaries with enforced key and value types.
Scalar types (int, bool, etc.) Map directly and behave as expected from their pydantic counterparts.
Enum Used for pydantic enums or literals.

Note: These patterns should be maintained in any future schema changes

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner December 4, 2025 22:54
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new Pydantic models to enrich the metadata for Ray Train runs, covering dataset details, runtime configuration, and execution configuration. These changes are confined to the schema definition in python/ray/train/v2/_internal/state/schema.py. My review focuses on improving the maintainability of the TrainRun model by suggesting more concise descriptions for the newly added fields. The current descriptions are redundant as they repeat details from the nested models' docstrings, which could become a maintenance issue. Overall, the changes are a good step towards better observability.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner December 4, 2025 22:55
@ray-gardener ray-gardener bot added train Ray Train Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Dec 5, 2025
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Comment on lines +243 to +245
class TrainingExecutionConfiguration(BaseModel):
"""Configuration parameters for executing the training loop,
including details about the training configs, scaling configs, and backend settings."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What goes in here vs. as a direct attribute of the TrainRun?

Copy link
Contributor Author

@JasonLi1909 JasonLi1909 Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that the naming here is vague due to the loose relationship between the training loop config, scaling config, and backend config. It was somewhat of a forced grouping in an attempt to categorize the remaining fields. To resolve this, created a new "Run Configuration" schema that captures the new metadata and places these three (train loop, scaling, backend config) at the top level along with Dataset Details and Runtime Configuration. Notice this matches the nesting of the TorchTrainer args where these fields are defined. We can have a further discussion about this if it is the best approach.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
…for default behavior

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
…ults to schema

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
…s in schemas

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner January 22, 2026 23:47
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
… framework

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

…em and name fields, refacted _to_human_readable_json to _to_human_readable_struct and updated tests accordingly

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

…ests, nits

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

JasonLi1909 and others added 2 commits March 5, 2026 17:16
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks for the multiple iterations on this!

Comment on lines +146 to +147
# Fallback: string representation
return str(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably okay for now but be aware that if there is no __str__/__repr__ defined it might end up looking something like <__main__.MyClass object at 0x7f3c2a1b4d30>. We can follow up here in the future if we want to clean it up.

Comment on lines +140 to +142
if depth - 1 <= 0:
# Collapse the list/tuple/set to "..."
return ["..."]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why special casing for this? I was looking at the unit test and typically I would expect the items in the list to have the same behavior as the keys of a dict.

    # max_depth=2
    assert json.loads(
        MessageToJson(_dict_to_human_readable_struct(obj, max_depth=2))
    ) == {
        "native": 42,
        "nested": {"inner": "..."},
        "obj": "CustomObj",
        "sequence": ["..."],
    }

Copy link
Contributor Author

@JasonLi1909 JasonLi1909 Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to avoid long lists of ellipses, ["..."] looks better than ["...", "...", "..."].
The latter also just seems a bit redundant. And in the front-end, this will be easier to parse from ["..."] to, say, [...]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this behavior to only account for dict depth

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp proto changes

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

// Either “max” or “min”. If “max”/”min”, then checkpoints with highest/lowest values
// of the checkpoint_score_attribute will be kept.
CheckpointScoreOrder checkpoint_score_order = 3;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR modifies .proto files — review RPC standards

Low Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

Fix in Cursor Fix in Web

Triggered by project rule: Bugbot Rules

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

package ray.rpc;

// Metadata for a Ray Train run, including its details and status
import "google/protobuf/struct.proto";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto file modified — RPC fault-tolerance review required

Low Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

Fix in Cursor Fix in Web

Triggered by project rule: Bugbot Rules

"nested": {"inner": {"deep": 99}},
"obj": "CustomObj",
"sequence": [1, "CustomObj"],
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test missing inf_float key in expected output

Medium Severity

The test_dict_to_human_readable_struct_max_depth test includes "inf_float": float("inf") in the input dict, but the expected outputs for both max_depth=2 and max_depth=3 are missing the corresponding "inf_float": "inf" entry. The _dict_to_human_readable_struct function correctly converts non-finite floats to their string representation, so the Struct will contain this key. The == assertion will fail because the actual output includes a key the expected dict does not.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants