[Train] Richer Train Run Metadata#59186
[Train] Richer Train Run Metadata#59186JasonLi1909 wants to merge 43 commits intoray-project:masterfrom
Conversation
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces new Pydantic models to enrich the metadata for Ray Train runs, covering dataset details, runtime configuration, and execution configuration. These changes are confined to the schema definition in python/ray/train/v2/_internal/state/schema.py. My review focuses on improving the maintainability of the TrainRun model by suggesting more concise descriptions for the newly added fields. The current descriptions are redundant as they repeat details from the nested models' docstrings, which could become a maintenance issue. Overall, the changes are a good step towards better observability.
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
python/ray/data/_internal/execution/interfaces/execution_options.py
Outdated
Show resolved
Hide resolved
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
| class TrainingExecutionConfiguration(BaseModel): | ||
| """Configuration parameters for executing the training loop, | ||
| including details about the training configs, scaling configs, and backend settings.""" |
There was a problem hiding this comment.
What goes in here vs. as a direct attribute of the TrainRun?
There was a problem hiding this comment.
Yeah I agree that the naming here is vague due to the loose relationship between the training loop config, scaling config, and backend config. It was somewhat of a forced grouping in an attempt to categorize the remaining fields. To resolve this, created a new "Run Configuration" schema that captures the new metadata and places these three (train loop, scaling, backend config) at the top level along with Dataset Details and Runtime Configuration. Notice this matches the nesting of the TorchTrainer args where these fields are defined. We can have a further discussion about this if it is the best approach.
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
…for default behavior Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
…ults to schema Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
python/ray/data/_internal/execution/interfaces/execution_options.py
Outdated
Show resolved
Hide resolved
…s in schemas Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
python/ray/data/_internal/execution/interfaces/execution_options.py
Outdated
Show resolved
Hide resolved
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
… framework Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
…em and name fields, refacted _to_human_readable_json to _to_human_readable_struct and updated tests accordingly Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
…ests, nits Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
matthewdeng
left a comment
There was a problem hiding this comment.
This is awesome, thanks for the multiple iterations on this!
| # Fallback: string representation | ||
| return str(value) |
There was a problem hiding this comment.
This is probably okay for now but be aware that if there is no __str__/__repr__ defined it might end up looking something like <__main__.MyClass object at 0x7f3c2a1b4d30>. We can follow up here in the future if we want to clean it up.
| if depth - 1 <= 0: | ||
| # Collapse the list/tuple/set to "..." | ||
| return ["..."] |
There was a problem hiding this comment.
Why special casing for this? I was looking at the unit test and typically I would expect the items in the list to have the same behavior as the keys of a dict.
# max_depth=2
assert json.loads(
MessageToJson(_dict_to_human_readable_struct(obj, max_depth=2))
) == {
"native": 42,
"nested": {"inner": "..."},
"obj": "CustomObj",
"sequence": ["..."],
}There was a problem hiding this comment.
Just to avoid long lists of ellipses, ["..."] looks better than ["...", "...", "..."].
The latter also just seems a bit redundant. And in the front-end, this will be easier to parse from ["..."] to, say, [...]
There was a problem hiding this comment.
Updated this behavior to only account for dict depth
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
| // Either “max” or “min”. If “max”/”min”, then checkpoints with highest/lowest values | ||
| // of the checkpoint_score_attribute will be kept. | ||
| CheckpointScoreOrder checkpoint_score_order = 3; | ||
| } |
There was a problem hiding this comment.
PR modifies .proto files — review RPC standards
Low Severity
.proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst
Triggered by project rule: Bugbot Rules
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
| package ray.rpc; | ||
|
|
||
| // Metadata for a Ray Train run, including its details and status | ||
| import "google/protobuf/struct.proto"; |
There was a problem hiding this comment.
Proto file modified — RPC fault-tolerance review required
Low Severity
.proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst
Triggered by project rule: Bugbot Rules
| "nested": {"inner": {"deep": 99}}, | ||
| "obj": "CustomObj", | ||
| "sequence": [1, "CustomObj"], | ||
| } |
There was a problem hiding this comment.
Test missing inf_float key in expected output
Medium Severity
The test_dict_to_human_readable_struct_max_depth test includes "inf_float": float("inf") in the input dict, but the expected outputs for both max_depth=2 and max_depth=3 are missing the corresponding "inf_float": "inf" entry. The _dict_to_human_readable_struct function correctly converts non-finite floats to their string representation, so the Struct will contain this key. The == assertion will fail because the actual output includes a key the expected dict does not.


Overview
This PR adds richer metadata to Ray Train runs, enabling improved train run observability and ease of reproducibility between train runs. The new metadata consists of the following:
Implementation
Diagram: New metadata collection and export flow
This PR:
TrainStateActor's export API for converting pydantic TrainRuns into Protobuf, while also sanitizing fields to be human-readableto_dictabstract method and implements a DefaultBackendConfig classTRAIN_SCHEMA_VERSIONfrom 2 to 3Pydantic
TrainRunto ProtobufExportTrainRunEventDataConversionThe below table details the type mapping from the pydantic TrainRun model to the protobuf
ExportTrainRunEventDataschema. The protobuf types were chosen to allow for non-lossy conversion while maintaining the original pydantic model's clear type contract.Pydantic TrainRun to Protobuf ExportTrainRunEventData Type Mappings
framework_versions: Dict[str, str]map<string, string> framework_versionsRun_settings: RunSettingsRunSettings run_settingstrain_loop_config: Optional[Dict]optional google.protobuf.Struct train_loop_config = 1;backend_config: BackendConfigBackendConfig backend_configframework: Optional[TrainingFramework](enum) TrainingFramework frameworkconfig: Dict[str, Any]google.protobuf.Struct configscaling_config: ScalingConfigScalingConfig scaling_confignum_workers: Union[int, Tuple[int, int]]message IntRange {int32 min = 1;int32 max = 2;}// The number of workers for the Train run, can be a range with elastic training enabledoneof num_workers {int32 num_workers_fixed = 1;IntRange num_workers_range = 2;}use_gpu: boolbool use_gpuresources_per_worker: Optional[Dict[str, float]]message StringFloatMap {map<string, double> values = 1;}StringFloatMap resources_per_worker = 3;placement_strategy: strstring placement_strategyaccelerator_type: Optional[str]optional string accelerator_typeuse_tpu: boolbool use_tputopology: Optional[str]optional string topologybundle_label_selector: Optional[Union[Dict[str, str], List[Dict[str, str]]]]message StringMap {map<string, string> values = 1;}repeated StringMap bundle_label_selector;datasets: List[str]repeated string datasetsdata_config: DataConfigDataConfig data_configdatasets_to_split: Union[Literal["all"], List[str]]message All {}message StringList {repeated string values = 1;}oneof datasets_to_split {All all = 1;StringList datasets = 2;}execution_options: Optional[Dict]google.protobuf.Struct execution_optionsenable_shard_locality: boolbool enable_shard_localityrun_config: RunConfigRunConfig run_configname: strstring namefailure_config: FailureConfigFailureConfig failure_configworker_runtime_env: Dict[str, Any]google.protobuf.Struct worker_runtime_envcheckpoint_config: CheckpointConfigCheckpointConfig checkpoint_confignum_to_keep: intoptional uint64 num_to_keepcheckpoint_score_attribute: Optional[str]optional string checkpoint_score_attributecheckpoint_score_order: Literal["max", "min"](enum) CheckpointScoreOrder checkpoint_score_orderstorage_path: strstring storage_pathstorage_filesystem: Optional[str]optional string storage_filesystemmax_failures: intint max_failurescontroller_failure_limit: intint controller_failure_limitThe below table outlines the underlying type usage patterns that guided the mappings above.
Protobuf Type Usage Patterns and Reasoning Table
google.protobuf.StructstringoneofUnion[...]types in pydantic where the field can be one of several mutually exclusive types that cannot be mapped to a single protobuf type.optionalOptional[T]/ nullable fields). The proto field can be unset and will not implicitly default. Provides a 1-1 mapping from pydantic to proto so the UI can differentiate values that have and have not been set.repeatedmapEnumNote: These patterns should be maintained in any future schema changes