Description
Is your feature request related to a problem? Please describe.
The current logging system for training metrics has two critical shortcomings:
-
Lack of granular logging: It is currently impossible to log loss function values for individual channels within each data stream. This limits visibility into model performance at a per-channel level, hindering detailed analysis.
-
Fragile and unreadable format: Metrics are stored in unstructured plain-text (txt) files, which are not human-readable and lack a consistent schema. Minor changes to the order of logged metrics or their contents break backward compatibility, making post-processing, visualization, or comparison across runs error-prone and unmaintainable.
Describe the solution you'd like
Replace the current plain-text logging format with CSV files using multi-level column headers to organize metrics hierarchically.
This structure would:
-
Separate global training statistics (e.g., epoch, total loss) from per-stream and per-channel metrics.
-
Ensure human readability while maintaining machine-readability.
-
Prevent compatibility breaks due to explicit column naming and hierarchical organization.
Example:
global ,global ,global ,global ,global ,global ,global ,FESOM ,FESOM ,FESOM ,FESOM ,FESOM ,FESOM ,FESOM ,FESOM ,FESOM ,FESOM ,FESOM ,FESOM ,FESOM ,FESOM
step ,time ,samples ,perf_gpu ,perf_mem ,learning_rate ,loss_mean ,mse ,a_ice ,evap ,fh ,fw ,prec ,snow ,ssh ,sss ,sst ,swr ,tx_sur ,ty_sur ,std
1 ,2025-02-27 14:38:31.542443 ,320 ,98.5 ,48.75 ,2.9802549459833396e-06 ,0.9765826463699341 ,0.9765826463699341 ,0.9043928980827332 ,0.9301251173019409 ,0.972670316696167 ,0.936567485332489 ,0.9416562914848328 ,1.0318517684936523 ,1.0079926252365112 ,1.0319256782531738 ,0.941299319267273 ,0.9138585329055786 ,1.0519397258758545 ,1.054713249206543 ,
This solution was implemented in my fork: kacpnowak#3
Describe alternatives you've considered
An alternative solution is to use an embedded database (e.g., DuckDB or SQLite) to store metrics in structured tables. Benefits include:
- Support for complex queries across multiple training runs.
- Native schema enforcement, eliminating fragility caused by format changes.
- Efficient storage and retrieval of large-scale experiments.
However, CSV files provide a simpler, more accessible intermediate solution that meets immediate needs without introducing database dependencies. Either approach would be preferable to the current system, which is unmaintainable and error-prone due to its reliance on unstructured text.
Additional context
The current logging implementation’s lack of structure and flexibility actively impedes debugging, analysis, and iterative improvements. A structured format (CSV or database) is critical for scaling experimentation and ensuring reproducibility.
Organisation
AWI
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Concept phase
Activity