Skip to content

Refactor TrainLogger #29

Open
@kacpnowak

Description

@kacpnowak

Is your feature request related to a problem? Please describe.

The current logging system for training metrics has two critical shortcomings:

  • Lack of granular logging: It is currently impossible to log loss function values for individual channels within each data stream. This limits visibility into model performance at a per-channel level, hindering detailed analysis.

  • Fragile and unreadable format: Metrics are stored in unstructured plain-text (txt) files, which are not human-readable and lack a consistent schema. Minor changes to the order of logged metrics or their contents break backward compatibility, making post-processing, visualization, or comparison across runs error-prone and unmaintainable.

Describe the solution you'd like

Replace the current plain-text logging format with CSV files using multi-level column headers to organize metrics hierarchically.

This structure would:

  • Separate global training statistics (e.g., epoch, total loss) from per-stream and per-channel metrics.

  • Ensure human readability while maintaining machine-readability.

  • Prevent compatibility breaks due to explicit column naming and hierarchical organization.

Example:

global ,global                     ,global  ,global   ,global   ,global                 ,global              ,FESOM               ,FESOM                ,FESOM               ,FESOM               ,FESOM               ,FESOM              ,FESOM              ,FESOM                 ,FESOM                 ,FESOM                 ,FESOM               ,FESOM               ,FESOM              ,FESOM
step   ,time                       ,samples ,perf_gpu ,perf_mem ,learning_rate          ,loss_mean           ,mse                 ,a_ice                ,evap                ,fh                  ,fw                  ,prec               ,snow               ,ssh                   ,sss                   ,sst                   ,swr                 ,tx_sur              ,ty_sur             ,std
1      ,2025-02-27 14:38:31.542443 ,320     ,98.5     ,48.75    ,2.9802549459833396e-06 ,0.9765826463699341  ,0.9765826463699341  ,0.9043928980827332   ,0.9301251173019409  ,0.972670316696167   ,0.936567485332489   ,0.9416562914848328 ,1.0318517684936523 ,1.0079926252365112    ,1.0319256782531738    ,0.941299319267273     ,0.9138585329055786  ,1.0519397258758545  ,1.054713249206543  ,

This solution was implemented in my fork: kacpnowak#3

Describe alternatives you've considered

An alternative solution is to use an embedded database (e.g., DuckDB or SQLite) to store metrics in structured tables. Benefits include:

  • Support for complex queries across multiple training runs.
  • Native schema enforcement, eliminating fragility caused by format changes.
  • Efficient storage and retrieval of large-scale experiments.

However, CSV files provide a simpler, more accessible intermediate solution that meets immediate needs without introducing database dependencies. Either approach would be preferable to the current system, which is unmaintainable and error-prone due to its reliance on unstructured text.

Additional context

The current logging implementation’s lack of structure and flexibility actively impedes debugging, analysis, and iterative improvements. A structured format (CSV or database) is critical for scaling experimentation and ensuring reproducibility.

Organisation

AWI

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    • Status

      Concept phase

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions