Skip to content

More comprehensive integration tests in training #484

@VeraChristina

Description

@VeraChristina

Is your feature request related to a problem? Please describe.

We have added infrastructure for integration tests in training and have added tests to cover most uses cases, and a couple of additional tests (restart, restart from existing checkpoint, use existing graph, etc.). However, some aspects of training are not tested or only tested for a couple of use cases and some problems are missed because we use datasets with rless parameters for these tests.

Describe the solution you'd like

A couple of things we could think about adding:

  • more comprehensive tests for checkpoints / checkpoint migrations -- currently only testing gnn global
  • tests for rollout -- currently not tested
  • add multi-gpu tests to test sharding for different models (probably partially covered in benchmark tests)
  • review datasets used for testing, can we keep them small and still catch more of the potential problems?
  • tests for forking runs (potentially better placed in system-level tests)

Depending on how comprehensively we want to test these, it might be enough to add a few tests, or it might be better to revisit the existing structure of fixtures to make them more reusable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    To be triaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions