Skip to content

Should we have a "train_native.py"? #1947

@6DammK9

Description

@6DammK9

This issus is majorly focusing on code structure.

Currently I'm working on porting resume from assigned epoch / iter and bundled validation loss to sdxl_train.py, to enable massive "native full finetune" in my SDXL base model.
I'm currently working in sd3 "WIP branch".

I have found that many (basic / fundamental / general) features are implemented in class NetworkTrainer, which is no accessible in *_train.py.

Meanwhile ARB / latent cache related configurations, even the implementation itself, I have made my own scalable version of prepare_buckets_latents.py, and made a huge latent dataset, realizing that I am close to have invalidate configuration because of the inconsistent magic numbers in verify_bucket_reso_steps.
Moreover, the super() will amplify this issue if we use downsteram applications / extensions, such as LyCORIS's "full bypass", which may hide the stack trace and the actual code dependency.

Examining the newer coding structures shared in train_*.py, maybe we should have a train_naive.py to unify the implementation diifference spreaded across arch specific *_train.py.

Any actions able to mitigate this risk will be greatly appreciated.

PS: accelerator.skip_first_batches in sdxl_train.py "soon".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions