Skip to content

Conversation

@ianbulovic
Copy link
Contributor

@ianbulovic ianbulovic commented May 5, 2025

This is an attempt at refactoring the messier parts of the codebase. Quite a few changes, summary below. All the pre-refactoring code is still available in the cnlpt.legacy package.

Refactored train system

The refactored train system lives in the cnlpt.train_system package. With the new setup, you can initialize the train system by creating a CnlpTrainSystem instance.

To run the new train system, use cnlpt train [ARGS]

Initialization and training

CnlpTrainSystem is created from model arguments, data arguments, and training arguments. Classmethods are also available to initialize a CnlpTrainSystem from argv, a config dictionary, or a json file.

The __init__ method of CnlpTrainSystem configures logging (more info below) and validates the provided args, then sets up the tokenizer, dataset, and model for training. Training won't actually start until the train() method is called.

Metrics and model saving

The model_selection_score and model_selection_label training arguments have been removed in favor of Trainer's built-in system to save the best model. Use the training argument --metric_for_best_model to choose your selection metric. It will default to average accuracy across all tasks, but other options are available:

Metric Name Description
loss Evaluation loss
avg_acc (default) Average accuracy over all tasks
avg_macro_f1 Average macro-F1 over all tasks
avg_micro_f1 Average micro-F1 over all tasks
TASKNAME.acc Accuracy for task TASKNAME
TASKNAME.macro_f1 Macro-F1 for task TASKNAME
TASKNAME.micro_f1 Micro-F1 for task TASKNAME
TASKNAME.LABELNAME.f1 F1 for label LABELNAME of task TASKNAME
METRIC_1, METRIC_2, ... METRIC_N Average of multiple metrics (any of the above)

Reworked predictions and analysis system

This PR introduces a CnlpPredictions dataclass (in the data package) that stores information related to predictions made by the model on test data (with or without labels). These predictions can be generated with the predict() method of CnlpTrainSystem, and the dataclass has methods for JSON serialization. Using the --do_predict flag when training will automatically run predictions on the test set when the training is complete, and save them to a predictions.json file in your output directory.

There is also a new cnlpt.data.analysis module with a function that can convert a CnlpPredictions instance to a polars dataframe for analysis.

Logging and live display

Rather than relying on stdout/stderr to document the training process, all relevant information is now logged in train_system.log in the configured output directory.

By moving everything to the logfile, we can reclaim console real estate for a much more interpretable live training progress display using rich.

Refactored data processing

Most of the data processing code has also been refactored (i.e., cnlp_processors and cnlp_data). The new code, which is used by the new train system, lives in the cnlpt.data package.

The main goal of the data refactoring was to simplify a lot of code by packaging all info related to each task into a new dataclass, TaskInfo. Basically all of our data processing before required passing around a bunch of dicts mapping task names to different properties (task type, number of labels, label set, task index). Repackaging all that data on a per-task basis simplifies quite a lot.

Other stuff

  • Changed the module structure a bit and moved examples out of src
  • Added a bunch of unit tests
  • Added support for python 3.12
    • We can revert this if it's not something we want right now, but all the tests are passing so I figured might as well 🤷‍♂️
  • A few minor CI updates with some new linting rules
  • Simplified release workflow (previous compatibility issues between uv and setuptools-scm have been resolved in newer uv versions)
  • Reworked the documentation system a bit, and switched to more human-readable Google-style docstrings.

TODO

When reworking the train system, I noticed that the old code only successfully sets the class weights for the CNN model; for the Hierarchical and CNLP models the class weights are taken from dataset.class_weights which (as far as I can tell) is always None. This is most likely a bug in the original train system, but since I didn't write that code I'll wait for review before fixing it in case I'm missing something.

One chunk of code that's still missing in this refactor is the error and disagreement analysis stuff in cnlp_predict.py. I don't have a good sense of how much of that code is still needed now that we can export a dataframe with much of the same information from the new CnlpPredictions dataclass via the make_preds_df function in cnlpt.data.analysis.

Despite the fact that training seems to run the same with the arguments I've tried, it's possible I accidentally broke something for someone else's use case. I'm opening this PR early as a draft so that people can test it on their own tasks/data to make sure everything is still working fine. As a reminder, run the new train system with cnlpt train [ARGS].

return [
tokenized_input.word_ids(i) for i in range(len(tokenized_input.input_ids))
]
elif character_level:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote all the character level code here and elsewhere for some experiments using Google's CANINE model on one of Guergana's projects. If we want to keep the code there's some cleaning up I can do, although I haven't used CANINE in a while personally

Copy link
Member

@etgld etgld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of great clarity and efficiency improvements! Really liked the refactoring some of the functionalities from train_system into callbacks, those examples are really helpful.

I'm not sure if/when I'll get the chance to test any of this out but from what I can tell all of the functionality I typically use should still work

@ianbulovic ianbulovic changed the title Train system and QOL refactoring Train system and general refactoring Jun 27, 2025
@ianbulovic ianbulovic marked this pull request as ready for review June 27, 2025 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants