Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap evaluation benchmark using HF-trainer #61

Open
sbmaruf opened this issue Aug 25, 2021 · 2 comments
Open

Wrap evaluation benchmark using HF-trainer #61

sbmaruf opened this issue Aug 25, 2021 · 2 comments

Comments

@sbmaruf
Copy link

sbmaruf commented Aug 25, 2021

This might sounds like a bit of re-structuring but for the sake of future compatibility, I propose the following,

  1. Move to huggingface trainer: This will help the repo to automatically adapt to deepspeed and all the exclusive features of transformers library.
  2. We don't have to re-invent the wheel. Given that we are using huggingface trainer, we only need to implement the following functions for a trainer for different tasks.
    -- data_loader
    -- DataCollator
    -- compute_metrics
    -- predictions (if needed)
  3. In case if we want to finetune our full model, we don't have to change a lot in the surface level.

I would love to take some responsibility if needed. Let me know. @jaketae @tianjianjiang @wilsonyhlee

@jaketae
Copy link
Member

jaketae commented Aug 25, 2021

Hey @sbmaruf, thanks for the input! We haven't thought much about parallelizing large models yet (focusing on the baselines for now), but I totally agree this is something we should have in mind moving forward.

Do you imagine the process would require a lot of modification to the codebase? Just wondering what a proof of concept implementation of this would look like.

@sbmaruf
Copy link
Author

sbmaruf commented Aug 26, 2021

For a proof of concept implementation,

These are some of the possible places that might get affected,

*** (I'm pasting some example codes from one of my large codebase, there might be some redundant stuff) ***

  1. Initialize a trainer object here.
    Basic initialization stuffs,
def training_utils(data_args, model_args, training_args, train_dataset, validation_dataset, data_collator, logger):
    logger.info("Loading model.")
    config, tokenizer, model = load_config_tokenizer_model(model_args, training_args)
    model.config.max_length = data_args.val_max_target_length
    
    # Log parameters
    log_parameter_stat(model, logger, verbose=model_args.model_verbose)

    Trainer = get_trainer(training_args.trainer_class_name)
    compute_metrics_wrapper = get_compute_metric(training_args.compute_metrics)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=validation_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics_wrapper(
                            tokenizer, 
                            data_args.class_seperator,
                            data_args.token_seperator
                        )
    )
    return config, tokenizer, model, trainer
  1. We need to write data collator for the trainer object. We can follow/use this one.

  2. For every task, a scorer function needs to be declared. The function is compute_metrics. We just need to pass this function to the trainer object for evaluation. Usually we would expect that the contributor who is taking responsibility for a task would implement this. We can also pre-implement some of the scorer functions for them. An example for seq2seq model,

def seq2seq_EM(tokenizer, class_seperator=None, token_seperator=None):
    def compute_metrics(eval_preds):
        preds, labels = eval_preds
        if isinstance(preds, tuple):
            preds = preds[0]
        decoded_preds = [tokenizer.decode(pred, skip_special_tokens=True, clean_up_tokenization_spaces=True) for pred in preds]
        for idx, decoded_pred in enumerate(decoded_preds):
            decoded_pred = decoded_pred.split("<extra_id_0>")
            assert len(decoded_pred) <= 2
            decoded_pred = decoded_pred[0] if len(decoded_pred) == 1 else decoded_pred[1]
            decoded_preds[idx] = decoded_pred
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        decoded_labels = [tokenizer.decode(label, skip_special_tokens=True, clean_up_tokenization_spaces=True) for label in labels]
        assert len(decoded_preds) == len(decoded_labels)
        cnt = 0
        for pred, label in zip(decoded_preds, decoded_labels):
            if pred == label:
                cnt += 1
        result = {
            "accuracy" : cnt/float(len(preds))
        }
        return result
    return compute_metrics
  1. Save the trainer object in the AutoTask class here.

  2. Perform evaluation like below, Possible place in this repo.

def evaluate_model(trainer, data_args, training_args, data_collator, logger):
    for eval_dir in data_args.eval_dirs:
        eval_dataset = datasets.load_from_disk(eval_dir)
        eval_dataset = data_collator.set_format(eval_dataset)
        eval_metrics = trainer.evaluate(
                eval_dataset = eval_dataset, 
                max_length=data_args.val_max_target_length, 
                num_beams=data_args.num_beams, 
                metric_key_prefix="eval"
            )
        if trainer.is_world_process_zero():
            eval_metrics["data_path"] = eval_dir
            metric_str = json.dumps(eval_metrics, indent=4)
            logger.info("{}".format(metric_str))
            open(os.path.join(training_args.output_dir, "{}.eval_score".format(os.path.basename(eval_dir))), "w").write(metric_str)
  1. Prediction generation,
    An example for a seq2seq model is below. The decoder only auto-regressive one should be much easier.
def predict_dataset(trainer, data_args, training_args, data_collator, tokenizer, logger):
    for eval_dir in data_args.eval_dirs:
        eval_dataset = datasets.load_from_disk(eval_dir)
        eval_dataset = data_collator.set_format(eval_dataset)
        predict_results = trainer.predict(
                eval_dataset, 
                metric_key_prefix="predict",
                max_length=data_args.val_max_target_length, 
                num_beams=data_args.num_beams, 
            )
        metrics = predict_results.metrics
        metrics["predict_samples"] = len(eval_dataset)
        metrics["data_path"] = eval_dir
        trainer.log_metrics("predict", metrics)
        trainer.save_metrics("predict", metrics)

        if trainer.is_world_process_zero():
            if training_args.predict_with_generate:
                predictions = tokenizer.batch_decode(
                    predict_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True
                )
                predictions = [pred.strip() for pred in predictions]
                source_text = [tokenizer.decode(sample['source_ids'], skip_special_tokens=True, clean_up_tokenization_spaces=True) for sample in eval_dataset]
                target_text = [tokenizer.decode(sample['target_ids'], skip_special_tokens=True, clean_up_tokenization_spaces=True) for sample in eval_dataset]

                assert len(predictions) == len(source_text)
                assert len(target_text) == len(source_text)

                pred_outputs = []
                for s_t, y_true, y_pred in zip(source_text, target_text, predictions):
                    pred_outputs.append({
                        "input":s_t,
                        "y_true":y_true,
                        "y_pred":y_pred,
                    })
                output_prediction_file = os.path.join(
                        training_args.output_dir, 
                        "{}.pred.json".format(os.path.basename(eval_dir)
                    )
                )
                json.dump(pred_outputs, open(output_prediction_file, "w"), indent=4)
                

@jaketae

I can work/review on related pull request if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants