-
Notifications
You must be signed in to change notification settings - Fork 206
Open
Description
Following evaluation instructions to evaluate the pre-trained LCM_MSE on cnn_dailymail parquet data.
Run command
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes=1 --nproc-per-node=4 -m lcm.evaluation \
--predictor base_lcm --sample_latent_variable False \
--model_card checkpoints/mse_lcm/checkpoints/step_2000/model_card.yaml \
--launcher standalone \
--dataset.parquet_path examples/evaluation/parquet_dataset/cnn_dailymail/0_ae89e535f2a41f33_0_0.parquet \
--tasks lcm_generation \
--task_args '{"max_gen_len": 200}' \
--data_loading.batch_size 4 --generator_batch_size 4 \
--dump_dir /mnt/large_concept_model/output
Error:
'Key _source_column not found in batch.'
Investigation
Adding a print(batch.keys) to data_utilis.py reveals keys the iterate_batches is looking for _source_column key
dict_keys(['split', '__batch_index', '__fragment_index', '__filename', '__row_groups_ids', '__index_in_fragement'])
The cnn-dailymail generated parquet columns using the prepare script are:
Index(['prompt', 'split', 'category', 'answer', 'answer_sentences',
'prompt_sentences', 'answer_sentences_sonar_emb',
'prompt_sentences_sonar_emb'],
dtype='object')
and cnn_dailymail.py also must be modified with the following:
if form != "inverse_":
source_text_column = "prompt"
target_text_column = "answer"
dataset.source_prefix_text = "[INST] Summarize the following article: "
dataset.source_suffix_text = " [/INST]"
else:
source_text_column = "answer"
target_text_column = "prompt"
dataset.source_prefix_text = ("[INST] Write an article from the following summary: ") # fmt: skip
dataset.source_suffix_text = " [/INST]"
Metadata
Metadata
Assignees
Labels
No labels