LCM_MSE eval fails with cnn_dailymail prepared parquet due to missing keys

Following evaluation instructions to evaluate the pre-trained LCM_MSE on cnn_dailymail parquet data. 

Run command

```
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes=1 --nproc-per-node=4 -m lcm.evaluation  \
  --predictor base_lcm --sample_latent_variable False \
  --model_card checkpoints/mse_lcm/checkpoints/step_2000/model_card.yaml \
  --launcher standalone \
  --dataset.parquet_path examples/evaluation/parquet_dataset/cnn_dailymail/0_ae89e535f2a41f33_0_0.parquet \
  --tasks lcm_generation \
  --task_args '{"max_gen_len": 200}' \
  --data_loading.batch_size 4  --generator_batch_size 4 \
  --dump_dir /mnt/large_concept_model/output
```

Error:

```
'Key _source_column not found in batch.'
````

Investigation

Adding a print(batch.keys) to data_utilis.py reveals keys the iterate_batches is looking for _source_column key

```
dict_keys(['split', '__batch_index', '__fragment_index', '__filename', '__row_groups_ids', '__index_in_fragement'])
```

The cnn-dailymail generated parquet columns using the prepare script are:

```
Index(['prompt', 'split', 'category', 'answer', 'answer_sentences',
       'prompt_sentences', 'answer_sentences_sonar_emb',
       'prompt_sentences_sonar_emb'],
      dtype='object')
```

and cnn_dailymail.py also must be modified with the following:

```
if form != "inverse_":
        source_text_column = "prompt"
        target_text_column = "answer"
        dataset.source_prefix_text = "[INST] Summarize the following article: "
        dataset.source_suffix_text = " [/INST]"
    else:
        source_text_column = "answer"
        target_text_column = "prompt"
        dataset.source_prefix_text = ("[INST] Write an article from the following summary: ")  # fmt: skip
        dataset.source_suffix_text = " [/INST]"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCM_MSE eval fails with cnn_dailymail prepared parquet due to missing keys #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LCM_MSE eval fails with cnn_dailymail prepared parquet due to missing keys #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions