We use the YAML file to define tasks, this allows us to easily evaluate multiple tasks at a single run and configure them independently. Specifically, you can add multiple tasks or folders at a time for evaluation, and the script will automatically collect all YAML files under those folders recursively.
# Single node
bash scripts/evaluate.sh task1.yaml task2.yaml dir1 dir2 ...
# Multi node
bash scripts/evaluate_multiple_node.sh task1.yaml task2.yaml dir1 dir2 ...
We support two types of evaluation tasks: multi-choice and generation. The YAML config options for both tasks are defined in evaluation/configs.py
. Basically, all types of tasks share common configs defining task information:
name: 'glue_cola' # Task Name
type: 'mul' # Task type, 'gen' (generate) or 'mul' (multiple choice)
path: 'bloom/glue_cola' # task data path relative to DATA_PATH in 'evaluate.sh'
use_task_mask: False # Whether use [gMASK] for evaluation
unidirectional: False # Whether use unidirectional attention
max_seq_length: 2048 # Max sequence length
file-pattern: # Organize jsonl file in groups
validation: "**/validation.jsonl" # Will search for all file named 'validation.jsonl' in `DATA_PATH/bloom/glue_cola` using glob.glob()
micro-batch-size: 30 # 'gen' task only support mbs = 1 for now
See configuration details for multi-choice and generation tasks in evaluation/configs.py
.
We recommend organizing the task data in the following structure and setup up two groups named "validation" and "test" in the file-pattern
config so that it becomes very easy to evaluate different prompts on both validation and test sets independently.
DATA_PATH
└── task_name
├── prompt_1
│ ├── test.jsonl
│ └── val.jsonl
├── prompt_2
│ ├── test.jsonl
│ └── val.jsonl
└── prompt_3
├── test.jsonl
└── val.jsonl
The evaluation data for each prompt are organized into jsonline format. For multi-choice tasks, the format of each line of JSON should be
{
"inputs_pretokenized": "Context and question here",
"choices_pretokenized": ["Choice 1", "Choice 2", "Choice 3"],
"label": int
}
The default metric for the multi-choice task is Accuracy.
For the generation task, the format of each line of JSON should be
{
"inputs_pretokenized": "Context and question here",
"targets_pretokenized": ["Target 1", "Target 2", "Target 3"],
"label": int
}
The default metrics for the generation task are EM(Exact-Match) and F1. Given inputs, the sequence generated by the model will be metricized separately from all targets and the highest value will be taken.
You can customize your evaluation metrics function and add it to DEFAULT_METRICS
in evaluation/metrics.py
, and then you can specify metric: ['Your metric name']
in the task YAML file.
By default, we implement classes named MultiChoiceTask
and GenerationTask
in evaluation/tasks.py
for multi-choice tasks and generation tasks, respectively.
You can implement a new task class and inherit from one of these two classes, and implement the process_single_batch
function to define how to process a batch of inputs and get the predictions. Following Big-Bench, we implemented two methods you can use for your evaluation:
model.cond_log_prob()
: Compute the probabilities of provided model outputs for given inputs.model.generate_text()
: Generate text for given inputs.
Once you have created the new task class, you need to specify the relative path to import the class in the module
field of the task YAML file. See tasks/lambada/tasks.py
and tasks/lambada/lambada.yaml
for how we customize the beam search generation strategy for LAMBADA tasks and configure the YAML file.