Skip to content

Commit 8fd8c8e

Browse files
authored
Add multi-label text classification support to pytorch example (#24770)
* Add text classification example * set the problem type and finetuning task * ruff reformated * fix bug for unseting label_to_id for regression * update README.md * fixed finetuning task * update comment * check if label exists in feature before removing * add useful logging
1 parent 7381987 commit 8fd8c8e

File tree

2 files changed

+780
-0
lines changed

2 files changed

+780
-0
lines changed

examples/pytorch/text-classification/README.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,55 @@ python run_glue.py \
8181

8282
> If your model classification head dimensions do not fit the number of labels in the dataset, you can specify `--ignore_mismatched_sizes` to adapt it.
8383
84+
## Text classification
85+
As an alternative, we can use the script [`run_classification.py`](./run_classification.py) to fine-tune models on a single/multi-label classification task.
86+
87+
The following example fine-tunes BERT on the `en` subset of [`amazon_reviews_multi`](https://huggingface.co/datasets/amazon_reviews_multi) dataset.
88+
We can specify the metric, the label column and aso choose which text columns to use jointly for classification.
89+
```bash
90+
dataset="amazon_reviews_multi"
91+
subset="en"
92+
python run_classification.py \
93+
--model_name_or_path bert-base-uncased \
94+
--dataset_name ${dataset} \
95+
--dataset_config_name ${subset} \
96+
--shuffle_train_dataset \
97+
--metric_name accuracy \
98+
--text_column_name "review_title,review_body,product_category" \
99+
--text_column_delimiter "\n" \
100+
--label_column_name stars \
101+
--do_train \
102+
--do_eval \
103+
--max_seq_length 512 \
104+
--per_device_train_batch_size 32 \
105+
--learning_rate 2e-5 \
106+
--num_train_epochs 1 \
107+
--output_dir /tmp/${dataset}_${subset}/
108+
```
109+
Training for 1 epoch results in acc of around 0.5958 for review_body only and 0.659 for title+body+category.
110+
111+
The following is a multi-label classification example. It fine-tunes BERT on the `reuters21578` dataset hosted on our [hub](https://huggingface.co/datasets/reuters21578):
112+
```bash
113+
dataset="reuters21578"
114+
subset="ModApte"
115+
python run_classification.py \
116+
--model_name_or_path bert-base-uncased \
117+
--dataset_name ${dataset} \
118+
--dataset_config_name ${subset} \
119+
--shuffle_train_dataset \
120+
--remove_splits "unused" \
121+
--metric_name f1 \
122+
--text_column_name text \
123+
--label_column_name topics \
124+
--do_train \
125+
--do_eval \
126+
--max_seq_length 512 \
127+
--per_device_train_batch_size 32 \
128+
--learning_rate 2e-5 \
129+
--num_train_epochs 15 \
130+
--output_dir /tmp/${dataset}_${subset}/
131+
```
132+
It results in a Micro F1 score of around 0.82 without any text and label filtering. Note that you have to explictly remove the "unused" split from the dataset, since it is not used for classification.
84133

85134
### Mixed precision training
86135

0 commit comments

Comments
 (0)