Skip to content

Commit 4bb8fa5

Browse files
hepengfehunterhectorfeipenghe
authored
Datasets reader for Classification Tasks (#516)
* test to branch * rm test * two classfication datasets * bank77 test script * updates * Delete forte/spacy directory * bank77 example file * delete root folder unrelated file * implemented generic reader for classification dataset * implemented a generic dataset and write two test cases * fix some pylint style * restore .gitignore to upstream * add more docs and some minor fixs on classification reader * add a line of code to import ClassificationDatasetReader * fixed some grammar errors, added more comments and replaced assertion with raising error * fixed the two classification examples based on the new wrapper class and changed some comments about dataset reader * fixed some comments and supressed mypy errors since the reader arguments type is not consistent with its parent class * added link to the classification example * reformat using black * fixed long line * fixed default configs docstring based on the review feedback. * fixed ontology paths * fixed redudent config initialiazation * add example documentation * rewrite docstring and some variable names * pylint * black * black * mypy * black * black and mypy * add test cases for classification dataset reader * edits based on code review * pylint * removed sys.path.insert * remove unused external forte packages * remove test case causing type error * remove readme * Delete README.md * edits based on code review * remove unused pipeline components * pylint * add more instructions and minor edits on example scripts * fixed issues based on code review * add sphinx build scripts to data.rst * corrected test case based on the new data entry, Body * merge from master * test import * test if adding body to base_ontology.json solves importing issue * remove test ontology * fixed spelling errors * add banking77 sample data * changed banking77 sample data name * fixed issues based on code review * add more error checkings * pylint * setting relative data path to make running example more convenient * fixed function annotation * black * add copyright headers Co-authored-by: Hector <hunterhector@gmail.com> Co-authored-by: feipenghe <kern1996@outlook.com>
1 parent de96171 commit 4bb8fa5

File tree

12 files changed

+681
-1
lines changed

12 files changed

+681
-1
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
"2","Great CD","My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing ""Who was that singing ?"""
2+
"2","One of the best game music soundtracks - for a game I didn't really play","Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, which I find distracting. But even if those weren't included I would still consider the collection worth it."
3+
"1","Batteries died within a year ...","I bought this charger in Jul 2003 and it worked OK for a while. The design is nice and convenient. However, after about a year, the batteries would not hold a charge. Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power."
4+
"2","works fine, but Maha Energy is better","Check out Maha Energy's website. Their Powerex MH-C204F charger works in 100 minutes for rapid charge, with option for slower charge (better for batteries). And they have 2200 mAh batteries."
5+
"2","Great for the non-audiophile","Reviewed quite a bit of the combo players and was hesitant due to unfavorable reviews and size of machines. I am weaning off my VHS collection, but don't want to replace them with DVD's. This unit is well built, easy to setup and resolution and special effects (no progressive scan for HDTV owners) suitable for many people looking for a versatile product.Cons- No universal remote."
6+
"1","DVD Player crapped out after one year","I also began having the incorrect disc problems that I've read about on here. The VCR still works, but hte DVD side is useless. I understand that DVD players sometimes just quit on you, but after not even one year? To me that's a sign on bad quality. I'm giving up JVC after this as well. I'm sticking to Sony or giving another brand a shot."
7+
"1","Incorrect Disc","I love the style of this, but after a couple years, the DVD is giving me problems. It doesn't even work anymore and I use my broken PS2 Now. I wouldn't recommend this, I'm just going to upgrade to a recorder now. I wish it would work but I guess i'm giving up on JVC. I really did like this one... before it stopped working. The dvd player gave me problems probably after a year of having it."
8+
"1","DVD menu select problems","I cannot scroll through a DVD menu that is set up vertically. The triangle keys will only select horizontally. So I cannot select anything on most DVD's besides play. No special features, no language select, nothing, just play."
9+
"2","Unique Weird Orientalia from the 1930's","Exotic tales of the Orient from the 1930's. ""Dr Shen Fu"", a Weird Tales magazine reprint, is about the elixir of life that grants immortality at a price. If you're tired of modern authors who all sound alike, this is the antidote for you. Owen's palette is loaded with splashes of Chinese and Japanese colours. Marvelous."
10+
"1","Not an ""ultimate guide""","Firstly,I enjoyed the format and tone of the book (how the author addressed the reader). However, I did not feel that she imparted any insider secrets that the book promised to reveal. If you are just starting to research law school, and do not know all the requirements of admission, then this book may be a tremendous help. If you have done your homework and are looking for an edge when it comes to admissions, I recommend some more topic-specific books. For example, books on how to write your personal statment, books geared specifically towards LSAT preparation (Powerscore books were the most helpful for me), and there are some websites with great advice geared towards aiding the individuals whom you are asking to write letters of recommendation. Yet, for those new to the entire affair, this book can definitely clarify the requirements for you."

data_samples/banking77/sample.csv

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
text,category
2+
How do I locate my card?,card_arrival
3+
"I still have not received my new card, I ordered over a week ago.",card_arrival
4+
I ordered a card but it has not arrived. Help please!,card_arrival
5+
Is there a way to know when my card will arrive?,card_arrival
6+
My card has not arrived yet.,card_arrival
7+
When will I get my card?,card_arrival
8+
Do you know if there is a tracking number for the new card you sent me?,card_arrival
9+
i have not received my card,card_arrival
10+
still waiting on that card,card_arrival
11+
Is it normal to have to wait over a week for my new card?,card_arrival
12+
How do I track my card?,card_arrival
13+
How long does a card delivery take?,card_arrival
14+
I still don't have my card after 2 weeks. What should I do?,card_arrival
15+
still waiting on my new card,card_arrival
16+
I am still waiting for my card after 1 week. Is this ok?,card_arrival
17+
"I have been waiting longer than expected for my bank card, could you provide information on when it will arrive?",card_arrival
18+
I've been waiting longer than expected for my card.,card_arrival
19+
Why hasn't my card been delivered?,card_arrival
20+
Can the card be mailed and used in Europe?,country_support

docs/code/data.rst

+6
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,12 @@ Readers
191191
.. autoclass:: forte.datasets.mrc.squad_reader.SquadReader
192192
:members:
193193

194+
:hidden:`ClassificationDatasetReader`
195+
--------------------------------------
196+
.. autoclass:: forte.data.readers.classification_reader.ClassificationDatasetReader
197+
:members:
198+
199+
194200
DataPack Dataset
195201
=================
196202

examples/chatbot/chatbot_example.py

-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414
import yaml
15-
1615
from termcolor import colored
1716
import torch
1817
from fortex.nltk import NLTKSentenceSegmenter, NLTKWordTokenizer, NLTKPOSTagger

examples/classification/README.md

+41
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
## Prepare dataset
2+
### Amazon Review Sentiment
3+
amazon review sentiment(**ARS**) dataset is a binary classification dataset, and it has digit labels. `1` is
4+
the negative and `2` is the positive. Each class has 1,800,000 training samples and 200,000 testing samples.
5+
The dataset can be downloaded from [link](https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz).
6+
7+
8+
### banking77
9+
Banking77 is a multi-class datasets. It has 77 classes which are fine-grained intents in a banking domain.
10+
The train data can be downloaded from [link](https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv) and test data can be downloaded from [link](https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv).
11+
12+
## Run classifier
13+
To simply test the script, one can use run cmd below directly to use sample data under the project root folder.
14+
```
15+
python examples/classification/amazon_review_sentiment.py
16+
```
17+
18+
```
19+
python examples/classification/bank_customer_intent.py
20+
```
21+
22+
23+
If User wants to run the script on the full dataset, User can download the dataset and set the dataset path correctly in the script.
24+
Under the `forte/examples/classification/` folder, User can run the following command to run the classifier.
25+
```bash
26+
python amazon_review_sentiment.py
27+
```
28+
29+
```bash
30+
python bank_customer_intent.py
31+
```
32+
33+
34+
## Reader Configuration
35+
`ClassificationDatasetReader` is designed to read table-like classification datasets and currently it only support `csv` file which is a common file format. To use the reader correctly, User needs to check the dataset and configure the reader correspondingly. To better explain this, we will use ARS dataset as an example throughout the explanation.
36+
* User needs to check column names of the dataset. In the example dataset, we have column names [label, title, content]. First, we need know the first column is about data labels. Second, we know the second and third column can be input text. Therefore, we can set `forte_data_fields` to be `['label', 'ft.onto.base_ontology.Title', 'ft.onto.base_ontology.Body']` that each element matches column names from dataset. `label` is just a keyword that reader needs to identify the label. `'ft.onto.base_ontology.Title'` and `'ft.onto.base_ontology.Body'` are two forte data entries that stores input text in proper wrappers. In some cases that dataset might contain unnecessary columns that User doesn't want to use at all, User can set corresponding list elements in `forte_data_fields` to `None` so that the reader can skip processing them.
37+
* User also needs to check if how many classes in the dataset to configure `index2class` which is a dictionary mapping from zero-based indices to class names. In ARS dataset, User can simply set it to
38+
`{0: "negative", 1: "positive"}`. For dataset with many classes such as banking77, User can initialize `class_names` to store a list of class names and then set
39+
`index2class` to `dict(enumerate(class_names))`.
40+
* User needs to check the first line of dataset if they are column names which are not input data. If it's the case, User needs to set `skip_k_starting_lines` to `1` to skip the first line. Otherwise, `skip_k_starting_lines` defaults to `0` which means not skipping the first line. In special cases when User wants to skip multiple lines, User can just set `skip_k_starting_lines` to the number of lines they want to skip.
41+
* In some cases, dataset labels are digits rather than text. User needs to set `digit_label` to `True`. Then User needs to check if the dataset label starting with `1`, if so, User needs to set `one_based_index_label` to True.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Copyright 2022 The Forte Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
import sys
15+
from termcolor import colored
16+
from forte.data.readers import ClassificationDatasetReader
17+
from fortex.huggingface import ZeroShotClassifier
18+
from forte.pipeline import Pipeline
19+
from fortex.nltk import NLTKSentenceSegmenter
20+
from ft.onto.base_ontology import Sentence
21+
22+
23+
csv_path = "data_samples/amazon_review_polarity_csv/sample.csv"
24+
pl = Pipeline()
25+
26+
# initialize labels
27+
class_names = ["negative", "positive"]
28+
index2class = dict(enumerate(class_names))
29+
pl.set_reader(
30+
ClassificationDatasetReader(), config={"index2class": index2class}
31+
)
32+
pl.add(NLTKSentenceSegmenter())
33+
pl.add(ZeroShotClassifier(), config={"candidate_labels": class_names})
34+
pl.initialize()
35+
36+
37+
for pack in pl.process_dataset(csv_path):
38+
for sent in pack.get(Sentence):
39+
if (
40+
input("Type n for the next documentation and its prediction: ").lower()
41+
== "n"
42+
):
43+
sent_text = sent.text
44+
print(colored("Sentence:", "red"), sent_text, "\n")
45+
print(colored("Prediction:", "blue"), sent.classification)
46+
else:
47+
print("Exit the program due to unrecognized input")
48+
sys.exit()
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Copyright 2022 The Forte Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
import sys
15+
from importlib_metadata import csv
16+
from termcolor import colored
17+
18+
from forte import Pipeline
19+
from forte.data.readers import ClassificationDatasetReader
20+
from fortex.nltk import NLTKSentenceSegmenter
21+
from fortex.huggingface import ZeroShotClassifier
22+
from ft.onto.base_ontology import Sentence
23+
24+
25+
csv_path = "data_samples/banking77/sample.csv"
26+
pl = Pipeline()
27+
# initialize labels
28+
class_names = [
29+
"activate_my_card",
30+
"age_limit",
31+
"apple_pay_or_google_pay",
32+
"atm_support",
33+
"automatic_top_up",
34+
"balance_not_updated_after_bank_transfer",
35+
"balance_not_updated_after_cheque_or_cash_deposit",
36+
"beneficiary_not_allowed",
37+
"cancel_transfer",
38+
"card_about_to_expire",
39+
"card_acceptance",
40+
"card_arrival",
41+
"card_delivery_estimate",
42+
"card_linking",
43+
"card_not_working",
44+
"card_payment_fee_charged",
45+
"card_payment_not_recognised",
46+
"card_payment_wrong_exchange_rate",
47+
"card_swallowed",
48+
"cash_withdrawal_charge",
49+
"cash_withdrawal_not_recognised",
50+
"change_pin",
51+
"compromised_card",
52+
"contactless_not_working",
53+
"country_support",
54+
"declined_card_payment",
55+
"declined_cash_withdrawal",
56+
"declined_transfer",
57+
"direct_debit_payment_not_recognised",
58+
"disposable_card_limits",
59+
"edit_personal_details",
60+
"exchange_charge",
61+
"exchange_rate",
62+
"exchange_via_app",
63+
"extra_charge_on_statement",
64+
"failed_transfer",
65+
"fiat_currency_support",
66+
"get_disposable_virtual_card",
67+
"get_physical_card",
68+
"getting_spare_card",
69+
"getting_virtual_card",
70+
"lost_or_stolen_card",
71+
"lost_or_stolen_phone",
72+
"order_physical_card",
73+
"passcode_forgotten",
74+
"pending_card_payment",
75+
"pending_cash_withdrawal",
76+
"pending_top_up",
77+
"pending_transfer",
78+
"pin_blocked",
79+
"receiving_money",
80+
"Refund_not_showing_up",
81+
"request_refund",
82+
"reverted_card_payment?",
83+
"supported_cards_and_currencies",
84+
"terminate_account",
85+
"top_up_by_bank_transfer_charge",
86+
"top_up_by_card_charge",
87+
"top_up_by_cash_or_cheque",
88+
"top_up_failed",
89+
"top_up_limits",
90+
"top_up_reverted",
91+
"topping_up_by_card",
92+
"transaction_charged_twice",
93+
"transfer_fee_charged",
94+
"transfer_into_account",
95+
"transfer_not_received_by_recipient",
96+
"transfer_timing",
97+
"unable_to_verify_identity",
98+
"verify_my_identity",
99+
"verify_source_of_funds",
100+
"verify_top_up",
101+
"virtual_card_not_working",
102+
"visa_or_mastercard",
103+
"why_verify_identity",
104+
"wrong_amount_of_cash_received",
105+
"wrong_exchange_rate_for_cash_withdrawal",
106+
]
107+
index2class = dict(enumerate(class_names))
108+
109+
# initialize reader config
110+
this_reader_config = {
111+
"forte_data_fields": [
112+
"ft.onto.base_ontology.Body",
113+
"label",
114+
],
115+
"index2class": index2class,
116+
"text_fields": [
117+
"ft.onto.base_ontology.Body"
118+
],
119+
"digit_label": False,
120+
"one_based_index_label": False,
121+
}
122+
123+
pl.set_reader(ClassificationDatasetReader(), config=this_reader_config)
124+
pl.add(NLTKSentenceSegmenter())
125+
pl.add(ZeroShotClassifier(), config={"candidate_labels": class_names})
126+
pl.initialize()
127+
128+
for pack in pl.process_dataset(csv_path):
129+
for sentence in pack.get(Sentence):
130+
if (
131+
input("Type n for the next sentence and its prediction: ").lower()
132+
== "n"
133+
):
134+
sent_text = sentence.text
135+
print(colored("Sentence:", "red"), sent_text, "\n")
136+
print(colored("Prediction:", "blue"), sentence.classification)
137+
else:
138+
print("Exit the program due to unrecognized input")
139+
sys.exit()

forte/data/readers/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -30,4 +30,5 @@
3030
from forte.data.readers.ag_news_reader import *
3131
from forte.data.readers.largemovie_reader import *
3232
from forte.data.readers.misc_readers import *
33+
from forte.data.readers.classification_reader import *
3334
from forte.data.readers.audio_reader import *

0 commit comments

Comments
 (0)