Datasets reader for Classification Tasks (#516)

hepengfe · hunterhector · feipenghe · web-flow · commit 4bb8fa5bd0be · 2022-02-01T12:44:42.000-05:00
* test to branch

* rm test

* two classfication datasets

* bank77 test script

* updates

* Delete forte/spacy directory

* bank77 example file

* delete root folder unrelated file

* implemented generic reader for classification dataset

* implemented a generic dataset and write two test cases

* fix some pylint style

* restore .gitignore to upstream

* add more docs and some minor fixs on classification reader

* add a line of code to import ClassificationDatasetReader

* fixed some grammar errors, added more comments and replaced assertion with raising error

* fixed the two classification examples based on the new wrapper class and changed some comments about dataset reader

* fixed some comments and supressed mypy errors since the reader arguments type is not consistent with its parent class

* added link to the classification example

* reformat using black

* fixed long line

* fixed default configs docstring based on the review feedback.

* fixed ontology paths

* fixed redudent config initialiazation

* add example documentation

* rewrite docstring and some variable names

* pylint

* black

* black

* mypy

* black

* black and mypy

* add test cases for classification dataset reader

* edits based on code review

* pylint

* removed sys.path.insert

* remove unused external forte packages

* remove test case causing type error

* remove readme

* Delete README.md

* edits based on code review

* remove unused pipeline components

* pylint

* add more instructions and minor edits on example scripts

* fixed issues based on code review

* add sphinx build scripts to data.rst

* corrected test case based on the new data entry, Body

* merge from master

* test import

* test if adding body to base_ontology.json solves importing issue

* remove test ontology

* fixed spelling errors

* add banking77 sample data

* changed banking77 sample data name

* fixed issues based on code review

* add more error checkings

* pylint

* setting relative data path to make running example more convenient

* fixed function annotation

* black

* add copyright headers

Co-authored-by: Hector &lt;hunterhector@gmail.com&gt;
Co-authored-by: feipenghe &lt;kern1996@outlook.com&gt;
diff --git a/data_samples/amazon_review_polarity_csv/sample.csv b/data_samples/amazon_review_polarity_csv/sample.csv
@@ -0,0 +1,10 @@
+"2","Great CD","My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing ""Who was that singing ?"""
+"2","One of the best game music soundtracks - for a game I didn't really play","Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, which I find distracting. But even if those weren't included I would still consider the collection worth it."
+"1","Batteries died within a year ...","I bought this charger in Jul 2003 and it worked OK for a while. The design is nice and convenient. However, after about a year, the batteries would not hold a charge. Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power."
+"2","works fine, but Maha Energy is better","Check out Maha Energy's website. Their Powerex MH-C204F charger works in 100 minutes for rapid charge, with option for slower charge (better for batteries). And they have 2200 mAh batteries."
+"2","Great for the non-audiophile","Reviewed quite a bit of the combo players and was hesitant due to unfavorable reviews and size of machines. I am weaning off my VHS collection, but don't want to replace them with DVD's. This unit is well built, easy to setup and resolution and special effects (no progressive scan for HDTV owners) suitable for many people looking for a versatile product.Cons- No universal remote."
+"1","DVD Player crapped out after one year","I also began having the incorrect disc problems that I've read about on here. The VCR still works, but hte DVD side is useless. I understand that DVD players sometimes just quit on you, but after not even one year? To me that's a sign on bad quality. I'm giving up JVC after this as well. I'm sticking to Sony or giving another brand a shot."
+"1","Incorrect Disc","I love the style of this, but after a couple years, the DVD is giving me problems. It doesn't even work anymore and I use my broken PS2 Now. I wouldn't recommend this, I'm just going to upgrade to a recorder now. I wish it would work but I guess i'm giving up on JVC. I really did like this one... before it stopped working. The dvd player gave me problems probably after a year of having it."
+"1","DVD menu select problems","I cannot scroll through a DVD menu that is set up vertically. The triangle keys will only select horizontally. So I cannot select anything on most DVD's besides play. No special features, no language select, nothing, just play."
+"2","Unique Weird Orientalia from the 1930's","Exotic tales of the Orient from the 1930's. ""Dr Shen Fu"", a Weird Tales magazine reprint, is about the elixir of life that grants immortality at a price. If you're tired of modern authors who all sound alike, this is the antidote for you. Owen's palette is loaded with splashes of Chinese and Japanese colours. Marvelous."
+"1","Not an ""ultimate guide""","Firstly,I enjoyed the format and tone of the book (how the author addressed the reader). However, I did not feel that she imparted any insider secrets that the book promised to reveal. If you are just starting to research law school, and do not know all the requirements of admission, then this book may be a tremendous help. If you have done your homework and are looking for an edge when it comes to admissions, I recommend some more topic-specific books. For example, books on how to write your personal statment, books geared specifically towards LSAT preparation (Powerscore books were the most helpful for me), and there are some websites with great advice geared towards aiding the individuals whom you are asking to write letters of recommendation. Yet, for those new to the entire affair, this book can definitely clarify the requirements for you."
diff --git a/data_samples/banking77/sample.csv b/data_samples/banking77/sample.csv
@@ -0,0 +1,20 @@
+text,category
+How do I locate my card?,card_arrival
+"I still have not received my new card, I ordered over a week ago.",card_arrival
+I ordered a card but it has not arrived. Help please!,card_arrival
+Is there a way to know when my card will arrive?,card_arrival
+My card has not arrived yet.,card_arrival
+When will I get my card?,card_arrival
+Do you know if there is a tracking number for the new card you sent me?,card_arrival
+i have not received my card,card_arrival
+still waiting on that card,card_arrival
+Is it normal to have to wait over a week for my new card?,card_arrival
+How do I track my card?,card_arrival
+How long does a card delivery take?,card_arrival
+I still don't have my card after 2 weeks.  What should I do?,card_arrival
+still waiting on my new card,card_arrival
+I am still waiting for my card after 1 week.  Is this ok?,card_arrival
+"I have been waiting longer than expected for my bank card, could you provide information on when it will arrive?",card_arrival
+I've been waiting longer than expected for my card.,card_arrival
+Why hasn't my card been delivered?,card_arrival
+Can the card be mailed and used in Europe?,country_support
diff --git a/docs/code/data.rst b/docs/code/data.rst
@@ -191,6 +191,12 @@ Readers
 .. autoclass:: forte.datasets.mrc.squad_reader.SquadReader
     :members:
 
+:hidden:`ClassificationDatasetReader`
+--------------------------------------
+.. autoclass:: forte.data.readers.classification_reader.ClassificationDatasetReader
+    :members:
+
+
 DataPack Dataset
 =================
 
diff --git a/examples/chatbot/chatbot_example.py b/examples/chatbot/chatbot_example.py
@@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import yaml
-
 from termcolor import colored
 import torch
 from fortex.nltk import NLTKSentenceSegmenter, NLTKWordTokenizer, NLTKPOSTagger
diff --git a/examples/classification/README.md b/examples/classification/README.md
@@ -0,0 +1,41 @@
+## Prepare dataset
+### Amazon Review Sentiment
+amazon review sentiment(**ARS**) dataset is a binary classification dataset, and it has digit labels. `1` is
+the negative and `2` is the positive. Each class has 1,800,000 training samples and 200,000 testing samples. 
+The dataset can be downloaded from [link](https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz).
+
+
+### banking77
+Banking77 is a multi-class datasets. It has 77 classes which are fine-grained intents in a banking domain.
+The train data can be downloaded from [link](https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv) and test data can be downloaded from [link](https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv).
+
+## Run classifier
+To simply test the script, one can use run cmd below directly to use sample data under the project root folder.
+```
+python examples/classification/amazon_review_sentiment.py
+```
+
+```
+python examples/classification/bank_customer_intent.py
+```
+
+
+If User wants to run the script on the full dataset, User can download the dataset and set the dataset path correctly in the script.
+Under the `forte/examples/classification/` folder, User can run the following command to run the classifier.
+```bash
+python amazon_review_sentiment.py
+```
+
+```bash
+python bank_customer_intent.py
+```
+
+
+## Reader Configuration
+`ClassificationDatasetReader` is designed to read table-like classification datasets and currently it only support `csv` file which is a common file format. To use the reader correctly, User needs to check the dataset and configure the reader correspondingly. To better explain this, we will use ARS dataset as an example throughout the explanation.
+* User needs to check column names of the dataset. In the example dataset, we have column names [label, title, content]. First, we need know the first column is about data labels. Second, we know the second and third column can be input text. Therefore, we can set `forte_data_fields` to be `['label', 'ft.onto.base_ontology.Title', 'ft.onto.base_ontology.Body']` that each element matches column names from dataset. `label` is just a keyword that reader needs to identify the label. `'ft.onto.base_ontology.Title'` and `'ft.onto.base_ontology.Body'` are two forte data entries that stores input text in proper wrappers. In some cases that dataset might contain unnecessary columns that User doesn't want to use at all, User can set corresponding list elements in `forte_data_fields` to `None` so that the reader can skip processing them. 
+* User also needs to check if how many classes in the dataset to configure `index2class` which is a dictionary mapping from zero-based indices to class names. In ARS dataset, User can simply set it to
+    `{0: "negative", 1: "positive"}`. For dataset with many classes such as banking77, User can initialize `class_names` to store a list of class names and then set 
+    `index2class` to `dict(enumerate(class_names))`.
+* User needs to check the first line of dataset if they are column names which are not input data. If it's the case, User needs to set `skip_k_starting_lines` to `1` to skip the first line. Otherwise, `skip_k_starting_lines` defaults to `0` which means not skipping the first line. In special cases when User wants to skip multiple lines, User can just set `skip_k_starting_lines` to the number of lines they want to skip.
+* In some cases, dataset labels are digits rather than text. User needs to set `digit_label` to `True`. Then User needs to check if the dataset label starting with `1`, if so, User needs to set `one_based_index_label` to True.
diff --git a/examples/classification/amazon_review_sentiment.py b/examples/classification/amazon_review_sentiment.py
@@ -0,0 +1,48 @@
+# Copyright 2022 The Forte Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import sys
+from termcolor import colored
+from forte.data.readers import ClassificationDatasetReader
+from fortex.huggingface import ZeroShotClassifier
+from forte.pipeline import Pipeline
+from fortex.nltk import NLTKSentenceSegmenter
+from ft.onto.base_ontology import Sentence
+
+
+csv_path = "data_samples/amazon_review_polarity_csv/sample.csv"
+pl = Pipeline()
+
+# initialize labels
+class_names = ["negative", "positive"]
+index2class = dict(enumerate(class_names))
+pl.set_reader(
+    ClassificationDatasetReader(), config={"index2class": index2class}
+)
+pl.add(NLTKSentenceSegmenter())
+pl.add(ZeroShotClassifier(), config={"candidate_labels": class_names})
+pl.initialize()
+
+
+for pack in pl.process_dataset(csv_path):
+    for sent in pack.get(Sentence):
+        if (
+            input("Type n for the next documentation and its prediction: ").lower()
+            == "n"
+        ):
+            sent_text = sent.text
+            print(colored("Sentence:", "red"), sent_text, "\n")
+            print(colored("Prediction:", "blue"), sent.classification)
+        else:
+            print("Exit the program due to unrecognized input")
+            sys.exit()
diff --git a/examples/classification/bank_customer_intent.py b/examples/classification/bank_customer_intent.py
@@ -0,0 +1,139 @@
+# Copyright 2022 The Forte Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import sys
+from importlib_metadata import csv
+from termcolor import colored
+
+from forte import Pipeline
+from forte.data.readers import ClassificationDatasetReader
+from fortex.nltk import NLTKSentenceSegmenter
+from fortex.huggingface import ZeroShotClassifier
+from ft.onto.base_ontology import Sentence
+
+
+csv_path = "data_samples/banking77/sample.csv"
+pl = Pipeline()
+# initialize labels
+class_names = [
+    "activate_my_card",
+    "age_limit",
+    "apple_pay_or_google_pay",
+    "atm_support",
+    "automatic_top_up",
+    "balance_not_updated_after_bank_transfer",
+    "balance_not_updated_after_cheque_or_cash_deposit",
+    "beneficiary_not_allowed",
+    "cancel_transfer",
+    "card_about_to_expire",
+    "card_acceptance",
+    "card_arrival",
+    "card_delivery_estimate",
+    "card_linking",
+    "card_not_working",
+    "card_payment_fee_charged",
+    "card_payment_not_recognised",
+    "card_payment_wrong_exchange_rate",
+    "card_swallowed",
+    "cash_withdrawal_charge",
+    "cash_withdrawal_not_recognised",
+    "change_pin",
+    "compromised_card",
+    "contactless_not_working",
+    "country_support",
+    "declined_card_payment",
+    "declined_cash_withdrawal",
+    "declined_transfer",
+    "direct_debit_payment_not_recognised",
+    "disposable_card_limits",
+    "edit_personal_details",
+    "exchange_charge",
+    "exchange_rate",
+    "exchange_via_app",
+    "extra_charge_on_statement",
+    "failed_transfer",
+    "fiat_currency_support",
+    "get_disposable_virtual_card",
+    "get_physical_card",
+    "getting_spare_card",
+    "getting_virtual_card",
+    "lost_or_stolen_card",
+    "lost_or_stolen_phone",
+    "order_physical_card",
+    "passcode_forgotten",
+    "pending_card_payment",
+    "pending_cash_withdrawal",
+    "pending_top_up",
+    "pending_transfer",
+    "pin_blocked",
+    "receiving_money",
+    "Refund_not_showing_up",
+    "request_refund",
+    "reverted_card_payment?",
+    "supported_cards_and_currencies",
+    "terminate_account",
+    "top_up_by_bank_transfer_charge",
+    "top_up_by_card_charge",
+    "top_up_by_cash_or_cheque",
+    "top_up_failed",
+    "top_up_limits",
+    "top_up_reverted",
+    "topping_up_by_card",
+    "transaction_charged_twice",
+    "transfer_fee_charged",
+    "transfer_into_account",
+    "transfer_not_received_by_recipient",
+    "transfer_timing",
+    "unable_to_verify_identity",
+    "verify_my_identity",
+    "verify_source_of_funds",
+    "verify_top_up",
+    "virtual_card_not_working",
+    "visa_or_mastercard",
+    "why_verify_identity",
+    "wrong_amount_of_cash_received",
+    "wrong_exchange_rate_for_cash_withdrawal",
+]
+index2class = dict(enumerate(class_names))
+
+# initialize reader config
+this_reader_config = {
+    "forte_data_fields": [
+        "ft.onto.base_ontology.Body",
+        "label",
+    ],
+    "index2class": index2class,
+    "text_fields": [
+        "ft.onto.base_ontology.Body"
+    ],
+    "digit_label": False,
+    "one_based_index_label": False,
+}
+
+pl.set_reader(ClassificationDatasetReader(), config=this_reader_config)
+pl.add(NLTKSentenceSegmenter())
+pl.add(ZeroShotClassifier(), config={"candidate_labels": class_names})
+pl.initialize()
+
+for pack in pl.process_dataset(csv_path):
+    for sentence in pack.get(Sentence):
+        if (
+            input("Type n for the next sentence and its prediction: ").lower()
+            == "n"
+        ):
+            sent_text = sentence.text
+            print(colored("Sentence:", "red"), sent_text, "\n")
+            print(colored("Prediction:", "blue"), sentence.classification)
+        else:
+            print("Exit the program due to unrecognized input")
+            sys.exit()
diff --git a/forte/data/readers/__init__.py b/forte/data/readers/__init__.py
@@ -30,4 +30,5 @@
 from forte.data.readers.ag_news_reader import *
 from forte.data.readers.largemovie_reader import *
 from forte.data.readers.misc_readers import *
+from forte.data.readers.classification_reader import *
 from forte.data.readers.audio_reader import *
diff --git a/forte/data/readers/classification_reader.py b/forte/data/readers/classification_reader.py
diff --git a/forte/ontology_specs/base_ontology.json b/forte/ontology_specs/base_ontology.json
diff --git a/ft/onto/base_ontology.py b/ft/onto/base_ontology.py
diff --git a/tests/forte/data/readers/classification_reader_test.py b/tests/forte/data/readers/classification_reader_test.py