Jupyter Notebook Tutorial for zero shot text classification #799

bhaskar2443053 · 2022-05-23T12:39:07Z

This issue is part of #691
This PR fixes #734

Description of changes

This tutorial is an introduction to the Zero-Shot Learning technique for text classification.

Possible influences of this PR.

removed text_classification_pipeline.ipynb , as its updated with a new tutorial

Test Conducted

Describe what test cases are included for the PR.

pr

…skar2443053/forte into tutorial_text_classification

codecov · 2022-05-23T12:46:41Z

Codecov Report

Merging #799 (280b574) into master (6ddbcab) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #799   +/-   ##
=======================================
  Coverage   80.50%   80.50%           
=======================================
  Files         252      252           
  Lines       19363    19363           
=======================================
  Hits        15588    15588           
  Misses       3775     3775

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ddbcab...280b574. Read the comment docs.

hunterhector

Since the tutorial is quite large let me comment them here instead of inline:

The rendering format is a bit off, you can check it here: https://asyml-forte--799.org.readthedocs.build/en/799/notebook_tutorial/zero_shot_text_classification.html, notice that the bullet list format is not working
The explanation of the pipeline is a bit too rigid. You introduce the whole configuration list, but that's a bit too lengthy and we should probably talk about the code, not what the configuration parameters are.
There are grammar errors from time to time, like from hugging face model, it should be from the HuggingFace model. Instead of Zero shot is a powerful tool, it should be Zero Shot model is a powerful tool
The cell output is really long, I don't think we want to include all those.

Overall, there is an improvement to the writing, but it still does not feel like a tutorial to me.

hunterhector · 2022-05-23T22:58:16Z

data_samples/amazon_review_polarity_csv/amazon_sample_4.csv

@@ -0,0 +1,5 @@
+label,Title,description


The sample 10k is a bit too large for the repo (4MB), which would increase the git history by quite a bit. Let's consider hosting it at a different place?

Or just keep a few samples (~20) is good enough for the tutorial.

removed the 10k sample and added a 20 rows sample.

hepengfe · 2022-05-24T16:19:37Z

docs/notebook_tutorial/zero_shot_text_classification.ipynb

+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "Zero shot text classification is an extreme example of transfer learning, where the model tries to predict without any fine tuning. This kind of model predict classification result based on similarity between the input text and label name, so they don't require any labeled data. Well defined Label names are sufficient to achieve a reasonably performing model. We will be using [valhalla/distilbart-mnli-12-1](https://huggingface.co/valhalla/distilbart-mnli-12-1) as our text classification model, which was trained on [Multi-nli](https://huggingface.co/datasets/multi_nli) dataset to make predictions on a subset of amazon review sentiment [ARS](https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz) dataset and [Banking77](https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv) dataset. we will also see how the forte pipeline works seamlessly with different third party tools like [nltk](https://www.nltk.org/) and [huggingface](https://huggingface.co/) to make our life easier."


Could you check grammar errors? I have seen some apparent grammar errors.

I fixed using Grammarly.

hepengfe · 2022-05-24T16:23:59Z

docs/spelling_wordlist.txt

@@ -156,3 +156,5 @@ GroupType
 PathLike
 ElementType
 customizable
+sklearn


for package name or suffix, maybe you can mention them by using double backquotes like sklearn and csv rather than add them into spelling wordlist.

used double backquotes to fix spellings not in wordlist

…to tutorial_text_classification

bhaskar2443053 · 2022-06-06T19:29:50Z

I am unable to fix the bullet points. I tried '- ' , '* ' and '1. ' , but none of them are rendering correctly in docs build.
any suggestions?

hepengfe · 2022-06-07T00:33:45Z

I am unable to fix the bullet points. I tried '- ' , '* ' and '1. ' , but none of them are rendering correctly in docs build.
any suggestions?

According to https://buildmedia.readthedocs.org/media/pdf/nbsphinx/latest/nbsphinx.pdf,

In case that your Bulletpoints do render in the notebook and do not render with nbsphinx, please add one blank line before the bulletpoints.

You can google the notebook issue with keyword nbsphinx which is the package we are using to convert the notebook to html.

hepengfe · 2022-06-07T00:49:32Z

I have a few suggestions.

You could remove square brackets around the mentioned issues in the PR description.
Some descriptions such as the one below is inaccurate or don't relate very well to forte. I would suggest refer to some official definition and rephrase it to show how forte could handle data and organize nlp components for this particular task.

Text classification is the process of grouping text data into various categories based on some algorithm. These algorithms are at the heart of a variety of software systems that process text data at scale

maybe backtick the pipeline component in the markdown cell such as NLTKSentenceSegmenter( )

In general, I would suggest review both the notebook and the rendered html page after you are done with editing.

bhaskar2443053 · 2022-06-07T10:26:22Z

I am unable to fix the bullet points. I tried '- ' , '* ' and '1. ' , but none of them are rendering correctly in docs build.
any suggestions?

According to https://buildmedia.readthedocs.org/media/pdf/nbsphinx/latest/nbsphinx.pdf,

In case that your Bulletpoints do render in the notebook and do not render with nbsphinx, please add one blank line before the bulletpoints.

You can google the notebook issue with keyword nbsphinx which is the package we are using to convert the notebook to html.

adding a blank line above bullets worked.

bhaskar2443053 · 2022-06-07T11:18:51Z

3. backtick

Mostly fixed all the above comments.

hepengfe · 2022-06-08T07:17:46Z

Some minor issues. After these are fixed I think it's ready to merge.

These algorithms are at the heart of various software systems that process text data at scale.

I think we probably don't want to keep this sentence as it doesn't relate very well to either the context or forte.

body is the complete description of the review.

body -> Body

where the model tries to classify without any fine-tuning

without any fine-tuning on a single data.

let’s also check the class balance of the data by using groupby and count functions in pandas.

let -> Let

The first entry to our pipeline will be the data reading step.

Rather than entry, we want to refer them as component or pipeline component. As entry is referring to something else in forte.

Accuracy of 95% on unseen data is an exceptional performance from the Hugging-Face model. Zero-Shot learning is a powerful tool for low volume of label data.

I think we don't need to backtick HuggingFace and zero-shot learning as they are already known names. By backtick package names, I mean suppose transformers as a python package name shown in markdown cell, we definitely need to backtick it.

…to tutorial_text_classification

bhaskar2443053 · 2022-06-08T09:04:01Z

Some minor issues. After these are fixed I think it's ready to merge.

These algorithms are at the heart of various software systems that process text data at scale.

I think we probably don't want to keep this sentence as it doesn't relate very well to either the context or forte.

body is the complete description of the review.

body -> Body

where the model tries to classify without any fine-tuning

without any fine-tuning on a single data.

let’s also check the class balance of the data by using groupby and count functions in pandas.

let -> Let

The first entry to our pipeline will be the data reading step.

Rather than entry, we want to refer them as component or pipeline component. As entry is referring to something else in forte.

Accuracy of 95% on unseen data is an exceptional performance from the Hugging-Face model. Zero-Shot learning is a powerful tool for low volume of label data.

I think we don't need to backtick HuggingFace and zero-shot learning as they are already known names. By backtick package names, I mean suppose transformers as a python package name shown in markdown cell, we definitely need to backtick it.

modified according to the above feedbacks.

bhaskar2443053 added 14 commits May 19, 2022 10:19

Merge pull request #1 from asyml/master

1dca3a0

pr

zero_shot tutorial and additional sample data

2b24db4

remove text_pipeline_tutorial

f975392

edit test notebooks text_classification_pipeline_test

2773e7d

spelling corrections

2f49b50

spelling added to list

67d077b

spelling multiclass

f2dc78f

Merge branch 'asyml:master' into master

a9eec81

Merge branch 'asyml:master' into tutorial_text_classification

c6a1f69

modification on bater feedback

ea4aac8

modification on bater feedback

7862b2b

Merge remote-tracking branch 'origin' into tutorial_text_classification

c1b4ecf

Merge branch 'tutorial_text_classification' of https://github.com/bha…

c47d735

…skar2443053/forte into tutorial_text_classification

modification on bater feedback

558d17e

bhaskar2443053 requested review from hunterhector and hepengfe May 23, 2022 12:39

hunterhector reviewed May 23, 2022

View reviewed changes

hepengfe reviewed May 24, 2022

View reviewed changes

bhaskar2443053 added 9 commits June 6, 2022 15:39

fixing long outputs

2d17df8

Merge branch 'master' into tutorial_text_classification

5a755c4

test

89429b2

Merge remote-tracking branch 'origin/tutorial_text_classification' in…

63b4a5d

…to tutorial_text_classification

test_undo

a3b7ed7

spelling fix

f9a30fc

indentation fix

c8853a7

indentation fix 2

82c160f

indentation fix 3

dc8f889

bhaskar2443053 added 3 commits June 7, 2022 09:06

indentation fix 4 and small edits

2d0f8e4

indentation fix 5

4fa2e76

grammar improvement

fe45c8b

bhaskar2443053 added 4 commits June 7, 2022 14:37

small text display issue

c3d6eb2

small text display issue

28bfb29

small text display issue

638dfad

hyperlink and backtick not working together

0596f07

Merge branch 'master' into tutorial_text_classification

b039d5b

bhaskar2443053 added 2 commits June 8, 2022 12:03

fix minor changes

8d9be10

Merge remote-tracking branch 'origin/tutorial_text_classification' in…

280b574

…to tutorial_text_classification

hepengfe approved these changes Jun 8, 2022

View reviewed changes

hepengfe merged commit ac34f10 into asyml:master Jun 8, 2022

hepengfe mentioned this pull request Jun 10, 2022

Better Documentation for Text Classification Pipeline #748

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jupyter Notebook Tutorial for zero shot text classification #799

Jupyter Notebook Tutorial for zero shot text classification #799

bhaskar2443053 commented May 23, 2022 •

edited

Loading

codecov bot commented May 23, 2022 •

edited

Loading

hunterhector left a comment

hunterhector May 23, 2022

hepengfe May 24, 2022

bhaskar2443053 Jun 6, 2022

hepengfe May 24, 2022

bhaskar2443053 Jun 6, 2022

hepengfe May 24, 2022 •

edited

Loading

bhaskar2443053 Jun 6, 2022

bhaskar2443053 commented Jun 6, 2022

hepengfe commented Jun 7, 2022

hepengfe commented Jun 7, 2022 •

edited

Loading

bhaskar2443053 commented Jun 7, 2022

bhaskar2443053 commented Jun 7, 2022

hepengfe commented Jun 8, 2022 •

edited

Loading

bhaskar2443053 commented Jun 8, 2022

Jupyter Notebook Tutorial for zero shot text classification #799

Jupyter Notebook Tutorial for zero shot text classification #799

Conversation

bhaskar2443053 commented May 23, 2022 • edited Loading

Description of changes

Possible influences of this PR.

Test Conducted

codecov bot commented May 23, 2022 • edited Loading

Codecov Report

hunterhector left a comment

Choose a reason for hiding this comment

hunterhector May 23, 2022

Choose a reason for hiding this comment

hepengfe May 24, 2022

Choose a reason for hiding this comment

bhaskar2443053 Jun 6, 2022

Choose a reason for hiding this comment

hepengfe May 24, 2022

Choose a reason for hiding this comment

bhaskar2443053 Jun 6, 2022

Choose a reason for hiding this comment

hepengfe May 24, 2022 • edited Loading

Choose a reason for hiding this comment

bhaskar2443053 Jun 6, 2022

Choose a reason for hiding this comment

bhaskar2443053 commented Jun 6, 2022

hepengfe commented Jun 7, 2022

hepengfe commented Jun 7, 2022 • edited Loading

bhaskar2443053 commented Jun 7, 2022

bhaskar2443053 commented Jun 7, 2022

hepengfe commented Jun 8, 2022 • edited Loading

bhaskar2443053 commented Jun 8, 2022

bhaskar2443053 commented May 23, 2022 •

edited

Loading

codecov bot commented May 23, 2022 •

edited

Loading

hepengfe May 24, 2022 •

edited

Loading

hepengfe commented Jun 7, 2022 •

edited

Loading

hepengfe commented Jun 8, 2022 •

edited

Loading