Skip to content

Jupyter Notebook Tutorial for zero shot text classification #799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Jun 8, 2022

Conversation

bhaskar2443053
Copy link
Collaborator

@bhaskar2443053 bhaskar2443053 commented May 23, 2022

This issue is part of #691
This PR fixes #734

Description of changes

This tutorial is an introduction to the Zero-Shot Learning technique for text classification.

Possible influences of this PR.

removed text_classification_pipeline.ipynb , as its updated with a new tutorial

Test Conducted

Describe what test cases are included for the PR.

@codecov
Copy link

codecov bot commented May 23, 2022

Codecov Report

Merging #799 (280b574) into master (6ddbcab) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #799   +/-   ##
=======================================
  Coverage   80.50%   80.50%           
=======================================
  Files         252      252           
  Lines       19363    19363           
=======================================
  Hits        15588    15588           
  Misses       3775     3775           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ddbcab...280b574. Read the comment docs.

Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the tutorial is quite large let me comment them here instead of inline:

  1. The rendering format is a bit off, you can check it here: https://asyml-forte--799.org.readthedocs.build/en/799/notebook_tutorial/zero_shot_text_classification.html, notice that the bullet list format is not working
  2. The explanation of the pipeline is a bit too rigid. You introduce the whole configuration list, but that's a bit too lengthy and we should probably talk about the code, not what the configuration parameters are.
  3. There are grammar errors from time to time, like from hugging face model, it should be from the HuggingFace model. Instead of Zero shot is a powerful tool, it should be Zero Shot model is a powerful tool
  4. The cell output is really long, I don't think we want to include all those.

Overall, there is an improvement to the writing, but it still does not feel like a tutorial to me.

@@ -0,0 +1,5 @@
label,Title,description
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample 10k is a bit too large for the repo (4MB), which would increase the git history by quite a bit. Let's consider hosting it at a different place?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just keep a few samples (~20) is good enough for the tutorial.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the 10k sample and added a 20 rows sample.

"metadata": {},
"source": [
"## Introduction\n",
"Zero shot text classification is an extreme example of transfer learning, where the model tries to predict without any fine tuning. This kind of model predict classification result based on similarity between the input text and label name, so they don't require any labeled data. Well defined Label names are sufficient to achieve a reasonably performing model. We will be using [valhalla/distilbart-mnli-12-1](https://huggingface.co/valhalla/distilbart-mnli-12-1) as our text classification model, which was trained on [Multi-nli](https://huggingface.co/datasets/multi_nli) dataset to make predictions on a subset of amazon review sentiment [ARS](https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz) dataset and [Banking77](https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv) dataset. we will also see how the forte pipeline works seamlessly with different third party tools like [nltk](https://www.nltk.org/) and [huggingface](https://huggingface.co/) to make our life easier."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check grammar errors? I have seen some apparent grammar errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed using Grammarly.

@@ -156,3 +156,5 @@ GroupType
PathLike
ElementType
customizable
sklearn
Copy link
Collaborator

@hepengfe hepengfe May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for package name or suffix, maybe you can mention them by using double backquotes like sklearn and csv rather than add them into spelling wordlist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used double backquotes to fix spellings not in wordlist

@bhaskar2443053
Copy link
Collaborator Author

I am unable to fix the bullet points. I tried '- ' , '* ' and '1. ' , but none of them are rendering correctly in docs build.
any suggestions?

@hepengfe
Copy link
Collaborator

hepengfe commented Jun 7, 2022

I am unable to fix the bullet points. I tried '- ' , '* ' and '1. ' , but none of them are rendering correctly in docs build.
any suggestions?

According to https://buildmedia.readthedocs.org/media/pdf/nbsphinx/latest/nbsphinx.pdf,

In case that your Bulletpoints do render in the notebook and do not render with nbsphinx, please add one blank line before the bulletpoints.

You can google the notebook issue with keyword nbsphinx which is the package we are using to convert the notebook to html.

@hepengfe
Copy link
Collaborator

hepengfe commented Jun 7, 2022

I have a few suggestions.

  1. You could remove square brackets around the mentioned issues in the PR description.
  2. Some descriptions such as the one below is inaccurate or don't relate very well to forte. I would suggest refer to some official definition and rephrase it to show how forte could handle data and organize nlp components for this particular task.

Text classification is the process of grouping text data into various categories based on some algorithm. These algorithms are at the heart of a variety of software systems that process text data at scale

  1. maybe backtick the pipeline component in the markdown cell such as NLTKSentenceSegmenter( )

In general, I would suggest review both the notebook and the rendered html page after you are done with editing.

@bhaskar2443053
Copy link
Collaborator Author

I am unable to fix the bullet points. I tried '- ' , '* ' and '1. ' , but none of them are rendering correctly in docs build.
any suggestions?

According to https://buildmedia.readthedocs.org/media/pdf/nbsphinx/latest/nbsphinx.pdf,

In case that your Bulletpoints do render in the notebook and do not render with nbsphinx, please add one blank line before the bulletpoints.

You can google the notebook issue with keyword nbsphinx which is the package we are using to convert the notebook to html.

adding a blank line above bullets worked.

@bhaskar2443053
Copy link
Collaborator Author

3. backtick

Mostly fixed all the above comments.

@hepengfe
Copy link
Collaborator

hepengfe commented Jun 8, 2022

Some minor issues. After these are fixed I think it's ready to merge.

These algorithms are at the heart of various software systems that process text data at scale.

I think we probably don't want to keep this sentence as it doesn't relate very well to either the context or forte.

body is the complete description of the review.

body -> Body

where the model tries to classify without any fine-tuning

without any fine-tuning on a single data.

let’s also check the class balance of the data by using groupby and count functions in pandas.

let -> Let

The first entry to our pipeline will be the data reading step.

Rather than entry, we want to refer them as component or pipeline component. As entry is referring to something else in forte.

Accuracy of 95% on unseen data is an exceptional performance from the Hugging-Face model. Zero-Shot learning is a powerful tool for low volume of label data.

I think we don't need to backtick HuggingFace and zero-shot learning as they are already known names. By backtick package names, I mean suppose transformers as a python package name shown in markdown cell, we definitely need to backtick it.

@bhaskar2443053
Copy link
Collaborator Author

Some minor issues. After these are fixed I think it's ready to merge.

These algorithms are at the heart of various software systems that process text data at scale.

I think we probably don't want to keep this sentence as it doesn't relate very well to either the context or forte.

body is the complete description of the review.

body -> Body

where the model tries to classify without any fine-tuning

without any fine-tuning on a single data.

let’s also check the class balance of the data by using groupby and count functions in pandas.

let -> Let

The first entry to our pipeline will be the data reading step.

Rather than entry, we want to refer them as component or pipeline component. As entry is referring to something else in forte.

Accuracy of 95% on unseen data is an exceptional performance from the Hugging-Face model. Zero-Shot learning is a powerful tool for low volume of label data.

I think we don't need to backtick HuggingFace and zero-shot learning as they are already known names. By backtick package names, I mean suppose transformers as a python package name shown in markdown cell, we definitely need to backtick it.

modified according to the above feedbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jupyter notebook classification example with banking77 data
3 participants