-
Notifications
You must be signed in to change notification settings - Fork 60
Jupyter Notebook Tutorial for zero shot text classification #799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jupyter Notebook Tutorial for zero shot text classification #799
Conversation
…skar2443053/forte into tutorial_text_classification
Codecov Report
@@ Coverage Diff @@
## master #799 +/- ##
=======================================
Coverage 80.50% 80.50%
=======================================
Files 252 252
Lines 19363 19363
=======================================
Hits 15588 15588
Misses 3775 3775 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the tutorial is quite large let me comment them here instead of inline:
- The rendering format is a bit off, you can check it here: https://asyml-forte--799.org.readthedocs.build/en/799/notebook_tutorial/zero_shot_text_classification.html, notice that the bullet list format is not working
- The explanation of the pipeline is a bit too rigid. You introduce the whole configuration list, but that's a bit too lengthy and we should probably talk about the code, not what the configuration parameters are.
- There are grammar errors from time to time, like
from hugging face model
, it should befrom the HuggingFace model
. Instead ofZero shot is a powerful tool
, it should beZero Shot model is a powerful tool
- The cell output is really long, I don't think we want to include all those.
Overall, there is an improvement to the writing, but it still does not feel like a tutorial to me.
@@ -0,0 +1,5 @@ | |||
label,Title,description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sample 10k is a bit too large for the repo (4MB), which would increase the git history by quite a bit. Let's consider hosting it at a different place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just keep a few samples (~20) is good enough for the tutorial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the 10k sample and added a 20 rows sample.
"metadata": {}, | ||
"source": [ | ||
"## Introduction\n", | ||
"Zero shot text classification is an extreme example of transfer learning, where the model tries to predict without any fine tuning. This kind of model predict classification result based on similarity between the input text and label name, so they don't require any labeled data. Well defined Label names are sufficient to achieve a reasonably performing model. We will be using [valhalla/distilbart-mnli-12-1](https://huggingface.co/valhalla/distilbart-mnli-12-1) as our text classification model, which was trained on [Multi-nli](https://huggingface.co/datasets/multi_nli) dataset to make predictions on a subset of amazon review sentiment [ARS](https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz) dataset and [Banking77](https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv) dataset. we will also see how the forte pipeline works seamlessly with different third party tools like [nltk](https://www.nltk.org/) and [huggingface](https://huggingface.co/) to make our life easier." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check grammar errors? I have seen some apparent grammar errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed using Grammarly.
docs/spelling_wordlist.txt
Outdated
@@ -156,3 +156,5 @@ GroupType | |||
PathLike | |||
ElementType | |||
customizable | |||
sklearn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for package name or suffix, maybe you can mention them by using double backquotes like sklearn
and csv
rather than add them into spelling wordlist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used double backquotes to fix spellings not in wordlist
…to tutorial_text_classification
I am unable to fix the bullet points. I tried '- ' , '* ' and '1. ' , but none of them are rendering correctly in docs build. |
According to https://buildmedia.readthedocs.org/media/pdf/nbsphinx/latest/nbsphinx.pdf,
You can google the notebook issue with keyword |
I have a few suggestions.
In general, I would suggest review both the notebook and the rendered html page after you are done with editing. |
adding a blank line above bullets worked. |
Mostly fixed all the above comments. |
Some minor issues. After these are fixed I think it's ready to merge.
I think we probably don't want to keep this sentence as it doesn't relate very well to either the context or forte.
without any fine-tuning on a single data.
Rather than
I think we don't need to backtick HuggingFace and zero-shot learning as they are already known names. By backtick package names, I mean suppose |
…to tutorial_text_classification
modified according to the above feedbacks. |
This issue is part of #691
This PR fixes #734
Description of changes
This tutorial is an introduction to the Zero-Shot Learning technique for text classification.
Possible influences of this PR.
removed text_classification_pipeline.ipynb , as its updated with a new tutorial
Test Conducted
Describe what test cases are included for the PR.