-
Notifications
You must be signed in to change notification settings - Fork 80
Add Flash Question Answering tutorial #125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
karthikrangasai
wants to merge
11
commits into
Lightning-AI:main
from
karthikrangasai:add_flash_tutorial_question_answering
Closed
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
eecd680
Add Flash Question Answering tutorial - Initial Commit.
karthikrangasai c95130b
Update dataset link to get from kaggle in meta file.
karthikrangasai 5ff366a
Merge branch 'main' into add_flash_tutorial_question_answering
karthikrangasai 4392f78
Merge branch 'main' into add_flash_tutorial_question_answering
Borda 712526b
Merge branch 'main' into add_flash_tutorial_question_answering
karthikrangasai 7a305f5
Merge branch 'add_flash_tutorial_question_answering' of https://githu…
karthikrangasai 77cc890
Merge branch 'main' into add_flash_tutorial_question_answering
Borda 9a9b644
Merge branch 'main' into add_flash_tutorial_question_answering
Borda f08cfd7
Merge branch 'main' into add_flash_tutorial_question_answering
Borda f6b8b31
Merge branch 'main' into add_flash_tutorial_question_answering
Borda 5157d01
Merge branch 'main' into add_flash_tutorial_question_answering
Borda File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
21 changes: 21 additions & 0 deletions
21
flash_tutorials/dravidian_languages_question_answering/.meta.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
title: Question Answering for Dravidian Languages | ||
author: Karthik Rangasai Sivaraman (karthikrangasai@gmail.com) | ||
created: 2021-12-15 | ||
updated: 2021-12-15 | ||
license: CC BY-SA | ||
build: 3 | ||
tags: | ||
- Text | ||
- Question Answering | ||
description: | | ||
This tutorial covers using Lightning Flash and it's integration with Hugging Face Transformers to train a Transformer | ||
model (XLM-RoBERTa) on SQuAD type dataset for the dravidian languages. We show how easy it is to use a Hugging Face | ||
Transformers model with the all goodness provided by PyTorch Lightning using Flash. | ||
requirements: | ||
- lightning-flash[text]>=0.5.2 | ||
accelerator: | ||
- GPU | ||
- CPU | ||
datasets: | ||
kaggle: | ||
- chaii-hindi-and-tamil-question-answering |
123 changes: 123 additions & 0 deletions
123
flash_tutorials/dravidian_languages_question_answering/multilingual_question_answering.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
# %% [markdown] | ||
# In this tutorial we'll look at using [Lightning Flash](https://github.com/PyTorchLightning/lightning-flash) and it's | ||
# integration with [Hugging Face Transformers](https://github.com/huggingface/transformers) for question answering of | ||
# dravidian language based corpus using [the XLM-RoBERTa model](https://arxiv.org/pdf/1911.02116.pdf). | ||
|
||
# %% | ||
|
||
import os | ||
|
||
import pandas as pd | ||
import torch | ||
from flash import Trainer | ||
from flash.text import QuestionAnsweringData, QuestionAnsweringTask | ||
|
||
# %% [markdown] | ||
# ## Loading the Data and generating splits | ||
# | ||
# To load the data, we start by creating a train, validation, and test splits: | ||
|
||
# %% | ||
DATASET_PATH = os.environ.get("PATH_DATASETS", "_datasets") | ||
CHAII_DATASET_PATH = os.path.join(DATASET_PATH, "chaii-hindi-and-tamil-question-answering") | ||
INPUT_DATA_PATH = os.path.join(CHAII_DATASET_PATH, "train.csv") | ||
TRAIN_DATA_PATH = os.path.join(CHAII_DATASET_PATH, "_train.csv") | ||
VAL_DATA_PATH = os.path.join(CHAII_DATASET_PATH, "_val.csv") | ||
PREDICT_DATA_PATH = os.path.join(CHAII_DATASET_PATH, "test.csv") | ||
|
||
df = pd.read_csv(INPUT_DATA_PATH) | ||
fraction = 0.9 | ||
|
||
tamil_examples = df[df["language"] == "tamil"] | ||
train_split_tamil = tamil_examples.sample(frac=fraction, random_state=200) | ||
val_split_tamil = tamil_examples.drop(train_split_tamil.index) | ||
|
||
hindi_examples = df[df["language"] == "hindi"] | ||
train_split_hindi = hindi_examples.sample(frac=fraction, random_state=200) | ||
val_split_hindi = hindi_examples.drop(train_split_hindi.index) | ||
|
||
train_split = pd.concat([train_split_tamil, train_split_hindi]).reset_index(drop=True) | ||
val_split = pd.concat([val_split_tamil, val_split_hindi]).reset_index(drop=True) | ||
|
||
train_split.to_csv(TRAIN_DATA_PATH, index=False) | ||
val_split.to_csv(VAL_DATA_PATH, index=False) | ||
|
||
# %% [markdown] | ||
# ## Creating the Flash DataModule | ||
# | ||
# Now, we can create a `QuestionAnsweringData`. | ||
# Flash supports a wide variety of input formats, each having its method with the naming format as `from_xxxx`. | ||
# Our datasets are available as CSV files, and it is the same format in which we saved the splits. Hence, we use the | ||
# `from_csv` method to generate the DataModule. The simplest form of the API only requires the data files, the Hugging | ||
# Face backbone of your choice, and batch size. Flash takes care of preprocessing the data, i.e., tokenizing using the | ||
# Hugging Face tokenizer and creating the Datasets. | ||
# | ||
# Here's the full preprocessing function: | ||
|
||
# %% | ||
|
||
datamodule = QuestionAnsweringData.from_csv( | ||
train_file=TRAIN_DATA_PATH, | ||
val_file=VAL_DATA_PATH, | ||
batch_size=4, | ||
backbone="xlm-roberta-base", | ||
) | ||
|
||
# %% [markdown] | ||
# ## Creating the Flash Task | ||
# | ||
# The API for building the NLP Task is also simple. For all Flash models, the naming pattern follows `XYZTask`, and | ||
# thus we will be using the `QuestionAnsweringTask` in this case. The power of Flash's simplicity comes into play here | ||
# as we pass the required backbone, Optimizer of choice, and the preferable learning rate for the model. Then Flash | ||
# takes care of the rest, i.e., downloading the model, instantiating the model, configuring the Optimizer, and even | ||
# logging the losses. | ||
|
||
# %% | ||
model = QuestionAnsweringTask( | ||
backbone="xlm-roberta-base", | ||
learning_rate=1e-5, | ||
optimizer="adamw", | ||
) | ||
|
||
# %% [markdown] | ||
# ## Setting up the Trainer and Fine-Tuning the model | ||
# | ||
# Flash's Trainer is inherited from Lightning's Trainer and provides an additional method `finetune` that takes in an | ||
# extra argument `strategy` that lets us specify a specific strategy for fine-tuning the backbone. We will be using | ||
# the `freeze_unfreeze` strategy to fine-tune the model, which freezes the gradients of the backbone transformer | ||
# containing the pre-trained weights and trains just the new model head for a certain number of epochs and unfreezes | ||
# the backbone after which the complete model (backbone + head) is trained for the remaining epochs. | ||
# | ||
# Check out the documentation to learn about the other strategies provided by Flash, and feel free to reach out and | ||
# contribute any new fine-tuning methods to the project. | ||
|
||
# %% | ||
trainer = Trainer( | ||
max_epochs=5, | ||
accumulate_grad_batches=2, | ||
gpus=int(torch.cuda.is_available()), | ||
) | ||
|
||
trainer.finetune(model, datamodule, strategy=("freeze_unfreeze", 2)) | ||
|
||
# %% [markdown] | ||
# ## Making predictions | ||
# | ||
# We convert the prediction file provided to us from a pandas DataFrame to a python dictionary object and pass it to | ||
# the model as predictions. | ||
|
||
# %% | ||
predict_data = pd.read_csv(PREDICT_DATA_PATH) | ||
predict_data = predict_data[predict_data.columns[:3]].to_dict(orient="list") | ||
|
||
predictions = model.predict(predict_data) | ||
print(predictions) | ||
|
||
# %% [markdown] | ||
# ## Closing thoughts and next steps! | ||
# | ||
# This tutorial has shown how Flash and Hugging Face Transformers can be used to train a state-of-the-art language | ||
# model (such as XLM-RoBERTa). | ||
# | ||
# If you want to be a bit more adventurous, you could look at | ||
# [some of the other problems that can solved with Lightning Flash](https://lightning-flash.readthedocs.io/en/stable/?badge=stable). |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.