This repository was archived by the owner on Aug 28, 2025. It is now read-only.
  
  
  - 
                Notifications
    
You must be signed in to change notification settings  - Fork 80
 
Add Flash Question Answering tutorial #125
          
     Closed
      
      
            karthikrangasai
  wants to merge
  11
  commits into
  Lightning-AI:main
from
karthikrangasai:add_flash_tutorial_question_answering
  
      
      
   
      
    
  
     Closed
                    Changes from all commits
      Commits
    
    
            Show all changes
          
          
            11 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      eecd680
              
                Add Flash Question Answering tutorial - Initial Commit.
              
              
                karthikrangasai c95130b
              
                Update dataset link to get from kaggle in meta file.
              
              
                karthikrangasai 5ff366a
              
                Merge branch 'main' into add_flash_tutorial_question_answering
              
              
                karthikrangasai 4392f78
              
                Merge branch 'main' into add_flash_tutorial_question_answering
              
              
                Borda 712526b
              
                Merge branch 'main' into add_flash_tutorial_question_answering
              
              
                karthikrangasai 7a305f5
              
                Merge branch 'add_flash_tutorial_question_answering' of https://githu…
              
              
                karthikrangasai 77cc890
              
                Merge branch 'main' into add_flash_tutorial_question_answering
              
              
                Borda 9a9b644
              
                Merge branch 'main' into add_flash_tutorial_question_answering
              
              
                Borda f08cfd7
              
                Merge branch 'main' into add_flash_tutorial_question_answering
              
              
                Borda f6b8b31
              
                Merge branch 'main' into add_flash_tutorial_question_answering
              
              
                Borda 5157d01
              
                Merge branch 'main' into add_flash_tutorial_question_answering
              
              
                Borda File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
        
          
  
    
      
          
            21 changes: 21 additions & 0 deletions
          
          21 
        
  flash_tutorials/dravidian_languages_question_answering/.meta.yml
  
  
      
      
   
        
      
      
    
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| title: Question Answering for Dravidian Languages | ||
| author: Karthik Rangasai Sivaraman (karthikrangasai@gmail.com) | ||
| created: 2021-12-15 | ||
| updated: 2021-12-15 | ||
| license: CC BY-SA | ||
| build: 3 | ||
| tags: | ||
| - Text | ||
| - Question Answering | ||
| description: | | ||
| This tutorial covers using Lightning Flash and it's integration with Hugging Face Transformers to train a Transformer | ||
| model (XLM-RoBERTa) on SQuAD type dataset for the dravidian languages. We show how easy it is to use a Hugging Face | ||
| Transformers model with the all goodness provided by PyTorch Lightning using Flash. | ||
| requirements: | ||
| - lightning-flash[text]>=0.5.2 | ||
| accelerator: | ||
| - GPU | ||
| - CPU | ||
| datasets: | ||
| kaggle: | ||
| - chaii-hindi-and-tamil-question-answering | ||
        
          
  
    
      
          
            123 changes: 123 additions & 0 deletions
          
          123 
        
  flash_tutorials/dravidian_languages_question_answering/multilingual_question_answering.py
  
  
      
      
   
        
      
      
    
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| # %% [markdown] | ||
| # In this tutorial we'll look at using [Lightning Flash](https://github.com/PyTorchLightning/lightning-flash) and it's | ||
| # integration with [Hugging Face Transformers](https://github.com/huggingface/transformers) for question answering of | ||
| # dravidian language based corpus using [the XLM-RoBERTa model](https://arxiv.org/pdf/1911.02116.pdf). | ||
| 
     | 
||
| # %% | ||
| 
     | 
||
| import os | ||
| 
     | 
||
| import pandas as pd | ||
| import torch | ||
| from flash import Trainer | ||
| from flash.text import QuestionAnsweringData, QuestionAnsweringTask | ||
| 
     | 
||
| # %% [markdown] | ||
| # ## Loading the Data and generating splits | ||
| # | ||
| # To load the data, we start by creating a train, validation, and test splits: | ||
| 
     | 
||
| # %% | ||
| DATASET_PATH = os.environ.get("PATH_DATASETS", "_datasets") | ||
| CHAII_DATASET_PATH = os.path.join(DATASET_PATH, "chaii-hindi-and-tamil-question-answering") | ||
| INPUT_DATA_PATH = os.path.join(CHAII_DATASET_PATH, "train.csv") | ||
| TRAIN_DATA_PATH = os.path.join(CHAII_DATASET_PATH, "_train.csv") | ||
| VAL_DATA_PATH = os.path.join(CHAII_DATASET_PATH, "_val.csv") | ||
| PREDICT_DATA_PATH = os.path.join(CHAII_DATASET_PATH, "test.csv") | ||
| 
     | 
||
| df = pd.read_csv(INPUT_DATA_PATH) | ||
| fraction = 0.9 | ||
| 
     | 
||
| tamil_examples = df[df["language"] == "tamil"] | ||
| train_split_tamil = tamil_examples.sample(frac=fraction, random_state=200) | ||
| val_split_tamil = tamil_examples.drop(train_split_tamil.index) | ||
| 
     | 
||
| hindi_examples = df[df["language"] == "hindi"] | ||
| train_split_hindi = hindi_examples.sample(frac=fraction, random_state=200) | ||
| val_split_hindi = hindi_examples.drop(train_split_hindi.index) | ||
| 
     | 
||
| train_split = pd.concat([train_split_tamil, train_split_hindi]).reset_index(drop=True) | ||
| val_split = pd.concat([val_split_tamil, val_split_hindi]).reset_index(drop=True) | ||
| 
     | 
||
| train_split.to_csv(TRAIN_DATA_PATH, index=False) | ||
| val_split.to_csv(VAL_DATA_PATH, index=False) | ||
| 
     | 
||
| # %% [markdown] | ||
| # ## Creating the Flash DataModule | ||
| # | ||
| # Now, we can create a `QuestionAnsweringData`. | ||
| # Flash supports a wide variety of input formats, each having its method with the naming format as `from_xxxx`. | ||
| # Our datasets are available as CSV files, and it is the same format in which we saved the splits. Hence, we use the | ||
| # `from_csv` method to generate the DataModule. The simplest form of the API only requires the data files, the Hugging | ||
| # Face backbone of your choice, and batch size. Flash takes care of preprocessing the data, i.e., tokenizing using the | ||
| # Hugging Face tokenizer and creating the Datasets. | ||
| # | ||
| # Here's the full preprocessing function: | ||
| 
     | 
||
| # %% | ||
| 
     | 
||
| datamodule = QuestionAnsweringData.from_csv( | ||
| train_file=TRAIN_DATA_PATH, | ||
| val_file=VAL_DATA_PATH, | ||
| batch_size=4, | ||
| backbone="xlm-roberta-base", | ||
| ) | ||
| 
     | 
||
| # %% [markdown] | ||
| # ## Creating the Flash Task | ||
| # | ||
| # The API for building the NLP Task is also simple. For all Flash models, the naming pattern follows `XYZTask`, and | ||
| # thus we will be using the `QuestionAnsweringTask` in this case. The power of Flash's simplicity comes into play here | ||
| # as we pass the required backbone, Optimizer of choice, and the preferable learning rate for the model. Then Flash | ||
| # takes care of the rest, i.e., downloading the model, instantiating the model, configuring the Optimizer, and even | ||
| # logging the losses. | ||
| 
     | 
||
| # %% | ||
| model = QuestionAnsweringTask( | ||
| backbone="xlm-roberta-base", | ||
| learning_rate=1e-5, | ||
| optimizer="adamw", | ||
| ) | ||
| 
     | 
||
| # %% [markdown] | ||
| # ## Setting up the Trainer and Fine-Tuning the model | ||
| # | ||
| # Flash's Trainer is inherited from Lightning's Trainer and provides an additional method `finetune` that takes in an | ||
| # extra argument `strategy` that lets us specify a specific strategy for fine-tuning the backbone. We will be using | ||
| # the `freeze_unfreeze` strategy to fine-tune the model, which freezes the gradients of the backbone transformer | ||
| # containing the pre-trained weights and trains just the new model head for a certain number of epochs and unfreezes | ||
| # the backbone after which the complete model (backbone + head) is trained for the remaining epochs. | ||
| # | ||
| # Check out the documentation to learn about the other strategies provided by Flash, and feel free to reach out and | ||
| # contribute any new fine-tuning methods to the project. | ||
| 
     | 
||
| # %% | ||
| trainer = Trainer( | ||
| max_epochs=5, | ||
| accumulate_grad_batches=2, | ||
| gpus=int(torch.cuda.is_available()), | ||
| ) | ||
| 
     | 
||
| trainer.finetune(model, datamodule, strategy=("freeze_unfreeze", 2)) | ||
| 
     | 
||
| # %% [markdown] | ||
| # ## Making predictions | ||
| # | ||
| # We convert the prediction file provided to us from a pandas DataFrame to a python dictionary object and pass it to | ||
| # the model as predictions. | ||
| 
     | 
||
| # %% | ||
| predict_data = pd.read_csv(PREDICT_DATA_PATH) | ||
| predict_data = predict_data[predict_data.columns[:3]].to_dict(orient="list") | ||
| 
     | 
||
| predictions = model.predict(predict_data) | ||
| print(predictions) | ||
| 
     | 
||
| # %% [markdown] | ||
| # ## Closing thoughts and next steps! | ||
| # | ||
| # This tutorial has shown how Flash and Hugging Face Transformers can be used to train a state-of-the-art language | ||
| # model (such as XLM-RoBERTa). | ||
| # | ||
| # If you want to be a bit more adventurous, you could look at | ||
| # [some of the other problems that can solved with Lightning Flash](https://lightning-flash.readthedocs.io/en/stable/?badge=stable). | 
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Uh oh!
There was an error while loading. Please reload this page.