Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Using Google 1-billion benchmark data on PyTorch #644

@h56cho

Description

@h56cho

Hello,

I am new to NLP and I have some questions.

I downloaded the Google 1-billion benchmark dataset, and I am trying to use the dataset on PyTorch:

               
# Import packages 
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from torchtext.data import Field, BucketIterator, TabularDataset
from transformers import OpenAIGPTConfig, OpenAIGPTTokenizer, OpenAIGPTLMHeadModel
from transformers import AdamW, WarmupLinearSchedule
from scipy.spatial import distance
import spacy
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.data import Field, BPTTIterator
import tensorflow as tf
#import lineflow as lf
#import lineflow.datasets as lfds
import math
import random
import numpy as np
import pandas as pd 
import time

# set hyperparameters for this experiment
bptt = 30
batch_size = 64
lr = 0.01 # learning rate
#criterion = nn.CrossEntropyLoss() # loss criterion
log_interval = 200
nlayer = 6

# define tokenizer
en = spacy.load('en')

def Sp_Tokenizer(text): 
    return [tok.text for tok in en.tokenizer(text)]

# define the English text field
TEXT = Field(tokenize = Sp_Tokenizer,
             init_token = '<sos>',
             eos_token = '<eos>',
             unk_token = '<unk>',
             pad_token = '<pad>',
             tokenizer_language = 'en',
             lower = True)

# loading Google 1 Billion Benchmark dataset
billion_google = open('/Users/dev/billion_google', encoding='utf-8').read()
billion_google_dict = {'English' : [line for line in billion_google]}
# convert billion_google into a pandas dataframe
billion_google_df = pd.DataFrame(billion_google_dict, columns=["English"])

# remove very long sentences
billion_google_df['eng_len'] = billion_google_df['English'].str.count(' ')
billion_google_df = billion_google_df.query('eng_len < 1025')

# create train and test set 
train_billion_google, test_billion_google = train_test_split(billion_google_df, test_size=0.2)
train_billion_google.to_csv("train_billion_google.csv", index=False)
test_billion_google.to_csv("test_billion_google.csv", index=False)

data_fields = [('English', TEXT)]
train_billion_google, test_billion_google = TabularDataset.splits(path='./', 
                                                                  train='train_billion_google.csv',
                                                                  validation='test_billion_google.csv', 
                                                                  format='csv', 
                                                                  fields=data_fields)

I also want to make a use of WikiText2 that is built in PyTorch:

train_Wiki2, val_Wiki2, test_Wiki2 = torchtext.datasets.WikiText2.splits(TEXT)

...and I want my train_billion_google to have the same structure as my train_Wiki2, more specifically, I want my train_billion_google to store a list of individual tokens under train_billion_google.examples[0].text.

How can I do this?

Thank you,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions