This repository was archived by the owner on Sep 10, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 814
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
Using Google 1-billion benchmark data on PyTorch #644
Copy link
Copy link
Closed
Description
Hello,
I am new to NLP and I have some questions.
I downloaded the Google 1-billion benchmark dataset, and I am trying to use the dataset on PyTorch:
# Import packages
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from torchtext.data import Field, BucketIterator, TabularDataset
from transformers import OpenAIGPTConfig, OpenAIGPTTokenizer, OpenAIGPTLMHeadModel
from transformers import AdamW, WarmupLinearSchedule
from scipy.spatial import distance
import spacy
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.data import Field, BPTTIterator
import tensorflow as tf
#import lineflow as lf
#import lineflow.datasets as lfds
import math
import random
import numpy as np
import pandas as pd
import time
# set hyperparameters for this experiment
bptt = 30
batch_size = 64
lr = 0.01 # learning rate
#criterion = nn.CrossEntropyLoss() # loss criterion
log_interval = 200
nlayer = 6
# define tokenizer
en = spacy.load('en')
def Sp_Tokenizer(text):
return [tok.text for tok in en.tokenizer(text)]
# define the English text field
TEXT = Field(tokenize = Sp_Tokenizer,
init_token = '<sos>',
eos_token = '<eos>',
unk_token = '<unk>',
pad_token = '<pad>',
tokenizer_language = 'en',
lower = True)
# loading Google 1 Billion Benchmark dataset
billion_google = open('/Users/dev/billion_google', encoding='utf-8').read()
billion_google_dict = {'English' : [line for line in billion_google]}
# convert billion_google into a pandas dataframe
billion_google_df = pd.DataFrame(billion_google_dict, columns=["English"])
# remove very long sentences
billion_google_df['eng_len'] = billion_google_df['English'].str.count(' ')
billion_google_df = billion_google_df.query('eng_len < 1025')
# create train and test set
train_billion_google, test_billion_google = train_test_split(billion_google_df, test_size=0.2)
train_billion_google.to_csv("train_billion_google.csv", index=False)
test_billion_google.to_csv("test_billion_google.csv", index=False)
data_fields = [('English', TEXT)]
train_billion_google, test_billion_google = TabularDataset.splits(path='./',
train='train_billion_google.csv',
validation='test_billion_google.csv',
format='csv',
fields=data_fields)I also want to make a use of WikiText2 that is built in PyTorch:
train_Wiki2, val_Wiki2, test_Wiki2 = torchtext.datasets.WikiText2.splits(TEXT)...and I want my train_billion_google to have the same structure as my train_Wiki2, more specifically, I want my train_billion_google to store a list of individual tokens under train_billion_google.examples[0].text.
How can I do this?
Thank you,
Metadata
Metadata
Assignees
Labels
No labels