Using Google 1-billion benchmark data on PyTorch

Hello,

I am new to NLP and I have some questions.

I downloaded the Google 1-billion benchmark dataset, and I am trying to use the dataset on PyTorch:

```js
               
# Import packages 
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from torchtext.data import Field, BucketIterator, TabularDataset
from transformers import OpenAIGPTConfig, OpenAIGPTTokenizer, OpenAIGPTLMHeadModel
from transformers import AdamW, WarmupLinearSchedule
from scipy.spatial import distance
import spacy
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.data import Field, BPTTIterator
import tensorflow as tf
#import lineflow as lf
#import lineflow.datasets as lfds
import math
import random
import numpy as np
import pandas as pd 
import time

# set hyperparameters for this experiment
bptt = 30
batch_size = 64
lr = 0.01 # learning rate
#criterion = nn.CrossEntropyLoss() # loss criterion
log_interval = 200
nlayer = 6

# define tokenizer
en = spacy.load('en')

def Sp_Tokenizer(text): 
    return [tok.text for tok in en.tokenizer(text)]

# define the English text field
TEXT = Field(tokenize = Sp_Tokenizer,
             init_token = '<sos>',
             eos_token = '<eos>',
             unk_token = '<unk>',
             pad_token = '<pad>',
             tokenizer_language = 'en',
             lower = True)

# loading Google 1 Billion Benchmark dataset
billion_google = open('/Users/dev/billion_google', encoding='utf-8').read()
billion_google_dict = {'English' : [line for line in billion_google]}
# convert billion_google into a pandas dataframe
billion_google_df = pd.DataFrame(billion_google_dict, columns=["English"])

# remove very long sentences
billion_google_df['eng_len'] = billion_google_df['English'].str.count(' ')
billion_google_df = billion_google_df.query('eng_len < 1025')

# create train and test set 
train_billion_google, test_billion_google = train_test_split(billion_google_df, test_size=0.2)
train_billion_google.to_csv("train_billion_google.csv", index=False)
test_billion_google.to_csv("test_billion_google.csv", index=False)

data_fields = [('English', TEXT)]
train_billion_google, test_billion_google = TabularDataset.splits(path='./', 
                                                                  train='train_billion_google.csv',
                                                                  validation='test_billion_google.csv', 
                                                                  format='csv', 
                                                                  fields=data_fields)
```

I also want to make a use of WikiText2 that is built in PyTorch:

```json
train_Wiki2, val_Wiki2, test_Wiki2 = torchtext.datasets.WikiText2.splits(TEXT)
```
...and I want my `train_billion_google` to have the same structure as my `train_Wiki2`, more specifically, I want my `train_billion_google` to store a list of individual tokens under `train_billion_google.examples[0].text`.

How can I do this?

Thank you,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using Google 1-billion benchmark data on PyTorch #644

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using Google 1-billion benchmark data on PyTorch #644

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions