This repository was archived by the owner on Jul 7, 2023. It is now read-only.
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
MRPC dev data is being used for training #1280
Closed
Description
Description
I expected that the dev dataset would be different from the training dataset. However, all dev examples of MRPC are actually included in the training dataset.
Environment information
OS: macOS 10.13.4
$ pip freeze | grep tensor
mesh-tensorflow==0.0.4
-e git+git@github.com:tensorflow/tensor2tensor.git@7de63449a98375011e2a8715482dfeea946e6de7#egg=tensor2tensor
tensorboard==1.12.0
tensorflow==1.12.0
tensorflow-metadata==0.9.0
tensorflow-probability==0.5.0
$ python -V
Python 3.6.4
For bugs: reproduction and error logs
import tensorflow as tf
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators.mrpc import MSRParaphraseCorpus
data_dir = "/tmp/t2t_mrpc"
mrpc = MSRParaphraseCorpus()
tf.gfile.MakeDirs(data_dir)
mrpc.generate_data(data_dir, "/tmp")
encoder = mrpc.feature_encoders(data_dir).get("inputs")
tfe = tf.contrib.eager
tfe.enable_eager_execution()
train_dataset = set(
encoder.decode(example["inputs"])
for example in tfe.Iterator(mrpc.dataset(problem.DatasetSplit.TRAIN, data_dir)))
eval_dataset = set(
encoder.decode(example["inputs"])
for example in tfe.Iterator(mrpc.dataset(problem.DatasetSplit.EVAL, data_dir)))
print("TRAIN Dataset: {}".format(len(train_dataset)))
print("EVAL Dataset: {}".format(len(eval_dataset)))
print("Duplication: {}".format(len(train_dataset & eval_dataset)))
Output:
TRAIN Dataset: 8152
EVAL Dataset: 816
Duplication: 816
Metadata
Metadata
Assignees
Labels
No labels