Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

MRPC dev data is being used for training #1280

Closed
@ywkim

Description

@ywkim

Description

I expected that the dev dataset would be different from the training dataset. However, all dev examples of MRPC are actually included in the training dataset.

Environment information

OS: macOS 10.13.4

$ pip freeze | grep tensor
mesh-tensorflow==0.0.4
-e git+git@github.com:tensorflow/tensor2tensor.git@7de63449a98375011e2a8715482dfeea946e6de7#egg=tensor2tensor
tensorboard==1.12.0
tensorflow==1.12.0
tensorflow-metadata==0.9.0
tensorflow-probability==0.5.0

$ python -V
Python 3.6.4

For bugs: reproduction and error logs

import tensorflow as tf
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators.mrpc import MSRParaphraseCorpus

data_dir = "/tmp/t2t_mrpc"
mrpc = MSRParaphraseCorpus()
tf.gfile.MakeDirs(data_dir)
mrpc.generate_data(data_dir, "/tmp")
encoder = mrpc.feature_encoders(data_dir).get("inputs")

tfe = tf.contrib.eager
tfe.enable_eager_execution()
train_dataset = set(
    encoder.decode(example["inputs"])
    for example in tfe.Iterator(mrpc.dataset(problem.DatasetSplit.TRAIN, data_dir)))
eval_dataset = set(
    encoder.decode(example["inputs"])
    for example in tfe.Iterator(mrpc.dataset(problem.DatasetSplit.EVAL, data_dir)))

print("TRAIN Dataset: {}".format(len(train_dataset)))
print("EVAL Dataset: {}".format(len(eval_dataset)))
print("Duplication: {}".format(len(train_dataset & eval_dataset)))

Output:

TRAIN Dataset: 8152
EVAL Dataset: 816
Duplication: 816

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions