Shuffle examples in pipelines shown in the intro notebook.

This is good practice to avoid showing examples always in the same order, and to break sequential correlations that may be in the original dataset. PiperOrigin-RevId: 340745971
google-research · Nov 5, 2020 · c3f62a1 · c3f62a1
1 parent 3512a82
commit c3f62a1
Showing 1 changed file with 9 additions and 4 deletions.
diff --git a/Intro_to_Metadataset.ipynb b/Intro_to_Metadataset.ipynb
@@ -233,7 +233,8 @@
         "- **use_bilevel_ontology_list**:  This is a list of booleans indicating whether corresponding dataset in `ALL_DATASETS` should use bilevel ontology. Omniglot is set up with a hierarchy with two level: the alphabet (Latin, Inuktitut...), and the character (with 20 examples per character).\n",
         "The flag means that each episode will contain classes from a single alphabet. \n",
         "- **use_dag_ontology_list**:  This is a list of booleans indicating whether corresponding dataset in `ALL_DATASETS` should use dag_ontology. Same idea for ImageNet, except it uses the hierarchical sampling procedure described in the article.\n",
-        "- **image_size**: All images from various datasets are down or upsampled to the same size. This is the flag controls the edge size of the square."
+        "- **image_size**: All images from various datasets are down or upsampled to the same size. This is the flag controls the edge size of the square.\n",
+        "- **shuffle_buffer_size**: Controls the amount of shuffling among examples from any given class."
       ]
     },
     {
@@ -259,7 +260,9 @@
         "    use_dag_ontology_list=use_dag_ontology_list,\n",
         "    use_bilevel_ontology_list=use_bilevel_ontology_list,\n",
         "    episode_descr_config=variable_ways_shots,\n",
-        "    split=SPLIT, image_size=84)"
+        "    split=SPLIT,\n",
+        "    image_size=84,\n",
+        "    shuffle_buffer_size=300)"
       ]
     },
     {
@@ -358,7 +361,8 @@
         "- `ADD_DATASET_OFFSET` controls whether the class_id's returned by the iterator overlaps among different datasets or not. A dataset specific offset is added in order to make returned ids unique.\n",
         "- `make_multisource_batch_pipeline()` creates a `tf.data.Dataset` object that returns datasets of the form (Batch, data source ID) where similarly to the\n",
         "episodic case, the data source ID is an integer Tensor that identifies which\n",
-        "dataset the given batch originates from."
+        "dataset the given batch originates from.\n",
+        "- `shuffle_buffer_size` controls the amount of shuffling done among examples from a given dataset (unlike for the episodic pipeline)."
       ]
     },
     {
@@ -387,7 +391,8 @@
       "source": [
         "dataset_batch = pipeline.make_multisource_batch_pipeline(\n",
         "    dataset_spec_list=all_dataset_specs, batch_size=BATCH_SIZE, split=SPLIT,\n",
-        "    image_size=84, add_dataset_offset=ADD_DATASET_OFFSET)\n",
+        "    image_size=84, add_dataset_offset=ADD_DATASET_OFFSET,\n",
+        "    shuffle_buffer_size=1000)\n",
         "\n",
         "for idx, ((images, labels), source_id) in iterate_dataset(dataset_batch, 1):\n",
         "  print(images.shape, labels.shape)"