[Docs] [Data] Fix broken references (ray-project#36232)

amogkam · angelinalg · web-flow · commit 4e9b9f9ad81f · 2023-06-08T17:11:48.000-07:00
Fixes a bunch of broken Ray data link references.

---------

Signed-off-by: amogkam &lt;amogkamsetty@yahoo.com&gt;
Signed-off-by: Amog Kamsetty &lt;amogkam@users.noreply.github.com&gt;
Co-authored-by: angelinalg &lt;122562471+angelinalg@users.noreply.github.com&gt;
diff --git a/doc/source/data/batch_inference.rst b/doc/source/data/batch_inference.rst
@@ -29,7 +29,7 @@ Using Ray Data for offline inference involves four basic steps:
 - **Step 1:** Load your data into a Ray Dataset. Ray Data supports many different data sources and formats. For more details, see :ref:`Loading Data <loading_data>`.
 - **Step 2:** Define a Python class to load the pre-trained model. 
 - **Step 3:** Transform your dataset using the pre-trained model by calling :meth:`ds.map_batches() <ray.data.Dataset.map_batches>`. For more details, see :ref:`Transforming Data <transforming-data>`.
-- **Step 4:** Get the final predictions by either iterating through the output or saving the results. For more details, see :ref:`Consuming data <consuming_data>`.
+- **Step 4:** Get the final predictions by either iterating through the output or saving the results. For more details, see the :ref:`Iterating over data <iterating-over-data>` and :ref:`Saving data <saving-data>` user guides.
 
 For more in-depth examples for your use case, see :ref:`batch_inference_examples`_. For how to configure batch inference, see :ref:`batch_inference_configuration`_.
 
@@ -365,7 +365,7 @@ Increasing batch size results in faster execution because inference is a vectori
 Handling GPU out-of-memory failures
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-If you run into CUDA out-of-memory issues, your batch size is likely too large. Decrease the batch size by following :ref:`these steps <_batch_inference_batch_size>`.
+If you run into CUDA out-of-memory issues, your batch size is likely too large. Decrease the batch size by following :ref:`these steps <batch_inference_batch_size>`.
 
 If your batch size is already set to 1, then use either a smaller model or GPU devices with more memory.
 
diff --git a/doc/source/data/examples/batch_training.ipynb b/doc/source/data/examples/batch_training.ipynb
@@ -453,6 +453,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -726,13 +727,14 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Recall how we wrote a data transform `transform_batch` UDF? It was called with pattern:\n",
     "- `Dataset.map_batches(transform_batch, batch_format=\"pandas\")`\n",
     "\n",
-    "Similarly, we can write a custom groupy-aggregate function `agg_func` which will run for each [Dataset *group-by*](data-groupbys) group in parallel. The usage pattern is:\n",
+    "Similarly, we can write a custom groupy-aggregate function `agg_func` which will run for each [Dataset *group-by*](transform_groupby) group in parallel. The usage pattern is:\n",
     "- `Dataset.groupby(column).map_groups(agg_func, batch_format=\"pandas\")`.\n",
     "\n",
     "In the cell below, we define our custom `agg_func`."
diff --git a/doc/source/data/examples/custom-datasource.rst b/doc/source/data/examples/custom-datasource.rst
@@ -7,7 +7,7 @@ Implementing a Custom Datasource
 .. note::
 
   This MongoDatasource guide below is for education only. For production use of MongoDB
-  in Ray Data, see :ref:`Creating Dataset from MongoDB <dataset_mongo_db>`.
+  in Ray Data, see :ref:`Creating Dataset from MongoDB <reading_mongodb>`.
 
 Ray Data supports multiple ways to :ref:`create a dataset <loading_data>`,
 allowing you to easily ingest data of common formats from popular sources. However, if the
@@ -101,7 +101,7 @@ First, let's handle a single MongoDB pipeline, which is the unit of execution in
 and then convert results into Arrow format. We use ``PyMongo`` and  ``PyMongoArrow``
 to achieve this.
 
-.. literalinclude:: ./doc_code/custom_datasource.py
+.. literalinclude:: ../doc_code/custom_datasource.py
     :language: python
     :start-after: __read_single_partition_start__
     :end-before: __read_single_partition_end__
@@ -121,7 +121,7 @@ a wrapper of ``_read_single_partition``.
 A list of :class:`~ray.data.ReadTask` objects are returned by ``get_read_tasks``, and these
 tasks are executed on remote workers. You can find more details about `Dataset read execution here <https://docs.ray.io/en/master/data/key-concepts.html#reading-data>`__.
 
-.. literalinclude:: ./doc_code/custom_datasource.py
+.. literalinclude:: ../doc_code/custom_datasource.py
     :language: python
     :start-after: __mongo_datasource_reader_start__
     :end-before: __mongo_datasource_reader_end__
@@ -136,7 +136,7 @@ Write support
 Similar to read support, we start with handling a single block. Again
 the ``PyMongo`` and  ``PyMongoArrow`` are used for MongoDB interactions.
 
-.. literalinclude:: ./doc_code/custom_datasource.py
+.. literalinclude:: ../doc_code/custom_datasource.py
     :language: python
     :start-after: __write_single_block_start__
     :end-before: __write_single_block_end__
@@ -150,7 +150,7 @@ will later be used in the implementation of :meth:`~ray.data.Datasource.do_write
 In short, the below function spawns multiple :ref:`Ray remote tasks <ray-remote-functions>`
 and returns :ref:`their futures (object refs) <objects-in-ray>`.
 
-.. literalinclude:: ./doc_code/custom_datasource.py
+.. literalinclude:: ../doc_code/custom_datasource.py
     :language: python
     :start-after: __write_multiple_blocks_start__
     :end-before: __write_multiple_blocks_end__
@@ -164,7 +164,7 @@ ready to implement :meth:`create_reader() <ray.data.Datasource.create_reader>`
 and :meth:`do_write() <ray.data.Datasource.do_write>`, and put together
 a ``MongoDatasource``.
 
-.. literalinclude:: ./doc_code/custom_datasource.py
+.. literalinclude:: ../doc_code/custom_datasource.py
     :language: python
     :start-after: __mongo_datasource_start__
     :end-before: __mongo_datasource_end__
diff --git a/doc/source/data/examples/nyc_taxi_basic_processing.ipynb b/doc/source/data/examples/nyc_taxi_basic_processing.ipynb
@@ -595,6 +595,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "id": "0d1e2106",
    "metadata": {},
@@ -604,7 +605,7 @@
     "Note that Ray Data' Parquet reader supports projection (column selection) and row filter pushdown, where we can push the above column selection and the row-based filter to the Parquet read. If we specify column selection at Parquet read time, the unselected columns won't even be read from disk!\n",
     "\n",
     "The row-based filter is specified via\n",
-    "[Arrow's dataset field expressions](https://arrow.apache.org/docs/6.0/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression). See the {ref}`feature guide for reading Parquet data <dataset_supported_file_formats>` for more information."
+    "[Arrow's dataset field expressions](https://arrow.apache.org/docs/6.0/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression). See the {ref}`Parquet row pruning tips <parquet_row_pruning>` for more information."
    ]
   },
   {
diff --git a/doc/source/data/examples/ocr_example.ipynb b/doc/source/data/examples/ocr_example.ipynb
@@ -21,6 +21,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "id": "2a344178",
    "metadata": {},
@@ -78,7 +79,7 @@
     "\n",
     "### Running the OCR software on the data\n",
     "\n",
-    "We can now use the {meth}`ray.data.read_binary_files <ray.data.read_binary_files>` function to read all the images from S3. We set the `include_paths=True` option to create a dataset of the S3 paths and image contents. We then run the {meth}`ds.map <ray.data.Dataset.map>` function on this dataset to execute the actual OCR process on each file and convert the screen shots into text. This will create a tabular dataset with columns `path` and `text`, see also [](transforming_data).\n",
+    "We can now use the {meth}`ray.data.read_binary_files <ray.data.read_binary_files>` function to read all the images from S3. We set the `include_paths=True` option to create a dataset of the S3 paths and image contents. We then run the {meth}`ds.map <ray.data.Dataset.map>` function on this dataset to execute the actual OCR process on each file and convert the screen shots into text. This creates a tabular dataset with columns `path` and `text`.\n",
     "\n",
     "````{note}\n",
     "If you want to load the data from a private bucket, you have to run\n",
@@ -317,6 +318,12 @@
     "\n",
     "Contributions that extend the example in this direction with a PR are welcome!"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "582546c8",
+   "metadata": {},
+   "source": []
   }
  ],
  "metadata": {
diff --git a/doc/source/data/faq.rst b/doc/source/data/faq.rst
@@ -80,10 +80,10 @@ What should I not use Ray Data for?
 
 Ray Data is not meant to be used for generic ETL pipelines (like Spark) or
 scalable data science (like Dask, Modin, or Mars). However, each of these frameworks
-are :ref:`runnable on Ray <data_integrations>`, and Datasets integrates tightly with
+are runnable on Ray, and Datasets integrates tightly with
 these frameworks, allowing for efficient exchange of distributed data partitions often
 with zero-copy. Check out the
-:ref:`dataset creation feature guide <dataset_from_in_memory_data_distributed>` to learn
+:ref:`dataset creation feature guide <loading_datasets_from_distributed_df>` to learn
 more about these integrations.
 
 Datasets is specifically targeting
diff --git a/doc/source/data/getting-started.rst b/doc/source/data/getting-started.rst
@@ -45,7 +45,7 @@ To learn more about creating datasets, read
 Transform the dataset
 ------------------------
 
-Apply :ref:`user-defined functions <transform_datasets_writing_udfs>` (UDFs) to
+Apply user-defined functions (UDFs) to
 transform datasets. Ray executes transformations in parallel for performance.
 
 .. testcode::
@@ -135,7 +135,7 @@ Pass datasets to Ray tasks or actors, and access records with methods like
 
 
 To learn more about consuming datasets, read
-:ref:`Consuming data <consuming_data>`.
+:ref:`Iterating over Data <iterating-over-data>` and :ref:`Saving Data <saving-data>`.
 
 Save the dataset
 -------------------
diff --git a/doc/source/data/inspecting-data.rst b/doc/source/data/inspecting-data.rst
@@ -86,7 +86,7 @@ a dictionary.
 
 
 For more information on working with rows, see
-:ref:`Transforming rows <transforming-rows>` and
+:ref:`Transforming rows <transforming_rows>` and
 :ref:`Iterating over rows <iterating-over-rows>`.
 
 .. _inspecting-batches:
@@ -141,5 +141,5 @@ of the returned batch, set ``batch_format``.
             [2 rows x 5 columns]
 
 For more information on working with batches, see
-:ref:`Transforming batches <transforming-batches>` and
+:ref:`Transforming batches <transforming_batches>` and
 :ref:`Iterating over batches <iterating-over-batches>`.
diff --git a/doc/source/data/iterating-over-data.rst b/doc/source/data/iterating-over-data.rst
@@ -40,7 +40,7 @@ as a dictionary.
 
 
 For more information on working with rows, see
-:ref:`Transforming rows <transforming-rows>` and
+:ref:`Transforming rows <transforming_rows>` and
 :ref:`Inspecting rows <inspecting-rows>`.
 
 .. _iterating-over-batches:
@@ -142,7 +142,7 @@ formats by calling one of the following methods:
             tf.Tensor([6.2 5.9], shape=(2,), dtype=float64) tf.Tensor([2 2], shape=(2,), dtype=int64)
 
 For more information on working with batches, see
-:ref:`Transforming batches <transforming-batches>` and
+:ref:`Transforming batches <transforming_batches>` and
 :ref:`Inspecting batches <inspecting-batches>`.
 
 .. _iterating-over-batches-with-shuffling:
diff --git a/doc/source/data/loading-data.rst b/doc/source/data/loading-data.rst
@@ -444,6 +444,8 @@ Ray Data interoperates with libraries like pandas, NumPy, and Arrow.
                schema={food: string, price: double}
             )
 
+.. _loading_datasets_from_distributed_df:
+
 Loading data from distributed DataFrame libraries
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -633,6 +635,8 @@ Reading databases
 
 Ray Data reads from databases like MySQL, Postgres, and MongoDB.
 
+.. _reading_sql:
+
 Reading SQL databases
 ~~~~~~~~~~~~~~~~~~~~~
 
@@ -828,6 +832,8 @@ Call :func:`~ray.data.read_sql` to read data from a database that provides a
                 "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection
             )
 
+.. _reading_mongodb:
+
 Reading MongoDB
 ~~~~~~~~~~~~~~~
 
diff --git a/doc/source/data/performance-tips.rst b/doc/source/data/performance-tips.rst
@@ -108,6 +108,8 @@ avoid loading unnecessary data (projection pushdown).
 For example, use ``ray.data.read_parquet("example://iris.parquet", columns=["sepal.length", "variety"])`` to read
 just two of the five columns of Iris dataset.
 
+.. _parquet_row_pruning:
+
 Parquet Row Pruning
 ~~~~~~~~~~~~~~~~~~~
 
diff --git a/doc/source/data/transforming-data.rst b/doc/source/data/transforming-data.rst
@@ -1,4 +1,4 @@
-.. _transforming-data:
+.. _transforming_data:
 
 =================
 Transforming Data
@@ -15,6 +15,8 @@ This guide shows you how to:
 * `Shuffle rows <#shuffling-rows>`_
 * `Repartition data <#repartitioning-data>`_
 
+.. _transforming_rows:
+
 Transforming rows
 =================
 
@@ -71,6 +73,8 @@ If your transformation returns multiple rows for each input row, call
 
     [{'id': 0}, {'id': 0}, {'id': 1}, {'id': 1}, {'id': 2}, {'id': 2}]
 
+.. _transforming_batches:
+
 Transforming batches
 ====================
 
@@ -108,6 +112,8 @@ uses tasks by default.
         .map_batches(increase_brightness)
     )
 
+.. _transforming_data_actors:
+
 Transforming batches with actors
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -191,8 +197,10 @@ To transform batches with actors, complete these steps:
 
             ds.materialize()
 
-Configuring batch type
-~~~~~~~~~~~~~~~~~~~~~~
+.. _configure_batch_format:
+
+Configuring batch format
+~~~~~~~~~~~~~~~~~~~~~~~~
 
 Ray Data represents batches as dicts of NumPy ndarrays or pandas DataFrames. By
 default, Ray Data represents batches as dicts of NumPy ndarrays.
@@ -248,6 +256,8 @@ program might run out of memory. If you encounter an out-of-memory error, decrea
     the default batch size is 4096. If you're using GPUs, you must specify an explicit
     batch size.
 
+.. _transforming_groupby:
+
 Groupby and transforming groups
 ===============================
 
diff --git a/doc/source/ray-air/computer-vision.rst b/doc/source/ray-air/computer-vision.rst
@@ -38,7 +38,7 @@ Reading image data
             :end-before: __read_images1_stop__
             :dedent:
 
-        Then, apply a :ref:`user-defined function <transform_datasets_writing_udfs>` to
+        Then, apply a :ref:`user-defined function <transforming_data>` to
         encode the class names as integer targets.
 
         .. literalinclude:: ./doc_code/computer_vision.py
@@ -98,7 +98,7 @@ Reading image data
             :end-before: __read_tfrecords1_stop__
             :dedent:
 
-        Then, apply a :ref:`user-defined function <transform_datasets_writing_udfs>` to
+        Then, apply a :ref:`user-defined function <transforming_data>` to
         decode the raw image bytes.
 
         .. literalinclude:: ./doc_code/computer_vision.py
diff --git a/doc/source/ray-core/patterns/pipelining.rst b/doc/source/ray-core/patterns/pipelining.rst
@@ -7,7 +7,7 @@ you can use the `pipelining <https://en.wikipedia.org/wiki/Pipeline_(computing)>
 .. note::
 
   Pipelining is an important technique to improve the performance and is heavily used by Ray libraries.
-  See :ref:`DatasetPipelines <pipelining_datasets>` as an example.
+  See :ref:`Ray Data <data>` as an example.
 
 .. figure:: ../images/pipelining.svg
 
diff --git a/doc/source/ray-more-libs/dask-on-ray.rst b/doc/source/ray-more-libs/dask-on-ray.rst
diff --git a/doc/source/ray-references/glossary.rst b/doc/source/ray-references/glossary.rst
diff --git a/python/ray/data/dataset.py b/python/ray/data/dataset.py
diff --git a/python/ray/data/read_api.py b/python/ray/data/read_api.py

Original file line number	Diff line number	Diff line change
`@@ -453,6 +453,7 @@`
`453`	`453`	`]`
`454`	`454`	`},`
`455`	`455`	`{`
	`456`	`+ "attachments": {},`
`456`	`457`	`"cell_type": "markdown",`
`457`	`458`	`"metadata": {},`
`458`	`459`	`"source": [`
`@@ -726,13 +727,14 @@`
`726`	`727`	`]`
`727`	`728`	`},`
`728`	`729`	`{`
	`730`	`+ "attachments": {},`
`729`	`731`	`"cell_type": "markdown",`
`730`	`732`	`"metadata": {},`
`731`	`733`	`"source": [`
`732`	`734`	"Recall how we wrote a data transform `transform_batch` UDF? It was called with pattern:\n",
`733`	`735`	"- `Dataset.map_batches(transform_batch, batch_format=\"pandas\")`\n",
`734`	`736`	`"\n",`
`735`		- "Similarly, we can write a custom groupy-aggregate function `agg_func` which will run for each [Dataset group-by](data-groupbys) group in parallel. The usage pattern is:\n",
	`737`	+ "Similarly, we can write a custom groupy-aggregate function `agg_func` which will run for each [Dataset group-by](transform_groupby) group in parallel. The usage pattern is:\n",
`736`	`738`	"- `Dataset.groupby(column).map_groups(agg_func, batch_format=\"pandas\")`.\n",
`737`	`739`	`"\n",`
`738`	`740`	"In the cell below, we define our custom `agg_func`."
Original file line number	Diff line number	Diff line change
`@@ -595,6 +595,7 @@`
`595`	`595`	`]`
`596`	`596`	`},`
`597`	`597`	`{`
	`598`	`+ "attachments": {},`
`598`	`599`	`"cell_type": "markdown",`
`599`	`600`	`"id": "0d1e2106",`
`600`	`601`	`"metadata": {},`
`@@ -604,7 +605,7 @@`
`604`	`605`	`"Note that Ray Data' Parquet reader supports projection (column selection) and row filter pushdown, where we can push the above column selection and the row-based filter to the Parquet read. If we specify column selection at Parquet read time, the unselected columns won't even be read from disk!\n",`
`605`	`606`	`"\n",`
`606`	`607`	`"The row-based filter is specified via\n",`
`607`		- "[Arrow's dataset field expressions](https://arrow.apache.org/docs/6.0/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression). See the {ref}`feature guide for reading Parquet data <dataset_supported_file_formats>` for more information."
	`608`	+ "[Arrow's dataset field expressions](https://arrow.apache.org/docs/6.0/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression). See the {ref}`Parquet row pruning tips <parquet_row_pruning>` for more information."
`608`	`609`	`]`
`609`	`610`	`},`
`610`	`611`	`{`