Skip to content

Commit 5ccd1c0

Browse files
[MINOR] Page renames (Dataset API) cntd.
1 parent 31107a4 commit 5ccd1c0

File tree

86 files changed

+238
-260
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

86 files changed

+238
-260
lines changed

docs/Analyzer.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ scala> :type spark.sessionState.analyzer
170170
org.apache.spark.sql.catalyst.analysis.Analyzer
171171
```
172172

173-
You can access the analyzed logical plan of a structured query using [Dataset.explain](dataset-operators.md#explain) basic action (with `extended` flag enabled) or SQL's `EXPLAIN EXTENDED` SQL command.
173+
You can access the analyzed logical plan of a structured query using [Dataset.explain](dataset/index.md#explain) basic action (with `extended` flag enabled) or SQL's `EXPLAIN EXTENDED` SQL command.
174174

175175
```text
176176
// sample structured query

docs/CacheManager.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ spark.sharedState.cacheManager
1212

1313
## Dataset.cache and persist Operators
1414

15-
A structured query (as [Dataset](Dataset.md)) can be [cached](#cacheQuery) and registered with `CacheManager` using [Dataset.cache](caching-and-persistence.md#cache) or [Dataset.persist](caching-and-persistence.md#persist) high-level operators.
15+
A structured query (as [Dataset](dataset/index.md)) can be [cached](#cacheQuery) and registered with `CacheManager` using [Dataset.cache](caching-and-persistence.md#cache) or [Dataset.persist](caching-and-persistence.md#persist) high-level operators.
1616

1717
## <span id="CachedData"> Cached Queries { #cachedData }
1818

@@ -92,7 +92,7 @@ lookupCachedData(
9292

9393
`lookupCachedData` is used when:
9494

95-
* [Dataset.storageLevel](dataset-operators.md#storageLevel) action is used
95+
* [Dataset.storageLevel](dataset/index.md#storageLevel) action is used
9696
* `CatalogImpl` is requested to [isCached](CatalogImpl.md#isCached)
9797
* `CacheManager` is requested to [cacheQuery](#cacheQuery) and [useCachedData](#useCachedData)
9898

@@ -116,7 +116,7 @@ uncacheQuery(
116116

117117
`uncacheQuery` is used when:
118118

119-
* [Dataset.unpersist](dataset-operators.md#unpersist) basic action is used
119+
* [Dataset.unpersist](dataset/index.md#unpersist) basic action is used
120120
* `DropTableCommand` and [TruncateTableCommand](logical-operators/TruncateTableCommand.md) logical commands are executed
121121
* `CatalogImpl` is requested to [uncache](CatalogImpl.md#uncacheTable) and [refresh](CatalogImpl.md#refreshTable) a table or view, [dropTempView](CatalogImpl.md#dropTempView) and [dropGlobalTempView](CatalogImpl.md#dropGlobalTempView)
122122

@@ -129,9 +129,9 @@ cacheQuery(
129129
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit
130130
```
131131

132-
`cacheQuery` adds the [analyzed logical plan](Dataset.md#logicalPlan) of the input [Dataset](Dataset.md) to the [cachedData](#cachedData) internal registry of cached queries.
132+
`cacheQuery` adds the [analyzed logical plan](dataset/index.md#logicalPlan) of the input [Dataset](dataset/index.md) to the [cachedData](#cachedData) internal registry of cached queries.
133133

134-
Internally, `cacheQuery` requests the `Dataset` for the [analyzed logical plan](Dataset.md#logicalPlan) and creates a [InMemoryRelation](logical-operators/InMemoryRelation.md) with the following:
134+
Internally, `cacheQuery` requests the `Dataset` for the [analyzed logical plan](dataset/index.md#logicalPlan) and creates a [InMemoryRelation](logical-operators/InMemoryRelation.md) with the following:
135135

136136
* [spark.sql.inMemoryColumnarStorage.compressed](configuration-properties.md#spark.sql.inMemoryColumnarStorage.compressed) configuration property
137137
* [spark.sql.inMemoryColumnarStorage.batchSize](configuration-properties.md#spark.sql.inMemoryColumnarStorage.batchSize) configuration property
@@ -152,7 +152,7 @@ Asked to cache already cached data.
152152

153153
`cacheQuery` is used when:
154154

155-
* [Dataset.persist](dataset-operators.md#persist) basic action is used
155+
* [Dataset.persist](dataset/index.md#persist) basic action is used
156156
* `CatalogImpl` is requested to [cache](CatalogImpl.md#cacheTable) and [refresh](CatalogImpl.md#refreshTable) a table or view in-memory
157157

158158
## Clearing Cache { #clearCache }

docs/CatalystSerde.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Internally, `deserialize` creates an `UnresolvedDeserializer` for the deserializ
1717

1818
`deserialize` is used when:
1919

20-
* `Dataset` is requested for a [QueryExecution](Dataset.md#rddQueryExecution)
20+
* `Dataset` is requested for a [QueryExecution](dataset/index.md#rddQueryExecution)
2121
* `ExpressionEncoder` is requested to [resolveAndBind](ExpressionEncoder.md#resolveAndBind)
2222
* `MapPartitions` utility is used to [apply](logical-operators/MapPartitions.md#apply)
2323
* `MapElements` utility is used to `apply`

docs/Column.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,6 @@ named: NamedExpression
335335
336336
`named` is used when the following operators are used:
337337
338-
* [Dataset.select](dataset-operators.md#select)
338+
* [Dataset.select](dataset/index.md#select)
339339
* [KeyValueGroupedDataset.agg](KeyValueGroupedDataset.md#agg)
340340
-->

docs/DataFrameNaFunctions.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ subtitle: Working With Missing Data
77

88
`DataFrameNaFunctions` is used to work with missing data in a [DataFrame](DataFrame.md).
99

10-
`DataFrameNaFunctions` is available using [na](dataset-untyped-transformations.md#na) untyped transformation.
10+
`DataFrameNaFunctions` is available using [na](dataset/untyped-transformations.md#na) untyped transformation.
1111

1212
```scala
1313
val q: DataFrame = ...

docs/DataFrameStatFunctions.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ title: DataFrameStatFunctions
66

77
`DataFrameStatFunctions` API gives the statistic functions to be used in a structured query.
88

9-
`DataFrameStatFunctions` is available using [stat](dataset-untyped-transformations.md#stat) untyped transformation.
9+
`DataFrameStatFunctions` is available using [stat](dataset/untyped-transformations.md#stat) untyped transformation.
1010

1111
```scala
1212
val q: DataFrame = ...

docs/DataFrameWriter.md

+9-9
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,13 @@
88

99
`DataFrameWriter` ends description of a write specification and does trigger a Spark job (unlike [DataFrameWriter](DataFrameWriter.md)).
1010

11-
`DataFrameWriter` is available using [Dataset.write](Dataset.md#write) operator.
11+
`DataFrameWriter` is available using [Dataset.write](dataset/index.md#write) operator.
1212

1313
## Creating Instance
1414

1515
`DataFrameWriter` takes the following to be created:
1616

17-
* <span id="ds"> [Dataset](Dataset.md)
17+
* <span id="ds"> [Dataset](dataset/index.md)
1818

1919
### Demo
2020

@@ -29,7 +29,7 @@ assert(writer.isInstanceOf[DataFrameWriter])
2929

3030
## DataFrame { #df }
3131

32-
When [created](#creating-instance), `DataFrameWriter` converts the [Dataset](#ds) to a [DataFrame](Dataset.md#toDF).
32+
When [created](#creating-instance), `DataFrameWriter` converts the [Dataset](#ds) to a [DataFrame](dataset/index.md#toDF).
3333

3434
## <span id="format"> Name of Data Source { #source }
3535

@@ -55,7 +55,7 @@ insertInto(
5555
tableName: String): Unit
5656
```
5757

58-
`insertInto` requests the [DataFrame](#df) for the [SparkSession](Dataset.md#sparkSession).
58+
`insertInto` requests the [DataFrame](#df) for the [SparkSession](dataset/index.md#sparkSession).
5959

6060
`insertInto` tries to [look up the TableProvider](#lookupV2Provider) for the [data source](#source).
6161

@@ -106,7 +106,7 @@ saveAsTable(
106106
tableName: String): Unit
107107
```
108108

109-
`saveAsTable` requests the [DataFrame](#df) for the [SparkSession](Dataset.md#sparkSession).
109+
`saveAsTable` requests the [DataFrame](#df) for the [SparkSession](dataset/index.md#sparkSession).
110110

111111
`saveAsTable` tries to [look up the TableProvider](#lookupV2Provider) for the [data source](#source).
112112

@@ -174,7 +174,7 @@ Saves a `DataFrame` (the result of executing a structured query) to a data sourc
174174
Internally, `save` uses `DataSource` to [look up the class of the requested data source](DataSource.md#lookupDataSource) (for the [source](#source) option and the [SQLConf](SessionState.md#conf)).
175175

176176
!!! note
177-
`save` uses [SparkSession](Dataset.md#sparkSession) to access the [SessionState](SparkSession.md#sessionState) and in turn the [SQLConf](SessionState.md#conf).
177+
`save` uses [SparkSession](dataset/index.md#sparkSession) to access the [SessionState](SparkSession.md#sessionState) and in turn the [SQLConf](SessionState.md#conf).
178178

179179
```text
180180
val df: DataFrame = ???
@@ -279,10 +279,10 @@ partitioningAsV2: Seq[Transform]
279279
saveToV1Source(): Unit
280280
```
281281

282-
`saveToV1Source` creates a [DataSource](DataSource.md#apply) (for the [source](#source) class name, the [partitioningColumns](#partitioningColumns) and the [extraOptions](#extraOptions)) and requests it for the [logical command for writing](DataSource.md#planForWriting) (with the [mode](#mode) and the [analyzed logical plan](Dataset.md#logicalPlan) of the structured query).
282+
`saveToV1Source` creates a [DataSource](DataSource.md#apply) (for the [source](#source) class name, the [partitioningColumns](#partitioningColumns) and the [extraOptions](#extraOptions)) and requests it for the [logical command for writing](DataSource.md#planForWriting) (with the [mode](#mode) and the [analyzed logical plan](dataset/index.md#logicalPlan) of the structured query).
283283

284284
!!! note
285-
While requesting the [analyzed logical plan](Dataset.md#logicalPlan) of the structured query, `saveToV1Source` triggers execution of logical commands.
285+
While requesting the [analyzed logical plan](dataset/index.md#logicalPlan) of the structured query, `saveToV1Source` triggers execution of logical commands.
286286

287287
In the end, `saveToV1Source` [runs the logical command for writing](#runCommand).
288288

@@ -336,7 +336,7 @@ createTable(
336336

337337
`createTable` creates a [CatalogTable](CatalogTable.md) (with the [bucketSpec](CatalogTable.md#bucketSpec) per [getBucketSpec](#getBucketSpec)).
338338

339-
In the end, `createTable` creates a [CreateTable](logical-operators/CreateTable.md) logical command (with the `CatalogTable`, [mode](#mode) and the [logical query plan](Dataset.md#planWithBarrier) of the [dataset](#df)) and [runs](#runCommand) it.
339+
In the end, `createTable` creates a [CreateTable](logical-operators/CreateTable.md) logical command (with the `CatalogTable`, [mode](#mode) and the [logical query plan](dataset/index.md#planWithBarrier) of the [dataset](#df)) and [runs](#runCommand) it.
340340

341341
---
342342

docs/DataFrameWriterV2.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# DataFrameWriterV2
22

3-
`DataFrameWriterV2` is an API for Spark SQL developers to describe how to write a [Dataset](Dataset.md) to an external storage using the DataSource V2.
3+
`DataFrameWriterV2` is an API for Spark SQL developers to describe how to write a [Dataset](dataset/index.md) to an external storage using the DataSource V2.
44

55
`DataFrameWriterV2` is a [CreateTableWriter](CreateTableWriter.md) (and thus a [WriteConfigMethods](WriteConfigMethods.md)).
66

@@ -21,11 +21,11 @@ org.apache.spark.sql.DataFrameWriterV2[Long]
2121
`DataFrameWriterV2` takes the following to be created:
2222

2323
* Name of the target table (_multi-part table identifier_)
24-
* [Dataset](Dataset.md)
24+
* [Dataset](dataset/index.md)
2525

2626
`DataFrameWriterV2` is created when:
2727

28-
* [Dataset.writeTo](Dataset.md#writeTo) operator is used
28+
* [Dataset.writeTo](dataset/index.md#writeTo) operator is used
2929

3030
## <span id="append"> append
3131

docs/Dataset.md

+10-31
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,10 @@
1-
---
2-
title: Dataset
3-
---
4-
51
# Dataset
62

73
`Dataset[T]` is a strongly-typed data structure that represents a structured query over rows of `T` type.
84

9-
`Dataset` is created using [SQL](sql/index.md) or [Dataset](dataset-operators.md) high-level declarative "languages".
5+
`Dataset` is created using [SQL](sql/index.md) or [Dataset](dataset/index.md) high-level declarative "languages".
106

11-
![Dataset's Internals](images/spark-sql-Dataset.png)
7+
![Dataset's Internals](images/Dataset.png)
128

139
It is fair to say that `Dataset` is a Spark SQL developer-friendly layer over the following two low-level entities:
1410

@@ -29,23 +25,6 @@ It is fair to say that `Dataset` is a Spark SQL developer-friendly layer over th
2925

3026
When created, `Dataset` requests [QueryExecution](#queryExecution) to [assert analyzed phase is successful](QueryExecution.md#assertAnalyzed).
3127

32-
`Dataset` is created when:
33-
34-
* [Dataset.apply](#apply) (for a [LogicalPlan](logical-operators/LogicalPlan.md) and a [SparkSession](SparkSession.md) with the [Encoder](Encoder.md) in a Scala implicit scope)
35-
36-
* [Dataset.ofRows](#ofRows) (for a [LogicalPlan](logical-operators/LogicalPlan.md) and a [SparkSession](SparkSession.md))
37-
38-
* [Dataset.toDF](dataset-untyped-transformations.md#toDF) untyped transformation is used
39-
40-
* [Dataset.select](dataset-typed-transformations.md#select), [Dataset.randomSplit](dataset-typed-transformations.md#randomSplit) and [Dataset.mapPartitions](dataset-typed-transformations.md#mapPartitions) typed transformations are used
41-
42-
* [KeyValueGroupedDataset.agg](KeyValueGroupedDataset.md#agg) operator is used (that requests `KeyValueGroupedDataset` to [aggUntyped](KeyValueGroupedDataset.md#aggUntyped))
43-
44-
* [SparkSession.emptyDataset](SparkSession.md#emptyDataset) and [SparkSession.range](SparkSession.md#range) operators are used
45-
46-
* `CatalogImpl` is requested to
47-
[makeDataset](CatalogImpl.md#makeDataset) (when requested to [list databases](CatalogImpl.md#listDatabases), [tables](CatalogImpl.md#listTables), [functions](CatalogImpl.md#listFunctions) and [columns](CatalogImpl.md#listColumns))
48-
4928
## observe
5029

5130
```scala
@@ -452,7 +431,7 @@ dataset.filter('value % 2 === 0).count
452431
dataset.filter("value % 2 = 0").count
453432
```
454433
455-
The <<dataset-operators.md#, Dataset API>> offers declarative and type-safe operators that makes for an improved experience for data processing (comparing to [DataFrames](DataFrame.md) that were a set of index- or column name-based [Row](Row.md)s).
434+
The <<dataset/index.md#, Dataset API>> offers declarative and type-safe operators that makes for an improved experience for data processing (comparing to [DataFrames](DataFrame.md) that were a set of index- or column name-based [Row](Row.md)s).
456435
457436
`Dataset` offers convenience of RDDs with the performance optimizations of DataFrames and the strong static type-safety of Scala. The last feature of bringing the strong type-safety to [DataFrame](DataFrame.md) makes Dataset so appealing. All the features together give you a more functional programming interface to work with structured data.
458437
@@ -504,13 +483,13 @@ A `Dataset` is <<Queryable, Queryable>> and `Serializable`, i.e. can be saved to
504483
505484
NOTE: SparkSession.md[SparkSession] and [QueryExecution](QueryExecution.md) are transient attributes of a `Dataset` and therefore do not participate in Dataset serialization. The only _firmly-tied_ feature of a `Dataset` is the [Encoder](Encoder.md).
506485
507-
You can request the ["untyped" view](dataset-operators.md#toDF) of a Dataset or access the dataset-operators.md#rdd[RDD] that is generated after executing the query. It is supposed to give you a more pleasant experience while transitioning from the legacy RDD-based or DataFrame-based APIs you may have used in the earlier versions of Spark SQL or encourage migrating from Spark Core's RDD API to Spark SQL's Dataset API.
486+
You can request the ["untyped" view](dataset/index.md#toDF) of a Dataset or access the dataset/index.md#rdd[RDD] that is generated after executing the query. It is supposed to give you a more pleasant experience while transitioning from the legacy RDD-based or DataFrame-based APIs you may have used in the earlier versions of Spark SQL or encourage migrating from Spark Core's RDD API to Spark SQL's Dataset API.
508487
509488
The default storage level for `Datasets` is spark-rdd-caching.md[MEMORY_AND_DISK] because recomputing the in-memory columnar representation of the underlying table is expensive. You can however [persist a `Dataset`](caching-and-persistence.md#persist).
510489
511490
NOTE: Spark 2.0 has introduced a new query model called spark-structured-streaming.md[Structured Streaming] for continuous incremental execution of structured queries. That made possible to consider Datasets a static and bounded as well as streaming and unbounded data sets with a single unified API for different execution models.
512491
513-
A `Dataset` is dataset-operators.md#isLocal[local] if it was created from local collections using SparkSession.md#emptyDataset[SparkSession.emptyDataset] or SparkSession.md#createDataset[SparkSession.createDataset] methods and their derivatives like <<toDF,toDF>>. If so, the queries on the Dataset can be optimized and run locally, i.e. without using Spark executors.
492+
A `Dataset` is dataset/index.md#isLocal[local] if it was created from local collections using SparkSession.md#emptyDataset[SparkSession.emptyDataset] or SparkSession.md#createDataset[SparkSession.createDataset] methods and their derivatives like <<toDF,toDF>>. If so, the queries on the Dataset can be optimized and run locally, i.e. without using Spark executors.
514493
515494
NOTE: `Dataset` makes sure that the underlying `QueryExecution` is [analyzed](QueryExecution.md#analyzed) and CheckAnalysis.md#checkAnalysis[checked].
516495
@@ -531,7 +510,7 @@ Used when:
531510
532511
* `Dataset` is <<apply, created>> (for a logical plan in a given `SparkSession`)
533512
534-
* dataset-operators.md#dataset-operators.md[Dataset.toLocalIterator] operator is used (to create a Java `Iterator` of objects of type `T`)
513+
* dataset/index.md#dataset/index.md[Dataset.toLocalIterator] operator is used (to create a Java `Iterator` of objects of type `T`)
535514
536515
* `Dataset` is requested to <<collectFromPlan, collect all rows from a spark plan>>
537516
@@ -712,7 +691,7 @@ collectFromPlan(plan: SparkPlan): Array[T]
712691
713692
`collectFromPlan`...FIXME
714693
715-
NOTE: `collectFromPlan` is used for dataset-operators.md#head[Dataset.head], dataset-operators.md#collect[Dataset.collect] and dataset-operators.md#collectAsList[Dataset.collectAsList] operators.
694+
NOTE: `collectFromPlan` is used for dataset/index.md#head[Dataset.head], dataset/index.md#collect[Dataset.collect] and dataset/index.md#collectAsList[Dataset.collectAsList] operators.
716695
717696
=== [[selectUntyped]] `selectUntyped` Internal Method
718697
@@ -723,7 +702,7 @@ selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]
723702
724703
`selectUntyped`...FIXME
725704
726-
NOTE: `selectUntyped` is used exclusively when <<dataset-typed-transformations.md#select, Dataset.select>> typed transformation is used.
705+
NOTE: `selectUntyped` is used exclusively when <<typed-transformations.md#select, Dataset.select>> typed transformation is used.
727706
728707
=== [[sortInternal]] `sortInternal` Internal Method
729708
@@ -752,7 +731,7 @@ Internally, `sortInternal` firstly builds ordering expressions for the given `so
752731
753732
In the end, `sortInternal` <<withTypedPlan, creates a Dataset>> with <<Sort.md#, Sort>> unary logical operator (with the ordering expressions, the given `global` flag, and the <<logicalPlan, logicalPlan>> as the <<Sort.md#child, child logical plan>>).
754733
755-
NOTE: `sortInternal` is used for the <<dataset-operators.md#sort, sort>> and <<dataset-operators.md#sortWithinPartitions, sortWithinPartitions>> typed transformations in the Dataset API (with the only change of the `global` flag being enabled and disabled, respectively).
734+
NOTE: `sortInternal` is used for the <<dataset/index.md#sort, sort>> and <<dataset/index.md#sortWithinPartitions, sortWithinPartitions>> typed transformations in the Dataset API (with the only change of the `global` flag being enabled and disabled, respectively).
756735
757736
=== [[withPlan]] Helper Method for Untyped Transformations and Basic Actions -- `withPlan` Internal Method
758737
@@ -765,7 +744,7 @@ withPlan(logicalPlan: LogicalPlan): DataFrame
765744
766745
NOTE: `withPlan` is annotated with Scala's https://www.scala-lang.org/api/current/scala/inline.html[@inline] annotation that requests the Scala compiler to try especially hard to inline it.
767746
768-
`withPlan` is used in [untyped transformations](dataset-untyped-transformations.md)
747+
`withPlan` is used in [untyped transformations](dataset/untyped-transformations.md)
769748
770749
=== [[i-want-more]] Further Reading and Watching
771750

docs/Encoder.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
`Encoder` is also called _"a container of serde expressions in Dataset"_.
1313

14-
`Encoder` is a part of [Dataset](Dataset.md)s (to serialize and deserialize the records of this dataset).
14+
`Encoder` is a part of [Dataset](dataset/index.md)s (to serialize and deserialize the records of this dataset).
1515

1616
`Encoder` knows the [schema](#schema) of the records and that is how they offer significantly faster serialization and deserialization (comparing to the default Java or Kryo serializers).
1717

0 commit comments

Comments
 (0)