You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can access the analyzed logical plan of a structured query using [Dataset.explain](dataset-operators.md#explain) basic action (with `extended` flag enabled) or SQL's `EXPLAIN EXTENDED` SQL command.
173
+
You can access the analyzed logical plan of a structured query using [Dataset.explain](dataset/index.md#explain) basic action (with `extended` flag enabled) or SQL's `EXPLAIN EXTENDED` SQL command.
Copy file name to clipboardExpand all lines: docs/CacheManager.md
+6-6
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ spark.sharedState.cacheManager
12
12
13
13
## Dataset.cache and persist Operators
14
14
15
-
A structured query (as [Dataset](Dataset.md)) can be [cached](#cacheQuery) and registered with `CacheManager` using [Dataset.cache](caching-and-persistence.md#cache) or [Dataset.persist](caching-and-persistence.md#persist) high-level operators.
15
+
A structured query (as [Dataset](dataset/index.md)) can be [cached](#cacheQuery) and registered with `CacheManager` using [Dataset.cache](caching-and-persistence.md#cache) or [Dataset.persist](caching-and-persistence.md#persist) high-level operators.
*[Dataset.storageLevel](dataset-operators.md#storageLevel) action is used
95
+
*[Dataset.storageLevel](dataset/index.md#storageLevel) action is used
96
96
*`CatalogImpl` is requested to [isCached](CatalogImpl.md#isCached)
97
97
*`CacheManager` is requested to [cacheQuery](#cacheQuery) and [useCachedData](#useCachedData)
98
98
@@ -116,7 +116,7 @@ uncacheQuery(
116
116
117
117
`uncacheQuery` is used when:
118
118
119
-
*[Dataset.unpersist](dataset-operators.md#unpersist) basic action is used
119
+
*[Dataset.unpersist](dataset/index.md#unpersist) basic action is used
120
120
*`DropTableCommand` and [TruncateTableCommand](logical-operators/TruncateTableCommand.md) logical commands are executed
121
121
*`CatalogImpl` is requested to [uncache](CatalogImpl.md#uncacheTable) and [refresh](CatalogImpl.md#refreshTable) a table or view, [dropTempView](CatalogImpl.md#dropTempView) and [dropGlobalTempView](CatalogImpl.md#dropGlobalTempView)
122
122
@@ -129,9 +129,9 @@ cacheQuery(
129
129
storageLevel: StorageLevel=MEMORY_AND_DISK):Unit
130
130
```
131
131
132
-
`cacheQuery` adds the [analyzed logical plan](Dataset.md#logicalPlan) of the input [Dataset](Dataset.md) to the [cachedData](#cachedData) internal registry of cached queries.
132
+
`cacheQuery` adds the [analyzed logical plan](dataset/index.md#logicalPlan) of the input [Dataset](dataset/index.md) to the [cachedData](#cachedData) internal registry of cached queries.
133
133
134
-
Internally, `cacheQuery` requests the `Dataset` for the [analyzed logical plan](Dataset.md#logicalPlan) and creates a [InMemoryRelation](logical-operators/InMemoryRelation.md) with the following:
134
+
Internally, `cacheQuery` requests the `Dataset` for the [analyzed logical plan](dataset/index.md#logicalPlan) and creates a [InMemoryRelation](logical-operators/InMemoryRelation.md) with the following:
When [created](#creating-instance), `DataFrameWriter` converts the [Dataset](#ds) to a [DataFrame](Dataset.md#toDF).
32
+
When [created](#creating-instance), `DataFrameWriter` converts the [Dataset](#ds) to a [DataFrame](dataset/index.md#toDF).
33
33
34
34
## <spanid="format"> Name of Data Source { #source }
35
35
@@ -55,7 +55,7 @@ insertInto(
55
55
tableName: String):Unit
56
56
```
57
57
58
-
`insertInto` requests the [DataFrame](#df) for the [SparkSession](Dataset.md#sparkSession).
58
+
`insertInto` requests the [DataFrame](#df) for the [SparkSession](dataset/index.md#sparkSession).
59
59
60
60
`insertInto` tries to [look up the TableProvider](#lookupV2Provider) for the [data source](#source).
61
61
@@ -106,7 +106,7 @@ saveAsTable(
106
106
tableName: String):Unit
107
107
```
108
108
109
-
`saveAsTable` requests the [DataFrame](#df) for the [SparkSession](Dataset.md#sparkSession).
109
+
`saveAsTable` requests the [DataFrame](#df) for the [SparkSession](dataset/index.md#sparkSession).
110
110
111
111
`saveAsTable` tries to [look up the TableProvider](#lookupV2Provider) for the [data source](#source).
112
112
@@ -174,7 +174,7 @@ Saves a `DataFrame` (the result of executing a structured query) to a data sourc
174
174
Internally, `save` uses `DataSource` to [look up the class of the requested data source](DataSource.md#lookupDataSource) (for the [source](#source) option and the [SQLConf](SessionState.md#conf)).
175
175
176
176
!!! note
177
-
`save` uses [SparkSession](Dataset.md#sparkSession) to access the [SessionState](SparkSession.md#sessionState) and in turn the [SQLConf](SessionState.md#conf).
177
+
`save` uses [SparkSession](dataset/index.md#sparkSession) to access the [SessionState](SparkSession.md#sessionState) and in turn the [SQLConf](SessionState.md#conf).
`saveToV1Source` creates a [DataSource](DataSource.md#apply) (for the [source](#source) class name, the [partitioningColumns](#partitioningColumns) and the [extraOptions](#extraOptions)) and requests it for the [logical command for writing](DataSource.md#planForWriting) (with the [mode](#mode) and the [analyzed logical plan](Dataset.md#logicalPlan) of the structured query).
282
+
`saveToV1Source` creates a [DataSource](DataSource.md#apply) (for the [source](#source) class name, the [partitioningColumns](#partitioningColumns) and the [extraOptions](#extraOptions)) and requests it for the [logical command for writing](DataSource.md#planForWriting) (with the [mode](#mode) and the [analyzed logical plan](dataset/index.md#logicalPlan) of the structured query).
283
283
284
284
!!! note
285
-
While requesting the [analyzed logical plan](Dataset.md#logicalPlan) of the structured query, `saveToV1Source` triggers execution of logical commands.
285
+
While requesting the [analyzed logical plan](dataset/index.md#logicalPlan) of the structured query, `saveToV1Source` triggers execution of logical commands.
286
286
287
287
In the end, `saveToV1Source`[runs the logical command for writing](#runCommand).
288
288
@@ -336,7 +336,7 @@ createTable(
336
336
337
337
`createTable` creates a [CatalogTable](CatalogTable.md) (with the [bucketSpec](CatalogTable.md#bucketSpec) per [getBucketSpec](#getBucketSpec)).
338
338
339
-
In the end, `createTable` creates a [CreateTable](logical-operators/CreateTable.md) logical command (with the `CatalogTable`, [mode](#mode) and the [logical query plan](Dataset.md#planWithBarrier) of the [dataset](#df)) and [runs](#runCommand) it.
339
+
In the end, `createTable` creates a [CreateTable](logical-operators/CreateTable.md) logical command (with the `CatalogTable`, [mode](#mode) and the [logical query plan](dataset/index.md#planWithBarrier) of the [dataset](#df)) and [runs](#runCommand) it.
Copy file name to clipboardExpand all lines: docs/DataFrameWriterV2.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# DataFrameWriterV2
2
2
3
-
`DataFrameWriterV2` is an API for Spark SQL developers to describe how to write a [Dataset](Dataset.md) to an external storage using the DataSource V2.
3
+
`DataFrameWriterV2` is an API for Spark SQL developers to describe how to write a [Dataset](dataset/index.md) to an external storage using the DataSource V2.
4
4
5
5
`DataFrameWriterV2` is a [CreateTableWriter](CreateTableWriter.md) (and thus a [WriteConfigMethods](WriteConfigMethods.md)).
It is fair to say that `Dataset` is a Spark SQL developer-friendly layer over the following two low-level entities:
14
10
@@ -29,23 +25,6 @@ It is fair to say that `Dataset` is a Spark SQL developer-friendly layer over th
29
25
30
26
When created, `Dataset` requests [QueryExecution](#queryExecution) to [assert analyzed phase is successful](QueryExecution.md#assertAnalyzed).
31
27
32
-
`Dataset` is created when:
33
-
34
-
*[Dataset.apply](#apply) (for a [LogicalPlan](logical-operators/LogicalPlan.md) and a [SparkSession](SparkSession.md) with the [Encoder](Encoder.md) in a Scala implicit scope)
35
-
36
-
*[Dataset.ofRows](#ofRows) (for a [LogicalPlan](logical-operators/LogicalPlan.md) and a [SparkSession](SparkSession.md))
37
-
38
-
*[Dataset.toDF](dataset-untyped-transformations.md#toDF) untyped transformation is used
39
-
40
-
*[Dataset.select](dataset-typed-transformations.md#select), [Dataset.randomSplit](dataset-typed-transformations.md#randomSplit) and [Dataset.mapPartitions](dataset-typed-transformations.md#mapPartitions) typed transformations are used
41
-
42
-
*[KeyValueGroupedDataset.agg](KeyValueGroupedDataset.md#agg) operator is used (that requests `KeyValueGroupedDataset` to [aggUntyped](KeyValueGroupedDataset.md#aggUntyped))
43
-
44
-
*[SparkSession.emptyDataset](SparkSession.md#emptyDataset) and [SparkSession.range](SparkSession.md#range) operators are used
45
-
46
-
*`CatalogImpl` is requested to
47
-
[makeDataset](CatalogImpl.md#makeDataset) (when requested to [list databases](CatalogImpl.md#listDatabases), [tables](CatalogImpl.md#listTables), [functions](CatalogImpl.md#listFunctions) and [columns](CatalogImpl.md#listColumns))
The <<dataset-operators.md#, Dataset API>> offers declarative and type-safe operators that makes for an improved experience for data processing (comparing to [DataFrames](DataFrame.md) that were a set of index- or column name-based [Row](Row.md)s).
434
+
The <<dataset/index.md#, Dataset API>> offers declarative and type-safe operators that makes for an improved experience for data processing (comparing to [DataFrames](DataFrame.md) that were a set of index- or column name-based [Row](Row.md)s).
456
435
457
436
`Dataset` offers convenience of RDDs with the performance optimizations of DataFrames and the strong static type-safety of Scala. The last feature of bringing the strong type-safety to [DataFrame](DataFrame.md) makes Dataset so appealing. All the features together give you a more functional programming interface to work with structured data.
458
437
@@ -504,13 +483,13 @@ A `Dataset` is <<Queryable, Queryable>> and `Serializable`, i.e. can be saved to
504
483
505
484
NOTE: SparkSession.md[SparkSession] and [QueryExecution](QueryExecution.md) are transient attributes of a `Dataset` and therefore do not participate in Dataset serialization. The only _firmly-tied_ feature of a `Dataset` is the [Encoder](Encoder.md).
506
485
507
-
You can request the ["untyped" view](dataset-operators.md#toDF) of a Dataset or access the dataset-operators.md#rdd[RDD] that is generated after executing the query. It is supposed to give you a more pleasant experience while transitioning from the legacy RDD-based or DataFrame-based APIs you may have used in the earlier versions of Spark SQL or encourage migrating from Spark Core's RDD API to Spark SQL's Dataset API.
486
+
You can request the ["untyped" view](dataset/index.md#toDF) of a Dataset or access the dataset/index.md#rdd[RDD] that is generated after executing the query. It is supposed to give you a more pleasant experience while transitioning from the legacy RDD-based or DataFrame-based APIs you may have used in the earlier versions of Spark SQL or encourage migrating from Spark Core's RDD API to Spark SQL's Dataset API.
508
487
509
488
The default storage level for `Datasets` is spark-rdd-caching.md[MEMORY_AND_DISK] because recomputing the in-memory columnar representation of the underlying table is expensive. You can however [persist a `Dataset`](caching-and-persistence.md#persist).
510
489
511
490
NOTE: Spark 2.0 has introduced a new query model called spark-structured-streaming.md[Structured Streaming] for continuous incremental execution of structured queries. That made possible to consider Datasets a static and bounded as well as streaming and unbounded data sets with a single unified API for different execution models.
512
491
513
-
A `Dataset` is dataset-operators.md#isLocal[local] if it was created from local collections using SparkSession.md#emptyDataset[SparkSession.emptyDataset] or SparkSession.md#createDataset[SparkSession.createDataset] methods and their derivatives like <<toDF,toDF>>. If so, the queries on the Dataset can be optimized and run locally, i.e. without using Spark executors.
492
+
A `Dataset` is dataset/index.md#isLocal[local] if it was created from local collections using SparkSession.md#emptyDataset[SparkSession.emptyDataset] or SparkSession.md#createDataset[SparkSession.createDataset] methods and their derivatives like <<toDF,toDF>>. If so, the queries on the Dataset can be optimized and run locally, i.e. without using Spark executors.
514
493
515
494
NOTE: `Dataset` makes sure that the underlying `QueryExecution` is [analyzed](QueryExecution.md#analyzed) and CheckAnalysis.md#checkAnalysis[checked].
516
495
@@ -531,7 +510,7 @@ Used when:
531
510
532
511
* `Dataset` is <<apply, created>> (for a logical plan in a given `SparkSession`)
533
512
534
-
* dataset-operators.md#dataset-operators.md[Dataset.toLocalIterator] operator is used (to create a Java `Iterator` of objects of type `T`)
513
+
* dataset/index.md#dataset/index.md[Dataset.toLocalIterator] operator is used (to create a Java `Iterator` of objects of type `T`)
535
514
536
515
* `Dataset` is requested to <<collectFromPlan, collect all rows from a spark plan>>
NOTE: `collectFromPlan` is used for dataset-operators.md#head[Dataset.head], dataset-operators.md#collect[Dataset.collect] and dataset-operators.md#collectAsList[Dataset.collectAsList] operators.
694
+
NOTE: `collectFromPlan` is used for dataset/index.md#head[Dataset.head], dataset/index.md#collect[Dataset.collect] and dataset/index.md#collectAsList[Dataset.collectAsList] operators.
@@ -752,7 +731,7 @@ Internally, `sortInternal` firstly builds ordering expressions for the given `so
752
731
753
732
In the end, `sortInternal` <<withTypedPlan, creates a Dataset>> with <<Sort.md#, Sort>> unary logical operator (with the ordering expressions, the given `global` flag, and the <<logicalPlan, logicalPlan>> as the <<Sort.md#child, child logical plan>>).
754
733
755
-
NOTE: `sortInternal` is used for the <<dataset-operators.md#sort, sort>> and <<dataset-operators.md#sortWithinPartitions, sortWithinPartitions>> typed transformations in the Dataset API (with the only change of the `global` flag being enabled and disabled, respectively).
734
+
NOTE: `sortInternal` is used for the <<dataset/index.md#sort, sort>> and <<dataset/index.md#sortWithinPartitions, sortWithinPartitions>> typed transformations in the Dataset API (with the only change of the `global` flag being enabled and disabled, respectively).
756
735
757
736
=== [[withPlan]] Helper Method for Untyped Transformations and Basic Actions -- `withPlan` Internal Method
NOTE: `withPlan` is annotated with Scala's https://www.scala-lang.org/api/current/scala/inline.html[@inline] annotation that requests the Scala compiler to try especially hard to inline it.
767
746
768
-
`withPlan` is used in [untyped transformations](dataset-untyped-transformations.md)
747
+
`withPlan` is used in [untyped transformations](dataset/untyped-transformations.md)
Copy file name to clipboardExpand all lines: docs/Encoder.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@
11
11
12
12
`Encoder` is also called _"a container of serde expressions in Dataset"_.
13
13
14
-
`Encoder` is a part of [Dataset](Dataset.md)s (to serialize and deserialize the records of this dataset).
14
+
`Encoder` is a part of [Dataset](dataset/index.md)s (to serialize and deserialize the records of this dataset).
15
15
16
16
`Encoder` knows the [schema](#schema) of the records and that is how they offer significantly faster serialization and deserialization (comparing to the default Java or Kryo serializers).
0 commit comments