Tags · khwj/spark

v3.2.0

Automatically build docker images

Nov 9, 2021
695d790
zip
tar.gz

v3.2.0-rc7

Preparing Spark release v3.2.0-rc7

Oct 6, 2021
5d45a41
zip
tar.gz

v3.2.0-rc6

Preparing Spark release v3.2.0-rc6

Sep 28, 2021
dde73e2
zip
tar.gz

v3.2.0-rc5

Preparing Spark release v3.2.0-rc5

Sep 27, 2021
49aea14
zip
tar.gz

v3.2.0-rc4

Preparing Spark release v3.2.0-rc4

Sep 23, 2021
b609f2f
zip
tar.gz

v3.2.0-rc3

Preparing Spark release v3.2.0-rc3

Sep 18, 2021
96044e9
zip
tar.gz

v3.2.0-rc2

Preparing Spark release v3.2.0-rc2

Aug 31, 2021
03f5d23
zip
tar.gz

v3.2.0-rc1

Preparing Spark release v3.2.0-rc1

Aug 20, 2021
6bb3523
zip
tar.gz

v3.1.2-ci

Add GitHub workflow for building runnable distributions

Jul 26, 2021
a93f4ad
zip
tar.gz

v3.1.2-glue-1.10.0-SNAPSHOT

[SPARK-36089][SQL][DOCS] Update the SQL migration guide about encodin…

…g auto-detection of CSV files

### What changes were proposed in this pull request?
In the PR, I propose to update the SQL migration guide, in particular the section about the migration from Spark 2.4 to 3.0. New item informs users about the following issue:

**What**: Spark doesn't detect encoding (charset) in CSV files with BOM correctly. Such files can be read only in the multiLine mode when the CSV option encoding matches to the actual encoding of CSV files. For example, Spark cannot read UTF-16BE CSV files when encoding is set to UTF-8 which is the default mode. This is the case of the current ES ticket.

**Why**: In previous Spark versions, encoding wasn't propagated to the underlying library that means the lib tried to detect file encoding automatically. It could success for some encodings that require BOM presents at the beginning of files. Starting from the versions 3.0, users can specify file encoding via the CSV option encoding which has UTF-8 as the default value. Spark propagates such default to the underlying library (uniVocity), and as a consequence this turned off encoding auto-detection.

**When**: Since Spark 3.0. In particular, the commit apache@2df34db causes the issue.

**Workaround**: Enabling the encoding auto-detection mechanism in uniVocity by passing null as the value of CSV option encoding. A more recommended approach is to set the encoding option explicitly.

### Why are the changes needed?
To improve user experience with Spark SQL. This should help to users in their migration from Spark 2.4.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Should be checked by building docs in GA/jenkins.

Closes apache#33300 from MaxGekk/csv-encoding-migration-guide.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Jul 12, 2021
e788a3f
zip
tar.gz
Downloads

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.2.0

v3.2.0-rc7

v3.2.0-rc6

v3.2.0-rc5

v3.2.0-rc4

v3.2.0-rc3

v3.2.0-rc2

v3.2.0-rc1

v3.1.2-ci

v3.1.2-glue-1.10.0-SNAPSHOT

Tags: khwj/spark