BUG: fix encoding issues on windows for some formats #361

theroggy · 2024-02-23T18:35:39Z

I noticed in some geofileops tests that using pyogrio to write/read dataframes to ".csv" files gives encoding issues.

This PR fixes those.

Odd detail: I only saw this behaviour on the github CI windows systems: locally I couldn't reproduce. Apparently:

when running tests on my local windows locale.getprefferedencoding() returns "UTF-8", even though when I run locale.getprefferedencoding() in a seperate script it returns the typical one for windows: "cp1252".
in the tests on the github windows machines, locale.getprefferedencoding() returns "cp1252".

theroggy · 2024-02-24T08:02:53Z

XLSX gave UTF-8 decoding errors when reading a file written after the change. GDAL does say that XLSX needs UTF-8 (via the OLCStringsAsUTF8 capability), but this only worked for existing files, not for new files/layers being created.

After reporting this via OSGeo/gdal#9295 and some more debugging and searching, this has already been fixed in GDAL in OSGeo/gdal#9301. So, for GDAL >= 3.8.5 this will be fixed.

…efined

pyogrio/_io.pyx

pyogrio/tests/test_geopandas_io.py

pyogrio/_io.pyx

brendan-ward

Thanks for working on this! Still a bit unsure of the correct behavior for shapefiles when encoding is not provided by the user, but otherwise this looks reasonable.

pyogrio/_io.pyx

theroggy · 2024-02-29T06:30:47Z

Thanks for working on this! Still a bit unsure of the correct behavior for shapefiles when encoding is not provided by the user, but otherwise this looks reasonable.

I'm not sure about the best way forward either. I now implemented it the same as the behaviour in fiona and the default of gdal, but e.g. just always using UTF-8 sound reasonable to me as well.

Not sure why fiona and gdal default to the 1990's encoding, but possibly some applications e.g. don't use the ".cpg" properly when reading leading to more risk of compatibility issues?

jorisvandenbossche · 2024-02-29T08:15:01Z

From an ESRI page (https://support.esri.com/en-us/knowledge-base/read-and-write-shapefile-and-dbase-files-encoded-in-var-000013192):

Shapefiles can now be stored in UTF-8. However, shapefiles encoded in UTF-8 are only recognized in ArcMap, ArcCatalog and ArcGIS Pro.

Now, I don't know what this "only" means (i.e. what other ESRI products are not included in that list).

Anyway, given we were already using UTF-8 before and didn't yet get any complaints about that, I would personally keep that. That seems like a better default nowadays (while the default in fiona and GDAL probably stems from many years ago)

jorisvandenbossche · 2024-02-29T08:17:24Z

However, I would then maybe do that for all platforms, including Windows?

theroggy · 2024-02-29T16:48:36Z

However, I would then maybe do that for all platforms, including Windows?

I agree. If we would go for "UTF-8", I would also vote to do it for all platforms.

EDIT: interesting detail: on the same ESRI page, in the "Summary", they state this:

The default code page in a shapefile (.DBF) is set to UTF-8 (UNICODE). This is the default for current internationalization practices.

brendan-ward · 2024-02-29T18:09:56Z

All right - let's go with UTF-8 as the default for shapefiles on all platforms and revisit (by setting to ISO-8859-1) if we get errors reported by users.

…g-of-encoding-on-windows

theroggy · 2024-03-01T15:54:03Z

UTF-8 is now the default for Shapefile on all platforms...

brendan-ward

Thanks for the updates!

Can you please add a test that Shapefile is always written to UTF-8 by default (since it wasn't necessarily set that way on Windows before) unless encoding is passed by user.

CHANGES.md

pyogrio/_io.pyx

…g-of-encoding-on-windows

brendan-ward

Thanks @theroggy ! Apologies for the very slow final review!

Planning to merge once CI is green so that we can have this in place before changes needed for #380

ENH: improve handling of encoding on windows

37f5964

theroggy marked this pull request as draft February 23, 2024 18:35

theroggy added 12 commits February 23, 2024 19:55

Add test to check encoding used to write csv

e6d4169

Fix writing string data in correct encoding

a5b6891

Encoding detection when writing based on driver

18e65f7

encode string values to the detected encoding

b6b3f6a

Rollback unneeded change

698ad27

replace filecmp with manual check

28b30d9

encoding can only be determined after creating the output layer

b55a4cd

Fix xlsx encoding

c22d6ad

Always encode field names in UTF8

fcfdd31

Add GeoJSONSeq to be utf8 for old gdal versions

3ced6e7

Update CHANGES.md

4cc390d

Update CHANGES.md

4f6fe84

theroggy changed the title ~~ENH: improve handling of encoding on windows~~ BUF: fix encoding issues on windows for some formats. Feb 24, 2024

theroggy changed the title ~~BUF: fix encoding issues on windows for some formats.~~ BUF: fix encoding issues on windows for some formats Feb 24, 2024

theroggy changed the title ~~BUF: fix encoding issues on windows for some formats~~ BUG: fix encoding issues on windows for some formats Feb 24, 2024

theroggy marked this pull request as ready for review February 24, 2024 01:42

theroggy mentioned this pull request Feb 24, 2024

XLSX: Add OLCStringsAsUTF8 capability OSGeo/gdal#9295

Closed

Try disabling the XLSX utf8 hardcoding use as it should be redundant

70c71c1

theroggy marked this pull request as draft February 24, 2024 12:06

theroggy added 4 commits February 24, 2024 14:54

Add logging regarding locale.setpreferredencoding

4814c00

Move encoding detection to where the layer has already been further d…

9d8680d

…efined

Move detection to after where transation is started

d0fcd0b

Rollback changes to debug XLSX issue in gdal

b0911b9

theroggy marked this pull request as ready for review February 24, 2024 20:23

theroggy marked this pull request as draft February 24, 2024 20:44

theroggy added 2 commits February 24, 2024 21:47

Add column name with special characters in csv tests

2e7f138

Update test_geopandas_io.py

9723475

jorisvandenbossche reviewed Feb 26, 2024

View reviewed changes

pyogrio/_io.pyx Show resolved Hide resolved

pyogrio/tests/test_geopandas_io.py Show resolved Hide resolved

pyogrio/_io.pyx Show resolved Hide resolved

jorisvandenbossche reviewed Feb 26, 2024

View reviewed changes

pyogrio/_io.pyx Show resolved Hide resolved

Specify different encodings in csv tests

78a30b4

theroggy mentioned this pull request Feb 26, 2024

TST: avoid/reduce unimportant warnings in tests #363

Merged

brendan-ward added this to the 0.8.0 milestone Feb 28, 2024

brendan-ward reviewed Feb 29, 2024

View reviewed changes

pyogrio/_io.pyx Show resolved Hide resolved

pyogrio/_io.pyx Show resolved Hide resolved

theroggy added 2 commits March 1, 2024 15:35

Merge remote-tracking branch 'upstream/main' into ENH-improve-handlin…

85cbbe8

…g-of-encoding-on-windows

Set default encoding for shapefile to UTF-8 for all platforms.

e7141c5

brendan-ward reviewed Mar 5, 2024

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

pyogrio/_io.pyx Outdated Show resolved Hide resolved

pyogrio/_io.pyx Outdated Show resolved Hide resolved

theroggy added 8 commits March 5, 2024 08:40

Merge remote-tracking branch 'upstream/main' into ENH-improve-handlin…

5c3004a

…g-of-encoding-on-windows

Improve changelog entry for PR 366 (arrow metadata)

b3d31d8

Add UTF-8 shapefiles to changelog

ea061e0

Centralize getprefferedencoding + improve inline doc

9255602

Add check that shp is written in UTF-8 by default on all platforms

f836bd2

Simplify detect_encoding again + improve doc

350fe0d

Add encoding test for shp writing

ba9b808

Add more documentation

1cc34d5

theroggy requested a review from brendan-ward March 5, 2024 21:59

theroggy and others added 2 commits March 5, 2024 23:30

Update _io.pyx

23ec6f7

Merge branch 'main' into ENH-improve-handling-of-encoding-on-windows

c6a9625

brendan-ward approved these changes Apr 4, 2024

View reviewed changes

brendan-ward merged commit 04a71e9 into geopandas:main Apr 4, 2024
19 checks passed

theroggy deleted the ENH-improve-handling-of-encoding-on-windows branch April 4, 2024 16:22

brendan-ward mentioned this pull request Apr 10, 2024

Write Arrow Table/RecordBatchReader to GDAL #346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix encoding issues on windows for some formats #361

BUG: fix encoding issues on windows for some formats #361

theroggy commented Feb 23, 2024 •

edited

Loading

theroggy commented Feb 24, 2024 •

edited

Loading

brendan-ward left a comment

theroggy commented Feb 29, 2024 •

edited

Loading

jorisvandenbossche commented Feb 29, 2024

jorisvandenbossche commented Feb 29, 2024

theroggy commented Feb 29, 2024 •

edited

Loading

brendan-ward commented Feb 29, 2024

theroggy commented Mar 1, 2024

brendan-ward left a comment

brendan-ward left a comment

BUG: fix encoding issues on windows for some formats #361

BUG: fix encoding issues on windows for some formats #361

Conversation

theroggy commented Feb 23, 2024 • edited Loading

theroggy commented Feb 24, 2024 • edited Loading

brendan-ward left a comment

Choose a reason for hiding this comment

theroggy commented Feb 29, 2024 • edited Loading

jorisvandenbossche commented Feb 29, 2024

jorisvandenbossche commented Feb 29, 2024

theroggy commented Feb 29, 2024 • edited Loading

brendan-ward commented Feb 29, 2024

theroggy commented Mar 1, 2024

brendan-ward left a comment

Choose a reason for hiding this comment

brendan-ward left a comment

Choose a reason for hiding this comment

theroggy commented Feb 23, 2024 •

edited

Loading

theroggy commented Feb 24, 2024 •

edited

Loading

theroggy commented Feb 29, 2024 •

edited

Loading

theroggy commented Feb 29, 2024 •

edited

Loading