-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix encoding issues on windows for some formats #361
BUG: fix encoding issues on windows for some formats #361
Conversation
XLSX gave UTF-8 decoding errors when reading a file written after the change. GDAL does say that XLSX needs UTF-8 (via the OLCStringsAsUTF8 capability), but this only worked for existing files, not for new files/layers being created. After reporting this via OSGeo/gdal#9295 and some more debugging and searching, this has already been fixed in GDAL in OSGeo/gdal#9301. So, for GDAL >= 3.8.5 this will be fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Still a bit unsure of the correct behavior for shapefiles when encoding
is not provided by the user, but otherwise this looks reasonable.
I'm not sure about the best way forward either. I now implemented it the same as the behaviour in fiona and the default of gdal, but e.g. just always using UTF-8 sound reasonable to me as well. Not sure why fiona and gdal default to the 1990's encoding, but possibly some applications e.g. don't use the ".cpg" properly when reading leading to more risk of compatibility issues? |
From an ESRI page (https://support.esri.com/en-us/knowledge-base/read-and-write-shapefile-and-dbase-files-encoded-in-var-000013192):
Now, I don't know what this "only" means (i.e. what other ESRI products are not included in that list). Anyway, given we were already using UTF-8 before and didn't yet get any complaints about that, I would personally keep that. That seems like a better default nowadays (while the default in fiona and GDAL probably stems from many years ago) |
However, I would then maybe do that for all platforms, including Windows? |
I agree. If we would go for "UTF-8", I would also vote to do it for all platforms. EDIT: interesting detail: on the same ESRI page, in the "Summary", they state this:
|
All right - let's go with |
UTF-8 is now the default for Shapefile on all platforms... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
Can you please add a test that Shapefile is always written to UTF-8
by default (since it wasn't necessarily set that way on Windows before) unless encoding is passed by user.
…g-of-encoding-on-windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed in some geofileops tests that using pyogrio to write/read dataframes to ".csv" files gives encoding issues.
This PR fixes those.
Odd detail: I only saw this behaviour on the github CI windows systems: locally I couldn't reproduce. Apparently:
locale.getprefferedencoding()
returns "UTF-8", even though when I runlocale.getprefferedencoding()
in a seperate script it returns the typical one for windows: "cp1252".locale.getprefferedencoding()
returns "cp1252".