-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: correctly use GDAL auto-decoding of shapefiles when encoding is set #384
Conversation
The only disadvantage I see from using the config option vs the dataset open option is that it can lead to inter-thread side effects. So if someone in one thread uses the A final alternative is to pass the |
Fiona used to use My concern with opening the file twice is that it might incur a lot of overhead for remote sources (e.g., shapefile in a zip on S3), but I don't have remote examples easy at hand to measure / prove that increase in overhead, it's just a guess. In contrast, setting the config option (so long as it doesn't negatively clobber other threads) should be low overhead. |
Thank you so much for your help. When I noticed garbled text in the feedback, I realized that I forgot to mention that the |
Unfortunately I'm finding that using I'm also finding that it is hard to use non-UTF-8 encodings with We can either disable the @jorisvandenbossche I'm unfamiliar with handling of non-UTF-8 column names / text values in pyarrow - should we even be trying to allow non-UTF-8 content through the Arrow API? I'm not seeing how we would intercept and decode them on our end like we do for the non-Arrow API. |
So it turns out that - at least for shapefiles - the The workaround was to always test capabilities for shapefiles on the only layer that is present in the shapefile (e.g., the base layer underneath the SQL query layer). I'm not sure the degree to which this might be present for other drivers; I think most of those already return true when testing for |
Yep, that would be great.
Hmm... didn't considered this, also don't have any experience with using remote files. Not really relevant anymore because of the threadsafe solution above, but if the overhead would be in the driver detection, it can be solved by specifying in GDALOpenEx already that it is an "ESRI Shapefile", so the driver detection is avoided/minimized. If this would be a significant overhead it might be a good idea in general to offer the option to specify the driver(s) to take in consideration to open the file. |
Sounds like a bug in GDAL? The problem occurs for both sql dialects ('OGRSQL' and 'SQLITE')?
|
Not sure, it could also be a bug in how we are trying to use it on our end in this case. The GDAL Python bindings also return from osgeo import ogr
drv = ogr.GetDriverByName("ESRI Shapefile")
ds = drv.Open("/tmp/test.shp", 0)
lyr = ds.GetLayerByIndex(0)
print(f"Supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# True
print(lyr.schema[0].name)
# 中文
lyr = ds.ExecuteSQL("select * from test where \"中文\" = '中文' ", None, "")
print(f"Supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# False
print(lyr.schema[0].name)
# 中文 The Python bindings are a bit hard to trace through, I'm still trying to find where it determines the schema / field name in this case to see if it is doing something different than we are here. |
pyogrio/_io.pyx
Outdated
encoding_b = encoding.encode("UTF-8") | ||
encoding_c = encoding_b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely there is a good reason, but I wonder... why isn't this moved inside the function as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally wanted to keep override_threadlocal_config_option
so that it only worked with C types, in case we ever wanted to pass in C values equivalent to what you'd get back from the function. Otherwise, I think we'd want to have it encode both the key and value, even if the key passed in is a string defined in code (already a const char*
I think?) rather than runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if the key passed in is a string defined in code (already a const char* I think?)
Hmm, not entirely sure that is correct, cython docs say those are str
(unicode) instances but yet show examples like this: cdef char* hello_world = 'hello world'
which implies that we at least get the conversion without extra overhead when defining string literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear and informative comments and tests!
What I don't fully understand is how this works for non-Shapefiles. From the recent discussions/PRs I had thought that eventually this encoding
keyword actually only makes sense for shapefiles. But you document and test that you can use it (in theory) also for other formats. But since for other formats there is no such thing as SHAPE_ENCODING
, GDAL will still assume it is reading those files as UTF-8 and is returning UTF-8 values to us, right? How does that not give errors within GDAL?
Something else I am wondering: do we actually test a format that does not have OLCStringsAsUTF8
(or where we override GDAL like OSM and xlsx and GeoSJONseq)? i.e. a format where we actually end up using locale.getpreferredencoding()
CSV is an example for that?
CHANGES.md
Outdated
@@ -19,6 +18,9 @@ | |||
- Raise exception in `read_arrow` or `read_dataframe(..., use_arrow=True)` if | |||
a boolean column is detected due to error in GDAL reading boolean values (#335) | |||
this has been fixed in GDAL >= 3.8.3. | |||
- Properly handle decoding of ESRI Shapefiles with user-provided `encoding` | |||
option for `read`, `read_dataframe`, and `open_arrow`, and correctly encode | |||
Shapefile field names and text values to the user-provided `encoding` (#384). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this last sentence mean exactly? Is that for writing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was for writing. Will make more explicit.
By definition Arrow strings are UTF-8, using anything else would be considered as invalid Arrow data. The only solution one has it to store it as a column with variable size binary data type (and keep track of the actual encoding as a user). But that only works for column values, I don't think there is a workaround for column names. Given that you already solved this for Shapefiles, and that for other file formats this seems a dubious use case anyway(?), it's probably fine to not support this for reading with Arrow? |
Or GDAL just reads the data and passes them on, assuming it is UTF-8 but not actually doing anything with it? So as long as we then decode those bytes with the user provided encoding instead of UTF-8, this just happens to "work"? (but so eg looking at that file with |
This is my assumption, based on the testing here. GDAL and other readers of those other drivers (GPKG, FGB, etc) will show miscoded field names / values. So it isn't terribly useful - seems like it enables breaking other tools in the ecosystem - but we don't (yet) specifically prevent it. I don't know if other tools are able to validly write non-UTF-8 encodings to some of these, but it appears to be possible in practice by converting shapefiles to some of them without correctly accounting for alternative encodings (e.g., GDAL #7458). It would take a bit of research to determine the list, but it would in theory be possible to define a list of drivers that we allow to be non-UTF-8 on write (e.g., shapefile, maybe CSV, XLSX?), and raise an error on others. We can leave it a bit more open-ended for reading, in part to allow users to specifically force a recoding to cleanup data. But we have to restrict that to the non-Arrow API. |
My current general understanding is as follows. There are 2 general types of format:
To conclude, I think the "encoding" parameter is strictly speaking necessary for the first type of files (not sure if there are any others than CSV)... for the other formats everything could be controlled via open options, but due to the naming clash with the one for shapefile... |
BTW if there is a clash with one of our own keywords, you can always explicitly pass the GDAL layer/dataset open option through |
Indeed, but only for |
Ah, that's something we should probably add then. For this PR, shall we merge it? (it just needs some fixing merge conflicts) |
Am I right that if a user passes |
For read (non-Arrow API) of other drivers than shapefile with the For write of other drivers than shapefile (where we force UTF-8), we use the I'm not sure if we should equate support for If we know that a given driver must be in UTF-8, we could raise an Exception when attempting to write an alternative encoding. Are you saying we should raise a warning if OLCStringsAsUTF8 == True that is not "ESRI Shapefile" during read? i.e., they know the data source is in a different encoding regardless of GDAL's capabilities to convert to UTF-8. Since `encoding' is opt-in, it seems like that is a deliberate choice and the warning just nags them about that, right? Maybe I don't follow what you are getting at here. |
Yes, sorry, not sure why I wrote that, but it being ignored is indeed not at all the case... It is used, but I think the cases where it leads to something useful are very limited. Not 100% sure, but I personally think indeed that
I think this is the case... but as there are that many drivers, I'm not 100% sure, which is why I was thinking about a warning.
True I suppose about the nagging. I'm afraid that few users will really know what they are doing when specifying "encoding= Possibly/probably it is best to just "let it go" and solve any issues if they would surface... |
I've tried to address some of the discussion in the latest changes: For the Arrow API (open / read arrow) it will raise an exception if an alternative encoding is provided and driver is not shapefile We now specifically raise exception for shapefiles if both This does not disable writing to an alternative encoding if provided by the user (hopefully they know the right thing to do for the target driver); I think we can leave this as is until we see issues. |
Added a specific test to verify that |
Added more tests for the Arrow write API, slimmed down the comments and repeated code around encoding in |
Didn't notice it before (and haven't yet look in detail what might be causing it), but merging this caused a bunch failures when testing the wheels for ubuntu and older macos: https://github.com/geopandas/pyogrio/actions/runs/8848749467/job/24299655017 |
As described in #380, GDAL attempts to auto-detect the native encoding of a shapefile and will automatically use that to decode to UTF-8 before returning strings to us. This runs into conflicts where the user provides
encoding
because we then apply the user'sencoding
to UTF-8 text, which produces incorrect text.This sets
SHAPE_ENCODING=""
(GDAL config option) to disable auto-decoding before opening the shapefile. Setting the dataset open optionENCODING=""
also works, but is specific to Shapefiles, and we don't know the path resolves to a Shapefile until after opening it - which is too late to set the option.This now correctly handles
encoding="cp936"
for the shapefiles described in #380. I'm still trying to create some test cases here, but running into issues saving to the correct native encoding to prove that.