Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: correctly use GDAL auto-decoding of shapefiles when encoding is set #384

Merged
merged 25 commits into from
Apr 26, 2024

Conversation

brendan-ward
Copy link
Member

As described in #380, GDAL attempts to auto-detect the native encoding of a shapefile and will automatically use that to decode to UTF-8 before returning strings to us. This runs into conflicts where the user provides encoding because we then apply the user's encoding to UTF-8 text, which produces incorrect text.

This sets SHAPE_ENCODING="" (GDAL config option) to disable auto-decoding before opening the shapefile. Setting the dataset open option ENCODING="" also works, but is specific to Shapefiles, and we don't know the path resolves to a Shapefile until after opening it - which is too late to set the option.

This now correctly handles encoding="cp936" for the shapefiles described in #380. I'm still trying to create some test cases here, but running into issues saving to the correct native encoding to prove that.

@brendan-ward brendan-ward added this to the 0.8.0 milestone Apr 5, 2024
@theroggy
Copy link
Member

theroggy commented Apr 5, 2024

This sets SHAPE_ENCODING="" (GDAL config option) to disable auto-decoding before opening the shapefile. Setting the dataset open option ENCODING="" also works, but is specific to Shapefiles, and we don't know the path resolves to a Shapefile until after opening it - which is too late to set the option.

The only disadvantage I see from using the config option vs the dataset open option is that it can lead to inter-thread side effects. So if someone in one thread uses the encoding parameter, all other threads running in the same process at the same time to read/write data can be influenced by this and get encoding issues.
This can be avoided by first opening the file to check the driver when encoding is specified to then be able to use the dataset open option... but I do agree it sounds like a far fetched scenario, even for a multithreading racing condition.

A final alternative is to pass the encoding specified to GDAL as a dataset open option, but this also needs a driver check to be sure it is a shapefile. Not sure if GDAL does some special/other things involving support of encodings compared to what we do in pyogrio... but if this is the case it might be "safer" just to delegate this to GDAL.

@brendan-ward
Copy link
Member Author

Fiona used to use CPLSetThreadLocalConfigOption to set this option specifically within a thread; we can do that here too.

My concern with opening the file twice is that it might incur a lot of overhead for remote sources (e.g., shapefile in a zip on S3), but I don't have remote examples easy at hand to measure / prove that increase in overhead, it's just a guess. In contrast, setting the config option (so long as it doesn't negatively clobber other threads) should be low overhead.

@stonereese
Copy link

Fiona used to use CPLSetThreadLocalConfigOption to set this option specifically within a thread; we can do that here too.

My concern with opening the file twice is that it might incur a lot of overhead for remote sources (e.g., shapefile in a zip on S3), but I don't have remote examples easy at hand to measure / prove that increase in overhead, it's just a guess. In contrast, setting the config option (so long as it doesn't negatively clobber other threads) should be low overhead.

Thank you so much for your help. When I noticed garbled text in the feedback, I realized that I forgot to mention that the shp file also displayed garbled text when using the sql parameter in read_dataframe(), regardless of the specified encoding parameter. I believe your fix has also resolved this issue. I am currently on vacation today, but I will verify it tomorrow.

@brendan-ward
Copy link
Member Author

Unfortunately I'm finding that using sql with a non-UTF-8 encoded file is returning incorrect column names / values when encoding is not also specified, even if GDAL detects the correct native encoding and should be decoding that for us. We might be missing a decoding step somewhere; investigating now.

I'm also finding that it is hard to use non-UTF-8 encodings with use_arrow=True because we don't decode the individual values; the raw values go directly into an Arrow table and immediately break certain things because they can't be decoded nicely. I was able to work around this for shapefiles by letting GDAL do the decoding for us, but I'm not sure how to do this for other drivers. Writing non-UTF values for some of those drivers is likely invalid, so it might not be an issue for those, but it seems theoretically possible for some of them (e.g., GeoJSON).

We can either disable the encoding option altogether for ogr_open_arrow - though this does work for shapefiles because GDAL decodes those to UTF-8 for us (per e738f7b).

@jorisvandenbossche I'm unfamiliar with handling of non-UTF-8 column names / text values in pyarrow - should we even be trying to allow non-UTF-8 content through the Arrow API? I'm not seeing how we would intercept and decode them on our end like we do for the non-Arrow API.

@brendan-ward
Copy link
Member Author

So it turns out that - at least for shapefiles - the OGRLayer instance returned when using an SQL always fails the test for UTF-8 capabilities - even if there is a valid .cpg file present.

The workaround was to always test capabilities for shapefiles on the only layer that is present in the shapefile (e.g., the base layer underneath the SQL query layer).

I'm not sure the degree to which this might be present for other drivers; I think most of those already return true when testing for UTF-8 capabilities, so we didn't run into the same issue.

@theroggy
Copy link
Member

theroggy commented Apr 6, 2024

Fiona used to use CPLSetThreadLocalConfigOption to set this option specifically within a thread; we can do that here too.

Yep, that would be great.

My concern with opening the file twice is that it might incur a lot of overhead for remote sources (e.g., shapefile in a zip on S3), but I don't have remote examples easy at hand to measure / prove that increase in overhead, it's just a guess. In contrast, setting the config option (so long as it doesn't negatively clobber other threads) should be low overhead.

Hmm... didn't considered this, also don't have any experience with using remote files.

Not really relevant anymore because of the threadsafe solution above, but if the overhead would be in the driver detection, it can be solved by specifying in GDALOpenEx already that it is an "ESRI Shapefile", so the driver detection is avoided/minimized. If this would be a significant overhead it might be a good idea in general to offer the option to specify the driver(s) to take in consideration to open the file.

@theroggy
Copy link
Member

theroggy commented Apr 6, 2024

So it turns out that - at least for shapefiles - the OGRLayer instance returned when using an SQL always fails the test for UTF-8 capabilities - even if there is a valid .cpg file present.

Sounds like a bug in GDAL? The problem occurs for both sql dialects ('OGRSQL' and 'SQLITE')?

The workaround was to always test capabilities for shapefiles on the only layer that is present in the shapefile (e.g., the base layer underneath the SQL query layer).

I'm not sure the degree to which this might be present for other drivers; I think most of those already return true when testing for UTF-8 capabilities, so we didn't run into the same issue.

@brendan-ward
Copy link
Member Author

Sounds like a bug in GDAL?

Not sure, it could also be a bug in how we are trying to use it on our end in this case. The GDAL Python bindings also return False for the capability but still return the correctly decoded value:

from osgeo import ogr

drv = ogr.GetDriverByName("ESRI Shapefile")
ds = drv.Open("/tmp/test.shp", 0)

lyr = ds.GetLayerByIndex(0)
print(f"Supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# True
print(lyr.schema[0].name)
# 中文

lyr = ds.ExecuteSQL("select * from test where \"中文\" = '中文' ", None, "")
print(f"Supports UTF-8: {lyr.TestCapability(ogr.OLCStringsAsUTF8)}")
# False
print(lyr.schema[0].name)
# 中文

The Python bindings are a bit hard to trace through, I'm still trying to find where it determines the schema / field name in this case to see if it is doing something different than we are here.

@brendan-ward brendan-ward changed the title FIX: disable GDAL auto-decoding of shapefiles to UTF-8 FIX: correctly use GDAL auto-decoding of shapefiles when encoding is set Apr 6, 2024
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated
Comment on lines 1197 to 1198
encoding_b = encoding.encode("UTF-8")
encoding_c = encoding_b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely there is a good reason, but I wonder... why isn't this moved inside the function as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally wanted to keep override_threadlocal_config_option so that it only worked with C types, in case we ever wanted to pass in C values equivalent to what you'd get back from the function. Otherwise, I think we'd want to have it encode both the key and value, even if the key passed in is a string defined in code (already a const char* I think?) rather than runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if the key passed in is a string defined in code (already a const char* I think?)

Hmm, not entirely sure that is correct, cython docs say those are str (unicode) instances but yet show examples like this: cdef char* hello_world = 'hello world' which implies that we at least get the conversion without extra overhead when defining string literals.

pyogrio/_io.pyx Outdated Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear and informative comments and tests!

What I don't fully understand is how this works for non-Shapefiles. From the recent discussions/PRs I had thought that eventually this encoding keyword actually only makes sense for shapefiles. But you document and test that you can use it (in theory) also for other formats. But since for other formats there is no such thing as SHAPE_ENCODING, GDAL will still assume it is reading those files as UTF-8 and is returning UTF-8 values to us, right? How does that not give errors within GDAL?

Something else I am wondering: do we actually test a format that does not have OLCStringsAsUTF8 (or where we override GDAL like OSM and xlsx and GeoSJONseq)? i.e. a format where we actually end up using locale.getpreferredencoding()
CSV is an example for that?

CHANGES.md Outdated
@@ -19,6 +18,9 @@
- Raise exception in `read_arrow` or `read_dataframe(..., use_arrow=True)` if
a boolean column is detected due to error in GDAL reading boolean values (#335)
this has been fixed in GDAL >= 3.8.3.
- Properly handle decoding of ESRI Shapefiles with user-provided `encoding`
option for `read`, `read_dataframe`, and `open_arrow`, and correctly encode
Shapefile field names and text values to the user-provided `encoding` (#384).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this last sentence mean exactly? Is that for writing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was for writing. Will make more explicit.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche I'm unfamiliar with handling of non-UTF-8 column names / text values in pyarrow - should we even be trying to allow non-UTF-8 content through the Arrow API? I'm not seeing how we would intercept and decode them on our end like we do for the non-Arrow API.

By definition Arrow strings are UTF-8, using anything else would be considered as invalid Arrow data. The only solution one has it to store it as a column with variable size binary data type (and keep track of the actual encoding as a user). But that only works for column values, I don't think there is a workaround for column names.

Given that you already solved this for Shapefiles, and that for other file formats this seems a dubious use case anyway(?), it's probably fine to not support this for reading with Arrow?
I assume the only option to get this working is to have an upstream change in GDAL, for example by having an encoding option in OGR_L_GetArrowStream to allow the user to override this and let GDAL decode the read values to UFT-8 before putting that in the Arrow data.

@jorisvandenbossche
Copy link
Member

But since for other formats there is no such thing as SHAPE_ENCODING, GDAL will still assume it is reading those files as UTF-8 and is returning UTF-8 values to us, right? How does that not give errors within GDAL?

Or GDAL just reads the data and passes them on, assuming it is UTF-8 but not actually doing anything with it? So as long as we then decode those bytes with the user provided encoding instead of UTF-8, this just happens to "work"? (but so eg looking at that file with ogrinfo would still given you garbage, because at that point GDAL will assume UTF-8?)

@brendan-ward
Copy link
Member Author

Or GDAL just reads the data and passes them on, assuming it is UTF-8 but not actually doing anything with it? So as long as we then decode those bytes with the user provided encoding instead of UTF-8, this just happens to "work"

This is my assumption, based on the testing here. GDAL and other readers of those other drivers (GPKG, FGB, etc) will show miscoded field names / values. So it isn't terribly useful - seems like it enables breaking other tools in the ecosystem - but we don't (yet) specifically prevent it. I don't know if other tools are able to validly write non-UTF-8 encodings to some of these, but it appears to be possible in practice by converting shapefiles to some of them without correctly accounting for alternative encodings (e.g., GDAL #7458).

It would take a bit of research to determine the list, but it would in theory be possible to define a list of drivers that we allow to be non-UTF-8 on write (e.g., shapefile, maybe CSV, XLSX?), and raise an error on others.

We can leave it a bit more open-ended for reading, in part to allow users to specifically force a recoding to cleanup data. But we have to restrict that to the non-Arrow API.

@theroggy
Copy link
Member

theroggy commented Apr 13, 2024

My current general understanding is as follows. There are 2 general types of format:

  1. "rude" formats that don't have a fixed encoding and don't have metadata with them to indicate the encoding.
    • OLCStringsAsUTF8 is False
    • e.g. "CSV"
    • for writing: GDAL doesn't do any conversions, it just writes the string/column data as it is.
    • for reading: GDAL doesn't do any conversions, it just writes the string/column data as it is.
  2. formats that support one or more encodings and have metadata about it, or the encoding can be deduced by GDAL.
    • OLCStringsAsUTF8 is True
    • e.g. "ESRI Shapefile", most xml or json formats,...
    • for writing: GDAL expects to receive UTF-8 and will write it in the default encoding of the format. As far as I know/found, "ESRI Shapefile" is the only one that supports to specify the encoding to use for writing (via layer creation option "ENCODING"), and then GDAL will do the conversion to the format specified.
    • for reading: GDAL will by default (try to) convert the data to UTF-8 based on the metadata found in the file. Depending on the file format, there are dataset open options to specify the to be used for reading, and then GDAL will do the conversion to UTF-8.
      • For "ESRI Shapefile", this options is called ENCODING, and you can specify ENCODING="" to avoid any recoding. Probably it is a pity this name is the same as the encoding parameter in pyogrio. If this wasn't the case, we could have just said: please specify the SHP_ENCODING open option.
      • For "DXF", this option is called DXF_ENCODING, and you can specify DXF_ENCODING="UTF-8" to avoid any recoding.
      • For most XML-based files (e.g. "GPX", "KML",...), the encoding should be correct in the xml file, you cannot overrule it software.
      • For json-based formats, I suppose GDAL autodetects the encoding based on the content.

To conclude, I think the "encoding" parameter is strictly speaking necessary for the first type of files (not sure if there are any others than CSV)... for the other formats everything could be controlled via open options, but due to the naming clash with the one for shapefile...

@jorisvandenbossche
Copy link
Member

  • For "ESRI Shapefile", this options is called ENCODING, and you can specify ENCODING="" to avoid any recoding. Probably it is a pity this name is the same as the encoding parameter in pyogrio. If this wasn't the case, we could have just said: please specify the SHP_ENCODING open option.

BTW if there is a clash with one of our own keywords, you can always explicitly pass the GDAL layer/dataset open option through layer_options=dict(encoding="..")

@theroggy
Copy link
Member

BTW if there is a clash with one of our own keywords, you can always explicitly pass the GDAL layer/dataset open option through layer_options=dict(encoding="..")

Indeed, but only for write_dataframe at the moment, not (yet) for read_dataframe.

@jorisvandenbossche
Copy link
Member

Indeed, but only for write_dataframe at the moment, not (yet) for read_dataframe.

Ah, that's something we should probably add then.


For this PR, shall we merge it? (it just needs some fixing merge conflicts)

@theroggy
Copy link
Member

Am I right that if a user passes encoding= for a file format having OLCStringsAsUTF8 == True that is not "ESRI Shapefile", it will be ignored?
If so, maybe it is useful to show a warning in that case?

@brendan-ward
Copy link
Member Author

if a user passes encoding= for a file format having OLCStringsAsUTF8 == True that is not "ESRI Shapefile", it will be ignored

For read (non-Arrow API) of other drivers than shapefile with the encoding parameter, we will attempt to use that encoding to decode the field names and string values regardless of OLCStringsAsUTF8 support. This breaks using the Arrow API because pyarrow expects everything in UTF-8 (per above).

For write of other drivers than shapefile (where we force UTF-8), we use the encoding parameter passed by the user to encode the field names and string values. This then renders them as potentially unreadable in other tools that expect UTF-8.

I'm not sure if we should equate support for OLCStringsAsUTF8 as an indication that the driver expects everything in UTF-8, only an indication that GDAL knows how to get to / from a different encoding and UTF-8 (isn't that what you were getting at above?).

If we know that a given driver must be in UTF-8, we could raise an Exception when attempting to write an alternative encoding.

Are you saying we should raise a warning if OLCStringsAsUTF8 == True that is not "ESRI Shapefile" during read? i.e., they know the data source is in a different encoding regardless of GDAL's capabilities to convert to UTF-8. Since `encoding' is opt-in, it seems like that is a deliberate choice and the warning just nags them about that, right? Maybe I don't follow what you are getting at here.

@theroggy
Copy link
Member

if a user passes encoding= for a file format having OLCStringsAsUTF8 == True that is not "ESRI Shapefile", it will be ignored

For read (non-Arrow API) of other drivers than shapefile with the encoding parameter, we will attempt to use that encoding to decode the field names and string values regardless of OLCStringsAsUTF8 support. This breaks using the Arrow API because pyarrow expects everything in UTF-8 (per above).

For write of other drivers than shapefile (where we force UTF-8), we use the encoding parameter passed by the user to encode the field names and string values. This then renders them as potentially unreadable in other tools that expect UTF-8.
I'm not sure if we should equate support for OLCStringsAsUTF8 as an indication that the driver expects everything in UTF-8, only an indication that GDAL knows how to get to / from a different encoding and UTF-8 (isn't that what you were getting at above?).

Yes, sorry, not sure why I wrote that, but it being ignored is indeed not at all the case... It is used, but I think the cases where it leads to something useful are very limited. Not 100% sure, but I personally think indeed that OLCStringsAsUTF8 is an indication that the driver almost always expects to receive UTF-8 data. The only possible exceptions I see are cases where the user specifies some specific GDAL options instructing GDAL not to do any encoding changes (e.g. https://gdal.org/drivers/vector/dxf.html#character-encodings).
If data is provided in another encoding and the GDAL driver writes "UTF-8" by default anyway it won't recode and won't crash on it, as is the case I think in your tests. But if a recoding is needed and data is not provided to GDAL in UTF-8 I guess this can lead to errors.

If we know that a given driver must be in UTF-8, we could raise an Exception when attempting to write an alternative encoding.

I think this is the case... but as there are that many drivers, I'm not 100% sure, which is why I was thinking about a warning.

Are you saying we should raise a warning if OLCStringsAsUTF8 == True that is not "ESRI Shapefile" during read? i.e., they know the data source is in a different encoding regardless of GDAL's capabilities to convert to UTF-8. Since `encoding' is opt-in, it seems like that is a deliberate choice and the warning just nags them about that, right? Maybe I don't follow what you are getting at here.

True I suppose about the nagging. I'm afraid that few users will really know what they are doing when specifying "encoding=for aOLCStringsAsUTF8 == True` file, but if they would know, it is probably annoying that the warning is shown.

Possibly/probably it is best to just "let it go" and solve any issues if they would surface...

@brendan-ward
Copy link
Member Author

I've tried to address some of the discussion in the latest changes:

For the Arrow API (open / read arrow) it will raise an exception if an alternative encoding is provided and driver is not shapefile

We now specifically raise exception for shapefiles if both encoding parameter and "ENCODING"="<whatever>" open / layer creation options are combined, because this leads to order of operations issues because we are specifically setting SHAPE_ENCODING or ENCODING on the user's behalf. Since SHAPE_ENCODING may be set in advance as a configuration option by the user (possibly re-used across many data sources), we don't raise an exception, we just silently override it (so there isn't a case where they can set it to SHAPE_ENCODING="" to disable recoding by GDAL and then pass in an alternative encoding parameter). This seems like advanced territory, so I didn't think it necessary to raise a warning until we start seeing issues.

This does not disable writing to an alternative encoding if provided by the user (hopefully they know the right thing to do for the target driver); I think we can leave this as is until we see issues.

@brendan-ward
Copy link
Member Author

Added a specific test to verify that locale.getpreferredencoding() is used for CSV files (only runs on Windows though) when not providing encoding parameter.

@brendan-ward
Copy link
Member Author

Added more tests for the Arrow write API, slimmed down the comments and repeated code around encoding in _io.pyx, and fixed an issue that would have bitten us immediately with the forthcoming GDAL 3.9: FlatGeobuff no longer allows writing non-UTF-8 values. We're already likely in unsupported territory with the tests that prove we can write non-UTF-8 values to drivers that likely expect UTF-8, but I left the other drivers in place for now because it proves the encoding works. Alternatively, we could simply block writing non-UTF-8 values for non-shapefile drivers where OLCStringsAsUTF8 is True.

@jorisvandenbossche jorisvandenbossche merged commit db58724 into main Apr 26, 2024
20 checks passed
@jorisvandenbossche jorisvandenbossche deleted the issue380 branch April 26, 2024 13:05
@jorisvandenbossche
Copy link
Member

Didn't notice it before (and haven't yet look in detail what might be causing it), but merging this caused a bunch failures when testing the wheels for ubuntu and older macos: https://github.com/geopandas/pyogrio/actions/runs/8848749467/job/24299655017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants