-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RFC 86: Column-oriented read API for vector layers #5830
Conversation
@jorisvandenbossche pinged me on this. I can just say that usage of the C stream interface looks basically sane (I'm not competent on geo topics otherwise :-)). |
Co-authored-by: Alan D. Snow <alansnow21@gmail.com>
What are the other (not Parquet, not Arrow) drivers that might produce this? I guess the API seems so specific to Arrow, and if I want to consume just Arrow, why would I come to a format abstraction library? |
AFAIU all drivers can produce this, they only might not have a specialized implementation (so falling back to the generic implementation using the existing You are correct that if you just want to consume an Arrow file, it might still be faster / easier to use an arrow library to read that file. But so personally for me (and for geopandas), this proposal is mostly exciting for all other formats (geopandas will probably keep a specialized
|
Not much to add to what @jorisvandenbossche said : the main purpose is not that much to have something convenient for column oriented formats, but more to have some convenient for column oriented consumers. The default implementation of GetArrowStream() will work with any OGR driver, using their row-oriented GetNextFeature() implementation. There are a couple other drivers that could potentially benefit from a column-oriented consumption API (but definitely not in the scope of implementation I foresee for the RFC)
|
Another benefit that does not depend on the driver (or whether or not the underlying data start as columnar or rowwise) is that the |
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 15.2 s down to 8.1 s
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 8.1 s (previous commit) to 7.5 s
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 15.2 s down to 8.1 s
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 8.1 s (previous commit) to 7.5 s
@rouault, unfortunately there is no optimized API in HANA for column oriented retrieval. |
Updated to reflect the fact that the implementation has now a GeoPackage specialized implementation, which is extremely fast ! |
// Do something useful | ||
array.release(&array); | ||
} | ||
stream.release(&stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this method require the explicit object pointer argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The release callbacks may need to use the private_data member of the ArrowArray and ArrowArrayStream structures (those are C structures, not C++ objects)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok. Do you think it's worth having a gdal specific wrapper around those structures so that we can provide a nicer public api which hides this kind of detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth having a gdal specific wrapper around those structures so that we can provide a nicer public api which hides this kind of detail?
The Arrow C stream/array interfaces defines a standard, and I'd want to avoid defining something GDAL specific on top of it. Anything that would make it more friendly to consume it would apply to any software that would produce Arrow C streams/arrays, so that doesn't really belong to GDAL specificially. @paleolimbot raised a discussion about that in https://lists.apache.org/thread/6vt7btpbrkm554otxqz7oxzdlpfwkhwc . He has prototyped
https://github.com/paleolimbot/gpkg/tree/master/src/arrow-hpp for the writing side. Not sure if there's an equivalent on the reading side (except using the arrow-cpp library itself to ingest Arrow C arrays and expose corresponding C++ objects, but that's an heavy dependency)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's general agreement from the Arrow end of things that there should be a library to make what you are doing here easier (both writing and reading). What you are doing here has been very helpful to push that forward in the discussions that we're having about this...I'm hoping to convert the mailing list discussion into a PR in the next few days although it will be some time before it's ready to use. It will basically be the arrow-hpp thing that Even linked to but using C idioms.
I separated out some C++ helpers here: https://github.com/paleolimbot/geoarrow-cpp , which could eventually help on the Geo-specific reading side (on pause at the moment while the lower-level Arrow library gets sorted).
All of these are some months away from being useful here, and reading Even's implementations has been incredibly helpful both in writing code and in demonstrating that people actually want to do this kind of thing.
by default other formats (particularly the Arrow driver that can return geometries | ||
encoded according to the GeoArrow specification (using a list of coordinates). | ||
The GEOMETRY_ENCODING=WKB option can be passed to force the use of WKB (through | ||
the default implementation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm not super-clear on here -- if the requested geometry format matches the underlying dataset's geometry format, does that mean there'll be absolutely no parsing/re-interpretation of the geometry on GDAL's side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean there'll be absolutely no parsing/re-interpretation of the geometry on GDAL's side?
yes, this is typically the case for GeoParquet that uses WKB as the storage format. And also for GeoPackage where only the GeoPackage geometry header is stripped from the SQlite blob to get a WKB geometry. No OGRGeometry object is instanciated then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should precise that when the default implementation of OGRLayer is used, as GetNextFeature() is used underneath and returns a OGRGeometry, if the underlying format would happen to use WKB for storage, you'd have: WKB --> OGRGeometry (in GetNextFeature() -> WKB (in OGRLayer::GetNextArrowArray()). That would for example be the case for the PostGIS driver currently (with EWKB as the source encoding)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I realise this is off-topic, sorry!)
yes, this is typically the case for GeoParquet that uses WKB as the storage format. And also for GeoPackage where only the GeoPackage geometry header is stripped from the SQlite blob to get a WKB geometry.
What's your thoughts regarding having a similar short-cut approach for the non-column oriented apis? (and for the scenario you've described above) I.e. being able to read features from a gpkg where the geometry is left as just the wkb binary content, and no interpretation is made on gdal's side. In the gpkg case this could be a good optimisation opportunity for some clients, in that they could request the wkb content directly if they are going to be constructing their own geometry representation of the features (ie gpkg -> wkb -> client geometry classes ) and want to avoid the extra representation (gpkg -> wkb -> ogr -> client geometry )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, but I'm not sure how to implement this in a generic option. I guess this could be an open option implemented in the drivers where that would make sense that would turn the geometry column as a regular Binary field. I'm not sure how to inform client of which column(s) has the geometry. maybe some new field in the OGRGeomFieldDefn class.
Another question -- it's not mentioned anywhere how this would interact with any filters applied to a layer, eg via OGR_L_SetAttributeFilter/OGR_L_SetSpatialFilter. I'm assuming that OGR level filters will be incompatible with the columnar api? |
" The method may take into account ignored fields set with SetIgnoredFields() (the |
``ARROW:extension:name`` set to ``WKB``. Specialized implementations may output | ||
by default other formats (particularly the Arrow driver that can return geometries | ||
encoded according to the GeoArrow specification (using a list of coordinates). | ||
The GEOMETRY_ENCODING=WKB option can be passed to force the use of WKB (through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it in theory (later) also be possible to enable that you can specify GEOMETRY_ENCODING=GEOARROW
to force the use of this encoding, even if the file itself is not using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes one could imagine such an extension
Gosh, how did I miss that?! 🤦 |
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 15.2 s down to 8.1 s
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 8.1 s (previous commit) to 7.5 s
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 15.2 s down to 8.1 s
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 8.1 s (previous commit) to 7.5 s
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 15.2 s down to 8.1 s
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 8.1 s (previous commit) to 7.5 s
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 15.2 s down to 8.1 s
With this improvement, the ogr_bench program on the nz-building-outlines.gpkg file mentioned in RFC 86 (OSGeo#5830) runs from 8.1 s (previous commit) to 7.5 s
Note: the oddities with the bench_ogr_batch.cpp benchmarker were GeoPackage was faster than GeoParquet were just an implementation issue fixed per #6558 . The corrected situation is that GeoParquet has the lead and perf of GeoPackage and FlatGeobuf is pretty close, which is expected |
This RFC describes the addition of new methods to the
OGRLayer
class to retrieve batches of features with a column-oriented memory layout, that suits formats that have that organization or downstream consumers that expect data to be presentedin such a way, in particular the Apache Arrow, Pandas / GeoPandas ecosystem, R spatial packages, and many modern (data analytics focused) databases / engines which are column oriented (eg Snowflake, Google BigQuery, ..)
(Semi-)rendered version of the RFC at: https://github.com/rouault/gdal/blob/rfc_86/doc/source/development/rfc/rfc86_column_oriented_api.rst