-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add read-only OGR ADBC (Arrow Database Connectivity) driver #11003
Conversation
7e44d72
to
7dd286b
Compare
d634619
to
a3bc6b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing!
With respect to geometry support, we're in the process of thinking through how to mark Geometry while pushing the minimum geometry awareness into the general-purpose drivers (that's GDAL's job, after all!): apache/arrow-adbc#2098 . I think we will be able to mark Snowflake geometry as some extension type denoting GeoJSON, PostGIS as something denoting EWKB, and BigQuery as geoarrow.wkt
(with spherical edges). Getting DuckDB to mark their output has been a feature request for a while but I think it's generally thought of as a good idea: duckdb/duckdb_spatial#153 .
I am not sure at what point I'll be able to get there, but I have wanted to do the reverse as well (expose OGR as an ADBC driver). I am in the process of finishing up the documentation for a framework to make it a bit easier: apache/arrow-adbc#2186
ogr/ogrsf_frmts/adbc/CMakeLists.txt
Outdated
|
||
gdal_standard_includes(ogr_ADBC) | ||
target_include_directories(ogr_ADBC PRIVATE $<TARGET_PROPERTY:ogrsf_generic,SOURCE_DIR>) | ||
gdal_target_link_libraries(ogr_ADBC PRIVATE AdbcDriverManager::adbc_driver_manager_shared) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK, but I think you can also vendor c/driver_manager/adbc_driver_manager.cc/h
if that is easier for packaging/makes it more likely to get this into a default build (this is what the R package does since we can't use CMake).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit hesitant on that:
- that would mean extra code to maintain in GDAL
- When we vendor things, we make an effort to rename their symbols, which is an extra layer of maintainance on GDAL side. If adbc_driver_manager code itself has some provision for a #define ADBC_DRIVER_MANAGER_NAMESPACE that could allow to prefix its symbols with a user specified prefix, that would save that effort on GDAL side.
- what about the Windows side of things? Initially, I tried to use Conda to test on Windows on our CI, but it failed, and then I realized that
https://github.com/conda-forge/arrow-adbc-split-feedstock/blob/6f2af5a9b0de9cbdf895ec0ab1e516c306abac67/recipe/meta.yaml#L31 disabled Windows packaging. Is it a Conda thing, or is there's something more fundamental in adbc_driver_manager itself that prevents it from working on Windows? - if we vendor, what about the stability of the interface between adbc_driver_manager and ADBC drivers themselves? I mean if we forget to update the vendored adbc_driver_manager, will that affect its working with ADBC drivers that would have been built against a more recent external ADBC interface?
- with respect to my above symbol renaming, do ADBC drivers link with adbc_driver_manager itself? If so, then symbol renaming is not possible, and vendoring would be a bit risky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TL;DR is that this is something we need to make possible, test, and document from our side: we want people to federate to data sources using ADBC exactly like you're doing here and there won't be widely available linux packages anytime soon.
Perhaps a safer alternative, if slightly more verbose, would be to use just the driver loader part of the driver manager:
...and call the driver callbacks directly:
One of the things we may do soon is make a header-only C++ wrapper that supports that process.
that would mean extra code to maintain in GDAL
If it worked using FetchContent
to an official ADBC release would that be sufficient? I don't think there will be a widely available "system" ADBC for quite some time that will let R users in particular benefit from this (since most of them choose not to use conda!).
with respect to my above symbol renaming, do ADBC drivers link with adbc_driver_manager itself? If so, then symbol renaming is not possible, and vendoring would be a bit risky.
ADBC drivers definitely do not use the AdbcXXX
symbols, but some (or all?) of them export them (I think this is confusing and I wish they didn't). Loading a driver using the init function and calling the driver callbacks definitely doesn't depend on the AdbcXXX
symbols at all.
what about the Windows side of things?
The driver manager definitely works on Windows (it's bundled into the Python package that builds on MSVC and vendored into the R package that builds using mingw). I'm not sure of the linking details of Python + conda or why it is not packaged for conda windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there won't be widely available linux packages anytime soon.
IMHO, that's something to consider in the top of the priority list for success of ADBC rather than investing in all other workarounds.
If it worked using
FetchContent
to an official ADBC release would that be sufficient?
not really, at least for GDAL as packaged by Linux distributions. I believe some build systems disable network after they have downloaded the tarball and installed all dependencies to make sure it is fully reproducible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! It's not a workaround in the sense that it's documented behaviour:
...it's just documented behaviour that we haven't made as easy to use as the driver manager's AdbcXX
symbols yet. With the ability to inject AdbcLoadDriver
, this would mean you don't necessarily need the driver manager at all (i.e., you would only need adbc.h
, which is intended to be vendored in this way and defines an ABI with very intentional forward/backward compatibility guarantees).
const char *pszADBCDriverName = | ||
CSLFetchNameValue(poOpenInfo->papszOpenOptions, "ADBC_DRIVER"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If at all possible, in R it would help to be able to initialize via the init function address:
...because in R this is how we pass around driver information (including DuckDB and Snowflake). In Python I believe the driver packages we ship always have a shared library + entrypoint (although I would have to look up if this is easily accessible to help something like GDAL load the driver). This can, of course, be handled some other time, but if there's any way to squeeze in supporting a driver installed using pip install adbc_driver_snowflake
or install.packages("adbcsnowflake")
, this is how our documentation encourages users of those environments to get the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If at all possible, in R it would help to be able to initialize via the init function address:
Hum, would you suggest passing a function pointer as an hexadecimal address as a string like "0xDEADBEEF" ? I believe that's technically undefined behaviour in C/C++ for function pointers, also that will work in most cases. However I remember some time ago to have experimenting with the CHERI architecture and pointers are kind of "fat pointers" and can't be "instanciated" from their address. And there's a security issue in allowing to pass this. I could do that, but protected with a configuration option to enable it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point! I was thinking a base 10 integer address string but an opt-in configuration option won't help R or Rust users here. Would a C or C++ API hook to inject AdbcLoadDriver
or similar be secure? What I'm hoping for here is the ability for GDAL to use a driver installed via install.packages()
, cargo
, or pip
since the drivers are seeing quite a bit of action at the moment and getting the latest version via system is difficult or impossible to do in some environments. With the ability to inject AdbcLoadDriver
, something like sf could provide that by hooking up to the adbcdrivermanager
package (or pyogrio could do it by hooking up to adbc_driver_manager
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a C or C++ API hook to inject
AdbcLoadDriver
or similar be secure?
definitely. The slightly annoying part is that setting the hook should be done in GDAL core (as the ADBC driver may be a plugin, that cannot be directly in it), but that's not much a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a new public header gdal_adbc.h per 8d86043 that can be used to call a void GDALSetAdbcDriverInitFunc(AdbcDriverInitFunc init_func)
to define the entry point.
This is only compile tested on my side. Actual functional testing would be appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look today/this evening and report back!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No pressure on any of this (all ADBC is great!), but I put together an example of what I had in mind. This turned out to be more like "inject AdbcLoadDriver
" than just one init function. Basically, if the embedding application can have some flexibility about how to load the drivers it can use ones already statically linked in, or updated versions of them installed via pip
or some other mechanism. Using the "load" mechanism frees GDAL from the driver manager completely since then it can just use the callbacks.
Experiment in GDAL: rouault/gdal@adbc...paleolimbot:gdal:adbc-suggestions
Changes to sf: r-spatial/sf@main...paleolimbot:adbc-experiments
Changes to the R driver manager: apache/arrow-adbc@main...paleolimbot:r-adbcdrivermanager-init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot Cool! I've merged your changes with minor adaptations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
m_apoLayers.emplace_back(std::move(poLayer)); | ||
} | ||
} | ||
else if (bIsDuckDB || bIsSQLite3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would AdbcConnectionGetObjects()
help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I've used in 637405b
However I've noticed what I believe is a discrepency between the API documentation and driver implementations regarding the depth. I would expect ADBC_OBJECT_DEPTH_DB_SCHEMAS to be enough to get the table list since its doc mentions "Return metadata on catalogs, schemas, and table", but I actually need to go one level deeper (ADBC_OBJECT_DEPTH_TABLES) to get the table list. At least with SQLite, DuckDB and PostgreSQL. With ADBC_OBJECT_DEPTH_DB_SCHEMAS, db_schema_tables is null or an empty list.
11b42af
to
66bc9f2
Compare
{ | ||
m_poLayerDefn = new OGRFeatureDefn(pszName); | ||
m_poLayerDefn->SetGeomType(wkbNone); | ||
m_poLayerDefn->Reference(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope for this PR: I was wondering if it will makes sense to eventually switch to a std::shared_ptr
for feature definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that we discussed with @dbaston for the scope of his future works, but besides the enormous effort to change that within OSGeo/GDAL, there's an obvious backward compatiiblity issue for external C++ users. Could very well be a 4.0 topic
const char *pszTableName = poFeature->GetFieldAsString(1); | ||
const std::string osStatement = CPLSPrintf( | ||
"SELECT * FROM \"%s\".\"%s\"", | ||
CPLString(pszNamespace).replaceAll("\"", "\"\"").c_str(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have an SQL quoting function available somewhere this is when it could be of use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added :
/** Replace all occurences of ch by it repeated twice.
* Typically used for SQL string literal or identifier escaping.
*/
std::string CPL_DLL OGRDuplicateCharacter(const std::string &osStr, char ch);
e81a7ba
to
ecf243d
Compare
const char *pszADBCDriverName = | ||
CSLFetchNameValue(poOpenInfo->papszOpenOptions, "ADBC_DRIVER"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
c8df459
to
c89d482
Compare
982c5aa
to
80c0fa6
Compare
Cf https://arrow.apache.org/adbc/current/index.html for what ADBC is. Depends on the adbc-driver-manager library. The driver is read-only, and there is no support for spatial data currently. Beyond official ADBC drivers (adbc-driver-sqlite, adbc-driver-postgresql, adbc-driver-snowflake, adbc-driver-bigquery, etc.), it can also be used to read Parquet or DuckDB datasets using libduckdb, if libduckdb is installed and can be loaded through dynamic shared library opening.
Also no longer make AdbcDriverManager a requirement
Merging |
Cf https://arrow.apache.org/adbc/current/index.html for what ADBC is.
Depends on the adbc-driver-manager library.
The driver is read-only, and there is no support for spatial data currently.
Beyond official ADBC drivers (adbc-driver-sqlite, adbc-driver-postgresql, adbc-driver-snowflake, adbc-driver-bigquery,
etc.), it can also be used to read Parquet or DuckDB datasets using libduckdb, if libduckdb is installed and can be loaded through dynamic shared library opening.