Skip to content

Commit cd5b464

Browse files
authored
Add option to read_dataframe to use sql statement (#70)
1 parent f0c2597 commit cd5b464

File tree

8 files changed

+394
-77
lines changed

8 files changed

+394
-77
lines changed

CHANGES.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
- generalize check for VSI files from `/vsizip` to `/vsi` (#29)
1212
- add dtype for each field to `read_info` (#30)
1313
- support writing empty GeoDataFrames (#38)
14+
- support use of a sql statement in read_dataframe (#70)
1415

1516
### Breaking changes
1617

docs/source/introduction.md

Lines changed: 59 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,6 @@ configuration options.
4141
You can certainly try to read or write using unsupported drivers that are
4242
available in your installation, but you may encounter errors.
4343

44-
Note: different drivers have different tolerance for mixed geometry types, e.g.,
45-
MultiPolygon and Polygon in the same dataset. You will get exceptions if you
46-
attempt to write mixed geometries to a driver that does not support them.
47-
4844
## List available layers
4945

5046
To list layers available in a data source:
@@ -181,6 +177,65 @@ with the bbox.
181177

182178
Note: the `bbox` values must be in the same CRS as the dataset.
183179

180+
## Execute a sql query
181+
182+
You can use the `sql` parameter to execute a sql query on a dataset.
183+
184+
Depending on the dataset, you can use different sql dialects. By default, if
185+
the dataset natively supports sql, the sql statement will be passed through
186+
as such. Hence, the sql query should be written in the relevant native sql
187+
dialect (e.g. [GeoPackage](https://gdal.org/drivers/vector/gpkg.html)/
188+
[Sqlite](https://gdal.org/drivers/vector/sqlite.html),
189+
[PostgreSQL](https://gdal.org/drivers/vector/pg.html)). If the datasource
190+
doesn't natively support sql (e.g.
191+
[ESRI Shapefile](https://gdal.org/drivers/vector/shapefile.html),
192+
[FlatGeobuf](https://gdal.org/drivers/vector/flatgeobuf.html)), you can choose
193+
between '[OGRSQL](https://gdal.org/user/ogr_sql_dialect.html#ogr-sql-dialect)'
194+
(the default) and
195+
'[SQLITE](https://gdal.org/user/sql_sqlite_dialect.html#sql-sqlite-dialect)'.
196+
For SELECT statements the 'SQLITE' dialect tends to provide more spatial
197+
features as all
198+
[spatialite](https://www.gaia-gis.it/gaia-sins/spatialite-sql-latest.html)
199+
functions can be used. If gdal is not built with spatialite support in SQLite,
200+
you can use ``sql_dialect="INDIRECT_SQLITE"`` to be able to use spatialite
201+
functions on native SQLite files like Geopackage.
202+
203+
You can combine a sql query with other parameters that will filter the
204+
dataset. When using ``columns``, ``skip_features``, ``max_features``, and/or
205+
``where`` it is important to note that they will be applied AFTER the sql
206+
statement, so these are some things you need to be aware of:
207+
- if you specify an alias for a column in the sql statement, you need to
208+
specify this alias when using the ``columns`` keyword.
209+
- ``skip_features`` and ``max_features`` will be applied on the rows returned
210+
by the sql query, not on the original dataset.
211+
212+
For the ``bbox`` parameter, depending on the combination of the dialect of the
213+
sql query and the dataset, a spatial index will be used or not, e.g.:
214+
- ESRI Shapefile: spatial index is used with 'OGRSQL', not with 'SQLITE'.
215+
- Geopackage: spatial index is always used.
216+
217+
The following sql query returns the 5 Western European countries with the most
218+
neighbours:
219+
220+
```python
221+
>>> sql = """
222+
SELECT geometry, name,
223+
(SELECT count(*)
224+
FROM ne_10m_admin_0_countries layer_sub
225+
WHERE ST_Intersects(layer.geometry, layer_sub.geometry)) AS nb_neighbours
226+
FROM ne_10m_admin_0_countries layer
227+
WHERE subregion = 'Western Europe'
228+
ORDER BY nb_neighbours DESC
229+
LIMIT 5"""
230+
>>> read_dataframe('ne_10m_admin_0_countries.shp', sql=sql, sql_dialect='SQLITE')
231+
NAME nb_neighbours geometry
232+
0 France 11 MULTIPOLYGON (((-54.11153 2.114...
233+
1 Germany 10 MULTIPOLYGON (((13.81572 48.766...
234+
2 Austria 9 POLYGON ((16.94504 48.60417, 16...
235+
3 Switzerland 6 POLYGON ((10.45381 46.86443, 10...
236+
4 Belgium 5 POLYGON ((2.52180 51.08754, 2.5...
237+
```
238+
184239
## Force geometries to be read as 2D geometries
185240

186241
You can force a 3D dataset to 2D using `force_2d`:

pyogrio/_io.pyx

Lines changed: 110 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,40 @@ cdef OGRLayerH get_ogr_layer(GDALDatasetH ogr_dataset, layer) except NULL:
171171
return ogr_layer
172172

173173

174+
cdef OGRLayerH execute_sql(GDALDatasetH ogr_dataset, str sql, str sql_dialect=None) except NULL:
175+
"""Execute an SQL statement on a dataset.
176+
177+
Parameters
178+
----------
179+
ogr_dataset : pointer to open OGR dataset
180+
sql : str
181+
The sql statement to execute
182+
sql_dialect : str, optional (default: None)
183+
The sql dialect the sql statement is written in
184+
185+
Returns
186+
-------
187+
pointer to OGR layer
188+
"""
189+
190+
try:
191+
sql_b = sql.encode('utf-8')
192+
sql_c = sql_b
193+
if sql_dialect is None:
194+
return exc_wrap_pointer(GDALDatasetExecuteSQL(ogr_dataset, sql_c, NULL, NULL))
195+
196+
sql_dialect_b = sql_dialect.encode('utf-8')
197+
sql_dialect_c = sql_dialect_b
198+
return exc_wrap_pointer(GDALDatasetExecuteSQL(ogr_dataset, sql_c, NULL, sql_dialect_c))
199+
200+
# GDAL does not always raise exception messages in this case
201+
except NullPointerError:
202+
raise DataLayerError(f"Error executing sql '{sql}'") from None
203+
204+
except CPLE_BaseError as exc:
205+
raise DataLayerError(str(exc))
206+
207+
174208
cdef str get_crs(OGRLayerH ogr_layer):
175209
"""Read CRS from layer as EPSG:<code> if available or WKT.
176210
@@ -586,7 +620,6 @@ cdef get_features(
586620
except CPLE_BaseError as exc:
587621
raise FeatureError(str(exc))
588622

589-
590623
if return_fids:
591624
fid_view[i] = OGR_F_GetFID(ogr_feature)
592625

@@ -738,6 +771,8 @@ def ogr_read(
738771
object where=None,
739772
tuple bbox=None,
740773
object fids=None,
774+
str sql=None,
775+
str sql_dialect=None,
741776
int return_fids=False,
742777
**kwargs):
743778

@@ -752,90 +787,100 @@ def ogr_read(
752787
path_b = path.encode('utf-8')
753788
path_c = path_b
754789

755-
# layer defaults to index 0
756-
if layer is None:
757-
layer = 0
758-
759790
if fids is not None:
760-
if where is not None or bbox is not None or skip_features or max_features:
791+
if where is not None or bbox is not None or sql is not None or skip_features or max_features:
761792
raise ValueError(
762-
"cannot set both 'fids' and any of 'where', 'bbox', "
793+
"cannot set both 'fids' and any of 'where', 'bbox', 'sql', "
763794
"'skip_features' or 'max_features'"
764795
)
765796
fids = np.asarray(fids, dtype=np.intc)
766797

798+
if sql is not None and layer is not None:
799+
raise ValueError("'sql' paramater cannot be combined with 'layer'")
800+
767801
ogr_dataset = ogr_open(path_c, 0, kwargs)
768-
ogr_layer = get_ogr_layer(ogr_dataset, layer)
802+
try:
803+
if sql is None:
804+
# layer defaults to index 0
805+
if layer is None:
806+
layer = 0
807+
ogr_layer = get_ogr_layer(ogr_dataset, layer)
808+
else:
809+
ogr_layer = execute_sql(ogr_dataset, sql, sql_dialect)
769810

770-
crs = get_crs(ogr_layer)
811+
crs = get_crs(ogr_layer)
771812

772-
# Encoding is derived from the dataset, from the user, or from the system locale
773-
encoding = (
774-
detect_encoding(ogr_dataset, ogr_layer)
775-
or encoding
776-
or locale.getpreferredencoding()
777-
)
813+
# Encoding is derived from the dataset, from the user, or from the system locale
814+
encoding = (
815+
detect_encoding(ogr_dataset, ogr_layer)
816+
or encoding
817+
or locale.getpreferredencoding()
818+
)
778819

779-
fields = get_fields(ogr_layer, encoding)
820+
fields = get_fields(ogr_layer, encoding)
780821

781-
if columns is not None:
782-
# Fields are matched exactly by name, duplicates are dropped.
783-
# Find index of each field into fields
784-
idx = np.intersect1d(fields[:,2], columns, return_indices=True)[1]
785-
fields = fields[idx, :]
822+
if columns is not None:
823+
# Fields are matched exactly by name, duplicates are dropped.
824+
# Find index of each field into fields
825+
idx = np.intersect1d(fields[:,2], columns, return_indices=True)[1]
826+
fields = fields[idx, :]
786827

787-
geometry_type = get_geometry_type(ogr_layer)
828+
geometry_type = get_geometry_type(ogr_layer)
788829

789-
if fids is not None:
790-
geometries, field_data = get_features_by_fid(
791-
ogr_layer,
792-
fids,
793-
fields,
794-
encoding,
795-
read_geometry=read_geometry and geometry_type is not None,
796-
force_2d=force_2d,
797-
)
830+
if fids is not None:
831+
geometries, field_data = get_features_by_fid(
832+
ogr_layer,
833+
fids,
834+
fields,
835+
encoding,
836+
read_geometry=read_geometry and geometry_type is not None,
837+
force_2d=force_2d,
838+
)
798839

799-
# bypass reading fids since these should match fids used for read
800-
if return_fids:
801-
fid_data = fids.astype(np.int64)
840+
# bypass reading fids since these should match fids used for read
841+
if return_fids:
842+
fid_data = fids.astype(np.int64)
843+
else:
844+
fid_data = None
802845
else:
803-
fid_data = None
804-
else:
805-
# Apply the attribute filter
806-
if where is not None and where != "":
807-
apply_where_filter(ogr_layer, where)
846+
# Apply the attribute filter
847+
if where is not None and where != "":
848+
apply_where_filter(ogr_layer, where)
808849

809-
# Apply the spatial filter
810-
if bbox is not None:
811-
apply_spatial_filter(ogr_layer, bbox)
850+
# Apply the spatial filter
851+
if bbox is not None:
852+
apply_spatial_filter(ogr_layer, bbox)
812853

813-
# Limit feature range to available range
814-
skip_features, max_features = validate_feature_range(
815-
ogr_layer, skip_features, max_features
816-
)
854+
# Limit feature range to available range
855+
skip_features, max_features = validate_feature_range(
856+
ogr_layer, skip_features, max_features
857+
)
817858

818-
fid_data, geometries, field_data = get_features(
819-
ogr_layer,
820-
fields,
821-
encoding,
822-
read_geometry=read_geometry and geometry_type is not None,
823-
force_2d=force_2d,
824-
skip_features=skip_features,
825-
max_features=max_features,
826-
return_fids=return_fids
827-
)
859+
fid_data, geometries, field_data = get_features(
860+
ogr_layer,
861+
fields,
862+
encoding,
863+
read_geometry=read_geometry and geometry_type is not None,
864+
force_2d=force_2d,
865+
skip_features=skip_features,
866+
max_features=max_features,
867+
return_fids=return_fids
868+
)
828869

829-
meta = {
830-
'crs': crs,
831-
'encoding': encoding,
832-
'fields': fields[:,2], # return only names
833-
'geometry_type': geometry_type
834-
}
870+
meta = {
871+
'crs': crs,
872+
'encoding': encoding,
873+
'fields': fields[:,2], # return only names
874+
'geometry_type': geometry_type
875+
}
835876

836-
if ogr_dataset != NULL:
837-
GDALClose(ogr_dataset)
838-
ogr_dataset = NULL
877+
finally:
878+
if ogr_dataset != NULL:
879+
if sql is not None:
880+
GDALDatasetReleaseResultSet(ogr_dataset, ogr_layer)
881+
882+
GDALClose(ogr_dataset)
883+
ogr_dataset = NULL
839884

840885
return (
841886
meta,
@@ -1015,7 +1060,6 @@ cdef void * create_crs(str crs) except NULL:
10151060
return ogr_crs
10161061

10171062

1018-
10191063
cdef infer_field_types(list dtypes):
10201064
cdef int field_type = 0
10211065
cdef int field_subtype = 0

pyogrio/_ogr.pxd

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,12 @@ cdef extern from "gdal.h":
324324
int GDALDatasetGetLayerCount(GDALDatasetH ds)
325325
OGRLayerH GDALDatasetGetLayer(GDALDatasetH ds, int iLayer)
326326
OGRLayerH GDALDatasetGetLayerByName(GDALDatasetH ds, char * pszName)
327+
OGRLayerH GDALDatasetExecuteSQL(
328+
GDALDatasetH ds,
329+
const char* pszStatement,
330+
OGRGeometryH hSpatialFilter,
331+
const char* pszDialect)
332+
void GDALDatasetReleaseResultSet(GDALDatasetH, OGRLayerH)
327333
OGRErr GDALDatasetStartTransaction(GDALDatasetH ds, int bForce)
328334
OGRErr GDALDatasetCommitTransaction(GDALDatasetH ds)
329335
OGRErr GDALDatasetRollbackTransaction(GDALDatasetH ds)

0 commit comments

Comments
 (0)