Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add append support to write functions #197

Merged
merged 5 commits into from
Jan 13, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/docker-gdal.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ jobs:
name: GDAL ${{ matrix.container }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
container:
- "osgeo/gdal:ubuntu-small-latest" # >= python 3.8.10
Expand Down
2 changes: 2 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
This can be enabled by passing `use_arrow=True` to `pyogrio.read_dataframe`
(or by using `pyogrio.raw.read_arrow` directly), and provides a further
speed-up (#155, #191).
- Support for appending to an existing data source when supported by GDAL by
passing `append=True` to `pyogrio.write_dataframe` (#197).

## O.4.2

Expand Down
37 changes: 32 additions & 5 deletions docs/source/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,15 @@ Not all geometry or field types may be supported for all drivers.
{...'GeoJSON': 'rw', 'GeoJSONSeq': 'rw',...}
```

Drivers that are not known to be supported are listed with `"?"` for capabilities.
Drivers that are known to support write capability end in `"w"`.
Drivers that support write capability in your version of GDAL end in `"w"`.
Certain drivers that are known to be unsupported in Pyogrio are disabled for
write capabilities.

To find subsets of drivers that have known support:
NOTE: not all drivers support writing the contents of a GeoDataFrame; you may
encounter errors due to unsupported data types, unsupported geometry types,
or other driver-related errors when writing to a data source.

To find subsets of drivers that support read or write capabilities:

```python
>>> list_drivers(read=True)
Expand All @@ -38,8 +43,13 @@ See the full list of [drivers](https://gdal.org/drivers/vector/index.html) for
more information about specific drivers, including their write support and
configuration options.

You can certainly try to read or write using unsupported drivers that are
available in your installation, but you may encounter errors.
The following drivers are known to be well-supported and tested in Pyogrio:

- `ESRI Shapefile`
- `FlatGeobuf`
- `GeoJSON`
- `GeoJSONSeq`
- `GPKG`

## List available layers

Expand Down Expand Up @@ -328,6 +338,23 @@ If you want to write another file format supported by GDAL or if you want to
overrule the default driver for an extension, you can specify the driver with the
`driver` keyword, e.g. `driver="GPKG"`.

## Appending to an existing data source

Certain drivers may support the ability to append records to an existing data
source. See the
brendan-ward marked this conversation as resolved.
Show resolved Hide resolved
[GDAL driver listing](https://gdal.org/drivers/vector/index.html)
for details about the capabilities of a driver for your version of GDAL.

```
>>> write_dataframe(df, "/tmp/existing_file.gpkg", append=True)
```

NOTE: the data structure of the data frame you are appending to the existing
data source must exactly match the structure of the existing data source.

NOTE: not all drivers that support write capabilities support append
capabilities for a given GDAL version.

## Reading from compressed files / archives

GDAL supports reading directly from an archive, such as a zipped folder, without
Expand Down
170 changes: 93 additions & 77 deletions pyogrio/_io.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,6 @@ cdef void* ogr_open(const char* path_c, int mode, options) except NULL:
else:
flags |= GDAL_OF_READONLY

# TODO: other open opts from fiona
open_opts = CSLAddNameValue(open_opts, "VALIDATE_OPEN_OPTIONS", "NO")

try:
Expand Down Expand Up @@ -1329,12 +1328,12 @@ cdef infer_field_types(list dtypes):
return field_types


# TODO: handle updateable data sources, like GPKG
# TODO: set geometry and field data as memory views?
def ogr_write(
str path, str layer, str driver, geometry, field_data, fields,
str crs, str geometry_type, str encoding, object dataset_kwargs,
object layer_kwargs, bint promote_to_multi=False, bint nan_as_null=True,
bint append=False
):
cdef const char *path_c = NULL
cdef const char *layer_c = NULL
Expand Down Expand Up @@ -1377,33 +1376,37 @@ def ogr_write(
if not layer:
layer = os.path.splitext(os.path.split(path)[1])[0]

layer_b = layer.encode('UTF-8')
layer_c = layer_b

# if shapefile, GeoJSON, or FlatGeobuf, always delete first
# for other types, check if we can create layers
# GPKG might be the only multi-layer writeable type. TODO: check this
if driver in ('ESRI Shapefile', 'GeoJSON', 'GeoJSONSeq', 'FlatGeobuf') and os.path.exists(path):
os.unlink(path)
if not append:
os.unlink(path)

# TODO: invert this: if exists then try to update it, if that doesn't work then always create
layer_exists = False
if os.path.exists(path):
try:
ogr_dataset = ogr_open(path_c, 1, None)

# If layer exists, delete it.
for i in range(GDALDatasetGetLayerCount(ogr_dataset)):
name = OGR_L_GetName(GDALDatasetGetLayer(ogr_dataset, i))
if layer == name.decode('UTF-8'):
layer_idx = i
break

if layer_idx >= 0:
GDALDatasetDeleteLayer(ogr_dataset, layer_idx)
layer_exists = True

if not append:
GDALDatasetDeleteLayer(ogr_dataset, layer_idx)

except DataSourceError as exc:
# open failed
if append:
raise exc

except DataSourceError:
# open failed, so create from scratch
# force delete it first
# otherwise create from scratch
os.unlink(path)
ogr_dataset = NULL

Expand All @@ -1416,48 +1419,58 @@ def ogr_write(

ogr_dataset = ogr_create(path_c, driver_c, dataset_options)

### Create the CRS
if crs is not None:
try:
ogr_crs = create_crs(crs)
# if we are not appending to an existing layer, we need to create
# the layer and all associated properties (CRS, field defs, etc)
create_layer = not (append and layer_exists)

except Exception as exc:
OGRReleaseDataSource(ogr_dataset)
ogr_dataset = NULL
if dataset_options != NULL:
CSLDestroy(<char**>dataset_options)
dataset_options = NULL
raise exc
### Create the layer
if create_layer:
# Create the CRS
if crs is not None:
try:
ogr_crs = create_crs(crs)

except Exception as exc:
OGRReleaseDataSource(ogr_dataset)
ogr_dataset = NULL
if dataset_options != NULL:
CSLDestroy(<char**>dataset_options)
dataset_options = NULL
raise exc

# Setup layer creation options
if not encoding:
encoding = locale.getpreferredencoding()

if driver == 'ESRI Shapefile':
# Fiona only sets encoding for shapefiles; other drivers do not support
# encoding as an option.
encoding_b = encoding.upper().encode('UTF-8')
encoding_c = encoding_b
layer_options = CSLSetNameValue(layer_options, "ENCODING", encoding_c)

# Setup other layer creation options
for k, v in layer_kwargs.items():
k = k.encode('UTF-8')
v = v.encode('UTF-8')
layer_options = CSLAddNameValue(layer_options, <const char *>k, <const char *>v)

### Get geometry type
# TODO: this is brittle for 3D / ZM / M types
# TODO: fail on M / ZM types
geometry_code = get_geometry_type_code(geometry_type or "Unknown")

### Create options
if not encoding:
encoding = locale.getpreferredencoding()
try:
if create_layer:
layer_b = layer.encode('UTF-8')
layer_c = layer_b

if driver == 'ESRI Shapefile':
# Fiona only sets encoding for shapefiles; other drivers do not support
# encoding as an option.
encoding_b = encoding.upper().encode('UTF-8')
encoding_c = encoding_b
layer_options = CSLSetNameValue(layer_options, "ENCODING", encoding_c)

# Setup other layer creation options
for k, v in layer_kwargs.items():
k = k.encode('UTF-8')
v = v.encode('UTF-8')
layer_options = CSLAddNameValue(layer_options, <const char *>k, <const char *>v)

### Get geometry type
# TODO: this is brittle for 3D / ZM / M types
# TODO: fail on M / ZM types
geometry_code = get_geometry_type_code(geometry_type or "Unknown")
ogr_layer = exc_wrap_pointer(
GDALDatasetCreateLayer(ogr_dataset, layer_c, ogr_crs,
geometry_code, layer_options))

### Create the layer
try:
ogr_layer = exc_wrap_pointer(
GDALDatasetCreateLayer(ogr_dataset, layer_c, ogr_crs,
<OGRwkbGeometryType>geometry_code,
layer_options))
else:
ogr_layer = exc_wrap_pointer(get_ogr_layer(ogr_dataset, layer))

except Exception as exc:
OGRReleaseDataSource(ogr_dataset)
Expand All @@ -1470,51 +1483,54 @@ def ogr_write(
ogr_crs = NULL

if dataset_options != NULL:
CSLDestroy(<char**>dataset_options)
CSLDestroy(dataset_options)
dataset_options = NULL

if layer_options != NULL:
CSLDestroy(<char**>layer_options)
CSLDestroy(layer_options)
layer_options = NULL

### Create the fields
field_types = infer_field_types([field.dtype for field in field_data])
for i in range(num_fields):
field_type, field_subtype, width, precision = field_types[i]

name_b = fields[i].encode(encoding)
try:
ogr_fielddef = exc_wrap_pointer(OGR_Fld_Create(name_b, field_type))
### Create the fields
if create_layer:
for i in range(num_fields):
field_type, field_subtype, width, precision = field_types[i]

# subtypes, see: https://gdal.org/development/rfc/rfc50_ogr_field_subtype.html
if field_subtype != OFSTNone:
OGR_Fld_SetSubType(ogr_fielddef, field_subtype)
name_b = fields[i].encode(encoding)
try:
ogr_fielddef = exc_wrap_pointer(OGR_Fld_Create(name_b, field_type))

if width:
OGR_Fld_SetWidth(ogr_fielddef, width)
# subtypes, see: https://gdal.org/development/rfc/rfc50_ogr_field_subtype.html
if field_subtype != OFSTNone:
OGR_Fld_SetSubType(ogr_fielddef, field_subtype)

# TODO: set precision
if width:
OGR_Fld_SetWidth(ogr_fielddef, width)

except:
if ogr_fielddef != NULL:
OGR_Fld_Destroy(ogr_fielddef)
ogr_fielddef = NULL
# TODO: set precision

OGRReleaseDataSource(ogr_dataset)
ogr_dataset = NULL
raise FieldError(f"Error creating field '{fields[i]}' from field_data") from None
except:
if ogr_fielddef != NULL:
OGR_Fld_Destroy(ogr_fielddef)
ogr_fielddef = NULL

try:
exc_wrap_int(OGR_L_CreateField(ogr_layer, ogr_fielddef, 1))
OGRReleaseDataSource(ogr_dataset)
ogr_dataset = NULL
raise FieldError(f"Error creating field '{fields[i]}' from field_data") from None

except:
OGRReleaseDataSource(ogr_dataset)
ogr_dataset = NULL
raise FieldError(f"Error adding field '{fields[i]}' to layer") from None
try:
exc_wrap_int(OGR_L_CreateField(ogr_layer, ogr_fielddef, 1))

finally:
if ogr_fielddef != NULL:
OGR_Fld_Destroy(ogr_fielddef)
except:
OGRReleaseDataSource(ogr_dataset)
ogr_dataset = NULL
raise FieldError(f"Error adding field '{fields[i]}' to layer") from None

finally:
if ogr_fielddef != NULL:
OGR_Fld_Destroy(ogr_fielddef)


### Create the features
Expand Down
41 changes: 19 additions & 22 deletions pyogrio/_ogr.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -100,24 +100,18 @@ def get_gdal_config_option(str name):
return str_value


### Drivers
# mapping of driver:mode
# see full list at https://gdal.org/drivers/vector/index.html
# only for drivers specifically known to operate correctly with pyogrio
DRIVERS = {
# "CSV": "rw", # TODO: needs geometry conversion method
"ESRI Shapefile": "rw",
"FlatGeobuf": "rw",
"GeoJSON": "rw",
"GeoJSONSeq": "rw",
"GML": "rw",
# "GPX": "rw", # TODO: supports limited geometry types
"GPKG": "rw",
"OAPIF": "r",
"OpenFileGDB": "r",
"TopoJSON": "r",
# "XLSX": "rw", # TODO: needs geometry conversion method
}
def ogr_driver_supports_write(driver):
# exclude drivers known to be unsupported by pyogrio even though they are
# supported for write by GDAL
if driver in {"CSV", "XLSX"}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV actually does work if specify GEOMETRY='AS_WKT'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wondering (but for a different issue/PR), do we want to support writing only the attributes without geometry (like we support reading with ignoring the geometry)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - didn't know that; probably means that XLSX would work with that as well. In which case, maybe we should just pass through those drivers that are supported for write, and let the user figure out what additional options may be required if GDAL returns an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: #105 addresses writing without geometry; was not trying to fast-track that for the upcoming release though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XLSX does not support geometry, so we need to keep this exclusion for XLSX until we do #105 and allow writing to it without geometry (though probably cleaner for user to do this through Pandas to_excel(...))

return False


# check metadata for driver to see if it supports write
if _get_driver_metadata_item(driver, "DCAP_CREATE") == 'YES':
return True

return False


def ogr_list_drivers():
Expand All @@ -131,9 +125,12 @@ def ogr_list_drivers():
name_c = <char *>OGR_Dr_GetName(driver)

name = get_string(name_c)
# drivers that are not specifically listed have unknown support
# this omits any drivers from supported list that are not installed
drivers[name] = DRIVERS.get(name, '?')

if ogr_driver_supports_write(name):
drivers[name] = "rw"

else:
drivers[name] = "r"

return drivers

Expand Down Expand Up @@ -186,7 +183,7 @@ def has_gdal_data():

def get_gdal_data_path():
"""
Get the path to the directory GDAL uses to read data files.
Get the path to the directory GDAL uses to read data files.
"""
cdef const char *path_c = CPLFindFile("gdal", "header.dxf")
if path_c != NULL:
Expand Down
Loading