Implementation of RFC 86: Column-oriented read API for vector layers #5922

rouault · 2022-06-13T22:16:46Z

Implements #5830.

Impacts:

OGR core
Python bindings
Parquet driver
GPKG driver
FlatGeoBuf driver

Doc and tests updated

doc/source/tutorials/vector_api_tut.rst

nyalldawson · 2022-06-14T04:51:15Z

Aside from the above trivial typos, I've dug through this PR and can't spot any issues. (Admittedly there's a LOT of code here and it would be very easy to miss something 🙃 )

…rrowArray()

…ray()

…row dictionary

…ROW:extension:name

…OGRLayerPool

rouault · 2022-06-16T19:45:23Z

@paleolimbot @jorisvandenbossche Want to have a chance testing that before I merge?

paleolimbot · 2022-06-17T12:16:25Z

I will take a look today! The only thing I noticed the last time I looked was that struct ArrowArray.null_count should be set to 0 if the validity buffer is nullptr (the wording is "Arrays having a 0 null count may choose to not allocate the validity bitmap; how this is represented depends on the implementation (for example, a C++ implementation may represent such an “absent” validity bitmap using a NULL pointer)").

…ull_count == 0

paleolimbot · 2022-06-17T22:03:06Z

I didn't get to this today but will take a look next week... That shouldn't keep your from merging this...I tested a number of files earlier and the things I plan to check are corner cases that are unlikely to affect most users.

paleolimbot · 2022-06-25T02:24:45Z

The only oddity I found when looking at the C arrays was that the extension metadata appears to be doubled when reading through the Parquet driver (but only for some files). See the 'details' bit below for how I was able to trigger that. (Doubling the metadata isn't strictly a problem and it might be Arrow C++ that's doing it. These files might be the only ones in existence that use a nested extension type anyway).

I ran Arrow's ValidateFull() on arrays generated from gpkg, fgb, and Parquet reads and there didn't seem to be any problems.

I did get a crash when trying to read a bunch of large Parquet files sequentially (they all pass ValidateFull individually)...I have a feeling that's a problem on my end not deleting something that I should be and running out of memory. The "bunch of large files" were all the Parquet and gpkg files listed here: https://github.com/paleolimbot/geoarrow-data/#geoarrow-data .

# remotes::install_github("paleolimbot/rfc86")
library(rfc86)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(narrow)
library(geoarrow)

if (!file.exists("nshn_water_line.parquet")) {
  curl::curl_download(
    "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.parquet",
    "nshn_water_line.parquet"
  )
}

# gdal seems to double the extension name/metadata?
schema_arrow <- open_dataset("nshn_water_line.parquet")$schema
narrow::as_narrow_schema(schema_arrow$geometry)
#> <narrow_schema '+l' at 0x556af0ef79f0>
#> - format: +l
#> - name: geometry
#> - flags: nullable
#> - metadata: List of 2
#>   $ ARROW:extension:name    : chr "geoarrow.multilinestring"
#>   $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#> - dictionary: NULL
#> - children[1]:
#>   <narrow_schema '+l' at 0x7fb66060cf28>
#>   - format: +l
#>   - name: linestrings
#>   - flags: nullable
#>   - metadata: List of 2
#>   $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#>   $ ARROW:extension:name    : chr "geoarrow.linestring"
#>   - dictionary: NULL
#>   - children[1]:
#>     <narrow_schema '+w:3' at 0x7fb66060d068>
#>     - format: +w:3
#>     - name: vertices
#>     - flags: nullable
#>     - metadata: List of 2
#>   $ ARROW:extension:metadata: raw [1:964] 01 00 00 00 ...
#>   $ ARROW:extension:name    : chr "geoarrow.point"
#>     - dictionary: NULL
#>     - children[1]:
#>       <narrow_schema 'g' at 0x7fb66060d1a8>
#>       - format: g
#>       - name: xyz
#>       - flags: nullable
#>       - metadata:  list()
#>       - dictionary: NULL
#>       - children[0]:

stream <- narrow::narrow_allocate_array_stream()
rfc86:::cpp_read_ogr_stream("nshn_water_line.parquet", stream)
#> $stream
#> <pointer: 0x556af036d340>
#> attr(,"class")
#> [1] "narrow_array_stream"
#> 
#> $crs
#> [1] "PROJCS[\"NAD_1983_CSRS_2010_UTM_20_Nova_Scotia\",GEOGCS[\"GCS_North_American_1983_CSRS_2010\",DATUM[\"NAD83_Canadian_Spatial_Reference_System\",SPHEROID[\"GRS 1980\",6378137,298.257222101],AUTHORITY[\"EPSG\",\"6140\"]],PRIMEM[\"Greenwich\",0],UNIT[\"Degree\",0.0174532925199433]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",0],PARAMETER[\"central_meridian\",-63],PARAMETER[\"scale_factor\",0.9996],PARAMETER[\"false_easting\",500000],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH]]"

schema_gdal <- narrow_array_stream_get_schema(stream)
schema_gdal$children$geometry
#> <narrow_schema '+l' at 0x556af177afe8>
#> - format: +l
#> - name: geometry
#> - flags: nullable
#> - metadata: List of 2
#>   $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#>   $ ARROW:extension:name    : chr "geoarrow.multilinestring"
#> - dictionary: NULL
#> - children[1]:
#>   <narrow_schema '+l' at 0x7fb661c0ab28>
#>   - format: +l
#>   - name: linestrings
#>   - flags: nullable
#>   - metadata: List of 4
#>   $ ARROW:extension:name    : chr "geoarrow.linestring"
#>   $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#>   $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#>   $ ARROW:extension:name    : chr "geoarrow.linestring"
#>   - dictionary: NULL
#>   - children[1]:
#>     <narrow_schema '+w:3' at 0x7fb661c0ac68>
#>     - format: +w:3
#>     - name: vertices
#>     - flags: nullable
#>     - metadata: List of 4
#>   $ ARROW:extension:name    : chr "geoarrow.point"
#>   $ ARROW:extension:metadata: raw [1:964] 01 00 00 00 ...
#>   $ ARROW:extension:metadata: raw [1:964] 01 00 00 00 ...
#>   $ ARROW:extension:name    : chr "geoarrow.point"
#>     - dictionary: NULL
#>     - children[1]:
#>       <narrow_schema 'g' at 0x7fb661c0ada8>
#>       - format: g
#>       - name: xyz
#>       - flags: nullable
#>       - metadata:  list()
#>       - dictionary: NULL
#>       - children[0]:

^{Created on 2022-06-24 by the reprex package (v2.0.1)}

rouault · 2022-06-25T12:22:57Z

(Doubling the metadata isn't strictly a problem and it might be Arrow C++ that's doing it

either Arrow C++ or the file itself, but that can be reproduced with pyarrow alone. For Parquet, if not forcing WKB encoding, OGR just delegates to libarrow/libparquet to return ArrowArray and ArrowSchema objects.

import pyarrow.parquet as pq
t = pq.read_table('nshn_water_line.parquet')
t.schema
[...]
geometry: list<linestrings: list<vertices: fixed_size_list<xyz: double>[3]>>
  child 0, linestrings: list<vertices: fixed_size_list<xyz: double>[3]>
      child 0, vertices: fixed_size_list<xyz: double>[3]
          child 0, xyz: double
      -- field metadata --
      ARROW:extension:name: 'geoarrow.point'
      ARROW:extension:metadata: 'crs�PROJCRS["NAD_1983_CSRS_20' + 924
      ARROW:extension:metadata: 'crs�PROJCRS["NAD_1983_CSRS_20' + 924
      ARROW:extension:name: 'geoarrow.point'
    -- field metadata --
    ARROW:extension:name: 'geoarrow.linestring'
    ARROW:extension:metadata: ''
    ARROW:extension:metadata: ''
    ARROW:extension:name: 'geoarrow.linestring'
  -- field metadata --
  ARROW:extension:metadata: ''
  ARROW:extension:name: 'geoarrow.multilinestring'

I did get a crash when trying to read a bunch of large Parquet files sequentially

I tried looping over and over on the above file, and it's stable (RAM usage too)

from osgeo import ogr
ds = ogr.Open('nshn_water_line.parquet')
lyr = ds.GetLayer(0)
while True:
    stream = lyr.GetArrowStreamAsPyArrow()
    for record_batch in stream: pass

rouault added 4 commits June 13, 2022 22:45

Arrow/Parquet: make memory pool a std::shared object

d403c2f

OGRLayer: add ArrowArray based batch retrieval

b070752

Add Python bindings for Arrow Batch functionality

ac41458

FlatGeoBuf: implement GetNextRecordBatch()

2677e46

rouault added this to the 3.6.0 milestone Jun 13, 2022

rouault force-pushed the arrow_batch_new branch from 63b37da to 6450a38 Compare June 13, 2022 22:21

nyalldawson self-assigned this Jun 13, 2022

rouault force-pushed the arrow_batch_new branch from 2ef646e to bc109ab Compare June 13, 2022 23:50

nyalldawson reviewed Jun 14, 2022

View reviewed changes

doc/source/tutorials/vector_api_tut.rst Outdated Show resolved Hide resolved

doc/source/tutorials/vector_api_tut.rst Outdated Show resolved Hide resolved

rouault added 13 commits June 14, 2022 11:02

Arrow batch functionality: switch to Arrow C Stream interface

6dce1c6

Arrow batch functionality: add tests

add4cc8

Add OGRArrowArrayHelper class to make it easier to implement GetNextA…

733141a

…rrowArray()

FlatGeoBuf: use OGRArrowArrayHelper class to implement GetNextArrowAr…

5d064c8

…ray()

GPKG: add ParseDateField() and ParseDateTimeField() helpers

7b56e90

GeoPackage: implement GetNextArrowArray()

6294212

Add OLCFastGetArrowStream layer capability

60679dc

GeoPackage: GetNextArrowArray(): report enum coded field domain as Ar…

f68f2da

…row dictionary

OGRArrowArrayHelper::FillDict(): be robust in case of error

4794c0f

OGRLayer::GetArrowSchema(): switch to using 'ogc.wkb' as value for AR…

b0e65dc

…ROW:extension:name

Doc: add more doc for OGRLayer::GetArrowSchema()

0946e21

Implement GetArrowStream() in OGRLayerDecorator, OGRMutexedLayer and …

87dbdd6

…OGRLayerPool

MSVC 2015 compilation fixes

6d3e9d1

rouault force-pushed the arrow_batch_new branch from bc109ab to 6d3e9d1 Compare June 14, 2022 09:03

Generic array batch: make sure top-level array (of type struct) has n…

fa07b6f

…ull_count == 0

rouault merged commit 588a200 into OSGeo:master Jun 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of RFC 86: Column-oriented read API for vector layers #5922

Implementation of RFC 86: Column-oriented read API for vector layers #5922

rouault commented Jun 13, 2022

nyalldawson commented Jun 14, 2022

rouault commented Jun 16, 2022

paleolimbot commented Jun 17, 2022

paleolimbot commented Jun 17, 2022

paleolimbot commented Jun 25, 2022

rouault commented Jun 25, 2022

Implementation of RFC 86: Column-oriented read API for vector layers #5922

Implementation of RFC 86: Column-oriented read API for vector layers #5922

Conversation

rouault commented Jun 13, 2022

nyalldawson commented Jun 14, 2022

rouault commented Jun 16, 2022

paleolimbot commented Jun 17, 2022

paleolimbot commented Jun 17, 2022

paleolimbot commented Jun 25, 2022

rouault commented Jun 25, 2022