Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of RFC 86: Column-oriented read API for vector layers #5922

Merged
merged 18 commits into from
Jun 18, 2022

Conversation

rouault
Copy link
Member

@rouault rouault commented Jun 13, 2022

Implements #5830.

Impacts:

  • OGR core
  • Python bindings
  • Parquet driver
  • GPKG driver
  • FlatGeoBuf driver

Doc and tests updated

doc/source/tutorials/vector_api_tut.rst Outdated Show resolved Hide resolved
doc/source/tutorials/vector_api_tut.rst Outdated Show resolved Hide resolved
@nyalldawson
Copy link
Collaborator

Aside from the above trivial typos, I've dug through this PR and can't spot any issues. (Admittedly there's a LOT of code here and it would be very easy to miss something 🙃 )

@rouault
Copy link
Member Author

rouault commented Jun 16, 2022

@paleolimbot @jorisvandenbossche Want to have a chance testing that before I merge?

@paleolimbot
Copy link
Contributor

I will take a look today! The only thing I noticed the last time I looked was that struct ArrowArray.null_count should be set to 0 if the validity buffer is nullptr (the wording is "Arrays having a 0 null count may choose to not allocate the validity bitmap; how this is represented depends on the implementation (for example, a C++ implementation may represent such an “absent” validity bitmap using a NULL pointer)").

@paleolimbot
Copy link
Contributor

I didn't get to this today but will take a look next week... That shouldn't keep your from merging this...I tested a number of files earlier and the things I plan to check are corner cases that are unlikely to affect most users.

@rouault rouault merged commit 588a200 into OSGeo:master Jun 18, 2022
@paleolimbot
Copy link
Contributor

The only oddity I found when looking at the C arrays was that the extension metadata appears to be doubled when reading through the Parquet driver (but only for some files). See the 'details' bit below for how I was able to trigger that. (Doubling the metadata isn't strictly a problem and it might be Arrow C++ that's doing it. These files might be the only ones in existence that use a nested extension type anyway).

I ran Arrow's ValidateFull() on arrays generated from gpkg, fgb, and Parquet reads and there didn't seem to be any problems.

I did get a crash when trying to read a bunch of large Parquet files sequentially (they all pass ValidateFull individually)...I have a feeling that's a problem on my end not deleting something that I should be and running out of memory. The "bunch of large files" were all the Parquet and gpkg files listed here: https://github.com/paleolimbot/geoarrow-data/#geoarrow-data .

# remotes::install_github("paleolimbot/rfc86")
library(rfc86)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(narrow)
library(geoarrow)

if (!file.exists("nshn_water_line.parquet")) {
  curl::curl_download(
    "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.parquet",
    "nshn_water_line.parquet"
  )
}

# gdal seems to double the extension name/metadata?
schema_arrow <- open_dataset("nshn_water_line.parquet")$schema
narrow::as_narrow_schema(schema_arrow$geometry)
#> <narrow_schema '+l' at 0x556af0ef79f0>
#> - format: +l
#> - name: geometry
#> - flags: nullable
#> - metadata: List of 2
#>   $ ARROW:extension:name    : chr "geoarrow.multilinestring"
#>   $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#> - dictionary: NULL
#> - children[1]:
#>   <narrow_schema '+l' at 0x7fb66060cf28>
#>   - format: +l
#>   - name: linestrings
#>   - flags: nullable
#>   - metadata: List of 2
#>   $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#>   $ ARROW:extension:name    : chr "geoarrow.linestring"
#>   - dictionary: NULL
#>   - children[1]:
#>     <narrow_schema '+w:3' at 0x7fb66060d068>
#>     - format: +w:3
#>     - name: vertices
#>     - flags: nullable
#>     - metadata: List of 2
#>   $ ARROW:extension:metadata: raw [1:964] 01 00 00 00 ...
#>   $ ARROW:extension:name    : chr "geoarrow.point"
#>     - dictionary: NULL
#>     - children[1]:
#>       <narrow_schema 'g' at 0x7fb66060d1a8>
#>       - format: g
#>       - name: xyz
#>       - flags: nullable
#>       - metadata:  list()
#>       - dictionary: NULL
#>       - children[0]:

stream <- narrow::narrow_allocate_array_stream()
rfc86:::cpp_read_ogr_stream("nshn_water_line.parquet", stream)
#> $stream
#> <pointer: 0x556af036d340>
#> attr(,"class")
#> [1] "narrow_array_stream"
#> 
#> $crs
#> [1] "PROJCS[\"NAD_1983_CSRS_2010_UTM_20_Nova_Scotia\",GEOGCS[\"GCS_North_American_1983_CSRS_2010\",DATUM[\"NAD83_Canadian_Spatial_Reference_System\",SPHEROID[\"GRS 1980\",6378137,298.257222101],AUTHORITY[\"EPSG\",\"6140\"]],PRIMEM[\"Greenwich\",0],UNIT[\"Degree\",0.0174532925199433]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",0],PARAMETER[\"central_meridian\",-63],PARAMETER[\"scale_factor\",0.9996],PARAMETER[\"false_easting\",500000],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH]]"

schema_gdal <- narrow_array_stream_get_schema(stream)
schema_gdal$children$geometry
#> <narrow_schema '+l' at 0x556af177afe8>
#> - format: +l
#> - name: geometry
#> - flags: nullable
#> - metadata: List of 2
#>   $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#>   $ ARROW:extension:name    : chr "geoarrow.multilinestring"
#> - dictionary: NULL
#> - children[1]:
#>   <narrow_schema '+l' at 0x7fb661c0ab28>
#>   - format: +l
#>   - name: linestrings
#>   - flags: nullable
#>   - metadata: List of 4
#>   $ ARROW:extension:name    : chr "geoarrow.linestring"
#>   $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#>   $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#>   $ ARROW:extension:name    : chr "geoarrow.linestring"
#>   - dictionary: NULL
#>   - children[1]:
#>     <narrow_schema '+w:3' at 0x7fb661c0ac68>
#>     - format: +w:3
#>     - name: vertices
#>     - flags: nullable
#>     - metadata: List of 4
#>   $ ARROW:extension:name    : chr "geoarrow.point"
#>   $ ARROW:extension:metadata: raw [1:964] 01 00 00 00 ...
#>   $ ARROW:extension:metadata: raw [1:964] 01 00 00 00 ...
#>   $ ARROW:extension:name    : chr "geoarrow.point"
#>     - dictionary: NULL
#>     - children[1]:
#>       <narrow_schema 'g' at 0x7fb661c0ada8>
#>       - format: g
#>       - name: xyz
#>       - flags: nullable
#>       - metadata:  list()
#>       - dictionary: NULL
#>       - children[0]:

Created on 2022-06-24 by the reprex package (v2.0.1)

@rouault
Copy link
Member Author

rouault commented Jun 25, 2022

(Doubling the metadata isn't strictly a problem and it might be Arrow C++ that's doing it

either Arrow C++ or the file itself, but that can be reproduced with pyarrow alone. For Parquet, if not forcing WKB encoding, OGR just delegates to libarrow/libparquet to return ArrowArray and ArrowSchema objects.

import pyarrow.parquet as pq
t = pq.read_table('nshn_water_line.parquet')
t.schema
[...]
geometry: list<linestrings: list<vertices: fixed_size_list<xyz: double>[3]>>
  child 0, linestrings: list<vertices: fixed_size_list<xyz: double>[3]>
      child 0, vertices: fixed_size_list<xyz: double>[3]
          child 0, xyz: double
      -- field metadata --
      ARROW:extension:name: 'geoarrow.point'
      ARROW:extension:metadata: 'crs�PROJCRS["NAD_1983_CSRS_20' + 924
      ARROW:extension:metadata: 'crs�PROJCRS["NAD_1983_CSRS_20' + 924
      ARROW:extension:name: 'geoarrow.point'
    -- field metadata --
    ARROW:extension:name: 'geoarrow.linestring'
    ARROW:extension:metadata: ''
    ARROW:extension:metadata: ''
    ARROW:extension:name: 'geoarrow.linestring'
  -- field metadata --
  ARROW:extension:metadata: ''
  ARROW:extension:name: 'geoarrow.multilinestring'

I did get a crash when trying to read a bunch of large Parquet files sequentially

I tried looping over and over on the above file, and it's stable (RAM usage too)

from osgeo import ogr
ds = ogr.Open('nshn_water_line.parquet')
lyr = ds.GetLayer(0)
while True:
    stream = lyr.GetArrowStreamAsPyArrow()
    for record_batch in stream: pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants