-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of RFC 86: Column-oriented read API for vector layers #5922
Conversation
Aside from the above trivial typos, I've dug through this PR and can't spot any issues. (Admittedly there's a LOT of code here and it would be very easy to miss something 🙃 ) |
…ROW:extension:name
@paleolimbot @jorisvandenbossche Want to have a chance testing that before I merge? |
I will take a look today! The only thing I noticed the last time I looked was that |
I didn't get to this today but will take a look next week... That shouldn't keep your from merging this...I tested a number of files earlier and the things I plan to check are corner cases that are unlikely to affect most users. |
The only oddity I found when looking at the C arrays was that the extension metadata appears to be doubled when reading through the Parquet driver (but only for some files). See the 'details' bit below for how I was able to trigger that. (Doubling the metadata isn't strictly a problem and it might be Arrow C++ that's doing it. These files might be the only ones in existence that use a nested extension type anyway). I ran Arrow's ValidateFull() on arrays generated from gpkg, fgb, and Parquet reads and there didn't seem to be any problems. I did get a crash when trying to read a bunch of large Parquet files sequentially (they all pass ValidateFull individually)...I have a feeling that's a problem on my end not deleting something that I should be and running out of memory. The "bunch of large files" were all the Parquet and gpkg files listed here: https://github.com/paleolimbot/geoarrow-data/#geoarrow-data . # remotes::install_github("paleolimbot/rfc86")
library(rfc86)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(narrow)
library(geoarrow)
if (!file.exists("nshn_water_line.parquet")) {
curl::curl_download(
"https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.parquet",
"nshn_water_line.parquet"
)
}
# gdal seems to double the extension name/metadata?
schema_arrow <- open_dataset("nshn_water_line.parquet")$schema
narrow::as_narrow_schema(schema_arrow$geometry)
#> <narrow_schema '+l' at 0x556af0ef79f0>
#> - format: +l
#> - name: geometry
#> - flags: nullable
#> - metadata: List of 2
#> $ ARROW:extension:name : chr "geoarrow.multilinestring"
#> $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#> - dictionary: NULL
#> - children[1]:
#> <narrow_schema '+l' at 0x7fb66060cf28>
#> - format: +l
#> - name: linestrings
#> - flags: nullable
#> - metadata: List of 2
#> $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#> $ ARROW:extension:name : chr "geoarrow.linestring"
#> - dictionary: NULL
#> - children[1]:
#> <narrow_schema '+w:3' at 0x7fb66060d068>
#> - format: +w:3
#> - name: vertices
#> - flags: nullable
#> - metadata: List of 2
#> $ ARROW:extension:metadata: raw [1:964] 01 00 00 00 ...
#> $ ARROW:extension:name : chr "geoarrow.point"
#> - dictionary: NULL
#> - children[1]:
#> <narrow_schema 'g' at 0x7fb66060d1a8>
#> - format: g
#> - name: xyz
#> - flags: nullable
#> - metadata: list()
#> - dictionary: NULL
#> - children[0]:
stream <- narrow::narrow_allocate_array_stream()
rfc86:::cpp_read_ogr_stream("nshn_water_line.parquet", stream)
#> $stream
#> <pointer: 0x556af036d340>
#> attr(,"class")
#> [1] "narrow_array_stream"
#>
#> $crs
#> [1] "PROJCS[\"NAD_1983_CSRS_2010_UTM_20_Nova_Scotia\",GEOGCS[\"GCS_North_American_1983_CSRS_2010\",DATUM[\"NAD83_Canadian_Spatial_Reference_System\",SPHEROID[\"GRS 1980\",6378137,298.257222101],AUTHORITY[\"EPSG\",\"6140\"]],PRIMEM[\"Greenwich\",0],UNIT[\"Degree\",0.0174532925199433]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",0],PARAMETER[\"central_meridian\",-63],PARAMETER[\"scale_factor\",0.9996],PARAMETER[\"false_easting\",500000],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH]]"
schema_gdal <- narrow_array_stream_get_schema(stream)
schema_gdal$children$geometry
#> <narrow_schema '+l' at 0x556af177afe8>
#> - format: +l
#> - name: geometry
#> - flags: nullable
#> - metadata: List of 2
#> $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#> $ ARROW:extension:name : chr "geoarrow.multilinestring"
#> - dictionary: NULL
#> - children[1]:
#> <narrow_schema '+l' at 0x7fb661c0ab28>
#> - format: +l
#> - name: linestrings
#> - flags: nullable
#> - metadata: List of 4
#> $ ARROW:extension:name : chr "geoarrow.linestring"
#> $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#> $ ARROW:extension:metadata: raw [1:4] 00 00 00 00
#> $ ARROW:extension:name : chr "geoarrow.linestring"
#> - dictionary: NULL
#> - children[1]:
#> <narrow_schema '+w:3' at 0x7fb661c0ac68>
#> - format: +w:3
#> - name: vertices
#> - flags: nullable
#> - metadata: List of 4
#> $ ARROW:extension:name : chr "geoarrow.point"
#> $ ARROW:extension:metadata: raw [1:964] 01 00 00 00 ...
#> $ ARROW:extension:metadata: raw [1:964] 01 00 00 00 ...
#> $ ARROW:extension:name : chr "geoarrow.point"
#> - dictionary: NULL
#> - children[1]:
#> <narrow_schema 'g' at 0x7fb661c0ada8>
#> - format: g
#> - name: xyz
#> - flags: nullable
#> - metadata: list()
#> - dictionary: NULL
#> - children[0]: Created on 2022-06-24 by the reprex package (v2.0.1) |
either Arrow C++ or the file itself, but that can be reproduced with pyarrow alone. For Parquet, if not forcing WKB encoding, OGR just delegates to libarrow/libparquet to return ArrowArray and ArrowSchema objects.
I tried looping over and over on the above file, and it's stable (RAM usage too)
|
Implements #5830.
Impacts:
Doc and tests updated