-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Calculate feature count ourselves if GDAL returns unknown count (-1), better support OSM data #271
Conversation
Ideally there is also a test to cover this code path? But possibly there is no write support for a driver behaving this way? |
The fix for #272 can be verified manually using a sufficiently large (probably > 100 MB) OSM file from Geofabrik, e.g., Greece; it will return a count of 0 without the fix added in 31d7df3 from pyogrio import read_info
assert read_info('greece-latest.osm.pbf', layer='lines')['features'] > 0 It fixes both reading with |
We recommend the following to sidestep performance issues: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also mention use_arrow as an option. Based on my 1 test this is still not fast, but half as slow as without use_arrow=True. Or is it still too exprimental?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing different results for Alabama (https://download.geofabrik.de/north-america/us/alabama-latest.osm.pbf) reading the lines
layer:
local, with arrow: 4.49s
local, without arrow: 3.46s
remote, with arrow: 23.55s
remote, without arrow: 20.26s
times are greater for multipolygons
layer but follow similar trend locally but not for remote (likely because of much more network I/O, but it is remote enough from me that there could be huge variability in network I/O):
local, with arrow: 6.82s
local, without arrow: 5.50s
remote, with arrow: 33.85s
remote, without arrow: 39.06s
whereas for points, the first and fastest layer:
local, with arrow 1.12s
local, without arrow: 0.68s
remote, with arrow: 40.36s
remote, without arrow: 20.09s
I had expected the arrow option to be somewhat faster, but that doesn't seem to be the case for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... my timings are indeed different. They are consistently faster for use_arrow, and most of the time twice as fast.
Then again, the timings are also spectacularly longer for me for local files, most likely the difference between Windows and Linux... because CI tests also run a lot slower on Windows vs Linux :-(... To conclude, arrow seems less dramatically slow on Windows ;-).
from datetime import datetime
import itertools
import pyogrio
url_de_bw = "https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
path_de_bw = "C:/temp/baden-wuerttemberg-latest.osm.pbf"
url_us_al = "https://download.geofabrik.de/north-america/us/alabama-latest.osm.pbf"
path_us_al = "C:/temp/alabama-latest.osm.pbf"
urls = [path_de_bw, path_us_al]
layers = ["multipolygons", "points"]
for url, layer, use_arrow in itertools.product(urls, layers, [True, False]):
start = datetime.now()
pgdf = pyogrio.read_dataframe(
url, use_arrow=use_arrow, layer=layer, encoding="utf8"
)
print(f"url={url}, layer={layer}, use_arrow={use_arrow}: {datetime.now()-start}")
Results:
url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:01:46.250570
url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:03:40.812138
url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:18.412509
url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:24.491693
url=C:/temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:16.943871
url=C:/temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:30.859146
url=C:/temp/alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:07.560858
url=C:/temp/alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:08.725899
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS: I had to specify encoding="utf8"
to be able to run it with use_arrow=False
, probably because it is German data and they use quite some accents and stuff. Based on the following post and others, it seems osm data is always in "UTF-8":
https://help.openstreetmap.org/questions/5308/character-encoding
So I think it might be better to default the encoding
to "UTF-8" for .osm and .pbf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a fix to always set UTF-8
.
It looks like using the interleaved encoding option has an impact on performance:
With default (no interleaved reading):
url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:01:12.460372
url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:03:11.110123
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:08.757978
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:15.286818
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:06.517222
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:15.456438
url=alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:01.148906
url=alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:02.335114
with pyogrio.set_gdal_config_options({"OGR_INTERLEAVED_READING": True})
:
url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:52.878035
url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:01:05.556191
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:06.217804
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:04.641645
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:06.525850
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:05.703674
url=alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:00.852143
url=alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:00.724238
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... interleaved reading is consitently quite a bit faster... and often use_arrow=False
becomes faster than use_arrow=True
because of it. Kind of weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On windows, a similar pattern:
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=True: 0:01:49.743053
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=False: 0:01:51.280849
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=True: 0:01:28.155260
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=False: 0:03:56.258161
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True, interleaved=True: 0:00:20.615849
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True, interleaved=False: 0:00:15.530900
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False, interleaved=True: 0:00:10.094106
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False, interleaved=False: 0:00:28.546472
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=True: 0:00:19.117897
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=False: 0:00:19.194775
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=True: 0:00:17.296830
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=False: 0:00:35.047402
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=True, interleaved=True: 0:00:07.781301
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=True, interleaved=False: 0:00:07.374248
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=False, interleaved=True: 0:00:06.801308
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=False, interleaved=False: 0:00:09.139016
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a reason why interleaved isn't the default... maybe it is also useful to mention that it can improve performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing the same behaviour. It's indeed a bit strange that with the interleaved reading turned on, reading with arrow is slower than without. Might be worth reporting to GDAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice investigations!
# if the driver is OSM, we need to execute SQL to set the layer to read in | ||
# order to read it properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something to report to GDAL? Or is it expected we need to set this differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ogr2ogr
does this (which is how I figured out how to do this), so I think it is fair to say that GDAL knows about this and that it is intentional it is left for the user to call instead of being embedded in the driver. It is something we could suggest be added to the documentation for the driver, however.
Resolves #269, #272
GDAL will return -1 for some drivers if it has specifically disabled the ability to get a feature count (e.g., OSM driver).
Since we need to know this number in order to allocate arrays before reading features, we need to get this count in order to be able to read from certain drivers.