Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Memory usage increases with subsequent reads to same data #207

Closed
joostmeulenbeld opened this issue Jan 19, 2023 · 4 comments · Fixed by #209
Closed

BUG: Memory usage increases with subsequent reads to same data #207

joostmeulenbeld opened this issue Jan 19, 2023 · 4 comments · Fixed by #209
Labels
bug Something isn't working

Comments

@joostmeulenbeld
Copy link

RAM usage keeps going up when loading the same geospatial file in a loop.

The example script below creates a geopackage of about 10MB and reads it many times into a geodataframe using pyogrio. RAM usage goes up every iteration, even though each iteration the loaded geodataframe goes out of scope. After about 500 reads, memory usage is ~10GB and keeps rising.

import geopandas as gpd
from shapely import points

# Write 100k points to geopackage; file is about 10MB
gpd.GeoSeries(points(range(100_000), 0), crs="EPSG:4326").to_file("points.gpkg", engine="pyogrio")

print("Read file")
for _ in range(1_000):
    gpd.read_file("points.gpkg", engine="pyogrio")

The following lines did not yield increasing RAM usage over time, from which I conclude it's not shapely or geopandas itself, and fiona does not have the same problem:

# No RAM increase with these lines:
for _ in range(1_000):
    points(range(100_000), 0)
for _ in range(1_000):
    gpd.GeoSeries(points(range(100_000), 0), crs="EPSG:4326")
for _ in range(1_000):
    gpd.read_file("points.gpkg", engine="fiona")

Also adding gc.collect() inside the loop does not make a difference.

The actual use case I have problems with uses bounding-boxed reading of a file too large to load in memory in a loop, which has the same problems.

Environment

  • Pop!_OS 22.04 LTS (linux kernel 6.0.12-76060006-generic)
  • Python 3.10.8
  • pyogrio 0.5.0 (from PyPI)
  • shapely 2.0.0 (from PyPI)
  • geopandas 0.12.2 (from PyPI)
@brendan-ward brendan-ward added the bug Something isn't working label Jan 19, 2023
@brendan-ward brendan-ward changed the title Memory usage increases with subsequent reads to same data BUG: Memory usage increases with subsequent reads to same data Jan 19, 2023
@brendan-ward
Copy link
Member

Thanks for the report and good example @joostmeulenbeld

I can reproduce on MacOS 12.5 / M1, and also if I use read_dataframe(...) directly. Also reproducible for FlatGeobuf.

What is even more fun, I can reproduce this without reading any geometries or columns:

from pyogrio import read_dataframe

for _ in range(1_000):
    tmp = read_dataframe("points.gpkg", columns=[], read_geometry=False)

Likewise, I can reproduce it with reading only bounds:

from pyogrio import read_bounds

for _ in range(1_000):
    tmp = read_bounds("points.gpkg")

However, using Arrow I/O, it works without appearing to increase memory, though there is no columnar data other than geometry here:

for _ in range(1_000):
    tmp = read_dataframe("points.gpkg", read_geometry=False, use_arrow=True)

(it fails reading geometry with WKB error, need to investigate that separately)

This at least narrows down a little bit where the issue may be coming from.

@jorisvandenbossche
Copy link
Member

(it fails reading geometry with WKB error, need to investigate that separately)

I get the same error, and that's because have all empty bytes (array([b'', b'', b'', ..., b'', b'', b''], dtype=object)) for that column. Since it reads correctly without arrow, that seems a bug in GDAL.

@jorisvandenbossche
Copy link
Member

I ran the read_dataframe("points.gpkg", columns=[], read_geometry=False) example using memray (https://github.com/bloomberg/memray), and get this report: https://gist.github.com/jorisvandenbossche/f379b7deb51984ed6bb6e1784918de5e#file-memray-flamegraph-test_pyogrio_memory-py-2962573-html (the gist doesn't render the html, so you need to download it)

That points to get_features, and we indeed don't seem to destroy the features we get using OGR_L_GetNextFeature, while the docs explicitly state those need to be destroyed: https://gdal.org/api/vector_c_api.html#_CPPv420OGR_L_GetNextFeature9OGRLayerH (we only destroy features in the writing path)

@jorisvandenbossche
Copy link
Member

I have a fix at #209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants