BUG: Memory usage increases with subsequent reads to same data #207

joostmeulenbeld · 2023-01-19T16:02:31Z

RAM usage keeps going up when loading the same geospatial file in a loop.

The example script below creates a geopackage of about 10MB and reads it many times into a geodataframe using pyogrio. RAM usage goes up every iteration, even though each iteration the loaded geodataframe goes out of scope. After about 500 reads, memory usage is ~10GB and keeps rising.

import geopandas as gpd
from shapely import points

# Write 100k points to geopackage; file is about 10MB
gpd.GeoSeries(points(range(100_000), 0), crs="EPSG:4326").to_file("points.gpkg", engine="pyogrio")

print("Read file")
for _ in range(1_000):
    gpd.read_file("points.gpkg", engine="pyogrio")

The following lines did not yield increasing RAM usage over time, from which I conclude it's not shapely or geopandas itself, and fiona does not have the same problem:

# No RAM increase with these lines:
for _ in range(1_000):
    points(range(100_000), 0)
for _ in range(1_000):
    gpd.GeoSeries(points(range(100_000), 0), crs="EPSG:4326")
for _ in range(1_000):
    gpd.read_file("points.gpkg", engine="fiona")

Also adding gc.collect() inside the loop does not make a difference.

The actual use case I have problems with uses bounding-boxed reading of a file too large to load in memory in a loop, which has the same problems.

Environment

Pop!_OS 22.04 LTS (linux kernel 6.0.12-76060006-generic)
Python 3.10.8
pyogrio 0.5.0 (from PyPI)
shapely 2.0.0 (from PyPI)
geopandas 0.12.2 (from PyPI)

The text was updated successfully, but these errors were encountered:

brendan-ward · 2023-01-19T18:50:14Z

Thanks for the report and good example @joostmeulenbeld

I can reproduce on MacOS 12.5 / M1, and also if I use read_dataframe(...) directly. Also reproducible for FlatGeobuf.

What is even more fun, I can reproduce this without reading any geometries or columns:

from pyogrio import read_dataframe

for _ in range(1_000):
    tmp = read_dataframe("points.gpkg", columns=[], read_geometry=False)

Likewise, I can reproduce it with reading only bounds:

from pyogrio import read_bounds

for _ in range(1_000):
    tmp = read_bounds("points.gpkg")

However, using Arrow I/O, it works without appearing to increase memory, though there is no columnar data other than geometry here:

for _ in range(1_000):
    tmp = read_dataframe("points.gpkg", read_geometry=False, use_arrow=True)

(it fails reading geometry with WKB error, need to investigate that separately)

This at least narrows down a little bit where the issue may be coming from.

jorisvandenbossche · 2023-01-19T19:25:24Z

(it fails reading geometry with WKB error, need to investigate that separately)

I get the same error, and that's because have all empty bytes (array([b'', b'', b'', ..., b'', b'', b''], dtype=object)) for that column. Since it reads correctly without arrow, that seems a bug in GDAL.

jorisvandenbossche · 2023-01-19T19:48:57Z

I ran the read_dataframe("points.gpkg", columns=[], read_geometry=False) example using memray (https://github.com/bloomberg/memray), and get this report: https://gist.github.com/jorisvandenbossche/f379b7deb51984ed6bb6e1784918de5e#file-memray-flamegraph-test_pyogrio_memory-py-2962573-html (the gist doesn't render the html, so you need to download it)

That points to get_features, and we indeed don't seem to destroy the features we get using OGR_L_GetNextFeature, while the docs explicitly state those need to be destroyed: https://gdal.org/api/vector_c_api.html#_CPPv420OGR_L_GetNextFeature9OGRLayerH (we only destroy features in the writing path)

jorisvandenbossche · 2023-01-19T20:05:19Z

I have a fix at #209

brendan-ward added the bug Something isn't working label Jan 19, 2023

brendan-ward changed the title ~~Memory usage increases with subsequent reads to same data~~ BUG: Memory usage increases with subsequent reads to same data Jan 19, 2023

jorisvandenbossche mentioned this issue Jan 19, 2023

Fix memory leak in reading data #209

Merged

brendan-ward closed this as completed in #209 Jan 19, 2023

joostmeulenbeld mentioned this issue Jan 20, 2023

BUG: Cannot read with pyarrow when no non-geometry columns are loaded #212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Memory usage increases with subsequent reads to same data #207

BUG: Memory usage increases with subsequent reads to same data #207

joostmeulenbeld commented Jan 19, 2023

brendan-ward commented Jan 19, 2023

jorisvandenbossche commented Jan 19, 2023

jorisvandenbossche commented Jan 19, 2023

jorisvandenbossche commented Jan 19, 2023

BUG: Memory usage increases with subsequent reads to same data #207

BUG: Memory usage increases with subsequent reads to same data #207

Comments

joostmeulenbeld commented Jan 19, 2023

Environment

brendan-ward commented Jan 19, 2023

jorisvandenbossche commented Jan 19, 2023

jorisvandenbossche commented Jan 19, 2023

jorisvandenbossche commented Jan 19, 2023