Skip to content

Reduce copies when reading files in pyio, match behavior of _io #129005

Open
@cmaloney

Description

@cmaloney

Feature or enhancement

Proposal:

Currently _pyio uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.

Details from test_fileio run

$ ./python -m test -M8g -uall test_largefile -m test_large_read -vvv
== CPython 3.14.0a4+ (heads/main-dirty:3829104ab41, Jan 17 2025, 21:40:47) [Clang 19.1.6 ]
== Linux-6.12.9-arch1-1-x86_64-with-glibc2.40 little-endian
== Python build: debug
== cwd: <$HOME>/python/build/build/test_python_worker_32392æ
== CPU count: 32
== encodings: locale=UTF-8 FS=utf-8
== resources: all

Using random seed: 1740056613
0:00:00 load avg: 0.53 Run 1 test sequentially in a single process
0:00:00 load avg: 0.53 [1/1] test_largefile
test_large_read (test.test_largefile.CLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
ok
test_large_read (test.test_largefile.PyLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
 ... process data size: 4.3G
 ... process data size: 4.7G
ok

----------------------------------------------------------------------
Ran 2 tests in 3.711s

OK

== Tests result: SUCCESS ==

1 test OK.

Total duration: 3.7 sec
Total tests: run=2 (filtered)
Total test files: run=1/1 (filtered)
Result: SUCCESS

Plan:

  1. Switch to os.readv() os.readinto() to do readinto like C _Py_read used by _io does. os.read() can't take a buffer to use. This aligns behavior between _io.FileIO.readall and _pyio.FileIO.readall. os.readv works well today and takes a caller allocated buffer rather than needing to add a new os API. readv(2) mirrors the behavior and errors of read(2), so this should keep the same end behavior.
  2. Update _pyio.BufferedIO to not force a copy of the buffer for readall when its internal buffer is empty. Currently it always slices its internal buffer then adds the result of _pyio.FileIO.readall to it.

For iterating, I'm using a small tracemalloc script to find where copies are:

from _pyio import open

import tracemalloc

with open("README.rst", 'rb') as file:
    tracemalloc.start()
    data = file.read()
    snap = tracemalloc.take_snapshot()


stats = snap.statistics('lineno')
for stat in stats:
    print(stat)

Loose Ends

  • os.readv seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio require readv, but can do conditional code if needed. If making readv non-optional generally is feasible, happy to work on that.
    • os.readv is not supported on WASI, so need to add conditional code.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibPython modules in the Lib dirtype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions