[python] Backtick quoting for identifiers, exists_batch optimization#7347
[python] Backtick quoting for identifiers, exists_batch optimization#7347JingsongLi merged 12 commits intoapache:masterfrom
Conversation
- Add backtick quoting to Identifier for SQL-safe formatting - Add ChangelogProducer enum to core_options - Add exists_batch() for bulk file existence checks - Add LRU caching to ManifestFileManager and ManifestListManager - Add snapshot caching and traversal helpers to SnapshotManager - Add cachetools dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tEquals Extract shared base class for ManifestFileCacheTest and ManifestListCacheTest, add _make_snapshot() helper, and fix deprecated assertEquals (removed in 3.12). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rim docs, remove ChangelogProducer - Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support - Remove ChangelogProducer enum (belongs in apache#7348 scanners branch) - Replace manual cache hit/miss counters with @cachedmethod(info=True) decorator on ManifestFileManager, ManifestListManager, SnapshotManager - Trim verbose docstrings across identifier, file_io, pyarrow_file_io, manifest_list_manager, and snapshot_manager - Update cache tests to use cache_info() instead of manual counters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…efault-size tests - Move shared cache-behaviour tests (second_read, disabled_when_zero) into _CacheBehaviourMixin so they run for both manager types without duplication - Extract _EMPTY_ROW / _EMPTY_STATS module constants to reduce DataFileMeta boilerplate - Remove test_default_cache_size tests (just assert constructor defaults) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rim docs, remove ChangelogProducer - Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support - Remove ChangelogProducer enum (belongs in apache#7348 scanners branch) - Replace manual cache hit/miss counters with @cachedmethod(info=True) decorator on ManifestFileManager, ManifestListManager, SnapshotManager - Trim verbose docstrings across identifier, file_io, pyarrow_file_io, manifest_list_manager, and snapshot_manager - Update cache tests to use cache_info() instead of manual counters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rim docs, remove ChangelogProducer - Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support - Remove ChangelogProducer enum (belongs in apache#7348 scanners branch) - Replace manual cache hit/miss counters with @cachedmethod(info=True) decorator on ManifestFileManager, ManifestListManager, SnapshotManager - Trim verbose docstrings across identifier, file_io, pyarrow_file_io, manifest_list_manager, and snapshot_manager - Update cache tests to use cache_info() instead of manual counters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rim docs, remove ChangelogProducer - Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support - Remove ChangelogProducer enum (belongs in apache#7348 scanners branch) - Replace manual cache hit/miss counters with @cachedmethod(info=True) decorator on ManifestFileManager, ManifestListManager, SnapshotManager - Trim verbose docstrings across identifier, file_io, pyarrow_file_io, manifest_list_manager, and snapshot_manager - Update cache tests to use cache_info() instead of manual counters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rim docs, remove ChangelogProducer - Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support - Remove ChangelogProducer enum (belongs in apache#7348 scanners branch) - Replace manual cache hit/miss counters with @cachedmethod(info=True) decorator on ManifestFileManager, ManifestListManager, SnapshotManager - Trim verbose docstrings across identifier, file_io, pyarrow_file_io, manifest_list_manager, and snapshot_manager - Update cache tests to use cache_info() instead of manual counters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cachetools 7.x requires Python >=3.10 but the project supports 3.6+. Drop info=True and explicit key= from @cachedmethod (both 7.x-only features) while keeping the decorator itself (available since 4.x). Replace cache_info()-based test assertions with unittest.mock spies on file_io.new_input_stream, testing the actual caching effect without any production code counters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
||
| return results | ||
|
|
||
| def find_next_scannable( |
There was a problem hiding this comment.
Do we really need it? Just reading the snapshot file to determine, does this really need to be optimized?
|
Can you explain the specific function of Cache? It seems that streaming reading does not repeat reading files? |
paimon-python/dev/requirements.txt
Outdated
| pyarrow>=6,<7; python_version < "3.8" | ||
| pyarrow>=16,<20; python_version >= "3.8" | ||
| pylance>=0.20,<1; python_version>="3.9" | ||
| pylance>=0.10,<1; python_version>="3.8" and python_version<"3.9" |
There was a problem hiding this comment.
I did a small commit to remove these deps. Please rebase master.
dfb750d to
6af3313
Compare
Remove LRU caches from ManifestFileManager, ManifestListManager, and SnapshotManager — they have near-zero hit rates in practice (batch reads create new manager instances; streaming reads see unique manifest names per snapshot). Caching will be re-added in PR apache#7350 where streaming actually benefits. Remove find_next_scannable and get_snapshots_batch from SnapshotManager as they have zero callers on this branch. They will be added where needed in downstream PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert manifest_file_manager.py to upstream/master (caching split no longer needed) - Restore original docstring for get_snapshot_by_id - Revert assertEquals -> assertEqual drive-by change Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dynamic bucket options were accidentally removed during rebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
I introduced the cache when I was working on the catch-up streaming scenario - and only one of them ended up being used. I've removed them from here and will add them back if/when they're needed. |
|
@JingsongLi thanks a lot for the review, sorry about some of the useless changes - they crept in from a bunch of performance testing I was doing at the end of the CLI work. |
…rim docs, remove ChangelogProducer - Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support - Remove ChangelogProducer enum (belongs in apache#7348 scanners branch) - Replace manual cache hit/miss counters with @cachedmethod(info=True) decorator on ManifestFileManager, ManifestListManager, SnapshotManager - Trim verbose docstrings across identifier, file_io, pyarrow_file_io, manifest_list_manager, and snapshot_manager - Update cache tests to use cache_info() instead of manual counters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Identifierfor SQL-safe formattingexists_batch()for bulk file existence checksStacked PR series
This is PR 1a/5 in the Python streaming read series:
[python] Backtick quoting for identifiers, exists_batch optimization #7347 — caching infrastructure + utilities
[python] Add scanners, sharding, and row kind support #7348 — scanners, sharding, row kind
[python] Add consumer management for streaming progress #7349 — consumer management
[python] Add StreamReadBuilder and AsyncStreamingTableScan #7350 — StreamReadBuilder + AsyncStreamingTableScan
[python] Add paimon tail CLI for streaming table reads #7351 —
paimon tailCLITest plan
flake8passes on all changed filespython -m pytestpasses (630/630, 9 pre-existing lance skips)identifier_test.py,manifest_cache_test.py