Skip to content

[python] Backtick quoting for identifiers, exists_batch optimization#7347

Merged
JingsongLi merged 12 commits intoapache:masterfrom
tub:python-streaming-1a-caching
Mar 10, 2026
Merged

[python] Backtick quoting for identifiers, exists_batch optimization#7347
JingsongLi merged 12 commits intoapache:masterfrom
tub:python-streaming-1a-caching

Conversation

@tub
Copy link
Contributor

@tub tub commented Mar 5, 2026

Summary

  • Add backtick quoting to Identifier for SQL-safe formatting
  • Add exists_batch() for bulk file existence checks
  • Expose read_changelog in manifest list manager for future streaming reads

Stacked PR series

This is PR 1a/5 in the Python streaming read series:

Test plan

  • flake8 passes on all changed files
  • python -m pytest passes (630/630, 9 pre-existing lance skips)
  • New tests: identifier_test.py, manifest_cache_test.py

- Add backtick quoting to Identifier for SQL-safe formatting
- Add ChangelogProducer enum to core_options
- Add exists_batch() for bulk file existence checks
- Add LRU caching to ManifestFileManager and ManifestListManager
- Add snapshot caching and traversal helpers to SnapshotManager
- Add cachetools dependency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub and others added 2 commits March 5, 2026 18:07
…tEquals

Extract shared base class for ManifestFileCacheTest and ManifestListCacheTest,
add _make_snapshot() helper, and fix deprecated assertEquals (removed in 3.12).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub marked this pull request as draft March 6, 2026 11:25
tub and others added 2 commits March 6, 2026 12:21
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…efault-size tests

- Move shared cache-behaviour tests (second_read, disabled_when_zero)
  into _CacheBehaviourMixin so they run for both manager types without
  duplication
- Extract _EMPTY_ROW / _EMPTY_STATS module constants to reduce
  DataFileMeta boilerplate
- Remove test_default_cache_size tests (just assert constructor defaults)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub marked this pull request as ready for review March 6, 2026 12:44
tub and others added 2 commits March 6, 2026 14:00
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cachetools 7.x requires Python >=3.10 but the project supports 3.6+.
Drop info=True and explicit key= from @cachedmethod (both 7.x-only
features) while keeping the decorator itself (available since 4.x).

Replace cache_info()-based test assertions with unittest.mock spies on
file_io.new_input_stream, testing the actual caching effect without any
production code counters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

return results

def find_next_scannable(
Copy link
Contributor

@JingsongLi JingsongLi Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need it? Just reading the snapshot file to determine, does this really need to be optimized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped

@JingsongLi
Copy link
Contributor

Can you explain the specific function of Cache? It seems that streaming reading does not repeat reading files?

pyarrow>=6,<7; python_version < "3.8"
pyarrow>=16,<20; python_version >= "3.8"
pylance>=0.20,<1; python_version>="3.9"
pylance>=0.10,<1; python_version>="3.8" and python_version<"3.9"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a small commit to remove these deps. Please rebase master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

@tub tub force-pushed the python-streaming-1a-caching branch from dfb750d to 6af3313 Compare March 9, 2026 10:06
tub and others added 3 commits March 9, 2026 12:27
Remove LRU caches from ManifestFileManager, ManifestListManager, and
SnapshotManager — they have near-zero hit rates in practice (batch reads
create new manager instances; streaming reads see unique manifest names
per snapshot). Caching will be re-added in PR apache#7350 where streaming
actually benefits.

Remove find_next_scannable and get_snapshots_batch from SnapshotManager
as they have zero callers on this branch. They will be added where
needed in downstream PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert manifest_file_manager.py to upstream/master (caching split no longer needed)
- Restore original docstring for get_snapshot_by_id
- Revert assertEquals -> assertEqual drive-by change

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub marked this pull request as draft March 9, 2026 13:59
Dynamic bucket options were accidentally removed during rebase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub
Copy link
Contributor Author

tub commented Mar 9, 2026

Can you explain the specific function of Cache? It seems that streaming reading does not repeat reading files?

I introduced the cache when I was working on the catch-up streaming scenario - and only one of them ended up being used. I've removed them from here and will add them back if/when they're needed.

@tub tub marked this pull request as ready for review March 9, 2026 15:42
@tub
Copy link
Contributor Author

tub commented Mar 9, 2026

@JingsongLi thanks a lot for the review, sorry about some of the useless changes - they crept in from a bunch of performance testing I was doing at the end of the CLI work.

@tub tub marked this pull request as draft March 9, 2026 15:44
@tub tub changed the title [python] Add caching infrastructure and utilities [python] Backtick quoting for identifiers, exists_batch optimization Mar 9, 2026
@tub tub marked this pull request as ready for review March 9, 2026 16:19
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Thanks @tub

@JingsongLi JingsongLi merged commit 87a4c91 into apache:master Mar 10, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants