Skip to content

Conversation

@PavithranRick
Copy link
Collaborator

Describe the issue this Pull Request addresses

This PR adds large-scale testing coverage for the metadata table, focusing on performance and correctness when tables are partitioned by datestr. The motivation is to validate lookup behavior and scalability when using metadata partitions—specifically FILES and Column Stats—under realistic data distributions and query patterns.

Summary and Changelog

Summary
Introduce a large-scale metadata table test framework that:

  • Generates controlled data distributions across file groups
  • Populates column statistics deterministically
  • Benchmarks lookup performance using the Column Stats partition over date ranges

Changelog / TODO

  • Added large-scale metadata table tests with datestr partitioning
  • Enabled and validated FILES metadata partition at scale
  • Enabled and validated Column Stats metadata partition at scale
  • Implemented configurable data generation to control:
    • Column-to-file-group spread
    • Record distribution per partition
    • Column statistics characteristics (e.g., min/max, value skew)
  • Added lookup benchmarks for column-value predicates combined with date range filters
  • Collected and reported lookup latency metrics for Column Stats–based pruning

Impact

  • Improves confidence in metadata table scalability and performance characteristics
  • Provides measurable performance insights for Column Stats–based lookups over large date ranges
  • No user-facing API changes
  • Test-only impact; no change to production behavior

Risk Level

low

Changes are limited to test infrastructure and benchmarking.
Verification includes:

  • Running tests at large scale with multiple partitions and file groups
  • Validating correctness of lookup results against expected matches
  • Comparing lookup performance with and without Column Stats partition enabled

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@PavithranRick PavithranRick changed the title MDT Test framework without writing data files feat: MDT Test framework without writing data files Dec 23, 2025
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 23, 2025
// Query for column stats metadata for col0 (int) and col1 (long)
// Note: Use basePath as the table identifier for hudi_metadata function
// For Long values, they are stored in member2, not member1
String metadataSql = String.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the look up here is not really doing the pruning.
we might need to mimic what we are doing within ColumnStatsIndexSupport.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants