Skip to content

Conversation

@nuno-faria
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

With large Parquet files, a non-negligible amount of time is spent reading the footer and page metadata. This overhead becomes more noticeable when the queries are relatively simple. With repeated scans over the same file, we can improve this by caching the metadata, so it is only read once.

With a benchmark using a large file (100M rows, 2 cols) and simple reads (select where k = ...), caching the Parquet metadata makes it more than 10x faster.

Simple benchmark
use datafusion::common::Result;
use datafusion::prelude::*;
use tokio::time::Instant;

async fn create_file() -> Result<()> {
    let ctx = SessionContext::new();
    ctx.sql(
        "
        COPY (
            SELECT 'k-' || i as k, i as v
            FROM generate_series(1, 100000000) t(i)
            ORDER BY k
        )
        TO 't.parquet'
        OPTIONS (MAX_ROW_GROUP_SIZE 131072, DATA_PAGE_ROW_COUNT_LIMIT 8192, DICTIONARY_ENABLED false);
    ",
    )
    .await?
    .collect()
    .await?;
    Ok(())
}

async fn bench(cache_metadata: bool) -> Result<f64> {
    let config = SessionConfig::new().with_target_partitions(1);
    let ctx = SessionContext::new_with_config(config);
    let options = ParquetReadOptions::new().cache_metadata(cache_metadata);
    ctx.register_parquet("t", "t.parquet", options).await?;

    let t = Instant::now();
    for _ in 0..1000 {
        ctx.sql("SELECT v FROM t where k = 'k-12345'")
            .await?
            .collect()
            .await?;
    }
    Ok(t.elapsed().as_secs_f64())
}

#[tokio::main]
async fn main() -> Result<()> {
    create_file().await?;

    let time_not_cached = bench(false).await?;
    println!("time_not_cached: {time_not_cached:.3}");

    let time_cached = bench(true).await?;
    println!("time_cached: {time_cached:.3}");

    println!("diff: {:.3}x faster", time_not_cached / time_cached);

    Ok(())
}
time_not_cached: 18.507
time_cached: 1.485
diff: 12.459x faster

The metadata cache is disabled by default. It can be turned on for a specific Parquet using ParquetReadOptions:

let options = ParquetReadOptions::new().cache_metadata(true);
ctx.register_parquet("t", "t.parquet", options).await?;

It can also be enabled for all Parquet files, using the SQL API:

set datafusion.execution.parquet.cache_metadata = true;

The cache is automatically invalidated when the file changes.

When the cache is enabled, the entire metadata will be read, including the page index, unless using encryption:

// For now, page index does not work with encrypted files. See:
// https://github.com/apache/arrow-rs/issues/7629

This means that it is not worth enabling it for single file scans whose query does not need the page index.

What changes are included in this PR?

  • Added cache_metadata to ParquetOptions (default = false).
  • Added cache_metadata to ParquetReadOptions (default = None).
  • Added CachedParquetFileReaderFactory and CachedParquetFileReader.
  • Added file_metadata_cache to CacheManager (default = DefaultFilesMetadataCache).
  • Added DefaultFilesMetadataCache.
  • Update the ParquetFormat to call the CachedParquetFileReaderFactory when caching is enabled.
  • Updated the proto::ParquetOptions.
  • Added Parquet sqllogictests.
  • Added a unit test to cache/cache_unit.rs.

Are these changes tested?

Yes.

Are there any user-facing changes?

Added a new external configuration, but it is disabled by default.

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate proto Related to proto crate datasource Changes to the datasource crate labels Jul 29, 2025
Copy link
Contributor Author

@nuno-faria nuno-faria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @alamb

if let Some(lc) = &config.list_files_cache {
manager.list_files_cache = Some(Arc::clone(lc))
}
if let Some(mc) = &config.file_metadata_cache {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the file_metadata_cache is assigned to DefaultFilesMetadataCache if not provided. This makes it easier to enable metadata caching using just ParquetReadOptions or set datafusion.execution.parquet.cache_metadata = true.

async move {
let object_meta = &file_meta.object_meta;

// lookup if the metadata is already cached
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two workers call this at the same time for the same file, they will both independently read the metadata and then update the cache. There should be no problem with this, but pointing it out just in case.

| datafusion.execution.parquet.binary_as_string | false | (reading) If true, parquet reader will read columns of `Binary/LargeBinary` with `Utf8`, and `BinaryView` with `Utf8View`. Parquet files generated by some legacy writers do not correctly set the UTF8 flag for strings, causing string columns to be loaded as BLOB instead. |
| datafusion.execution.parquet.coerce_int96 | NULL | (reading) If true, parquet reader will read columns of physical type int96 as originating from a different resolution than nanosecond. This is useful for reading data from systems like Spark which stores microsecond resolution timestamps in an int96 allowing it to write values with a larger date range than 64-bit timestamps with nanosecond resolution. |
| datafusion.execution.parquet.bloom_filter_on_read | true | (reading) Use any available bloom filters when reading parquet files |
| datafusion.execution.parquet.cache_metadata | false | (reading) Whether or not to enable the caching of embedded metadata of Parquet files (footer and page metadata). Enabling it can offer substantial performance improvements for repeated queries over large files. By default, the cache is automatically invalidated when the underlying file is modified. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata cache is automatically invalidated when the file changes, but is never cleared. Is there a problem with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it is a problem if we wanted to turn this feature on by default, which I do. However, i don't think we need to make any changes in this particular PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue to add a limit to the cache here:

@alamb
Copy link
Contributor

alamb commented Jul 29, 2025

Thank you @nuno-faria - I plan to review this tomorrow

@alamb
Copy link
Contributor

alamb commented Jul 29, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing cache_parquet_metadata (1452333) to aab44fd diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 29, 2025

🤖: Benchmark completed

Details

Comparing HEAD and cache_parquet_metadata
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ cache_parquet_metadata ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1919.02 ms │             1981.25 ms │     no change │
│ QQuery 1     │   742.67 ms │              697.90 ms │ +1.06x faster │
│ QQuery 2     │  1447.35 ms │             1409.45 ms │     no change │
│ QQuery 3     │   683.30 ms │              685.63 ms │     no change │
│ QQuery 4     │  1345.17 ms │             1400.96 ms │     no change │
│ QQuery 5     │ 14388.59 ms │            14681.57 ms │     no change │
│ QQuery 6     │  2032.70 ms │             2056.19 ms │     no change │
│ QQuery 7     │  1937.28 ms │             1881.23 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 24496.10ms │
│ Total Time (cache_parquet_metadata)   │ 24794.18ms │
│ Average Time (HEAD)                   │  3062.01ms │
│ Average Time (cache_parquet_metadata) │  3099.27ms │
│ Queries Faster                        │          1 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          7 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ cache_parquet_metadata ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.32 ms │                2.35 ms │    no change │
│ QQuery 1     │    33.46 ms │               34.92 ms │    no change │
│ QQuery 2     │    80.99 ms │               81.22 ms │    no change │
│ QQuery 3     │   100.46 ms │              102.22 ms │    no change │
│ QQuery 4     │   581.35 ms │              583.97 ms │    no change │
│ QQuery 5     │   854.32 ms │              868.39 ms │    no change │
│ QQuery 6     │     2.28 ms │                2.26 ms │    no change │
│ QQuery 7     │    38.32 ms │               38.49 ms │    no change │
│ QQuery 8     │   874.65 ms │              860.28 ms │    no change │
│ QQuery 9     │  1190.75 ms │             1186.00 ms │    no change │
│ QQuery 10    │   268.25 ms │              257.62 ms │    no change │
│ QQuery 11    │   290.92 ms │              288.12 ms │    no change │
│ QQuery 12    │   894.98 ms │              875.13 ms │    no change │
│ QQuery 13    │  1226.26 ms │             1262.92 ms │    no change │
│ QQuery 14    │   804.08 ms │              800.86 ms │    no change │
│ QQuery 15    │   799.21 ms │              783.38 ms │    no change │
│ QQuery 16    │  1602.74 ms │             1613.03 ms │    no change │
│ QQuery 17    │  1604.27 ms │             1603.44 ms │    no change │
│ QQuery 18    │  2857.45 ms │             2868.90 ms │    no change │
│ QQuery 19    │    88.67 ms │               85.99 ms │    no change │
│ QQuery 20    │  1130.48 ms │             1167.02 ms │    no change │
│ QQuery 21    │  1279.87 ms │             1310.12 ms │    no change │
│ QQuery 22    │  2108.84 ms │             2188.13 ms │    no change │
│ QQuery 23    │  7472.56 ms │             7498.07 ms │    no change │
│ QQuery 24    │   447.29 ms │              441.37 ms │    no change │
│ QQuery 25    │   300.64 ms │              299.02 ms │    no change │
│ QQuery 26    │   448.18 ms │              443.82 ms │    no change │
│ QQuery 27    │  1538.50 ms │             1565.77 ms │    no change │
│ QQuery 28    │ 11865.32 ms │            11857.36 ms │    no change │
│ QQuery 29    │   540.12 ms │              518.18 ms │    no change │
│ QQuery 30    │   770.68 ms │              786.14 ms │    no change │
│ QQuery 31    │   809.41 ms │              816.07 ms │    no change │
│ QQuery 32    │  2403.90 ms │             2437.48 ms │    no change │
│ QQuery 33    │  3155.74 ms │             3181.40 ms │    no change │
│ QQuery 34    │  3305.36 ms │             3219.76 ms │    no change │
│ QQuery 35    │  1270.84 ms │             1225.82 ms │    no change │
│ QQuery 36    │   116.53 ms │              122.25 ms │    no change │
│ QQuery 37    │    51.59 ms │               51.78 ms │    no change │
│ QQuery 38    │   117.19 ms │              119.47 ms │    no change │
│ QQuery 39    │   195.50 ms │              196.23 ms │    no change │
│ QQuery 40    │    40.63 ms │               40.39 ms │    no change │
│ QQuery 41    │    37.28 ms │               39.29 ms │ 1.05x slower │
│ QQuery 42    │    32.78 ms │               32.98 ms │    no change │
└──────────────┴─────────────┴────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 53634.97ms │
│ Total Time (cache_parquet_metadata)   │ 53757.36ms │
│ Average Time (HEAD)                   │  1247.32ms │
│ Average Time (cache_parquet_metadata) │  1250.17ms │
│ Queries Faster                        │          0 │
│ Queries Slower                        │          1 │
│ Queries with No Change                │         42 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ cache_parquet_metadata ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │  98.34 ms │               98.10 ms │ no change │
│ QQuery 2     │  20.71 ms │               20.67 ms │ no change │
│ QQuery 3     │  33.51 ms │               32.46 ms │ no change │
│ QQuery 4     │  18.55 ms │               18.63 ms │ no change │
│ QQuery 5     │  50.93 ms │               49.12 ms │ no change │
│ QQuery 6     │  12.03 ms │               11.88 ms │ no change │
│ QQuery 7     │  86.58 ms │               87.56 ms │ no change │
│ QQuery 8     │  24.32 ms │               25.24 ms │ no change │
│ QQuery 9     │  55.10 ms │               54.33 ms │ no change │
│ QQuery 10    │  43.04 ms │               42.66 ms │ no change │
│ QQuery 11    │  11.05 ms │               11.10 ms │ no change │
│ QQuery 12    │  35.55 ms │               34.55 ms │ no change │
│ QQuery 13    │  25.88 ms │               26.07 ms │ no change │
│ QQuery 14    │   9.76 ms │                9.71 ms │ no change │
│ QQuery 15    │  18.81 ms │               19.07 ms │ no change │
│ QQuery 16    │  18.21 ms │               18.11 ms │ no change │
│ QQuery 17    │  96.93 ms │               96.61 ms │ no change │
│ QQuery 18    │ 189.35 ms │              193.90 ms │ no change │
│ QQuery 19    │  25.19 ms │               24.63 ms │ no change │
│ QQuery 20    │  31.73 ms │               31.87 ms │ no change │
│ QQuery 21    │ 143.68 ms │              144.51 ms │ no change │
│ QQuery 22    │  14.61 ms │               14.60 ms │ no change │
└──────────────┴───────────┴────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1063.85ms │
│ Total Time (cache_parquet_metadata)   │ 1065.38ms │
│ Average Time (HEAD)                   │   48.36ms │
│ Average Time (cache_parquet_metadata) │   48.43ms │
│ Queries Faster                        │         0 │
│ Queries Slower                        │         0 │
│ Queries with No Change                │        22 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

@Dandandan
Copy link
Contributor

🤖: Benchmark completed

Details

I think this doesn't show anything as it's not enabled by default? Should we enable it?

@alamb alamb changed the title feat: Cache Parquet metadata feat: Cache Parquet metadata in built in parquet reader Jul 30, 2025
@alamb
Copy link
Contributor

alamb commented Jul 30, 2025

🤖: Benchmark completed
Details

I think this doesn't show anything as it's not enabled by default? Should we enable it?

I made a PR to test here:

I don't think we can enable this cache by default unless it is memory limited. I will write more in my review (in progress)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @nuno-faria

I found this PR to be well designed, well implemented and well tested. 🏆

I tried this out and I couldn't seem to see it make a difference with datafusion-cli locally for me

I expect the queries after the first run almost instantaneously, however I see them fetching a non trivial amount of data over the network

I think this is related to the fact that there is a separate path to retrieve statistics for ListingTable, specifically https://github.com/apache/datafusion/blob/1452333cf0933d4d8da032af68bc5a3a05c62483/datafusion/datasource-parquet/src/file_format.rs#L975-L974

> set datafusion.execution.parquet.cache_metadata = true;
0 row(s) fetched.
Elapsed 0.000 seconds.

> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 4.632 seconds.

> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 2.717 seconds.

> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 3.409 seconds.

Since this cache disabled by default and doesn't affect performance, I think would be fine to merge into main and then iterate on there.

However, I think we should aim to enable this cache by default as part of the "great performance out of the box" philosophy.

To do so I think we would need:

  1. Some sort of upper memory limit on the parquet cache. We can get the in memory size for parquet metadata from https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaData.html#method.memory_size
  2. Use the cache for statistics as well

I will file a follow on ticket to explain these items too

cc @Ted-Jiang who I think contributed the first version of this cache and the cache infrastructure quite a while ago.

}

fn put(&self, _key: &Path, _value: Arc<FileMetadata>) -> Option<Arc<FileMetadata>> {
panic!("put in DefaultFilesMetadataCache is not supported, please use put_with_extra")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 a panic like that is unfortunate -- maybe we should change the API so this function can return an error (in a follow on PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preface: I'm a user looking forward to this work being implemented, so don't let anything said here impact the timeline for this PR.

As a user who has recently been looking into (seemingly) IO related performance using remote parquet backing listing tables, I came across the cache_unit.rs source the other day and similarly felt a bit concerned that methods in the default cache implementation code could panic. Obviously an error that can be handled is better than a panic; is this a situation that could be reasonably handled at compile time? This implementation makes use of only _with_extra implementations, but the DefaultListFilesCache makes use of only the normal get and put. Returning a Result would already be API breaking, perhaps the CacheAccessor trait could be broken into several parts indicating whether the implementer intends to support just get and put or the _with_extra variants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I studied this code more carefully, and here is an alternate proposal:

What if we made the key, from the cache's perspective, a tuple of (&Path, &ObjectMetadata) and didn't use extra

I think that is more in the spirit of this cache where the ObjectMetadata is logically part of the key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it locally and it seemed possible

What do you think @nuno-faria ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it makes sense, since the Extra is mandatory to ensure correct results in case the file changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb It appears to work, but I had to introduce lifetimes to FileMetadataCache in order to pass &(&Path, &ObjectMeta). What about using just ObjectMeta as the key, since it already has the path embedded?

pub trait FileMetadataCache:
    CacheAccessor<ObjectMeta, Arc<dyn FileMetadata>, Extra = ObjectMeta>
{
}

...

fn get(&self, k: &ObjectMeta)

And what about the [get|put]_with_extra methods in this new version, should they also panic! or call get/put and ignore the Extra?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using just ObjectMeta as the key, since it already has the path embedded?

That is a brilliant idea!

And what about the [get|put]_with_extra methods in this new version, should they also panic! or call get/put and ignore the Extra?

I think they should ignore the Extra and call get/put instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DefaultFileStatistics cache offers a caller the option of either checking for object modification or a presumably more performant option where there's no check (ostensibly this is a situation where the caller can reasonably assume the underlying files will not change). Is this something that could reasonably be implemented here, either now or in the future? If yes, it seems like it's important to ensure the key for the lookup can be reasonably built with little IO/latency overhead. Presumably this is why the other CacheAccessor implementations rely on Path since it's readily available.

I don't necessarily think this needs to be in the initial implementation of this metadata cache (again, I don't want to slow this PR down if possible), but it seems like having the option to always rely on the cached metadata when a caller can reasonably assume underlying objects will not change could be beneficial. Metadata requests for local files on a SSD/NVMe will be very fast, however when using remote storage of any sort even the head request to check modification incurs a relatively significant latency penalty compared to a blind cache lookup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that could reasonably be implemented here, either now or in the future? If yes, it seems like it's important to ensure the key for the lookup can be reasonably built with little IO/latency overhead.

I think the way the API is done now via traits it is possible to provide the default metadata file cache with one that ignores the last updated time

Also I should point out that that all this discussion only applies to people using the built in ListingTable catalog in DataFusion -- if you provide your own TableProvider you can implement many more sophisticated caching techniques (I am working on a blog to explain this more: apache/datafusion-site#98)

/// (footer and page metadata). Enabling it can offer substantial performance improvements
/// for repeated queries over large files. By default, the cache is automatically
/// invalidated when the underlying file is modified.
pub cache_metadata: bool, default = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, I think it would be better to have this be a size setting metadata_cache_size as then that can represent both disabled (0 size) and a memory cap.

We can do this in a follow on PR

Copy link
Contributor

@alamb alamb Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


/// Implementation of [`ParquetFileReaderFactory`] supporting the caching of footer and page
/// metadata. Reads and updates the [`FileMetadataCache`] with the [`ParquetMetaData`] data.
/// This reader always loads the entire metadata (including page index, unless the file is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let file_meta = self.file_meta.clone();
let metadata_cache = Arc::clone(&self.metadata_cache);

async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is impressive that you worked out this API dance -- it is something I really don't like about the current API of the parquet reader.

BTW I am working on improving it (no changes needed or suggested here, I am just self-promoting):

/// Cache of file-embedded metadata, used to avoid reading it multiple times when processing a
/// data file (e.g., Parquet footer and page metadata).
/// If not provided, the [`CacheManager`] will create a [`DefaultFilesMetadataCache`].
pub file_metadata_cache: Option<FileMetadataCache>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is already an option, why do we need a DefaultFilesMetadataCache? 🤔

Couldn't we just leave it as None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial idea here was to make it easy to enable the metadata cache without having to provide a custom FileMetadataCache when setting up the runtime (default). This way, the user can simply call set datafusion.execution.parquet.cache_metadata = true; or enable for a file with the ParquetReadOptions. But I don't know if there is a better approach (maybe removing the Option for the file_metadata_cache altogether?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand this now and it makes sense

);
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍


# Updating the file should invalidate the cache. Otherwise, the following queries would fail
# (e.g., with "Arrow: Parquet argument error: External: incomplete frame").
query I
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing

| datafusion.execution.parquet.binary_as_string | false | (reading) If true, parquet reader will read columns of `Binary/LargeBinary` with `Utf8`, and `BinaryView` with `Utf8View`. Parquet files generated by some legacy writers do not correctly set the UTF8 flag for strings, causing string columns to be loaded as BLOB instead. |
| datafusion.execution.parquet.coerce_int96 | NULL | (reading) If true, parquet reader will read columns of physical type int96 as originating from a different resolution than nanosecond. This is useful for reading data from systems like Spark which stores microsecond resolution timestamps in an int96 allowing it to write values with a larger date range than 64-bit timestamps with nanosecond resolution. |
| datafusion.execution.parquet.bloom_filter_on_read | true | (reading) Use any available bloom filters when reading parquet files |
| datafusion.execution.parquet.cache_metadata | false | (reading) Whether or not to enable the caching of embedded metadata of Parquet files (footer and page metadata). Enabling it can offer substantial performance improvements for repeated queries over large files. By default, the cache is automatically invalidated when the underlying file is modified. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it is a problem if we wanted to turn this feature on by default, which I do. However, i don't think we need to make any changes in this particular PR

@alamb
Copy link
Contributor

alamb commented Jul 31, 2025

🤖: Benchmark completed
Details

I think this doesn't show anything as it's not enabled by default? Should we enable it?

I made a PR to test here:

I don't think we can enable this cache by default unless it is memory limited. I will write more in my review (in progress)

Update -- clickbench benchmarks look quite promising for the already shorter queries:

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_default_to_on ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │    34.32 ms │            28.63 ms │ +1.20x faster │
│ QQuery 2     │    82.60 ms │            73.41 ms │ +1.13x faster │
│ QQuery 3     │    98.83 ms │            89.38 ms │ +1.11x faster │
│ QQuery 4     │   632.89 ms │           584.87 ms │ +1.08x faster │
│ QQuery 7     │    38.56 ms │            33.38 ms │ +1.16x faster │
│ QQuery 10    │   264.93 ms │           245.34 ms │ +1.08x faster │
│ QQuery 11    │   293.89 ms │           276.05 ms │ +1.06x faster │
│ QQuery 13    │  1245.28 ms │          1087.92 ms │ +1.14x faster │
│ QQuery 19    │    86.36 ms │            80.57 ms │ +1.07x faster │
│ QQuery 40    │    44.35 ms │            40.65 ms │ +1.09x faster │
└──────────────┴─────────────┴─────────────────────┴───────────────┘

@shehabgamin
Copy link
Contributor

This is super exciting! 🚀

@alamb
Copy link
Contributor

alamb commented Jul 31, 2025

I also filed a ticket to track making count(*) queries faster: #17001

@alamb
Copy link
Contributor

alamb commented Jul 31, 2025

I have gathered follow on tasks in an epic:

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I filed the follow on tasks I could see and thus I think this PR is ready to merge from my perspective

Thank you again @nuno-faria

I think we should wait another day or two in case anyone else has comments they would like to provide

I do think it would be good to sort out the panic

}

fn put(&self, _key: &Path, _value: Arc<FileMetadata>) -> Option<Arc<FileMetadata>> {
panic!("put in DefaultFilesMetadataCache is not supported, please use put_with_extra")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I studied this code more carefully, and here is an alternate proposal:

What if we made the key, from the cache's perspective, a tuple of (&Path, &ObjectMetadata) and didn't use extra

I think that is more in the spirit of this cache where the ObjectMetadata is logically part of the key.

}

fn put(&self, _key: &Path, _value: Arc<FileMetadata>) -> Option<Arc<FileMetadata>> {
panic!("put in DefaultFilesMetadataCache is not supported, please use put_with_extra")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it locally and it seemed possible

What do you think @nuno-faria ?

@alamb
Copy link
Contributor

alamb commented Aug 1, 2025

I thought some more about this PR last night and I wanted to suggest another idea, which is once we have added the memory limit to the cache, ALWAYS have the built in parquet reader try and save items to the cache (and remove the current parquet config, only use the file metadata cache)

The more I think about this series of PRs the better I am feeling

@BlakeOrth
Copy link
Contributor

BlakeOrth commented Aug 1, 2025

EDIT: I realized my testing methodology is flawed for these results, as using CREATE TABLE enabled other caching that isn't present when trying to access the remote URL directly. I'm leaving what I had here to avoid confusion, but implementing a get with no object meta requests does not seem to meaningfully alter performance.

I figured I'd put my money where my mouth was with regards to my comment here: #16971 (comment) specifically with regards to the latency penalty.

I've done a very quick modification to this branch to implement the standard get method that omits any head requests for object metadata. Results on a remote dataset can be seen below:

DataFusion CLI v49.0.0
> set datafusion.execution.parquet.cache_metadata = true;
0 row(s) fetched.
Elapsed 0.000 seconds.

> CREATE EXTERNAL TABLE athena_partitioned
STORED AS PARQUET LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
0 row(s) fetched.
Elapsed 3.319 seconds.

> select count(*) from athena_partitioned;
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 2.141 seconds.

> select count(*) from athena_partitioned;
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 0.159 seconds.

> select count(*) from athena_partitioned;
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 0.161 seconds.

@alamb are these results more in line with what you were expecting to see regarding your comment here?
#16971 (review)

@alamb
Copy link
Contributor

alamb commented Aug 2, 2025

Thanks for checking @BlakeOrth

@nuno-faria -- how do you want to proceed with this PR? it would be great to avoid the panic in the initial PR, but since it is turned off by default I think it is ok to merge this PR as is and then refactor the code as follow on PRs as well

Update: i see the new code. I will review and merge

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @nuno-faria and @BlakeOrth -- I think this looks great. Let's merge this one in and make additional changes as a follow on

@alamb alamb merged commit c37dd5e into apache:main Aug 2, 2025
28 checks passed
@nuno-faria nuno-faria deleted the cache_parquet_metadata branch August 2, 2025 11:30
Standing-Man pushed a commit to Standing-Man/datafusion that referenced this pull request Aug 4, 2025
* feat: Cache Parquet metadata

* Convert FileMetadata and FileMetadataCache to traits

* Use as_any to respect MSRV

* Use ObjectMeta as the key of FileMetadataCache
@alamb
Copy link
Contributor

alamb commented Aug 4, 2025

Thanks @BlakeOrth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache Parquet Metadata

5 participants