feat: Cache Parquet metadata in built in parquet reader #16971

nuno-faria · 2025-07-29T18:30:31Z

Which issue does this PR close?

Rationale for this change

With large Parquet files, a non-negligible amount of time is spent reading the footer and page metadata. This overhead becomes more noticeable when the queries are relatively simple. With repeated scans over the same file, we can improve this by caching the metadata, so it is only read once.

With a benchmark using a large file (100M rows, 2 cols) and simple reads (select where k = ...), caching the Parquet metadata makes it more than 10x faster.

Simple benchmark

use datafusion::common::Result;
use datafusion::prelude::*;
use tokio::time::Instant;

async fn create_file() -> Result<()> {
    let ctx = SessionContext::new();
    ctx.sql(
        "
        COPY (
            SELECT 'k-' || i as k, i as v
            FROM generate_series(1, 100000000) t(i)
            ORDER BY k
        )
        TO 't.parquet'
        OPTIONS (MAX_ROW_GROUP_SIZE 131072, DATA_PAGE_ROW_COUNT_LIMIT 8192, DICTIONARY_ENABLED false);
    ",
    )
    .await?
    .collect()
    .await?;
    Ok(())
}

async fn bench(cache_metadata: bool) -> Result<f64> {
    let config = SessionConfig::new().with_target_partitions(1);
    let ctx = SessionContext::new_with_config(config);
    let options = ParquetReadOptions::new().cache_metadata(cache_metadata);
    ctx.register_parquet("t", "t.parquet", options).await?;

    let t = Instant::now();
    for _ in 0..1000 {
        ctx.sql("SELECT v FROM t where k = 'k-12345'")
            .await?
            .collect()
            .await?;
    }
    Ok(t.elapsed().as_secs_f64())
}

#[tokio::main]
async fn main() -> Result<()> {
    create_file().await?;

    let time_not_cached = bench(false).await?;
    println!("time_not_cached: {time_not_cached:.3}");

    let time_cached = bench(true).await?;
    println!("time_cached: {time_cached:.3}");

    println!("diff: {:.3}x faster", time_not_cached / time_cached);

    Ok(())
}

time_not_cached: 18.507
time_cached: 1.485
diff: 12.459x faster

The metadata cache is disabled by default. It can be turned on for a specific Parquet using ParquetReadOptions:

let options = ParquetReadOptions::new().cache_metadata(true);
ctx.register_parquet("t", "t.parquet", options).await?;

It can also be enabled for all Parquet files, using the SQL API:

set datafusion.execution.parquet.cache_metadata = true;

The cache is automatically invalidated when the file changes.

When the cache is enabled, the entire metadata will be read, including the page index, unless using encryption:

datafusion/datafusion/datasource-parquet/src/opener.rs

Lines 146 to 147 in 94e8548

    
           // For now, page index does not work with encrypted files. See: 
        
           // https://github.com/apache/arrow-rs/issues/7629

This means that it is not worth enabling it for single file scans whose query does not need the page index.

What changes are included in this PR?

Added cache_metadata to ParquetOptions (default = false).
Added cache_metadata to ParquetReadOptions (default = None).
Added CachedParquetFileReaderFactory and CachedParquetFileReader.
Added file_metadata_cache to CacheManager (default = DefaultFilesMetadataCache).
Added DefaultFilesMetadataCache.
Update the ParquetFormat to call the CachedParquetFileReaderFactory when caching is enabled.
Updated the proto::ParquetOptions.
Added Parquet sqllogictests.
Added a unit test to cache/cache_unit.rs.

Are these changes tested?

Yes.

Are there any user-facing changes?

Added a new external configuration, but it is disabled by default.

nuno-faria

cc: @alamb

nuno-faria · 2025-07-29T18:32:56Z

datafusion/execution/src/cache/cache_manager.rs

        if let Some(lc) = &config.list_files_cache {
            manager.list_files_cache = Some(Arc::clone(lc))
        }
+        if let Some(mc) = &config.file_metadata_cache {


Here the file_metadata_cache is assigned to DefaultFilesMetadataCache if not provided. This makes it easier to enable metadata caching using just ParquetReadOptions or set datafusion.execution.parquet.cache_metadata = true.

nuno-faria · 2025-07-29T18:35:49Z

datafusion/datasource-parquet/src/reader.rs

+        async move {
+            let object_meta = &file_meta.object_meta;
+
+            // lookup if the metadata is already cached


If two workers call this at the same time for the same file, they will both independently read the metadata and then update the cache. There should be no problem with this, but pointing it out just in case.

nuno-faria · 2025-07-29T18:37:05Z

docs/source/user-guide/configs.md

 | datafusion.execution.parquet.binary_as_string                           | false                     | (reading) If true, parquet reader will read columns of `Binary/LargeBinary` with `Utf8`, and `BinaryView` with `Utf8View`. Parquet files generated by some legacy writers do not correctly set the UTF8 flag for strings, causing string columns to be loaded as BLOB instead.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
 | datafusion.execution.parquet.coerce_int96                               | NULL                      | (reading) If true, parquet reader will read columns of physical type int96 as originating from a different resolution than nanosecond. This is useful for reading data from systems like Spark which stores microsecond resolution timestamps in an int96 allowing it to write values with a larger date range than 64-bit timestamps with nanosecond resolution.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
 | datafusion.execution.parquet.bloom_filter_on_read                       | true                      | (reading) Use any available bloom filters when reading parquet files                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| datafusion.execution.parquet.cache_metadata                             | false                     | (reading) Whether or not to enable the caching of embedded metadata of Parquet files (footer and page metadata). Enabling it can offer substantial performance improvements for repeated queries over large files. By default, the cache is automatically invalidated when the underlying file is modified.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |


The metadata cache is automatically invalidated when the file changes, but is never cleared. Is there a problem with this?

I do think it is a problem if we wanted to turn this feature on by default, which I do. However, i don't think we need to make any changes in this particular PR

Filed an issue to add a limit to the cache here:

[Parquet Metadata Cache]: Limit memory used #17001

alamb · 2025-07-29T19:35:00Z

Thank you @nuno-faria - I plan to review this tomorrow

alamb · 2025-07-29T20:32:07Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing cache_parquet_metadata (1452333) to aab44fd diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-07-29T21:13:15Z

🤖: Benchmark completed

Details

Comparing HEAD and cache_parquet_metadata
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ cache_parquet_metadata ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1919.02 ms │             1981.25 ms │     no change │
│ QQuery 1     │   742.67 ms │              697.90 ms │ +1.06x faster │
│ QQuery 2     │  1447.35 ms │             1409.45 ms │     no change │
│ QQuery 3     │   683.30 ms │              685.63 ms │     no change │
│ QQuery 4     │  1345.17 ms │             1400.96 ms │     no change │
│ QQuery 5     │ 14388.59 ms │            14681.57 ms │     no change │
│ QQuery 6     │  2032.70 ms │             2056.19 ms │     no change │
│ QQuery 7     │  1937.28 ms │             1881.23 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 24496.10ms │
│ Total Time (cache_parquet_metadata)   │ 24794.18ms │
│ Average Time (HEAD)                   │  3062.01ms │
│ Average Time (cache_parquet_metadata) │  3099.27ms │
│ Queries Faster                        │          1 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          7 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ cache_parquet_metadata ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.32 ms │                2.35 ms │    no change │
│ QQuery 1     │    33.46 ms │               34.92 ms │    no change │
│ QQuery 2     │    80.99 ms │               81.22 ms │    no change │
│ QQuery 3     │   100.46 ms │              102.22 ms │    no change │
│ QQuery 4     │   581.35 ms │              583.97 ms │    no change │
│ QQuery 5     │   854.32 ms │              868.39 ms │    no change │
│ QQuery 6     │     2.28 ms │                2.26 ms │    no change │
│ QQuery 7     │    38.32 ms │               38.49 ms │    no change │
│ QQuery 8     │   874.65 ms │              860.28 ms │    no change │
│ QQuery 9     │  1190.75 ms │             1186.00 ms │    no change │
│ QQuery 10    │   268.25 ms │              257.62 ms │    no change │
│ QQuery 11    │   290.92 ms │              288.12 ms │    no change │
│ QQuery 12    │   894.98 ms │              875.13 ms │    no change │
│ QQuery 13    │  1226.26 ms │             1262.92 ms │    no change │
│ QQuery 14    │   804.08 ms │              800.86 ms │    no change │
│ QQuery 15    │   799.21 ms │              783.38 ms │    no change │
│ QQuery 16    │  1602.74 ms │             1613.03 ms │    no change │
│ QQuery 17    │  1604.27 ms │             1603.44 ms │    no change │
│ QQuery 18    │  2857.45 ms │             2868.90 ms │    no change │
│ QQuery 19    │    88.67 ms │               85.99 ms │    no change │
│ QQuery 20    │  1130.48 ms │             1167.02 ms │    no change │
│ QQuery 21    │  1279.87 ms │             1310.12 ms │    no change │
│ QQuery 22    │  2108.84 ms │             2188.13 ms │    no change │
│ QQuery 23    │  7472.56 ms │             7498.07 ms │    no change │
│ QQuery 24    │   447.29 ms │              441.37 ms │    no change │
│ QQuery 25    │   300.64 ms │              299.02 ms │    no change │
│ QQuery 26    │   448.18 ms │              443.82 ms │    no change │
│ QQuery 27    │  1538.50 ms │             1565.77 ms │    no change │
│ QQuery 28    │ 11865.32 ms │            11857.36 ms │    no change │
│ QQuery 29    │   540.12 ms │              518.18 ms │    no change │
│ QQuery 30    │   770.68 ms │              786.14 ms │    no change │
│ QQuery 31    │   809.41 ms │              816.07 ms │    no change │
│ QQuery 32    │  2403.90 ms │             2437.48 ms │    no change │
│ QQuery 33    │  3155.74 ms │             3181.40 ms │    no change │
│ QQuery 34    │  3305.36 ms │             3219.76 ms │    no change │
│ QQuery 35    │  1270.84 ms │             1225.82 ms │    no change │
│ QQuery 36    │   116.53 ms │              122.25 ms │    no change │
│ QQuery 37    │    51.59 ms │               51.78 ms │    no change │
│ QQuery 38    │   117.19 ms │              119.47 ms │    no change │
│ QQuery 39    │   195.50 ms │              196.23 ms │    no change │
│ QQuery 40    │    40.63 ms │               40.39 ms │    no change │
│ QQuery 41    │    37.28 ms │               39.29 ms │ 1.05x slower │
│ QQuery 42    │    32.78 ms │               32.98 ms │    no change │
└──────────────┴─────────────┴────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 53634.97ms │
│ Total Time (cache_parquet_metadata)   │ 53757.36ms │
│ Average Time (HEAD)                   │  1247.32ms │
│ Average Time (cache_parquet_metadata) │  1250.17ms │
│ Queries Faster                        │          0 │
│ Queries Slower                        │          1 │
│ Queries with No Change                │         42 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ cache_parquet_metadata ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │  98.34 ms │               98.10 ms │ no change │
│ QQuery 2     │  20.71 ms │               20.67 ms │ no change │
│ QQuery 3     │  33.51 ms │               32.46 ms │ no change │
│ QQuery 4     │  18.55 ms │               18.63 ms │ no change │
│ QQuery 5     │  50.93 ms │               49.12 ms │ no change │
│ QQuery 6     │  12.03 ms │               11.88 ms │ no change │
│ QQuery 7     │  86.58 ms │               87.56 ms │ no change │
│ QQuery 8     │  24.32 ms │               25.24 ms │ no change │
│ QQuery 9     │  55.10 ms │               54.33 ms │ no change │
│ QQuery 10    │  43.04 ms │               42.66 ms │ no change │
│ QQuery 11    │  11.05 ms │               11.10 ms │ no change │
│ QQuery 12    │  35.55 ms │               34.55 ms │ no change │
│ QQuery 13    │  25.88 ms │               26.07 ms │ no change │
│ QQuery 14    │   9.76 ms │                9.71 ms │ no change │
│ QQuery 15    │  18.81 ms │               19.07 ms │ no change │
│ QQuery 16    │  18.21 ms │               18.11 ms │ no change │
│ QQuery 17    │  96.93 ms │               96.61 ms │ no change │
│ QQuery 18    │ 189.35 ms │              193.90 ms │ no change │
│ QQuery 19    │  25.19 ms │               24.63 ms │ no change │
│ QQuery 20    │  31.73 ms │               31.87 ms │ no change │
│ QQuery 21    │ 143.68 ms │              144.51 ms │ no change │
│ QQuery 22    │  14.61 ms │               14.60 ms │ no change │
└──────────────┴───────────┴────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1063.85ms │
│ Total Time (cache_parquet_metadata)   │ 1065.38ms │
│ Average Time (HEAD)                   │   48.36ms │
│ Average Time (cache_parquet_metadata) │   48.43ms │
│ Queries Faster                        │         0 │
│ Queries Slower                        │         0 │
│ Queries with No Change                │        22 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

Dandandan · 2025-07-30T06:55:39Z

🤖: Benchmark completed

Details

I think this doesn't show anything as it's not enabled by default? Should we enable it?

alamb · 2025-07-30T20:38:04Z

🤖: Benchmark completed
Details

I think this doesn't show anything as it's not enabled by default? Should we enable it?

I made a PR to test here:

POC (testing) -- test with Parquet Metadata Cache enabled #16988

I don't think we can enable this cache by default unless it is memory limited. I will write more in my review (in progress)

alamb

Thank you so much @nuno-faria

I found this PR to be well designed, well implemented and well tested. 🏆

I tried this out and I couldn't seem to see it make a difference with datafusion-cli locally for me

I expect the queries after the first run almost instantaneously, however I see them fetching a non trivial amount of data over the network

I think this is related to the fact that there is a separate path to retrieve statistics for ListingTable, specifically https://github.com/apache/datafusion/blob/1452333cf0933d4d8da032af68bc5a3a05c62483/datafusion/datasource-parquet/src/file_format.rs#L975-L974

> set datafusion.execution.parquet.cache_metadata = true;
0 row(s) fetched.
Elapsed 0.000 seconds.

> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 4.632 seconds.

> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 2.717 seconds.

> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 3.409 seconds.

Since this cache disabled by default and doesn't affect performance, I think would be fine to merge into main and then iterate on there.

However, I think we should aim to enable this cache by default as part of the "great performance out of the box" philosophy.

To do so I think we would need:

Some sort of upper memory limit on the parquet cache. We can get the in memory size for parquet metadata from https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaData.html#method.memory_size
Use the cache for statistics as well

I will file a follow on ticket to explain these items too

cc @Ted-Jiang who I think contributed the first version of this cache and the cache infrastructure quite a while ago.

alamb · 2025-07-30T20:23:01Z

datafusion/execution/src/cache/cache_unit.rs

+    }
+
+    fn put(&self, _key: &Path, _value: Arc<FileMetadata>) -> Option<Arc<FileMetadata>> {
+        panic!("put in DefaultFilesMetadataCache is not supported, please use put_with_extra")


🤔 a panic like that is unfortunate -- maybe we should change the API so this function can return an error (in a follow on PR)

Preface: I'm a user looking forward to this work being implemented, so don't let anything said here impact the timeline for this PR.

As a user who has recently been looking into (seemingly) IO related performance using remote parquet backing listing tables, I came across the cache_unit.rs source the other day and similarly felt a bit concerned that methods in the default cache implementation code could panic. Obviously an error that can be handled is better than a panic; is this a situation that could be reasonably handled at compile time? This implementation makes use of only _with_extra implementations, but the DefaultListFilesCache makes use of only the normal get and put. Returning a Result would already be API breaking, perhaps the CacheAccessor trait could be broken into several parts indicating whether the implementer intends to support just get and put or the _with_extra variants.

I studied this code more carefully, and here is an alternate proposal:

What if we made the key, from the cache's perspective, a tuple of (&Path, &ObjectMetadata) and didn't use extra

I think that is more in the spirit of this cache where the ObjectMetadata is logically part of the key.

I tried it locally and it seemed possible

What do you think @nuno-faria ?

Yeah I think it makes sense, since the Extra is mandatory to ensure correct results in case the file changes.

@alamb It appears to work, but I had to introduce lifetimes to FileMetadataCache in order to pass &(&Path, &ObjectMeta). What about using just ObjectMeta as the key, since it already has the path embedded?

pub trait FileMetadataCache: CacheAccessor<ObjectMeta, Arc<dyn FileMetadata>, Extra = ObjectMeta> { } ... fn get(&self, k: &ObjectMeta)

And what about the [get|put]_with_extra methods in this new version, should they also panic! or call get/put and ignore the Extra?

What about using just ObjectMeta as the key, since it already has the path embedded?

That is a brilliant idea!

And what about the [get|put]_with_extra methods in this new version, should they also panic! or call get/put and ignore the Extra?

I think they should ignore the Extra and call get/put instead

The DefaultFileStatistics cache offers a caller the option of either checking for object modification or a presumably more performant option where there's no check (ostensibly this is a situation where the caller can reasonably assume the underlying files will not change). Is this something that could reasonably be implemented here, either now or in the future? If yes, it seems like it's important to ensure the key for the lookup can be reasonably built with little IO/latency overhead. Presumably this is why the other CacheAccessor implementations rely on Path since it's readily available.

I don't necessarily think this needs to be in the initial implementation of this metadata cache (again, I don't want to slow this PR down if possible), but it seems like having the option to always rely on the cached metadata when a caller can reasonably assume underlying objects will not change could be beneficial. Metadata requests for local files on a SSD/NVMe will be very fast, however when using remote storage of any sort even the head request to check modification incurs a relatively significant latency penalty compared to a blind cache lookup.

Is this something that could reasonably be implemented here, either now or in the future? If yes, it seems like it's important to ensure the key for the lookup can be reasonably built with little IO/latency overhead.

I think the way the API is done now via traits it is possible to provide the default metadata file cache with one that ignores the last updated time

Also I should point out that that all this discussion only applies to people using the built in ListingTable catalog in DataFusion -- if you provide your own TableProvider you can implement many more sophisticated caching techniques (I am working on a blog to explain this more: apache/datafusion-site#98)

alamb · 2025-07-30T20:28:01Z

datafusion/common/src/config.rs

+        /// (footer and page metadata). Enabling it can offer substantial performance improvements
+        /// for repeated queries over large files. By default, the cache is automatically
+        /// invalidated when the underlying file is modified.
+        pub cache_metadata: bool, default = false


Eventually, I think it would be better to have this be a size setting metadata_cache_size as then that can represent both disabled (0 size) and a memory cap.

We can do this in a follow on PR

See [Parquet Metadata Cache]: Limit memory used #17001

alamb · 2025-07-30T20:28:15Z

datafusion/datasource-parquet/src/reader.rs

+
+/// Implementation of [`ParquetFileReaderFactory`] supporting the caching of footer and page
+/// metadata. Reads and updates the [`FileMetadataCache`] with the [`ParquetMetaData`] data.
+/// This reader always loads the entire metadata (including page index, unless the file is


datafusion/execution/src/cache/cache_manager.rs

alamb · 2025-07-30T20:32:25Z

datafusion/datasource-parquet/src/reader.rs

+        let file_meta = self.file_meta.clone();
+        let metadata_cache = Arc::clone(&self.metadata_cache);
+
+        async move {


it is impressive that you worked out this API dance -- it is something I really don't like about the current API of the parquet reader.

BTW I am working on improving it (no changes needed or suggested here, I am just self-promoting):

Decouple IO and CPU operations in the Parquet Reader (push decoder) arrow-rs#7983

alamb · 2025-07-30T20:33:29Z

datafusion/execution/src/cache/cache_manager.rs

+    /// Cache of file-embedded metadata, used to avoid reading it multiple times when processing a
+    /// data file (e.g., Parquet footer and page metadata).
+    /// If not provided, the [`CacheManager`] will create a [`DefaultFilesMetadataCache`].
+    pub file_metadata_cache: Option<FileMetadataCache>,


if it is already an option, why do we need a DefaultFilesMetadataCache? 🤔

Couldn't we just leave it as None?

My initial idea here was to make it easy to enable the metadata cache without having to provide a custom FileMetadataCache when setting up the runtime (default). This way, the user can simply call set datafusion.execution.parquet.cache_metadata = true; or enable for a file with the ParquetReadOptions. But I don't know if there is a better approach (maybe removing the Option for the file_metadata_cache altogether?).

Ah, I understand this now and it makes sense

alamb · 2025-07-30T20:34:13Z

datafusion/execution/src/cache/cache_unit.rs

        );
    }
+
+    #[test]


alamb · 2025-07-30T20:39:22Z

datafusion/sqllogictest/test_files/parquet.slt

+
+# Updating the file should invalidate the cache. Otherwise, the following queries would fail 
+# (e.g., with "Arrow: Parquet argument error: External: incomplete frame").
+query I


alamb · 2025-07-30T21:01:52Z

docs/source/user-guide/configs.md

 | datafusion.execution.parquet.binary_as_string                           | false                     | (reading) If true, parquet reader will read columns of `Binary/LargeBinary` with `Utf8`, and `BinaryView` with `Utf8View`. Parquet files generated by some legacy writers do not correctly set the UTF8 flag for strings, causing string columns to be loaded as BLOB instead.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
 | datafusion.execution.parquet.coerce_int96                               | NULL                      | (reading) If true, parquet reader will read columns of physical type int96 as originating from a different resolution than nanosecond. This is useful for reading data from systems like Spark which stores microsecond resolution timestamps in an int96 allowing it to write values with a larger date range than 64-bit timestamps with nanosecond resolution.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
 | datafusion.execution.parquet.bloom_filter_on_read                       | true                      | (reading) Use any available bloom filters when reading parquet files                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| datafusion.execution.parquet.cache_metadata                             | false                     | (reading) Whether or not to enable the caching of embedded metadata of Parquet files (footer and page metadata). Enabling it can offer substantial performance improvements for repeated queries over large files. By default, the cache is automatically invalidated when the underlying file is modified.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |


I do think it is a problem if we wanted to turn this feature on by default, which I do. However, i don't think we need to make any changes in this particular PR

alamb · 2025-07-31T11:47:11Z

🤖: Benchmark completed
Details

I think this doesn't show anything as it's not enabled by default? Should we enable it?

I made a PR to test here:

POC (testing) -- test with Parquet Metadata Cache enabled #16988

I don't think we can enable this cache by default unless it is memory limited. I will write more in my review (in progress)

Update -- clickbench benchmarks look quite promising for the already shorter queries:

POC (testing) -- test with Parquet Metadata Cache enabled #16988 (comment)

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_default_to_on ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │    34.32 ms │            28.63 ms │ +1.20x faster │
│ QQuery 2     │    82.60 ms │            73.41 ms │ +1.13x faster │
│ QQuery 3     │    98.83 ms │            89.38 ms │ +1.11x faster │
│ QQuery 4     │   632.89 ms │           584.87 ms │ +1.08x faster │
│ QQuery 7     │    38.56 ms │            33.38 ms │ +1.16x faster │
│ QQuery 10    │   264.93 ms │           245.34 ms │ +1.08x faster │
│ QQuery 11    │   293.89 ms │           276.05 ms │ +1.06x faster │
│ QQuery 13    │  1245.28 ms │          1087.92 ms │ +1.14x faster │
│ QQuery 19    │    86.36 ms │            80.57 ms │ +1.07x faster │
│ QQuery 40    │    44.35 ms │            40.65 ms │ +1.09x faster │
└──────────────┴─────────────┴─────────────────────┴───────────────┘

shehabgamin · 2025-07-31T11:58:30Z

This is super exciting! 🚀

alamb · 2025-07-31T22:14:52Z

I also filed a ticket to track making count(*) queries faster: #17001

alamb · 2025-07-31T22:15:14Z

I have gathered follow on tasks in an epic:

[Epic] Enable parquet metadata cache by default #17000

alamb

Ok, I filed the follow on tasks I could see and thus I think this PR is ready to merge from my perspective

Thank you again @nuno-faria

I think we should wait another day or two in case anyone else has comments they would like to provide

I do think it would be good to sort out the panic

alamb · 2025-07-31T22:18:52Z

datafusion/execution/src/cache/cache_unit.rs

+    }
+
+    fn put(&self, _key: &Path, _value: Arc<FileMetadata>) -> Option<Arc<FileMetadata>> {
+        panic!("put in DefaultFilesMetadataCache is not supported, please use put_with_extra")


I studied this code more carefully, and here is an alternate proposal:

What if we made the key, from the cache's perspective, a tuple of (&Path, &ObjectMetadata) and didn't use extra

I think that is more in the spirit of this cache where the ObjectMetadata is logically part of the key.

alamb · 2025-07-31T22:30:20Z

datafusion/execution/src/cache/cache_unit.rs

+    }
+
+    fn put(&self, _key: &Path, _value: Arc<FileMetadata>) -> Option<Arc<FileMetadata>> {
+        panic!("put in DefaultFilesMetadataCache is not supported, please use put_with_extra")


I tried it locally and it seemed possible

What do you think @nuno-faria ?

alamb · 2025-08-01T17:47:29Z

I thought some more about this PR last night and I wanted to suggest another idea, which is once we have added the memory limit to the cache, ALWAYS have the built in parquet reader try and save items to the cache (and remove the current parquet config, only use the file metadata cache)

The more I think about this series of PRs the better I am feeling

BlakeOrth · 2025-08-01T19:06:46Z

EDIT: I realized my testing methodology is flawed for these results, as using CREATE TABLE enabled other caching that isn't present when trying to access the remote URL directly. I'm leaving what I had here to avoid confusion, but implementing a get with no object meta requests does not seem to meaningfully alter performance.

~~I figured I'd put my money where my mouth was with regards to my comment here: #16971 (comment) specifically with regards to the latency penalty.~~

~~I've done a very quick modification to this branch to implement the standard get method that omits any head requests for object metadata. Results on a remote dataset can be seen below:~~

DataFusion CLI v49.0.0
> set datafusion.execution.parquet.cache_metadata = true;
0 row(s) fetched.
Elapsed 0.000 seconds.

> CREATE EXTERNAL TABLE athena_partitioned
STORED AS PARQUET LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
0 row(s) fetched.
Elapsed 3.319 seconds.

> select count(*) from athena_partitioned;
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 2.141 seconds.

> select count(*) from athena_partitioned;
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 0.159 seconds.

> select count(*) from athena_partitioned;
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 0.161 seconds.

@alamb are these results more in line with what you were expecting to see regarding your comment here?
#16971 (review)

alamb · 2025-08-02T10:16:58Z

Thanks for checking @BlakeOrth

@nuno-faria -- how do you want to proceed with this PR? it would be great to avoid the panic in the initial PR, but since it is turned off by default I think it is ok to merge this PR as is and then refactor the code as follow on PRs as well

Update: i see the new code. I will review and merge

alamb

Thanks again @nuno-faria and @BlakeOrth -- I think this looks great. Let's merge this one in and make additional changes as a follow on

* feat: Cache Parquet metadata * Convert FileMetadata and FileMetadataCache to traits * Use as_any to respect MSRV * Use ObjectMeta as the key of FileMetadataCache

alamb · 2025-08-04T20:10:44Z

Thanks @BlakeOrth

feat: Cache Parquet metadata

1452333

nuno-faria commented Jul 29, 2025

View reviewed changes

alamb changed the title ~~feat: Cache Parquet metadata~~ feat: Cache Parquet metadata in built in parquet reader Jul 30, 2025

alamb mentioned this pull request Jul 30, 2025

POC (testing) -- test with Parquet Metadata Cache enabled #16988

Closed

alamb mentioned this pull request Jul 30, 2025

Release DataFusion 50.0.0 (Aug/Sep 2025) #16799

Closed

45 tasks

alamb approved these changes Jul 30, 2025

View reviewed changes

nuno-faria added 2 commits July 31, 2025 18:57

Convert FileMetadata and FileMetadataCache to traits

6e8385e

Use as_any to respect MSRV

292ac29

This was referenced Jul 31, 2025

[Epic] Enable parquet metadata cache by default #17000

Closed

[Parquet Metadata Cache]: Limit memory used #17001

Closed

[Parquet Metadata Cache] Use the cached metadata for ListingTable statistics #17002

Closed

alamb reviewed Jul 31, 2025

View reviewed changes

Use ObjectMeta as the key of FileMetadataCache

881917a

alamb approved these changes Aug 2, 2025

View reviewed changes

alamb merged commit c37dd5e into apache:main Aug 2, 2025
28 checks passed

nuno-faria deleted the cache_parquet_metadata branch August 2, 2025 11:30

This was referenced Aug 4, 2025

feat: Limit the memory used in the file metadata cache #17031

Merged

fix: Remove datafusion.execution.parquet.cache_metadata config #17062

Merged

BlakeOrth mentioned this pull request Oct 21, 2025

Adds DELETE and HEAD instrumentation to CLI #18206

Merged

	// For now, page index does not work with encrypted files. See:
	// https://github.com/apache/arrow-rs/issues/7629

                       );
                   }
+                  #[test]

feat: Cache Parquet metadata in built in parquet reader #16971

feat: Cache Parquet metadata in built in parquet reader #16971

Uh oh!

Conversation

nuno-faria commented Jul 29, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nuno-faria left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 29, 2025

Uh oh!

alamb commented Jul 29, 2025

Uh oh!

alamb commented Jul 29, 2025

Uh oh!

Dandandan commented Jul 30, 2025

Uh oh!

alamb commented Jul 30, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

alamb Jul 31, 2025 •

edited

Loading

alamb commented Jul 31, 2025 •

edited

Loading

BlakeOrth commented Aug 1, 2025 •

edited

Loading

alamb commented Aug 2, 2025 •

edited

Loading