feat: support `system.parquet_files` table #25002

hiltontj · 2024-05-14T21:05:56Z

This PR is intended to follow #25000.

Extend system table support to allow queries against system.parquet_files, a new system table that allows queries for parquet files associated with a given database and table.

Currently, the query will only look back on 1,000 segments to get associated parquet files, we may want to adjust that, or make it more flexible.

Two unit tests were added to query_executor for this, instead of E2E integration tests, as more control was needed over file persistence than what the E2E test harness provides.

Example

The system.parquet_files table can be queried like so:

SELECT * FROM system.parquet_files WHERE table_name = 'cpu'

And responds accordingly:

+------------+-------------------------------------------------+------------+-----------+---------------------+---------------------+
| table_name | path                                            | size_bytes | row_count | min_time            | max_time            |
+------------+-------------------------------------------------+------------+-----------+---------------------+---------------------+
| cpu        | dbs/foo/cpu/2024-05-14T20-53/4294967294.parquet | 1775       | 18        | 1715720025774406000 | 1715720035952490000 |
+------------+-------------------------------------------------+------------+-----------+---------------------+---------------------+

Other Changes

Another change in this PR was to go back to using dynamic dispatch for passing around the Persister. The associated types and generics made using the Persister a pain in the ParquetFilesTable implementation, so I removed those in favour of using Arc<dyn Persister>, as we had been doing previously (see 42d36d5).

Closes #24988

A shell for the `system` table provider was added to the QueryExecutorImpl which currently does not do anything, but will enable us to tie the different system table providers into it. The QueryLog was elevated from the `Database`, i.e., namespace provider, to the QueryExecutorImpl, so that it lives accross queries.

The system.queries table is now accessible, when queries are initiated in debug mode, which is not currently enabled via the HTTP API, therefore this is not yet accessible unless via the gRPC interface. The system.queries table lists all queries in the QueryLog on the QueryExecutorImpl.

influxdb3_write/src/lib.rs

…les-no-debug

…parquet-files

…les-no-debug

…parquet-files

pauldix

The SegmentState in the write buffer actually has the information for the last 1,000 persisted segments and their parquet files: https://github.com/influxdata/influxdb/blob/main/influxdb3_write/src/write_buffer/segment_state.rs#L45. This is populated on startup.

It may be a good idea to pull the parquet files information from there, rather than the persister/object store. When the buffer gets the ability to persist some table data ahead of the segment rolling over, that state will be kept there in memory and won't show up in a segment info file until after the segment rolls over.

I think I'll be refactoring the structure of SegmentState when doing that. So maybe wait until after that to change this?

hiltontj · 2024-05-17T15:20:27Z

It may be a good idea to pull the parquet files information from [the SegmentState]

That makes sense and seems simpler. Giving the system table provider read access to the SegmentState shouldn't be too hard and reading the PersistedSegments off of that should be much cheaper than how I'm doing it here. In fact, with making this API public, I should be able to get them directly:

influxdb/influxdb3_write/src/write_buffer/segment_state.rs

Lines 216 to 232 in 2381cc6

    
           pub(crate) fn get_parquet_files( 
        
               &self, 
        
               database_name: &str, 
        
               table_name: &str, 
        
           ) -> Vec<ParquetFile> { 
        
               let mut parquet_files = vec![]; 
        
               for segment in self.persisted_segments.values() { 
        
                   segment.databases.get(database_name).map(|db| { 
        
                       db.tables.get(table_name).map(|table| { 
        
                           parquet_files.extend(table.parquet_files.clone()); 
        
                       }) 
        
                   }); 
        
               } 
        
               parquet_files 
        
           }

I can at least try that out and flag if there is an issue with doing so.

hiltontj · 2024-05-17T16:37:40Z

@pauldix - after taking a quick look at using the SegmentState directly, I found it gets a bit messy with the generics on SegmentState. Its (truncated) definition is:

// T -> TimeProvider
// W -> Wal
struct SegmentState<T, W> {
    /* ... */
}

The reason it gets messy is because I need to get access via the Bufferer trait to whatever I need from the write buffer. So returning a thing with generics from the Bufferer trait becomes a nuisance. Getting the persister was easy because I could just put the persister method on Bufferer:

trait Bufferer {
    /* ... */
    fn persister(&self) -> Arc<dyn Persister>;
}

If you are re-thinking the segment state, it may be helpful to have a trait, e.g.,

trait SegmentStateProvider {
    fn get_parquet_files(&self) -> Vec<ParquetFile>;

    /* Other methods you are considering for this API */
}

Then that would be more easily extracted via the Bufferer trait and passed to the SystemTableProvider as Arc<dyn SegmentStateProvider>.

I can leave this for now, since you're planning to refactor, don't want to make doing so more difficult with changes here.

mgattozzi

This LGTM code wise. I'll defer to @pauldix though if there's anything else we need to get into this PR, but from a code perspective it looks great.

hiltontj · 2024-07-24T15:04:28Z

Planning to wait on #25144 and any related refactoring to the influxdb3_write crate before revisiting this.

hiltontj · 2024-08-07T20:01:47Z

Closing in favour of #25225

hiltontj added 15 commits May 9, 2024 15:56

chore: initial sync changes

be11052

fix: correct field type for retention policies

f637f8c

fix: test in wal

72d825d

refactor: use TokioDatafusionConfig for server setup

0fd3f31

chore: clippy

c5b7fba

fix: gRPC test broken after addition of system.queries table

5c1b2d4

Merge branch 'main' into hiltontj/system-tables

afdd5ba

Merge branch 'hiltontj/system-tables' into hiltontj/system-queries-table

df41e99

test: test system.queries table via gRPC

fa63a8e

refactor: clean up test for system queries table

e3aa631

refactor: naming on const in query executor

0b87130

refactor: expose system tables by default in edge/pro

8fc3537

feat: support system.parquet_files table

17d5bec

hiltontj added the v3 label May 14, 2024

hiltontj self-assigned this May 14, 2024

hiltontj commented May 15, 2024

View reviewed changes

influxdb3_write/src/lib.rs Outdated Show resolved Hide resolved

test: added test for system.parquet_files table

a990259

hiltontj force-pushed the hiltontj/sys-tbl-parquet-files branch from 8bcfeec to 457e7d1 Compare May 15, 2024 18:25

test: add test for missing table_name to system.parquet_files queries

f85fa78

hiltontj force-pushed the hiltontj/sys-tbl-parquet-files branch from 457e7d1 to f85fa78 Compare May 15, 2024 18:41

hiltontj added 8 commits May 15, 2024 14:49

chore: switch to core rev instead of branch

8998e59

Merge branch 'main' into hiltontj/system-tables

d101824

Merge branch 'hiltontj/system-tables' into hiltontj/system-queries-table

b9044cd

Merge branch 'hiltontj/system-queries-table' into hiltontj/system-tab…

f94c342

…les-no-debug

Merge branch 'hiltontj/system-tables-no-debug' into hiltontj/sys-tbl-…

d2b1cd5

…parquet-files

Merge branch 'hiltontj/system-tables' into hiltontj/system-queries-table

c4430cf

Merge branch 'hiltontj/system-queries-table' into hiltontj/system-tab…

788b73d

…les-no-debug

Merge branch 'hiltontj/system-tables-no-debug' into hiltontj/sys-tbl-…

77965ec

…parquet-files

hiltontj mentioned this pull request May 16, 2024

Support system tables #24972

Open

hiltontj requested review from pauldix and mgattozzi May 16, 2024 13:29

hiltontj marked this pull request as ready for review May 16, 2024 13:29

pauldix reviewed May 17, 2024

View reviewed changes

Base automatically changed from hiltontj/system-tables-no-debug to main May 17, 2024 16:39

Merge branch 'main' into hiltontj/sys-tbl-parquet-files

9171bbf

mgattozzi approved these changes Jun 5, 2024

View reviewed changes

hiltontj closed this Aug 7, 2024

hiltontj deleted the hiltontj/sys-tbl-parquet-files branch August 7, 2024 20:01

hiltontj mentioned this pull request Aug 7, 2024

feat: add system.parquet_files table #25225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support `system.parquet_files` table #25002

feat: support `system.parquet_files` table #25002

hiltontj commented May 14, 2024 •

edited

Loading

pauldix left a comment

hiltontj commented May 17, 2024

hiltontj commented May 17, 2024

mgattozzi left a comment

hiltontj commented Jul 24, 2024

hiltontj commented Aug 7, 2024

feat: support system.parquet_files table #25002

feat: support system.parquet_files table #25002

Conversation

hiltontj commented May 14, 2024 • edited Loading

Example

Other Changes

pauldix left a comment

Choose a reason for hiding this comment

hiltontj commented May 17, 2024

hiltontj commented May 17, 2024

mgattozzi left a comment

Choose a reason for hiding this comment

hiltontj commented Jul 24, 2024

hiltontj commented Aug 7, 2024

feat: support `system.parquet_files` table #25002

feat: support `system.parquet_files` table #25002

hiltontj commented May 14, 2024 •

edited

Loading