turbot · johnsmyth · Jun 20, 2025 · Jun 18, 2025 · Jun 18, 2025 · Jun 18, 2025
diff --git a/docs/collect/configure.md b/docs/collect/configure.md
@@ -53,14 +53,14 @@ Tailpipe uses [hive partitioning](https://duckdb.org/docs/data/partitioning/hive
 
   - The data is written to Parquet files in the workspace directory, with a prescribed directory and filename structure.  Each partition is written to a separate directory.
 
-  - For [custom tables](/docs/collect/custom-tables), you can define a `tp_index` column on which to index.  For tables implemented by plugins, the index is not *user*-definable. Be aware that defining a `tp_index` does not always increase performance and may, in fact, decrease it as it can result in many small parquet files.   
+  - The `tp_index` is used to partition the data and defaults to `"default"` if not specified. You can configure the `tp_index` in your [partition config](/docs/reference/config-files/partition) to specify a column whose value should be used as tp_index. Be aware that defining a `tp_index` does not always increase performance and may, in fact, decrease it as it can result in many small parquet files.
 
-The standard partitioning/hive structure enables efficient queries that only need to read subsets of the hive filtered by index or date.  Because the data is laid out into partitions,  performance is optimized when the partition appears in a `where` or `join` clause.  The index provides a way to segment the data to optimize lookup performance in a way that is *optimal for the specific plugin*.  For example, AWS tables index on account ID, Azure tables on subscription, and GCP on project ID. 
+The standard partitioning/hive structure enables efficient queries that only need to read subsets of the hive filtered by index or date.  Because the data is laid out into partitions,  performance is optimized when the partition appears in a `where` or `join` clause.  The index provides a way to segment the data to optimize lookup performance in a way that is *optimal for your specific use case*.  For example, you might index on account ID for AWS tables, subscription for Azure tables, or project ID for GCP tables. 
 
 ```bash
 tp_table=aws_cloudtrail_log
 └── tp_partition=prod
-    └── tp_index=605491513981
+    └── tp_index=default
         ├── tp_date=2024-12-31
         │   └── data_20250106140713_740378_0.parquet
         ├── tp_date=2025-01-01

diff --git a/docs/collect/custom-tables.md b/docs/collect/custom-tables.md
@@ -65,6 +65,8 @@ You may also use one or more [`column` definitions](/docs/reference/config-files
 
 In our example, the source format does not define a field named `tp_timestamp`.  Since ***`tp_timestamp` is a required column***,  we will add a `tp_timestamp` column and map the `timestamp` from the source.  Also, the source includes a `plugin_timestamp`, but it is parsed as a number because it is epoch milliseconds.  We will transform it to a timestamp data type.
 
+> [!NOTE]
+> You cannot set the `tp_index` mapping in the table definition. The `tp_index` can only be configured through the [partition config](/docs/reference/config-files/partition), where it defaults to `"default"` if not specified.
 
 ```hcl
 table "steampipe_plugin" {

diff --git a/docs/develop/plugin-release-checklist.md b/docs/develop/plugin-release-checklist.md
@@ -89,7 +89,6 @@ Every table and column has a description. These are consistent across tables. Th
 The table enriches the row with the following required [common columns](/docs/reference/config-files/table#common-columns):
 - `tp_date` - The date the event was originally generated.
 - `tp_id` - A unique identifier for the row. In Turbot plugins, it is typically set to an [xid](https://github.com/rs/xid).
-- `tp_index` - The index used to partition the data, e.g., AWS account ID, GitHub organization, hostname.
 - `tp_ingest_timestamp` - The timestamp when the event was ingested into the system.
 - `tp_timestamp` -  The timestamp when the event was originally generated.
 

diff --git a/docs/faq/index.md b/docs/faq/index.md
@@ -93,4 +93,4 @@ partition "aws_cloudtrail_log" "cloudtrail_all" {
 
 ## What partition indexes are available for a table?
 
-That depends on how the plugin author has defined the common `tp_index` field. For AWS tables, it's the `account_id`. In the dual-partition case above, you could carve the logs by `account_id` using the common `tp_partition` field (but `tp_index` will always be the same). In the single-partition case above, you could carve the logs by `account_id` using `tp_index` (but `tp_partition` will always be the same). 
+The `tp_index` value depends on how you have configured it in your [partition config](/docs/reference/config-files/partition). By default, `tp_index` is set to `"default"`, but you can configure it to specify a column whose value should be used as the partition index, as makes sense for the data. For AWS tables, you might set it to `account_id`.
diff --git a/docs/reference/cli/compact.md b/docs/reference/cli/compact.md
@@ -16,6 +16,7 @@ Compact multiple Parquet files per day to one per day.
 | Flag | Description
 |-|-
 |  `--help`          |  Help for compact
+|  `--reindex`       |  Reorganize data using the currently configured `tp_index` structure. Any data collected using a different `tp_index` value will be rewritten to new files [partitioned](/docs/collect/configure#hive-partitioning) using the current `tp_index`.
 
 
 ## Examples

diff --git a/docs/reference/config-files/partition.md b/docs/reference/config-files/partition.md
@@ -32,6 +32,7 @@ The partition has two labels:
 |----------|--------|-----------|-----------------
 | `source` | Block  | Required  | a [source](#source) from which to collect data.
 | `filter` | String | Optional  | A SQL `where` clause condition to filter log entries. Supports expressions using table columns.
+| `tp_index` | String | Optional  | The column whose value should be used as tp_index. Defaults to `"default"` if not specified. This is used in the [hive partitioning](/docs/collect/configure#hive-partitioning) scheme.
 
 
 
@@ -178,6 +179,20 @@ partition "aws_cloudtrail_log" "s3_bucket_us_east_1" {
 }
 ```
 
+You can configure the `tp_index` to use a specific column as the partition index:
+
+```hcl
+partition "aws_cloudtrail_log" "account_specific" {
+  tp_index = "account_id"
+
+  source "aws_s3_bucket" {
+    connection  = connection.aws.account_a
+    bucket      = "aws-cloudtrail-logs-account-a"
+    file_layout = `AWSLogs/%{NUMBER:account_id}/CloudTrail/%{DATA}.json.gz`  
+  }
+}
+```
+
 Another `source` type, `file`, enables you to collect from local log files that you've downloaded. This partition collects the [flaws.cloud](https://flaws.cloud) files.
 
 ```hcl

diff --git a/docs/reference/config-files/table.md b/docs/reference/config-files/table.md
@@ -111,7 +111,7 @@ Tailpipe supports most of the [DuckDB general-purpose data types](https://duckdb
 
 Tailpipe tables include a set of common columns.  These mappings enable queries that correlate values across different logs. If you have collected both Cloudtrail and ALB logs, for example, you could query for `tp_ips` to find IP addresses in the `aws_cloudtrail_log` and `aws_alb_access_log` tables using the same syntax.
 
-When creating a custom table, `tp_timestamp` is the only required column; ***you must define a `tp_timestamp` column***.  This is because Tailpipe uses the timestamp to [organize the data files](/docs/collect/configure#hive-partitioning).  The `tp_index` is also used in the hive partitioning scheme.  You may set it if you want, but it will default to `default` if not set.
+When creating a custom table, `tp_timestamp` is the only required column; ***you must define a `tp_timestamp` column***.  This is because Tailpipe uses the timestamp to [organize the data files](/docs/collect/configure#hive-partitioning).  The `tp_index` is also used in the hive partitioning scheme.  By default, `tp_index` is set to `"default"`, but you can configure it in your [partition config](/docs/reference/config-files/partition) to specify a column whose value should be used as the partition index.
 
 Some of the common columns (`tp_date`,`tp_id`,`tp_ingest_timestamp`,`tp_partition`,`tp_table`) are automatically set by the plugins - You do not need to create them.  Others are optional (but encouraged).  If you do not set an optional common column, all values will be `null`.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -93,4 +93,4 @@ partition "aws_cloudtrail_log" "cloudtrail_all" {

		## What partition indexes are available for a table?

		That depends on how the plugin author has defined the common `tp_index` field. For AWS tables, it's the `account_id`. In the dual-partition case above, you could carve the logs by `account_id` using the common `tp_partition` field (but `tp_index` will always be the same). In the single-partition case above, you could carve the logs by `account_id` using `tp_index` (but `tp_partition` will always be the same).
		The `tp_index` value depends on how you have configured it in your [partition config](/docs/reference/config-files/partition). By default, `tp_index` is set to `"default"`, but you can configure it to specify a column whose value should be used as the partition index, as makes sense for the data. For AWS tables, you might set it to `account_id`.