A specialized tool designed to analyze and monitor the storage size of individual columns in BigQuery tables. By leveraging BigQuery's dry-run capabilities, it provides accurate estimates of data volume per column and partition without incurring query costs.
- Cost-Free Analysis: Uses BigQuery dry runs to calculate column sizes, ensuring no processing costs are incurred during analysis.
- Granular Monitoring: Tracks column sizes at the partition level, allowing for detailed observation of data growth over time.
- Flexible Scoping: Analyze entire projects, specific datasets, or individual tables.
- Multiple Output Formats: Export results to local CSV files or directly into a BigQuery table for further analysis.
- Execution Project Support: Separate the project being analyzed from the project used for query execution (useful for permission or billing management).
The tool iterates through the specified GCP resources (projects, datasets, and tables). For each table, it identifies the partitions and executes a SELECT dry run for every column. The resulting "total bytes processed" statistic is then recorded as the column's size for that specific partition.
- GCP Credentials: Ensure you have active Google Cloud credentials. The tool uses Application Default Credentials (ADC).
gcloud auth application-default login
- Permissions: The user/service account must have:
bigquery.jobs.createon the execution project.bigquery.tables.getandbigquery.tables.liston the target project/datasets.bigquery.tables.getData(or metadata access) to perform dry runs.
The easiest way to run the tool is using Coursier:
- Fetch the dependency from Jitpack:
coursier fetch -r jitpack com.github.Kayrnt:bq_column_sizes:master-SNAPSHOT
- Launch the tool:
coursier launch -r jitpack com.github.Kayrnt:bq_column_sizes:master-SNAPSHOT -- --help
- Clone the repository:
git clone https://github.com/Kayrnt/bq_column_sizes.git cd bq_column_sizes - Build and run using SBT:
sbt "run --help"
| Option | Short | Description | Default |
|---|---|---|---|
--project-id |
-p |
(Required) GCP project ID to analyze. | - |
--execution-project-id |
-e |
GCP project ID used for query execution. | Same as -p |
--dataset |
-d |
Comma-separated list of datasets to analyze. | All datasets |
--table |
-t |
Comma-separated list of tables to analyze. | All tables |
--output-writer |
--ow |
Output format: csv or bq. |
csv |
--output-file-path |
-o |
Path for the CSV output file. | size.csv |
--max-concurrent-queries |
--mcq |
Maximum number of concurrent BigQuery requests. | 4 |
--partition |
--pt |
Specific partition to analyze (e.g., 20210101). |
All partitions |
--job-frequency |
--jf |
Partitioning frequency: year, month, day, hour. |
day |
--offset |
- | Partition offset as duration (e.g., 1d, 1hour). |
- |
--write-disposition |
- | BQ write disposition: write_empty, truncate, append. |
truncate |
When using --ow bq, the following options are available:
| Option | Short | Description |
|---|---|---|
--output-project |
--op |
Target GCP project for the output table. |
--output-dataset |
--od |
Target dataset for the output table. |
--output-table |
--ot |
Target table name for the output. |
bq-column-sizes -p bigquery-public-data -d crypto_bitcoin -t blocks --pt 20210501 -o bitcoin_sizes.csvbq-column-sizes \
-p my-source-project \
-d my_dataset \
--ow bq \
--op my-billing-project \
--od monitoring_dataset \
--ot column_size_historyIf you have read access to a project but want to use a different project for query execution (billing):
bq-column-sizes -p target-project-id -e billing-project-id -d production_datasetThe tool generates data with the following columns:
| Column | Type | Description |
|---|---|---|
job_partition |
TIMESTAMP | The timestamp of the analysis run. |
project_id |
STRING | The GCP project ID of the analyzed table. |
dataset |
STRING | The dataset name. |
table |
STRING | The table name. |
field |
STRING | The specific column/field name. |
table_partition |
TIMESTAMP | The specific partition of the table being analyzed. |
size_in_bytes |
INT64 | The calculated size of the column in bytes. |
- Range Partitioning: Currently, tables using integer range partitioning are treated as non-partitioned tables.
- Legacy SQL: Some internal queries may use Legacy SQL syntax for partition discovery.
This project is licensed under the LICENSE file included in the repository.