Skip to content

MIT-FutureTech/bigquery-gcs-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BigQuery GCS Utils

A minimal template repository for working with Google Cloud Storage (GCS) and BigQuery, specifically designed to handle large files (5-6 GB each) split across multiple GCS buckets.

Features

  • Authentication: Service account JSON key file authentication
  • Dataset Management: Create and manage BigQuery datasets
  • Table Creation: Create external tables or load data from GCS files
  • Large File Support: Handle hundreds of large files (5-6 GB each)
  • Batch Processing: Efficient file discovery and table creation for large file lists
  • Multiple Paths: Combine files from multiple GCS paths/buckets

Quick Start

1. Install Dependencies

macOS/Linux:

pip3 install -r requirements.txt

Windows:

pip install -r requirements.txt

Note: On macOS/Linux, you may need to use pip3 instead of pip depending on your Python installation. If you encounter permission errors, use pip3 install --user -r requirements.txt or install in a virtual environment.

2. Configure Environment Variables

Copy .env.example to .env and fill in your values:

macOS/Linux:

cp .env.example .env

Windows PowerShell:

Copy-Item .env.example .env

Windows Command Prompt:

copy .env.example .env

Required environment variables:

  • GOOGLE_APPLICATION_CREDENTIALS: Path to your service account JSON key file
    • macOS/Linux example: /Users/username/.gcp/service-account-key.json
    • Windows example: C:\Users\YourUsername\.gcp\service-account-key.json
  • GCP_PROJECT_ID: Your Google Cloud Project ID
  • GCS_BUCKET_NAME: Name of your GCS bucket
  • BIGQUERY_DATASET: Name of your BigQuery dataset

Optional:

  • BIGQUERY_DATASET_LOCATION: Dataset location (default: US)
  • GCS_PATH_PREFIX: GCS path prefix for file discovery

3. Set Up Service Account

  1. Create a service account in Google Cloud Console
  2. Grant the following roles:
    • Storage Object Viewer (for reading GCS objects)
    • Storage Legacy Bucket Reader (for bucket metadata access)
    • BigQuery Data Editor (for creating datasets and tables)
    • BigQuery Job User (for running BigQuery jobs)
  3. Download the JSON key file
  4. Set GOOGLE_APPLICATION_CREDENTIALS to the path of the key file

Setting the path:

macOS/Linux:

# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json

# Option 2: Set as environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"

Windows PowerShell:

# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json

# Option 2: Set as environment variable
$env:GOOGLE_APPLICATION_CREDENTIALS="C:\path\to\key.json"

Windows Command Prompt:

# Option 1: Set in .env file (recommended)
# Edit .env and set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json

# Option 2: Set as environment variable
set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\key.json

See GCP Setup Guide for detailed instructions.

4. Run Examples

macOS/Linux:

# Create a dataset
python3 examples/create_dataset_example.py

# Create tables from GCS files
python3 examples/create_table_from_gcs.py

Windows:

# Create a dataset
python examples/create_dataset_example.py

# Create tables from GCS files
python examples/create_table_from_gcs.py

Note: On macOS/Linux, you may need to use python3 instead of python depending on your Python installation.

Usage

Creating a Dataset

from config.gcp_config import GCPConfig
from utils.dataset_manager import DatasetManager

# Initialize
gcp_config = GCPConfig()
dataset_manager = DatasetManager(gcp_config)

# Create dataset
dataset_manager.create_dataset(
    dataset_name="my_dataset",
    location="US",
    description="My dataset description",
    if_exists="ignore"  # or "error" or "overwrite"
)

Creating Tables from GCS Files

External Table (References GCS files, no data movement)

from config.gcp_config import GCPConfig
from utils.table_manager import TableManager

# Initialize
gcp_config = GCPConfig()
table_manager = TableManager(gcp_config)

# Discover files
files = table_manager.discover_gcs_files(
    prefixes=["path/to/data/"],
    file_extensions=['.parquet']
)

# Create external table
table_manager.create_table_from_gcs_files(
    table_name="my_external_table",
    gcs_paths=files,
    source_format="PARQUET",
    table_type="external"
)

Loaded Table (Copies data to BigQuery)

# Create loaded table (data is copied to BigQuery)
table_manager.create_table_from_gcs_files(
    table_name="my_loaded_table",
    gcs_paths=files,
    source_format="PARQUET",
    table_type="loaded",
    write_disposition="WRITE_TRUNCATE"
)

Combining Files from Multiple Paths

# Combine files from multiple GCS paths
table_manager.combine_files_from_multiple_paths(
    table_name="combined_table",
    gcs_paths=[
        "path/to/data1/",
        "path/to/data2/",
        "another/path/"
    ],
    file_extensions=['.parquet'],
    source_format="PARQUET",
    table_type="external"
)

Handling Large File Lists

For very large file lists (hundreds of files), use batch methods:

# Discover files (handles large lists efficiently)
files = table_manager.discover_gcs_files(
    prefixes=["path/to/data/"],
    file_extensions=['.parquet'],
    max_files=None  # None for all files
)

# Create GCS URIs
gcs_uris = gcp_config.create_gcs_uris(files)

# Create external table (BigQuery handles large URI lists well)
table_manager.create_external_table_batch(
    table_name="large_table",
    gcs_uris=gcs_uris,
    source_format="PARQUET"
)

# Or load in batches (for loaded tables)
table_manager.load_files_batch(
    table_name="large_loaded_table",
    gcs_uris=gcs_uris,
    batch_size=50,
    source_format="PARQUET"
)

API Reference

GCPConfig

Main configuration class for GCP authentication and basic operations.

Methods

  • list_gcs_files(prefix, max_results, file_extensions): List files in GCS bucket
  • discover_gcs_files_batch(prefixes, file_extensions, max_files_per_prefix): Discover files across multiple prefixes
  • create_gcs_uris(blob_paths): Convert blob paths to GCS URIs
  • check_bigquery_table_exists(table_name): Check if table exists
  • create_external_table_from_gcs(table_name, gcs_uris, schema, source_format): Create external table
  • load_files_to_bigquery(table_name, gcs_uris, source_format, write_disposition): Load files into BigQuery

DatasetManager

Manages BigQuery dataset operations.

Methods

  • create_dataset(dataset_name, location, description, if_exists): Create a dataset
  • dataset_exists(dataset_name): Check if dataset exists
  • delete_dataset(dataset_name, delete_contents): Delete a dataset
  • list_datasets(): List all datasets in project
  • get_dataset_info(dataset_name): Get dataset information

TableManager

Manages BigQuery table creation from GCS files.

Methods

  • discover_gcs_files(prefixes, file_extensions, max_files): Discover files matching criteria
  • create_table_from_gcs_files(table_name, gcs_paths, gcs_uris, source_format, table_type, schema, write_disposition, if_exists): Create table from GCS files
  • create_external_table_batch(table_name, gcs_uris, batch_size, schema, source_format): Create external table from large file list
  • load_files_batch(table_name, gcs_uris, batch_size, source_format, write_disposition): Load files in batches
  • combine_files_from_multiple_paths(table_name, gcs_paths, file_pattern, file_extensions, source_format, table_type): Combine files from multiple paths
  • delete_table(table_name): Delete a table

Table Types

External Tables

  • Pros: No data movement, fast setup, always reflects latest GCS data
  • Cons: Query performance may be slower, requires GCS access permissions
  • Use when: Data changes frequently, want to avoid data duplication, query performance is acceptable

Loaded Tables

  • Pros: Better query performance, data is stored in BigQuery
  • Cons: Data is copied (storage costs), setup takes longer for large files
  • Use when: Query performance is critical, data is relatively static

Large File Handling

This template is designed to handle large files (5-6 GB each) efficiently:

  1. File Discovery: Uses efficient GCS listing without downloading files
  2. Batch Processing: Supports batch operations for large file lists
  3. Progress Tracking: Provides progress updates for long-running operations
  4. Error Handling: Includes error recovery and retry logic

Best Practices for Large Files

  • Use external tables when possible (no data movement)
  • For loaded tables, use batch loading for better control
  • Monitor BigQuery job status for long-running operations
  • Consider splitting very large datasets into multiple tables

Platform-Specific Notes

This repository works on Windows, macOS, and Linux. The code uses Python's cross-platform libraries (os, pathlib) for file path handling, so no platform-specific code changes are needed.

Path Formatting

  • macOS/Linux: Use forward slashes (/) in paths
    • Example: /Users/username/.gcp/service-account-key.json
  • Windows: Use backslashes (\) or forward slashes (/) in paths
    • Example: C:\Users\username\.gcp\service-account-key.json or C:/Users/username/.gcp/service-account-key.json

Python Installation

macOS:

# Using Homebrew (recommended)
brew install python3

# Or download from python.org
# Then install dependencies
pip3 install -r requirements.txt

Windows:

# Download Python from python.org
# Then install dependencies
pip install -r requirements.txt

Linux:

# Using package manager
sudo apt-get install python3 python3-pip  # Debian/Ubuntu
# or
sudo yum install python3 python3-pip      # CentOS/RHEL

# Then install dependencies
pip3 install -r requirements.txt

Troubleshooting

Authentication Errors

Error: "Could not automatically determine credentials"

Solution:

  • Verify GOOGLE_APPLICATION_CREDENTIALS is set correctly
  • Check that the JSON key file path is valid
    • macOS/Linux: Ensure the path uses forward slashes and is absolute (starts with /)
    • Windows: Ensure the path uses backslashes or forward slashes and includes the drive letter
  • Ensure the service account has required permissions
  • On macOS/Linux, verify file permissions allow reading: chmod 600 /path/to/key.json

Permission Errors

Error: "Permission denied" or "Access denied"

Solution:

  • Verify service account has required roles:
    • Storage Object Viewer
    • Storage Legacy Bucket Reader
    • BigQuery Data Editor
    • BigQuery Job User
  • Check bucket and dataset permissions

Dataset Creation Errors

Error: "Dataset already exists"

Solution: Use if_exists="ignore" or if_exists="replace" when creating datasets

Table Creation Errors

Error: "Table already exists"

Solution: Use if_exists="replace" or if_exists="ignore" when creating tables

Large File List Errors

Error: "Too many URIs" or timeout errors

Solution:

  • Use load_files_batch() for loaded tables
  • Consider splitting into multiple tables
  • Use external tables which handle large URI lists better

Project Structure

bigquery-gcs-utils/
├── README.md                    # This file
├── .env.example                 # Environment variable template
├── requirements.txt             # Python dependencies
├── config/
│   ├── __init__.py
│   └── gcp_config.py           # GCP configuration class
├── utils/
│   ├── __init__.py
│   ├── dataset_manager.py      # Dataset management utilities
│   └── table_manager.py        # Table creation utilities
└── examples/
    ├── __init__.py
    ├── create_dataset_example.py
    └── create_table_from_gcs.py

License

This is a template repository. Modify as needed for your use case.

Contributing

This is a minimal template. Feel free to extend it with additional features as needed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages