Lakeflow Tapworks

Automated Load Balancer and DAB (Databricks Asset Bundle) generation toolkit for Databricks Managed Lakeflow Connectors .

Problem

DAB is the recommended way for deploying Lakeflow connectors, however, manually creating and maintaining DAB templates for Lakeflow connectors doesn't scale. Common challenges include:

Manual table/object management - Adding hundreds or thousands of tables to DAB templates by hand is error-prone and time-consuming
Load balancing - Distributing tables across pipelines based on size, SLAs, or performance metrics is impossible to do manually at scale
Naming conventions - Table mapping for sources with unsupported characters (e.g., SAP tables with "/") or enforcing naming standards can be automated
DAB syntax errors - Minor syntax mistakes (e.g., missing spaces) cause errors and can be difficult to troubleshoot
Config re-use - Existing table configurations from other tools can't be reused as input for migration

Solution

Tapworks reads from a simple configuration (e.g., CSV, YAML, JSON, Delta table, or any DataFrame source) and automatically generates complete DAB packages with load balancing, validation, and proper syntax while splitting the specified tables across pipelines for performance (load balancing).

┌─────────────────┐     ┌─────────────────────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│     INPUT       │     │           TAPWORKS              │     │     OUTPUT      │     │     DEPLOY      │
│                 │     │                                 │     │   DAB Package   │     │                 │
│  • Table config │────▶│  1. Validate config             │────▶│  databricks.yml │────▶│  bundle deploy  │
│    (CSV/Delta)  │     │  2. Apply defaults/overrides    │     │  resources/     │     │                 │
│  • Targets      │     │  3. Load balance (split tables) │     │   pipelines.yml │     │                 │
│  • Defaults     │     │  4. Generate YAML               │     │   jobs.yml      │     │                 │
│  • Overrides    │     │                                 │     │   gateways.yml  │     │                 │
└─────────────────┘     └─────────────────────────────────┘     └─────────────────┘     └─────────────────┘

How It Works

Define your config - Specify at least source/target mappings or other extra configuration (e.g., schedule, SCD type, gateway driver type, ...), and target environements. Using target, it is possible to specify different workspaces for deployment (e.g., dev, staging, prod)

Example of a basic CSV config

source_schema,source_table,target_catalog,target_schema,target_table,connection_name
dbo,customers,bronze,sales,customers,sql_server_conn

Example target environments

{
    'dev': {'workspace_host': 'https://dev.cloud.databricks.com'},
    'prod': {'workspace_host': 'https://prod.cloud.databricks.com', "root_path": "/Shared/pipelines/prod"},
}

Run the generator - From CLI or notebook. This will write the DAB templates into the specified output dir.

Output Structure

output/<project_name>/
  databricks.yml
  resources/
    gateways.yml    # database connectors only
    pipelines.yml
    jobs.yml

CLI:

# Install the package first
pip install -e .

# List available connectors
tapworks --list

# Show connector requirements
tapworks sql_server --info

# Generate DAB files
tapworks sql_server --input-config tables.csv --output-dir output \
  --targets '{"dev": {"workspace_host": "https://your-workspace.databricks.com"}}'

Notebook / Python:

from tapworks.core import run_pipeline_generation

result = run_pipeline_generation(
    connector_name='sql_server',
    input_source='tables.csv',  # or Delta table or DataFrame
    output_dir='output',
    targets={'dev': {'workspace_host': 'https://your-workspace.databricks.com'}},
)

Deploy - Use the generated DAB files with databricks bundle deploy

Load Balancing

Tapworks automatically distributes tables across pipelines and gateways. This is done according to limitations of the pipelines (e.g., maximum recommended number of tables per pipeline), and can be adjusted via user config. Users can group pipelines together using project, prefix, or subgroups.

Example CSV with grouping columns

source_table,target_catalog,target_schema,target_table,connection_name,prefix,subgroup
customers,main,bronze,customers,sql_conn,sales,1
orders,main,bronze,orders,sql_conn,sales,1
products,main,bronze,products,sql_conn,sales,2
employees,main,bronze,employees,sql_conn,hr,1

Hierarchy

SaaS connectors (e.g., Salesforce, GA4, ServiceNow, Workday):

Project (DAB Package)
└── Prefix + Subgroup (logical grouping)
    └── Pipeline(s) - auto-split if > 250 tables

Database connectors (e.g., SQL Server, PostgreSQL):

Project (DAB Package)
└── Prefix + Subgroup (logical grouping)
    └── Gateway(s) - auto-split if > 250 tables
        └── Pipeline(s) - auto-split if > 250 tables per gateway

Auto-Distribution

Tables are automatically split based on configurable limits (default: 250 tables per pipeline/gateway):

SaaS connector example (600 tables):

              Input: 600 tables, prefix="sales", subgroup="01"
                                      │
                    ┌─────────────────┼─────────────────┐
                    ▼                 ▼                 ▼
            ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
            │   Pipeline    │ │   Pipeline    │ │   Pipeline    │
            │ sales_01_g01  │ │ sales_01_g02  │ │ sales_01_g03  │
            │ (250 tables)  │ │ (250 tables)  │ │ (100 tables)  │
            └───────────────┘ └───────────────┘ └───────────────┘

Database connector example (600 tables):

              Input: 600 tables, prefix="sales", subgroup="01"
                                      │
                    ┌─────────────────┴─────────────────┐
                    ▼                                   ▼
            ┌───────────────┐                   ┌───────────────┐
            │    Gateway    │                   │    Gateway    │
            │ sales_01_g01  │                   │ sales_01_g02  │
            │ (500 tables)  │                   │ (100 tables)  │
            └───────┬───────┘                   └───────┬───────┘
                    │                                   │
          ┌─────────┴─────────┐                         │
          ▼                   ▼                         ▼
   ┌───────────────┐   ┌───────────────┐        ┌───────────────┐
   │   Pipeline    │   │   Pipeline    │        │   Pipeline    │
   │  sales_01_    │   │  sales_01_    │        │  sales_01_    │
   │  g01_p01      │   │  g01_p02      │        │  g02_p01      │
   │ (250 tables)  │   │ (250 tables)  │        │ (100 tables)  │
   └───────────────┘   └───────────────┘        └───────────────┘

Manual Subgroups

Use subgroups to isolate specific tables (e.g., critical or high-volume tables). Note: When using subgroups, all tables in a prefix must have explicit subgroups.

                    prefix="sales"
                          │
          ┌───────────────┴───────────────┐
          ▼                               ▼
      subgroup="01"                  subgroup="02"
    (5 critical tables)          (595 remaining tables)
          │                               │
          ▼                               ▼
  ┌───────────────┐       ┌───────────────┬───────────────┬───────────────┐
  │   Pipeline    │       │   Pipeline    │   Pipeline    │   Pipeline    │
  │ sales_01_p01  │       │ sales_02_p01  │ sales_02_p02  │ sales_02_p03  │
  │  (5 tables)   │       │ (250 tables)  │ (250 tables)  │  (95 tables)  │
  └───────────────┘       └───────────────┴───────────────┴───────────────┘

Resource Naming

Tapworks generates DAB resource names from project_name, prefix, and subgroup. Suffixes are always present for stable naming — adding more tables never renames existing resources.

project_name  →  required, no default
prefix        →  falls back to project_name if not specified
subgroup      →  defaults to "01" if not specified
base_group    =  {prefix}_{subgroup}

Database connector (with gateways):

Resource	Pattern	Example
Gateway (resource key)	`gateway_{base_group}_g{NN}`	`gateway_sales_01_g01`
Gateway (display name)	`{base_group}_g{NN}`	`sales_01_g01`
Pipeline (resource name)	`pipeline_{base_group}_g{NN}_p{NN}`	`pipeline_sales_01_g01_p01`
Pipeline (display name)	`{base_group}_g{NN}_p{NN}`	`sales_01_g01_p01`
Job (resource name)	`job_{base_group}_g{NN}_p{NN}`	`job_sales_01_g01_p01`
Job (display name)	`{base_group}_g{NN}_p{NN}`	`sales_01_g01_p01`

SaaS connector (no gateways):

Resource	Pattern	Example
Pipeline (resource name)	`pipeline_{base_group}_p{NN}`	`pipeline_sales_01_p01`
Pipeline (display name)	`{base_group}_p{NN}`	`sales_01_p01`
Job (resource name)	`job_{base_group}_p{NN}`	`job_sales_01_p01`
Job (display name)	`{base_group}_p{NN}`	`sales_01_p01`

Important: Prefixes must be unique per workspace. Using the same prefix across different projects deployed to the same workspace will cause resource name collisions. Use distinct prefixes (or distinct project_name values if relying on the prefix fallback) for each project.

Defaults and Overrides

Users can define configs individually for objects or pipelines in the config, or they can specify config for a group of pipelines when calling generator, using default_values and override_configs.

default_values - Fill missing/empty columns with defaults (e.g., set schedule for rows that don't have one)
override_config - Force values for ALL rows, ignoring what's in the input (e.g., pause all jobs)

Simple Configuration

Apply the same default values to all rows using a flat dictionary:

default_values = {
    'schedule': '0 */6 * * *',
    'pause_status': 'UNPAUSED',
}

Group-Based Configuration

Apply different values per pipeline group using nested dictionaries:

default_values = {
    '*': {'schedule': '0 */6 * * *'},        # Global fallback
    'sales': {'schedule': '*/15 * * * *'},   # All sales pipelines
    'hr': {'schedule': '0 0 * * *'},         # HR pipelines
}

override_config = {
    '*': {'pause_status': 'UNPAUSED'},
    'finance': {'pause_status': 'PAUSED'},   # Pause finance for audit
}

Matching precedence (most specific wins):

pipeline_group (prefix_subgroup) - e.g., 'sales_2'
prefix - e.g., 'sales'
project_name - e.g., 'my_project'
'*' (global)

See examples/features/group_based_config (Databricks) for detailed examples.

CLI Examples

Inline JSON:

tapworks salesforce --input-config tables.csv --output-dir output \
  --targets '{"dev": {"workspace_host": "https://dev.cloud.databricks.com"}}' \
  --default-values '{"project_name": "sfdc_prod", "schedule": "0 */6 * * *"}' \
  --override '{"pause_status": "PAUSED"}'

Using settings file:

tapworks salesforce --input-config tables.csv --output-dir output --settings settings.json

Settings file (settings.json):

{
  "targets": {
    "dev": {
      "workspace_host": "https://dev.cloud.databricks.com",
      "root_path": "/Shared/pipelines/dev"
    },
    "prod": {
      "workspace_host": "https://prod.cloud.databricks.com",
      "root_path": "/Shared/pipelines/prod"
    }
  },
  "default_values": {
    "project_name": "sfdc_prod",
    "schedule": "0 */6 * * *"
  },
  "override_input_config": {
    "pause_status": "PAUSED"
  }
}

Notebook Example

from tapworks.core import run_pipeline_generation

result = run_pipeline_generation(
    connector_name='salesforce',
    input_source='config.csv',
    output_dir='output',
    targets={
        'dev': {'workspace_host': 'https://dev.cloud.databricks.com'},
        'prod': {'workspace_host': 'https://prod.cloud.databricks.com'},
    },
    # Fill missing values
    default_values={
        'project_name': 'sfdc_prod',
        'schedule': '0 */6 * * *',
    },
    # Override ALL rows (e.g., pause all jobs during maintenance)
    override_config={
        'pause_status': 'PAUSED',
    },
)

Documentation

USAGE.md (Databricks) - CLI and notebook usage examples for all connectors
ARCHITECTURE.md (Databricks) - Technical architecture and class hierarchy

Name		Name	Last commit message	Last commit date
Latest commit History 356 Commits
docs		docs
examples		examples
helpers		helpers
prompts		prompts
src/tapworks		src/tapworks
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CODEOWNERS.txt		CODEOWNERS.txt
LICENSE.md		LICENSE.md
NOTICE.md		NOTICE.md
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lakeflow Tapworks

Problem

Solution

How It Works

Load Balancing

Hierarchy

Auto-Distribution

Manual Subgroups

Resource Naming

Defaults and Overrides

Simple Configuration

Group-Based Configuration

CLI Examples

Notebook Example

Documentation

License

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

License

databricks-solutions/lakeflow-tapworks

Folders and files

Latest commit

History

Repository files navigation

Lakeflow Tapworks

Problem

Solution

How It Works

Load Balancing

Hierarchy

Auto-Distribution

Manual Subgroups

Resource Naming

Defaults and Overrides

Simple Configuration

Group-Based Configuration

CLI Examples

Notebook Example

Documentation

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages