Skip to content

Python Open-source ETL tool for seamless data movement across PostgreSQL, MySQL, Redshift, BigQuery, S3, GCS, and CSV files, with yaml/json-based configuration.

License

Notifications You must be signed in to change notification settings

petaly-labs/petaly

Repository files navigation

Overview

Petaly is an open-source ETL/ELT (Extract, Load, "Transform") tool, created by and for data professionals! Our mission is to simplify data movement across different platforms with a tool that truly understands the needs of the data community.

Key Features

  • Multiple Data Sources: Support for various endpoints:

    • PostgreSQL
    • MySQL
    • BigQuery
    • Redshift
    • Google Cloud Storage (GCS Bucket)
    • S3 Bucket
    • Local CSV files
  • Features:

    • Source to target schema evaluation and mapping
    • CSV file load with column-type recognition
    • Target table structure generation
    • Configurable type mapping between different databases
    • Full table unload/load in CSV format
  • User-Friendly: No programming knowledge required

  • YAML/JSON Configuration: Easy pipeline setup

  • Cloud Ready: Full support for AWS and GCP

[EXPERIMENTAL]:

Petaly went agentic!
The AI Agent can create and run pipeline using natural language prompts.
If you're interested in exploring, check out the experimental branch: petaly-ai-agent

Feedback is welcome!

Quick Start

  1. Installation
  2. Configuration
  3. Create Pipeline
  4. Run Pipeline

Requirements

System Requirements

  • Python 3.10 - 3.12
  • Operating System:
    • Linux
    • MacOS

Note: Petaly may work on other operating systems and Python versions, but these haven't been tested yet.

Installation

Basic Installation

# Create and activate virtual environment
mkdir petaly
cd petaly
python3 -m venv .venv
source .venv/bin/activate

# Install Petaly
python3 -m pip install petaly

Cloud Provider Support

GCP Support

# Install with GCP support
python3 -m pip install petaly[gcp]

Prerequisites:

  1. Install Google Cloud SDK
  2. Configure access to your Google Project
  3. Set up service account authentication

AWS Support

# Install with AWS support
python3 -m pip install petaly[aws]

Prerequisites:

  1. Install AWS CLI
  2. Configure AWS credentials

Full Installation

# Install all features including AWS, GCP
python3 -m pip install petaly[all]

From Source

# Clone the repository
git clone https://github.com/petaly-labs/petaly.git
cd petaly

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install development dependencies
pip3 install -r requirements.txt

# Install in editable mode (recommended)
pip install -e .

# Alternative: Add src to PYTHONPATH
export PYTHONPATH=$PYTHONPATH:$(pwd)/src

Configuration

1. Initialize Configuration

# Create petaly.ini in default location (~/.petaly/petaly.ini)
python3 -m petaly init

# Or specify custom location
python3 -m petaly -c /absolute-path-to-your-config-dir/petaly.ini init

2. Set Environment Variable (Optional)

# Set the environment variable if the folder differs from the default location
export PETALY_CONFIG_DIR=/absolute-path-to-your-config-dir

# Alternative run command using the main config parameter: -c /absolute-path-to-your-config-dir/petaly.ini
python3 -m petaly -c /absolute-path-to-your-config-dir/petaly.ini [command]

3. Initialize Workspace

  1. Configure petaly.ini:
[workspace_config]
pipeline_dir_path=/home/user/petaly/pipelines
logs_dir_path=/home/user/petaly/logs
output_dir_path=/home/user/petaly/output

[global_settings]
logging_mode=INFO
pipeline_format=yaml
  1. Create workspace:
python3 -m petaly init --workspace

Create Pipeline

Initialize a new pipeline:

python3 -m petaly init -p my_pipeline

Follow the wizard to configure your pipeline. For detailed configuration options, see Pipeline Configuration Guide.

Run Pipeline

Execute your pipeline:

python3 -m petaly run -p my_pipeline

Run Specific Operations

# Extract data from source only
python3 -m petaly run -p my_pipeline --source_only

# Load data to target only
python3 -m petaly run -p my_pipeline --target_only

# Run specific objects
python3 -m petaly run -p my_pipeline -o object1,object2

Tutorial: CSV to PostgreSQL

Prerequisites

  • Petaly installed and workspace initialized
  • PostgreSQL server running

Steps

  1. Initialize Pipeline
python3 -m petaly init -p csv_to_postgres
  1. Download Test Data
# Download and extract test files
gunzip options.csv.gz
gunzip stocks.csv.gz
  1. Configure Pipeline
  • Use csv as source
  • Use postgres as target
  • Configure database connection details
  1. Run Pipeline
python3 -m petaly run -p csv_to_postgres

Example Configuration

pipeline:
  pipeline_attributes:
    pipeline_name: csv_to_postgres
    is_enabled: true
  source_attributes:
    connector_type: csv
  target_attributes:
    connector_type: postgres
    database_user: root
    database_password: db-password
    database_host: localhost
    database_port: 5432
    database_name: petalydb
    database_schema: petaly_tutorial
  data_attributes:
    use_data_objects_spec: only
    object_default_settings:
      header: true
      columns_delimiter: ","
      columns_quote: none

Documentation

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

Petaly is licensed under the Apache License 2.0. See the LICENSE file for details.

About

Python Open-source ETL tool for seamless data movement across PostgreSQL, MySQL, Redshift, BigQuery, S3, GCS, and CSV files, with yaml/json-based configuration.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published