Skip to content

dataDisk is a Python package designed to simplify the creation and execution of data processing pipelines. It provides a flexible framework for defining sequential tasks, applying transformations, and validating data. Additionally, it includes a ParallelProcessor for efficient parallel execution.

License

Notifications You must be signed in to change notification settings

davitacols/dataDisk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dataDisk

Python package PyPI version License: MIT

dataDisk is a Python package designed to simplify the creation and execution of data processing pipelines. It provides a flexible framework for defining sequential tasks, applying transformations, and validating data. Additionally, it includes features for efficient parallel execution.

Key Features

  • DataPipeline: Define a sequence of data processing tasks in a straightforward manner.
  • Transformation: Apply custom transformations to your data easily.
  • Validator: Ensure your data meets specific conditions.
  • ParallelProcessor: Execute pipeline tasks in parallel for improved performance.
  • Data Sinks: Save processed data to various formats like CSV, Excel, and SQLite.

Installation

Install the package using pip:

pip install dataDisk

For development installation with extra dependencies:

pip install -e ".[dev,excel,sql]"

Quick Start

from dataDisk import DataPipeline, Transformation, Validator
from dataDisk.data_sources import CSVDataSource
from dataDisk.data_sinks import CSVDataSink

# Load data
source = CSVDataSource('input.csv')
data = source.load()

# Create transformations
def normalize(data):
    return (data - data.mean()) / data.std()

transformation = Transformation(normalize)

# Create validator
def check_valid(data):
    return data.notnull().all().all()

validator = Validator(check_valid)

# Create pipeline
pipeline = DataPipeline()
pipeline.add_step(transformation)
pipeline.add_step(validator)

# Process data
result = pipeline.run(data)

# Save results
sink = CSVDataSink('output.csv')
sink.save(result)

Transformations

Transformations allow you to apply various operations to your data. Here's a brief overview of available transformations:

  • Standardize: Scale features to have zero mean and unit variance.
  • Normalize: Scale features to have zero mean and unit variance.
  • Label Encode: Convert categorical labels to numeric values.
  • OneHot Encode: Convert categorical labels to one-hot encoded vectors.
  • Data Cleaning: Perform data cleaning operations like filling missing values and encoding categories.

Example of a custom transformation:

from dataDisk.transformation import Transformation

def double(x):
    return x * 2

transformation = Transformation(double)

Data Sinks

Data sinks allow you to save processed data to various formats:

  • CSVDataSink: Save data to a CSV file.
  • ExcelDataSink: Save data to an Excel file.
  • SQLiteDataSink: Save data to an SQLite database.

Example of using a data sink:

from dataDisk.data_sinks import CSVDataSink

csv_data_sink = CSVDataSink('output.csv')
csv_data_sink.save(data)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Install development dependencies (pip install -e ".[dev]")
  4. Set up pre-commit hooks (pre-commit install)
  5. Make your changes
  6. Commit your changes (git commit -m 'Add some amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

License

This project is licensed under the MIT License - see the License.md file for details.

About

dataDisk is a Python package designed to simplify the creation and execution of data processing pipelines. It provides a flexible framework for defining sequential tasks, applying transformations, and validating data. Additionally, it includes a ParallelProcessor for efficient parallel execution.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published