dataDisk is a Python package designed to simplify the creation and execution of data processing pipelines. It provides a flexible framework for defining sequential tasks, applying transformations, and validating data. Additionally, it includes features for efficient parallel execution.
- DataPipeline: Define a sequence of data processing tasks in a straightforward manner.
- Transformation: Apply custom transformations to your data easily.
- Validator: Ensure your data meets specific conditions.
- ParallelProcessor: Execute pipeline tasks in parallel for improved performance.
- Data Sinks: Save processed data to various formats like CSV, Excel, and SQLite.
Install the package using pip:
pip install dataDiskFor development installation with extra dependencies:
pip install -e ".[dev,excel,sql]"from dataDisk import DataPipeline, Transformation, Validator
from dataDisk.data_sources import CSVDataSource
from dataDisk.data_sinks import CSVDataSink
# Load data
source = CSVDataSource('input.csv')
data = source.load()
# Create transformations
def normalize(data):
return (data - data.mean()) / data.std()
transformation = Transformation(normalize)
# Create validator
def check_valid(data):
return data.notnull().all().all()
validator = Validator(check_valid)
# Create pipeline
pipeline = DataPipeline()
pipeline.add_step(transformation)
pipeline.add_step(validator)
# Process data
result = pipeline.run(data)
# Save results
sink = CSVDataSink('output.csv')
sink.save(result)Transformations allow you to apply various operations to your data. Here's a brief overview of available transformations:
- Standardize: Scale features to have zero mean and unit variance.
- Normalize: Scale features to have zero mean and unit variance.
- Label Encode: Convert categorical labels to numeric values.
- OneHot Encode: Convert categorical labels to one-hot encoded vectors.
- Data Cleaning: Perform data cleaning operations like filling missing values and encoding categories.
from dataDisk.transformation import Transformation
def double(x):
return x * 2
transformation = Transformation(double)Data sinks allow you to save processed data to various formats:
- CSVDataSink: Save data to a CSV file.
- ExcelDataSink: Save data to an Excel file.
- SQLiteDataSink: Save data to an SQLite database.
from dataDisk.data_sinks import CSVDataSink
csv_data_sink = CSVDataSink('output.csv')
csv_data_sink.save(data)Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Install development dependencies (
pip install -e ".[dev]") - Set up pre-commit hooks (
pre-commit install) - Make your changes
- Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the License.md file for details.