dataDisk

dataDisk is a Python package designed to simplify the creation and execution of data processing pipelines. It provides a flexible framework for defining sequential tasks, applying transformations, and validating data. Additionally, it includes features for efficient parallel execution.

Key Features

DataPipeline: Define a sequence of data processing tasks in a straightforward manner.
Transformation: Apply custom transformations to your data easily.
Validator: Ensure your data meets specific conditions.
ParallelProcessor: Execute pipeline tasks in parallel for improved performance.
Data Sinks: Save processed data to various formats like CSV, Excel, and SQLite.

Installation

Install the package using pip:

pip install dataDisk

For development installation with extra dependencies:

pip install -e ".[dev,excel,sql]"

Quick Start

from dataDisk import DataPipeline, Transformation, Validator
from dataDisk.data_sources import CSVDataSource
from dataDisk.data_sinks import CSVDataSink

# Load data
source = CSVDataSource('input.csv')
data = source.load()

# Create transformations
def normalize(data):
    return (data - data.mean()) / data.std()

transformation = Transformation(normalize)

# Create validator
def check_valid(data):
    return data.notnull().all().all()

validator = Validator(check_valid)

# Create pipeline
pipeline = DataPipeline()
pipeline.add_step(transformation)
pipeline.add_step(validator)

# Process data
result = pipeline.run(data)

# Save results
sink = CSVDataSink('output.csv')
sink.save(result)

Transformations

Transformations allow you to apply various operations to your data. Here's a brief overview of available transformations:

Standardize: Scale features to have zero mean and unit variance.
Normalize: Scale features to have zero mean and unit variance.
Label Encode: Convert categorical labels to numeric values.
OneHot Encode: Convert categorical labels to one-hot encoded vectors.
Data Cleaning: Perform data cleaning operations like filling missing values and encoding categories.

Example of a custom transformation:

from dataDisk.transformation import Transformation

def double(x):
    return x * 2

transformation = Transformation(double)

Data Sinks

Data sinks allow you to save processed data to various formats:

CSVDataSink: Save data to a CSV file.
ExcelDataSink: Save data to an Excel file.
SQLiteDataSink: Save data to an SQLite database.

Example of using a data sink:

from dataDisk.data_sinks import CSVDataSink

csv_data_sink = CSVDataSink('output.csv')
csv_data_sink.save(data)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Install development dependencies (pip install -e ".[dev]")
Set up pre-commit hooks (pre-commit install)
Make your changes
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the License.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.devcontainer		.devcontainer
.github		.github
.streamlit		.streamlit
.tox		.tox
.vscode		.vscode
__pycache__		__pycache__
build		build
data		data
dataDisk.egg-info		dataDisk.egg-info
dataDisk		dataDisk
dist		dist
docs		docs
examples		examples
logo		logo
source		source
src		src
tests		tests
venv		venv
.eslintrc.json		.eslintrc.json
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
API_QUICKSTART.md		API_QUICKSTART.md
CHANGELOG.md		CHANGELOG.md
CLINIC_TRIAL_GUIDE.md		CLINIC_TRIAL_GUIDE.md
Contributing.md		Contributing.md
DEPLOYMENT_GUIDE.md		DEPLOYMENT_GUIDE.md
DEPLOY_GITHUB_PAGES.md		DEPLOY_GITHUB_PAGES.md
FEATURE_SUMMARY.md		FEATURE_SUMMARY.md
HEALTHCARE_MVP.md		HEALTHCARE_MVP.md
IMPLEMENTATION_COMPLETE.md		IMPLEMENTATION_COMPLETE.md
License.md		License.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
NEXT_STEPS.md		NEXT_STEPS.md
NO_CODE_TRIAL.md		NO_CODE_TRIAL.md
QUICKSTART_HEALTHCARE.md		QUICKSTART_HEALTHCARE.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README_HEALTHCARE.md		README_HEALTHCARE.md
REVENUE_PROJECTION.md		REVENUE_PROJECTION.md
Readme.md		Readme.md
app_healthcare.py		app_healthcare.py
audit_1000.csv		audit_1000.csv
clinic_test.py		clinic_test.py
customer_discovery_questions.md		customer_discovery_questions.md
deidentified_1000.csv		deidentified_1000.csv
generate_mock_data.py		generate_mock_data.py
iris.csv		iris.csv
iris.data		iris.data
landing_page.html		landing_page.html
make.bat		make.bat
mock_patient_data_1000.csv		mock_patient_data_1000.csv
next.config.js		next.config.js
outreach_templates.md		outreach_templates.md
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
prospect_list_template.csv		prospect_list_template.csv
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_healthcare.txt		requirements_healthcare.txt
setup.py		setup.py
tailwind.config.ts		tailwind.config.ts
test_1000_records.py		test_1000_records.py
test_new_features.py		test_new_features.py
thank-you.html		thank-you.html
tox.ini		tox.ini
tsconfig.json		tsconfig.json
week1_checklist.md		week1_checklist.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dataDisk

Key Features

Installation

Quick Start

Transformations

Example of a custom transformation:

Data Sinks

Example of using a data sink:

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

davitacols/dataDisk

Folders and files

Latest commit

History

Repository files navigation

dataDisk

Key Features

Installation

Quick Start

Transformations

Example of a custom transformation:

Data Sinks

Example of using a data sink:

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages