AlphaClean v0.1

In many data science applications, data cleaning is effectively manual with ad-hoc, single-use scripts and programs. This practice is highly problematic for reproducibility (sharing analysis between researchers), interpretability (explaining the findings of a particular analysis), and scalability (replicating the analyses on larger corpora). Most existing relational data cleaning solutions are designed as stand-alone systems -- often coupled with a DBMS -- with poor support for languages widely in data science, such as Python and R.

In order to address this problem, we designed a Python 2 library that declaratively synthesizes data cleaning programs. AlphaClean is given a specification of quality (e.g. integrity constraints or a statistical model the data must conform to) and a language of allowed data transformations, then it searches to find a sequence of transformations that best satisfies the quality specification. The discovered sequence of transformations defines an intermediate representation, which can be easily transferred between languages or optimized with a compiler.

Requirements

As an initial research prototype, we have not yet designed AlphaClean for deployment. It is a package that researchers can locally develop and test to evaluate different data cleaning approaches and techniques. The dependencies are:

dateparser==0.6.0
Distance==0.1.3
numpy==1.12.1
pandas==0.20.1
pyparsing==2.2.0
python-dateutil==2.6.0
pytz==2017.2
regex==2017.7.11
ruamel.ordereddict==0.4.9
ruamel.yaml==0.15.18
scikit-learn==0.18.1
scipy==0.19.0
six==1.10.0
gensim==2.2.0

These can be installed by running the following code from the command line:

pip install -r requirements.txt

Using the Package

The docs folder contains a number of tutorials on how to use AlphaClean.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
alphaclean		alphaclean
datasets		datasets
docs		docs
examples		examples
resources		resources
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlphaClean v0.1

Requirements

Using the Package

About

Releases

Packages

Languages

sjyk/alphaclean

Folders and files

Latest commit

History

Repository files navigation

AlphaClean v0.1

Requirements

Using the Package

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages