A logical, reasonably standardized, but flexible project structure for doing and sharing data science work at Farmers Edge.
- Python 2.7 or 3.5
- Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
$ pip install cookiecutter
or
$ conda config --add channels conda-forge
$ conda install cookiecutter
cookiecutter https://github.com/jacobwbengtson/jake_cookiecutter
The directory structure of your new project looks like this:
βββ README.md <- The top-level README for developers using this project.
βββ .gitignore <- Boiler plate version will be provided
βββ config.txt <- contains passwords and tokens that should not be version controlled
βββ environment.yml <- The .yml file used to create the envrionment for the project.
β Generate .yml file using 'conda env_name export > environment.yml'.
β Generate a virtual environment from .yml using 'conda env_name create -f envrionment.yml'.
βββ data
β βββ processed <- The final, canonical data sets for modeling.
β βββ interim <- Data that has been cleansed or altered, but is not in its final state
β βββ raw <- The original, immutable data dump.
β
βββ models <- Trained and serialized models, model predictions, or model summaries
β
βββ exploration <- Jupyter notebooks or python scripts for EDA. Naming convention is a number
β (for ordering), the creator's initials, and a short `_` delimited description, e.g.
β e.g. `1.0_jwb_initial_data_exploration.ipynb`.
β
βββ experiments <- Jupyter notebooks or python scripts for model experimentation. Naming convention is a number
β (for ordering), the creator's initials, and a short `_` delimited description, e.g.
β e.g. `1.0_jwb_random_forest.py`.
β
βββ references <- Data dictionaries, manuals, and all other explanatory materials.
β
βββ main.py <- Script that will run everything required to generate the best working model for the project
β From data ingestion to model training
β
β
βββ src <- Source code for use in this project.
βββ __init__.py <- Makes src a Python module
β
βββ data <- Functions to download, generate, combine, clean, or featurize data.
β βββ pull.py <- Outputs to data/raw
β βββ clean.py <- Outputs to data/interim
β βββ featurize.py <- Outputs to either data/interim or data/processed
β
βββ models <- Functions to train/test models, or use trained models for predictions
β βββ train.py
β βββ test.py
β βββ predict.py
β
βββ visualization <- Functions to create exploratory and results oriented visualizations
βββ visualize.py
We welcome contributions! See the docs for guidelines.
conda env create -f environment.yml