This boilerplate is designed to kickstart data science projects by providing a basic setup for database connections, data processing, and machine learning model development. It includes a structured folder organization for your datasets and a set of pre-defined Python packages necessary for most data science tasks.
The project is organized as follows:
app.py
- The main Python script that you run for your project.explore.py
- A notebook to explore data, play around, visualize, clean, etc. Ideally the notebook code should be migrated to the app.py when moving to production.utils.py
- This file contains utility code for operations like database connections.requirements.txt
- This file contains the list of necessary python packages.models/
- This directory should contain your SQLAlchemy model classes.data/
- This directory contains the following subdirectories:interin/
- For intermediate data that has been transformed.processed/
- For the final data to be used for modeling.raw/
- For raw data without any processing.
Prerequisites
Make sure you have Python 3.11+ installed on your. You will also need pip for installing the Python packages.
Installation
Clone the project repository to your local machine.
Navigate to the project directory and install the required Python packages:
pip install -r requirements.txt
Create a database (if needed)
Create a new database within the Postgres engine by customizing and executing the following command: $ createdb -h localhost -U <username> <db_name>
Connect to the Postgres engine to use your database, manipulate tables and data: $ psql -h localhost -U <username> <db_name>
NOTE: Remember to check the ./.env file information to get the username and db_name.
Once you are inside PSQL you will be able to create tables, make queries, insert, update or delete data and much more!
Environment Variables
Create a .env file in the project root directory to store your environment variables, such as your database connection string:
DATABASE_URL="your_database_connection_url_here"
To run the application, execute the app.py script from the root of the project directory:
python app.py
To add SQLAlchemy model classes, create new Python script files inside the models/ directory. These classes should be defined according to your database schema.
Example model definition (models/example_model.py
):
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String
Base = declarative_base()
class ExampleModel(Base):
__tablename__ = 'example_table'
id = Column(Integer, primary_key=True)
name = Column(String)
You can place your raw datasets in the data/raw directory, intermediate datasets in data/interim, and the processed datasets ready for analysis in data/processed.
To process data, you can modify the app.py script to include your data processing steps, utilizing pandas for data manipulation and analysis.
This template was built as part of the 4Geeks Academy Data Science and Machine Learning Bootcamp by Alejandro Sanchez and many other contributors. Find out more about 4Geeks Academy's BootCamp programs here.
Other templates and resources like this can be found on the school GitHub page.