Skip to content

Latest commit

 

History

History
131 lines (117 loc) · 8.77 KB

README.md

File metadata and controls

131 lines (117 loc) · 8.77 KB

Zero to Hero: DATAQUEST's Become a Data Engineer

Here's how to get DataQuest's Data Engineering Track missions' content to work on your localhost. Using data from my Valenbisi ARIMA modeling project, I will walk through steps using PostgreSQL, Postico, and the Command Line to get our DataQuest exercises running out of a Jupyter Notebook.

This will not be a repeat of the many resources I used, so be sure to look out for any links I include if it seems I've skipped a few steps.

Getting started with PostgreSQL and Postico (01_intro_postgres):


PostgreSQL download
Postico download

1. Configure Postgres
I found this source incredibly helpful.
It walks through installation, creating users, and connecting to a local database.

This Repository will be covering how to do almost all the exercises out of a Jupyter Notebook. However, examples of how to complete some exercises in the CL will also be covered. To access the CLI, where you can create users, manage permissions, and create your first table: click the database created, "valenbisi2018", for this example.


In the CLI, each line will start with whatever you named your database, so for me it's valenbisi2018#=

Here are the points I found challenging, so they are documented below.
1a. How to fill a database with a csv file:
First create the database:

valenbisi2018#= CREATE TABLE vbstatic (id BIGSERIAL PRIMARY KEY, update VARCHAR(255), available INT, 
                                       free INT, total INT, name VARCHAR(255), long NUMERIC, lat NUMERIC);

Notice I made column update into data type VARCHAR. This is because when working with CSVs, DateTime Objects sometimes get converted to strings. Postgres cannot handle data type misgivings, so it was simplest to do this. Here is a guide to all the different Postgres data types you can encounter.

Then fill the database with data from a csv file containing only the columns you created in your table.

valenbisi2018#= \copy vbstatic(id,update,available,free,name,long,lat,total) 
FROM '~/Documents/Repos/data_quest_data_eng/postgres_mission/vb_table.csv' 
DELIMITER ',' 
CSV HEADER

Note that I use \copy, not COPY

The syntax for \COPY is slightly different: (a) being a psql command, it is not terminated by a semicolon (b) file paths are relative the current working directory.

Source: One of the answers to this StackOverflow Question, which linked to here.

1b. How to give permissions to your user [vbuser]

valenbisi2018#= GRANT SELECT
ON ALL TABLES IN SCHEMA public
TO vbuser;

Source: How to Grant Permissions in PostgreSQL

Alright, now you're ready to follow along in my first Jupyter Notebook, 01_intro_postgres
Some additional notes to keep in mind:

  1. Make sure when you are loading in your data using a csv, that all the columns in the csv are in the same order as in your CREATE TABLE statement
  2. If you need to delete a table, enter your Postgres CLI and type:
valenbisi2018#= DROP TABLE table_name;
  1. To see a description of tables, type into the CLI: \dt, short for 'describe tables'

Optimizing Your Postgres Database (02_opt_tables)


In this mission we review making tables, datatype selection, and I use SQLAlchemy to write a table from a pandas DataFrame object. This solves the porblem I ran into during the first mission: I no longer need to store my date column update as VARCHAR. It's now a proper TIMESTAMP object.

Datatypes from the PostGres Documentation:

Numeric Types

Name Storage Size Description Range
smallint 2 bytes small-range integer -32768 to +32767
integer 4 bytes typical choice for integer -2147483648 to +2147483647
bigint 8 bytes large-range integer -9223372036854775808 to 9223372036854775807
decimal variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
numeric variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
real 4 bytes variable-precision, inexact 6 decimal digits precision
double precision 8 bytes variable-precision, inexact 15 decimal digits precision
serial 4 bytes autoincrementing integer 1 to 2147483647
bigserial 8 bytes large autoincrementing integer 1 to 9223372036854775807

Postgres Documentation: Numeric Types

Character Types

Name Description
character varying(n), varchar(n) variable-length with limit
character(n), char(n) fixed-length, blank padded
text variable unlimited length

Postgres Documentation: Character Types

Datetime Types

Name Storage Size Description Low Value High Value Resolution
timestamp [ (p) ] [ without time zone ] 8 bytes both date and time (no time zone) 4713 BC 294276 AD 1 microsecond / 14 digits
timestamp [ (p) ] with time zone 8 bytes both date and time, with time zone 4713 BC 294276 AD 1 microsecond / 14 digits
date 4 bytes date (no time of day) 4713 BC 5874897 AD 1 day
time [ (p) ] [ without time zone ] 8 bytes time of day (no date) 00:00:00 24:00:00 1 microsecond / 14 digits
time [ (p) ] with time zone 12 bytes times of day only, with time zone 00:00:00+1459 24:00:00-1459 1 microsecond / 14 digits
interval [ fields ] [ (p) ] 16 bytes time interval -178000000 years 178000000 years 1 microsecond / 14 digits

Postgres Documentation: DateTime Types

You will need SQLAlchemy to create an SQL database from a pandas dataframe. The final code, for our example, will look as follows:

from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://nmolivo:MYPASSWORD@localhost/valenbisi2018')
data.to_sql('vbstatic', engine, dtype = {'id': sqlalchemy.types.BIGINT, \
                                         'update':sqlalchemy.types.TIMESTAMP(timezone=False), \
                                         'available':sqlalchemy.types.INT, \
                                         'free':sqlalchemy.types.INT, \
                                         'total':sqlalchemy.types.INT, \
                                         'name':sqlalchemy.types.CHAR(length=55), \
                                         'long': sqlalchemy.types.Float(precision=15), \
                                         'lat': sqlalchemy.types.Float(precision=15)})

To get this code to compile, I used the following sources:

Managing Tables (03_manage_tables):


For Non-Commercial Use Only

I highly reccommend participating in this course as a member of DATAQUEST. A summary of the curriculum is outlined below.

  • Postgres for Data Engineers
  • Optimizing Postgres Databases
  • Processing Large Datasets in Pandas
  • Optimizing Code performance on Large Datasets
  • Algorithms and Data Structures
  • Recursion Trees
  • Building a Data Pipeline