GitHub - nmolivo/dataquest_eng at 647883ea28e2e0122a361874d51999cbf6908c84

Name	Name	Last commit message	Last commit date
Latest commit History 58 Commits
.ipynb_checkpoints	.ipynb_checkpoints
1_production_databases	1_production_databases
images	images
.DS_Store	.DS_Store
README.md	README.md
vb_table.csv	vb_table.csv

Zero to Hero: DATAQUEST's Become a Data Engineer

Here's how to get DataQuest's Data Engineering Track missions' content to work on your localhost. Using data from my Valenbisi ARIMA modeling project, I will walk through steps using PostgreSQL, Postico, and the Command Line to get our DataQuest exercises running out of a Jupyter Notebook.

This will not be a complete repitition of the many resources I used, so be sure to look out for any links I include if it seems I've skipped a few steps.

Important note: In DataQuest, each exercise re-initiates the connection and cursor class of psycopg2 when interacting with the Postgres DB, with no deliberate closing of the connection. When we productionize our scripts, it will be more efficient and correct to use a with statement, which will close the connection once the operations are complete. For the sake of the exercises, I will follow DataQuest's format. I will switch to the with statement as we approach production.

There will be Three Directories in this Repository, each aligning with DataQuest's Data Engineer Track. This markdown file will contain all of the content from each, but in case it is a bit unwieldy, each directory will contain a README.md with only the information covered in it.

Production Databases
- Postgres For Data Engineers
  - Intro to Postgres
  - Creating Tables
  - Managing Created Tables
  - Loading and Extracting Data with Tables
  - User and Database Management
  - Project: Storing Tropical Storm Data
- Optimizing Postgres Databases
  - Exploring Postgres Internals
  - Debugging Postgres Queries
  - Using an Index
  - Advanced Indexing
  - Vacuuming Postgres Databases
Handling Large Data Sets In Python
- Processing Large Datasets in Pandas
  - Optimizing Dataframe Memory Footprint
  - Processing Dataframes in Chunks
  - Guided Project: Practice Optimizing Dataframes and Processing in Chunks
  - Augmenting Pandas with SQLite
  - Guided Project: Analyzing Startup Fundraising Deals from Crunchbase
- Optimizing Code Performance on Large Datasets
  - CPU Bound Programs
  - I/O Bound Programs
  - Overcoming the Limitations of Threads
  - Quickly Analyzing Data with Parallel Processing
  - Guided Project: Analyzing Wikipedia Pages
- Algorithms and Data Structures
  - Processing Tasks with Stacks and Queues
  - Effectively Using Arrays and Lists
  - Sorting Arrays and Lists
  - Searching Arrays and Lists
  - Hash Tables
  - Guided Project: Analyzing Stock Prices
- Recursion and Trees
  - Overview of Recursion
  - Introduction to Binary Trees
  - Implementing a Binary Heap
  - Working with Binary Search Trees
  - Performance Boosts of Using a B-Tree
  - Performance Boosts of Using a B-Tree II
  - Guided Project: Implementing a Key-Value Database
Data Pipelines
- Building a Data Pipeline
  - Functional Programming
  - Pipeline Tasks
  - Building a Pipeline Class
  - Multiple Dependency Pipeline
  - Guided Project: Hackernews Pipeline

Getting started with PostgreSQL and Postico (01_intro_postgres):

PostgreSQL download
Postico download

Configure Postgres
I found this source incredibly helpful.
It walks through installation, creating users, and connecting to a local database.

This Repository will be covering how to do almost all the exercises out of a Jupyter Notebook. However, examples of how to complete some exercises in the CL will also be covered. To access the CLI, where you can create users, manage permissions, and create your first table: click the database created, "valenbisi2018", for this example.

In the CLI, each line will start with whatever you named your database, so for me it's valenbisi2018#=

How to fill a database with a csv file:
First create the database:

valenbisi2018#= CREATE TABLE vbstatic (id BIGSERIAL PRIMARY KEY, update VARCHAR(255), available INT, 
                                       free INT, total INT, name VARCHAR(255), long NUMERIC, lat NUMERIC);

Notice I made column update into data type VARCHAR. This is because when working with CSVs, DateTime Objects sometimes get converted to strings. Postgres cannot handle data type misgivings, so it was simplest to do this. Here is a guide to all the different Postgres data types you can encounter.

Then fill the database with data from a csv file containing only the columns you created in your table.

valenbisi2018#= \copy vbstatic(id,update,available,free,name,long,lat,total) 
FROM '~/Documents/Repos/data_quest_data_eng/postgres_mission/vb_table.csv' 
DELIMITER ',' 
CSV HEADER

Note that I use \copy, not COPY

The syntax for \COPY is slightly different: (a) being a psql command, it is not terminated by a semicolon (b) file paths are relative the current working directory.

Source: One of the answers to this StackOverflow Question, which linked to here.

How to give permissions to your user [vbuser]

valenbisi2018#= GRANT SELECT
ON ALL TABLES IN SCHEMA public
TO vbuser;

Source: How to Grant Permissions in PostgreSQL

Alright, now you're ready to follow along in my first Jupyter Notebook, 01_intro_postgres
Some additional notes to keep in mind:

Make sure when you are loading in your data using a csv, that all the columns in the csv are in the same order as in your CREATE TABLE statement
If you need to delete a table, enter your Postgres CLI and type:

valenbisi2018#= DROP TABLE table_name;

To see a description of tables, type into the CLI: \dt, short for 'describe tables'

Optimizing Your Postgres Database (02_opt_tables)

In this mission we review making tables, datatype selection, and I use SQLAlchemy to write a table from a pandas DataFrame object. This solves the porblem I ran into during the first mission: I no longer need to store my date column update as VARCHAR. It's now a proper TIMESTAMP object.

Datatypes from the PostGres Documentation:

Numeric Types

Name Storage Size Description Range

smallint 2 bytes small-range integer -32768 to +32767

integer 4 bytes typical choice for integer -2147483648 to +2147483647

bigint 8 bytes large-range integer -9223372036854775808 to 9223372036854775807

decimal variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point

numeric variable user-specified precision, exact up to 131072 digits before the decimal point; up to 16383 digits after the decimal point

real 4 bytes variable-precision, inexact 6 decimal digits precision

double precision 8 bytes variable-precision, inexact 15 decimal digits precision

serial 4 bytes autoincrementing integer 1 to 2147483647

bigserial 8 bytes large autoincrementing integer 1 to 9223372036854775807

Postgres Documentation: Numeric Types

Character Types

Name Description

character varying(n), varchar(n) variable-length with limit

character(n), char(n) fixed-length, blank padded

text variable unlimited length

Postgres Documentation: Character Types

Datetime Types

Name Storage Size Description Low Value High Value Resolution

timestamp [ (p) ] [ without time zone ] 8 bytes both date and time (no time zone) 4713 BC 294276 AD 1 microsecond / 14 digits

timestamp [ (p) ] with time zone 8 bytes both date and time, with time zone 4713 BC 294276 AD 1 microsecond / 14 digits

date 4 bytes date (no time of day) 4713 BC 5874897 AD 1 day

time [ (p) ] [ without time zone ] 8 bytes time of day (no date) 00:00:00 24:00:00 1 microsecond / 14 digits

time [ (p) ] with time zone 12 bytes times of day only, with time zone 00:00:00+1459 24:00:00-1459 1 microsecond / 14 digits

interval [ fields ] [ (p) ] 16 bytes time interval -178000000 years 178000000 years 1 microsecond / 14 digits

Postgres Documentation: DateTime Types

Name	Storage Size	Description	Range
`smallint`	2 bytes	small-range integer	-32768 to +32767
`integer`	4 bytes	typical choice for integer	-2147483648 to +2147483647
`bigint`	8 bytes	large-range integer	-9223372036854775808 to 9223372036854775807
`decimal`	variable	user-specified precision, exact	up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
`numeric`	variable	user-specified precision, exact	up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
`real`	4 bytes	variable-precision, inexact	6 decimal digits precision
`double precision`	8 bytes	variable-precision, inexact	15 decimal digits precision
`serial`	4 bytes	autoincrementing integer	1 to 2147483647
`bigserial`	8 bytes	large autoincrementing integer	1 to 9223372036854775807

Name	Description
`character varying(n), varchar(n)`	variable-length with limit
`character(n), char(n)`	fixed-length, blank padded
`text`	variable unlimited length

Name	Storage Size	Description	Low Value	High Value	Resolution
`timestamp [ (p) ] [ without time zone ]`	8 bytes	both date and time (no time zone)	4713 BC	294276 AD	1 microsecond / 14 digits
`timestamp [ (p) ] with time zone`	8 bytes	both date and time, with time zone	4713 BC	294276 AD	1 microsecond / 14 digits
`date`	4 bytes	date (no time of day)	4713 BC	5874897 AD	1 day
`time [ (p) ] [ without time zone ]`	8 bytes	time of day (no date)	00:00:00	24:00:00	1 microsecond / 14 digits
`time [ (p) ] with time zone`	12 bytes	times of day only, with time zone	00:00:00+1459	24:00:00-1459	1 microsecond / 14 digits
`interval [ fields ] [ (p) ]`	16 bytes	time interval	-178000000 years	178000000 years	1 microsecond / 14 digits

You will need SQLAlchemy to create an SQL database from a pandas dataframe. The final code, for our example, will look as follows:

from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://nmolivo:MYPASSWORD@localhost/valenbisi2018')
data.to_sql('vbstatic', engine, dtype = {'id': sqlalchemy.types.BIGINT, \
                                         'update':sqlalchemy.types.TIMESTAMP(timezone=False), \
                                         'available':sqlalchemy.types.INT, \
                                         'free':sqlalchemy.types.INT, \
                                         'total':sqlalchemy.types.INT, \
                                         'name':sqlalchemy.types.CHAR(length=55), \
                                         'long': sqlalchemy.types.Float(precision=15), \
                                         'lat': sqlalchemy.types.Float(precision=15)})

To get this code to compile, I used the following sources:

To configure the engine: dialect+driver://username:password@host:port/database
To create the to_sql(dtype) dictionary

Managing Tables (03_manage_tables):

How to inspect your tables:
For this, I am heading back to the command line, rather than the Jupyter notebook. Remember, because my database name is valenbisi2018, all lines of code I do in the CL will start with valenbisi2018=#. This will help differentiate code snippets I share from the CL vs. from my Jupyter notebook.

valenbisi2018=# SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'staticvb'
ORDER BY ordinal_position;

Will output:

In this mission, the following concepts are covered:

Changing table name with ALTER TABLE current_name RENAME TO new_name
Delete variables with ALTER TABLE table_name DROP COLUMN col_name
Renaming variables with ALTER TABLE table_name RENAME COLUMN current_col_name TO new_col_name
Change variable datatype with ALTER TABLE table_name ALTER COLUMN bigint_col_name TYPE BIGINT
Creating new variables and optionally: setting a default value with ALTER TABLE table_name ADD COLUMN new_date_col DATE DEFAULT 01-01-1991
Populating a new variable with information from other variables using UPDATE table_name SET new_date_col = to_date(col_day || '-' || col_month || '-' || col_year, 'DD-MM-YYYY')

Loading and Extracting Data with Tables (04_extract_data):

For Non-Commercial Use Only

I highly reccommend participating in this course as a member of DATAQUEST. A summary of the curriculum is outlined below.

Postgres for Data Engineers
Optimizing Postgres Databases
Processing Large Datasets in Pandas
Optimizing Code performance on Large Datasets
Algorithms and Data Structures
Recursion Trees
Building a Data Pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zero to Hero: DATAQUEST's Become a Data Engineer

Getting started with PostgreSQL and Postico (01_intro_postgres):

Optimizing Your Postgres Database (02_opt_tables)

Numeric Types

Character Types

Datetime Types

Managing Tables (03_manage_tables):

Loading and Extracting Data with Tables (04_extract_data):

For Non-Commercial Use Only

About

Releases

Packages

Languages

nmolivo/dataquest_eng

Folders and files

Latest commit

History

Repository files navigation

Zero to Hero: DATAQUEST's Become a Data Engineer

Getting started with PostgreSQL and Postico (01_intro_postgres):

Optimizing Your Postgres Database (02_opt_tables)

Numeric Types

Character Types

Datetime Types

Managing Tables (03_manage_tables):

Loading and Extracting Data with Tables (04_extract_data):

For Non-Commercial Use Only

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages