TCP-DWH

Python ETL code using Pyspark API

The etl folder holds the solution to this task. The ingest_data.py is responsible for reading the ddl scripts and creating the hive tables. I had used the tpc-ds toolkit to generate the data.

A simple data model of the final tables

The datamodels folder holds the solution to this task. Here, there is an image (.png) file that shows the final table. The data model employs the star schema data model design where the store_sales table is the fact table and other tables are dimensions which include (customer, store, item, date_dim) as the requirements for this task. I have added two additional table to normalize the data in the data warehouse. These additional tables are (address, company). The data model design can be further normalized by creating additional tables in other to reduce data redundancy.

Hive tables and views DDL statements

THe sql_ddl_codes folder holds the solution to this task. I have included five (5) DDL sql scripts that creates all the five mandatory tables. The process to create the hive table as specified in the requirement is;

Create the database as specified by tpcdsDatabaseName
Create all the five (5) tables
Load the data into the tables in parquet format. Since the data generated by tpc-ds toolkit is in CSV format, we do the loading in multi steps
- Step 1: I created tables in csv format by pointing the location to the generated data
- Step 2: I created parquet tables by using CTAS to convert text data into parquet format then load into into alluxio cluster using LOCATION 'alluxio://master_hostname:port/{tablename}' assuming that a cluster exists and the table name is specified after the /
- Step 3: we drop the text based tables as we longer need them

I have also included a folder sql_queries inside the sql_ddl_codes folder which holds simple sql DML scripts that provides metrics that the BI team needs

Answers to questions in Task no. 4. (Data size scaling)

task4.md contains the answer to the questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TCP-DWH

Python ETL code using Pyspark API

A simple data model of the final tables

Hive tables and views DDL statements

Answers to questions in Task no. 4. (Data size scaling)

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
datamodels		datamodels
etl		etl
sql_ddl_codes		sql_ddl_codes
.gitignore		.gitignore
README.md		README.md
task4.md		task4.md

Abuton/TCP-DWH

Folders and files

Latest commit

History

Repository files navigation

TCP-DWH

Python ETL code using Pyspark API

A simple data model of the final tables

Hive tables and views DDL statements

Answers to questions in Task no. 4. (Data size scaling)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages