Skip to content

Abuton/TCP-DWH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TCP-DWH

Python ETL code using Pyspark API

The etl folder holds the solution to this task. The ingest_data.py is responsible for reading the ddl scripts and creating the hive tables. I had used the tpc-ds toolkit to generate the data.

A simple data model of the final tables

The datamodels folder holds the solution to this task. Here, there is an image (.png) file that shows the final table. The data model employs the star schema data model design where the store_sales table is the fact table and other tables are dimensions which include (customer, store, item, date_dim) as the requirements for this task. I have added two additional table to normalize the data in the data warehouse. These additional tables are (address, company). The data model design can be further normalized by creating additional tables in other to reduce data redundancy.

Hive tables and views DDL statements

THe sql_ddl_codes folder holds the solution to this task. I have included five (5) DDL sql scripts that creates all the five mandatory tables. The process to create the hive table as specified in the requirement is;

  • Create the database as specified by tpcdsDatabaseName

  • Create all the five (5) tables

  • Load the data into the tables in parquet format. Since the data generated by tpc-ds toolkit is in CSV format, we do the loading in multi steps

    • Step 1: I created tables in csv format by pointing the location to the generated data
    • Step 2: I created parquet tables by using CTAS to convert text data into parquet format then load into into alluxio cluster using LOCATION 'alluxio://master_hostname:port/{tablename}' assuming that a cluster exists and the table name is specified after the /
    • Step 3: we drop the text based tables as we longer need them

I have also included a folder sql_queries inside the sql_ddl_codes folder which holds simple sql DML scripts that provides metrics that the BI team needs

Answers to questions in Task no. 4. (Data size scaling)

task4.md contains the answer to the questions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages