The etl
folder holds the solution to this task. The ingest_data.py
is responsible for reading the ddl scripts and creating the hive tables. I had used the tpc-ds
toolkit to generate the data.
The datamodels
folder holds the solution to this task. Here, there is an image (.png) file that shows the final table. The data model employs the star schema data model design where the store_sales
table is the fact table and other tables are dimensions which include (customer, store, item, date_dim) as the requirements for this task. I have added two additional table to normalize the data in the data warehouse. These additional tables are (address, company). The data model design can be further normalized by creating additional tables in other to reduce data redundancy.
THe sql_ddl_codes
folder holds the solution to this task. I have included five (5) DDL sql scripts that creates all the five mandatory tables. The process to create the hive table as specified in the requirement is;
-
Create the database as specified by tpcdsDatabaseName
-
Create all the five (5) tables
-
Load the data into the tables in parquet format. Since the data generated by tpc-ds toolkit is in CSV format, we do the loading in multi steps
- Step 1: I created tables in csv format by pointing the location to the generated data
- Step 2: I created parquet tables by using CTAS to convert text data into parquet format then load into into alluxio cluster using
LOCATION 'alluxio://master_hostname:port/{tablename}'
assuming that a cluster exists and the table name is specified after the/
- Step 3: we drop the text based tables as we longer need them
I have also included a folder sql_queries
inside the sql_ddl_codes
folder which holds simple sql DML scripts that provides metrics that the BI team needs
task4.md
contains the answer to the questions.