Add KKTIX attendee Info ETL to BigQuery pycontw-225217.dwd.kktix_ticket_xxxx_attendees tables #123

GeekAngus · 2023-08-12T10:44:56Z

Types of changes

New feature

Description

Background: Start from 2022, we extract the KKTIX data via KKTIX API and load to "pycontw-225217.ods.ods_kktix_attendeeId_datetime". However most of the data are store in the ATTENDEE_INFO column with json format. To use metabase with SQL, users need to extract the data by json_extract with the knowledge kktix format instead of flat database. And we also need to rewrite all the SQLs build for current databases.
Solution: Transform the tables in backend that we could keep the same user experience by using Metabase.

Checklist:

Add test cases to all the changes you introduce
Run poetry run pytest locally to ensure all linter checks pass
Update the documentation if necessary

Steps to Test This Pull Request

cd contrib
./kktix_bq_etl.sh 2023

Expected behavior

The data had been load to dwd.kktix_ticket_${ticket_type}_attendees_test on bigquery
ticket_type = corporate individual reserved

"ods_kktix_attendeeId_datetime" table and load to the legecy tables: : ods_kktix_ticket_(corporate, individual, reserved)_attendees

…ames in better way

david30907d · 2023-08-12T11:56:34Z

contrib/README.md

+
+## KKTIX BigQuery Transform
+1. Background: Start from 2022, we extract the KKTIX data via KKTIX API and load to "pycontw-225217.ods.ods_kktix_attendeeId_datetime". However most of the data are store in the ATTENDEE_INFO column with json format. To use metabase with SQL, users need to extract the data by json_extract with the knowledge kktix format instead of flat database. And we also need to rewrite all the SQLs build for current databases.
+2. Solution: Transform the tables in backend that we could keep the same user experience by using Metabase.
+3. Run: 
+ - for 3 tables in single bash script: `./kktix_bq_etl.sh 2023`


thx for documenting these up 🙏

david30907d · 2023-08-12T11:57:42Z

dags/ods/kktix_ticket_orders/udfs/kktix_loader.py

@@ -28,6 +30,7 @@ def load(event_raw_data_array: List):
        sanitized_event_raw_data = _sanitize_payload(event_raw_data)
        payload.append(sanitized_event_raw_data)
    _load_to_bigquery(payload)
+    _load_to_bigquery_dwd(payload)


david30907d · 2023-08-12T11:58:30Z

dags/ods/kktix_ticket_orders/udfs/kktix_bq_dwd_etl.py

+    # print(sanitized_df.columns)
+    # print(sanitized_df.head())
+    # df_null = sanitized_df.isnull()
+    # print(sanitized_df.iloc[:, :5])


should we remove these comments?

david30907d · 2023-08-12T12:02:10Z

dags/ods/kktix_ticket_orders/udfs/kktix_bq_dwd_etl.py

+def load_to_df_from_list(
+    results, source="dag", update_after_ts=0
+) -> Tuple[DataFrame, DataFrame]:
+    # Use DataFrame for table transform operations


seems to me that you should rename this function to transform_xxx()?

david30907d · 2023-08-12T12:02:37Z

dags/ods/kktix_ticket_orders/udfs/kktix_bq_dwd_etl.py

+JOB_CONFIG = bigquery.LoadJobConfig(schema=SCHEMA)
+
+
+def _load_row_df_from_dict(json_dict, update_after_ts) -> DataFrame:


david30907d · 2023-08-12T12:03:07Z

dags/ods/kktix_ticket_orders/udfs/kktix_bq_dwd_etl.py

+CANONICAL_COLUMN_NAMES_2020_EXTRA_CORPORATE = {
+    "invoice_policy",


unused constants?

david30907d · 2023-08-12T12:03:22Z

contrib/kktix_bq_etl.sh

+#!/bin/bash
+#
+# export GOOGLE_APPLICATION_CREDENTIALS="<where to access service-account.json>"
+#
+project_id="pycontw-225217"
+cmd=${PWD}/../dags/ods/kktix_ticket_orders/udfs/kktix_bq_dwd_etl.py
+
+
+for ticket_type in corporate individual reserved
+do
+    suffix=${ticket_type}_attendees$2
+    cmd_args="-p ${project_id} -d dwd -t kktix_ticket_${suffix} -k ${ticket_type} -y $1 --upload"
+    echo ${cmd_args}
+    ${cmd} ${cmd_args}
+done


david30907d · 2023-08-12T12:11:36Z

dags/ods/kktix_ticket_orders/udfs/kktix_loader.py

@@ -43,6 +46,30 @@ def _load_to_bigquery(payload: List[Dict]) -> None:
    job.result()


+def _load_to_bigquery_dwd(payload: List[Dict]) -> None:


sorry for the nit picking, but seems to me that this function is more like transform than load

david30907d

Generally, it looks good to me. I just have some suggestions regarding the design aspect.

So feel free to ship it, and we can discuss the design pattern afterwards

About the Design Pattern

Hera are my 2 cents:

Follow ETL or ELT paradigm (for example: https://github.com/pycontw/pycon-etl/blob/master/dags/ods/kktix_ticket_orders/udfs/kktix_api.py#L30-L34)

Follow single responsibility principle: Since _load_to_bigquery_dwd() is actually doing transformation, (

pycon-etl/dags/ods/kktix_ticket_orders/udfs/kktix_loader.py

Lines 20 to 33 in cd68ec1

    
           def load(event_raw_data_array: List): 
        
               """ 
        
               load data into bigquery! 
        
               """ 
        
               # data quality check 
        
               if len(event_raw_data_array) == 0: 
        
                   print("Nothing to load, skip!") 
        
                   return 
        
               payload = [] 
        
               for event_raw_data in event_raw_data_array: 
        
                   sanitized_event_raw_data = _sanitize_payload(event_raw_data) 
        
                   payload.append(sanitized_event_raw_data) 
        
               _load_to_bigquery(payload) 
        
               _load_to_bigquery_dwd(payload)

) is no longer following single responsibility principle. I would suggest using an independent operator for this dwd operation.

Make the pipeline idempotent if possible. Since you're using WriteDisposition.WRITE_APPEND right now, it means we might have duplicate data if someone re-runs a partially failed job.

GeekAngus added 8 commits June 18, 2023 20:43

Add scripts extract and transform the

3a5c90b

"ods_kktix_attendeeId_datetime" table and load to the legecy tables: : ods_kktix_ticket_(corporate, individual, reserved)_attendees

style: lint the code and fix potential SQL inject by parameterized query

79506a8

fix (dataframe): Aggregate the columns with the same purposes or names

a3b954c

[refactor] (dataframe): Group the columns with the same purposes or n…

c3064ba

…ames in better way

[refactor] (dataframe): Clean the columns for 2022

3a2eccb

Merge branch 'pycontw:master' into master

209b5b9

[feature] (dag): integrate the script to kktix_loader.py in dag

d18249f

merged

cd68ec1

GeekAngus mentioned this pull request Aug 12, 2023

Add scripts extract and transform the #120

Closed

3 tasks

GeekAngus requested review from henry410213028 and david30907d August 12, 2023 10:47

david30907d reviewed Aug 12, 2023

View reviewed changes

david30907d approved these changes Aug 12, 2023

View reviewed changes

henry410213028 approved these changes Aug 12, 2023

View reviewed changes

GeekAngus merged commit dc751a5 into master Aug 12, 2023

GeekAngus deleted the kktix_etl_3tables branch August 12, 2023 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add KKTIX attendee Info ETL to BigQuery pycontw-225217.dwd.kktix_ticket_xxxx_attendees tables #123

Add KKTIX attendee Info ETL to BigQuery pycontw-225217.dwd.kktix_ticket_xxxx_attendees tables #123

Uh oh!

GeekAngus commented Aug 12, 2023

Uh oh!

david30907d Aug 12, 2023

Uh oh!

david30907d Aug 12, 2023

Uh oh!

david30907d Aug 12, 2023

Uh oh!

david30907d Aug 12, 2023

Uh oh!

david30907d Aug 12, 2023

Uh oh!

david30907d Aug 12, 2023

Uh oh!

david30907d Aug 12, 2023

Uh oh!

david30907d Aug 12, 2023

Uh oh!

david30907d left a comment •

edited

Loading

Uh oh!

Uh oh!

		JOB_CONFIG = bigquery.LoadJobConfig(schema=SCHEMA)


		def _load_row_df_from_dict(json_dict, update_after_ts) -> DataFrame:

		CANONICAL_COLUMN_NAMES_2020_EXTRA_CORPORATE = {
		"invoice_policy",

		@@ -43,6 +46,30 @@ def _load_to_bigquery(payload: List[Dict]) -> None:
		job.result()


		def _load_to_bigquery_dwd(payload: List[Dict]) -> None:

	def load(event_raw_data_array: List):
	"""
	load data into bigquery!
	"""
	# data quality check
	if len(event_raw_data_array) == 0:
	print("Nothing to load, skip!")
	return
	payload = []
	for event_raw_data in event_raw_data_array:
	sanitized_event_raw_data = _sanitize_payload(event_raw_data)
	payload.append(sanitized_event_raw_data)
	_load_to_bigquery(payload)
	_load_to_bigquery_dwd(payload)

Add KKTIX attendee Info ETL to BigQuery pycontw-225217.dwd.kktix_ticket_xxxx_attendees tables #123

Add KKTIX attendee Info ETL to BigQuery pycontw-225217.dwd.kktix_ticket_xxxx_attendees tables #123

Uh oh!

Conversation

GeekAngus commented Aug 12, 2023

Types of changes

Description

Checklist:

Steps to Test This Pull Request

Expected behavior

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david30907d left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

About the Design Pattern

Uh oh!

Uh oh!

david30907d left a comment •

edited

Loading