Add scripts extract and transform the #120

GeekAngus · 2023-06-18T12:55:40Z

Types of changes

New feature

Description

Background: Start from 2022, we extract the KKTIX data via KKTIX API and load to "pycontw-225217.ods.ods_kktix_attendeeId_datetime". However most of the data are store in the ATTENDEE_INFO column with json format. To use metabase with SQL, users need to extract the data by json_extract with the knowledge kktix format instead of flat database. And we also need to rewrite all the SQLs build for current databases.
Solution: Transform the tables in backend that we could keep the same user experience by using Metabase.

Checklist:

Add test cases to all the changes you introduce
Run poetry run pytest locally to ensure all linter checks pass
Update the documentation if necessary

Steps to Test This Pull Request

./kktix_bq_etl.py 2023

Expected behavior

The data had been load to ods_kktix_ticket_${ticket_type}_attendees_test on bigquery
ticket_type = corporate individual reserved

"ods_kktix_attendeeId_datetime" table and load to the legecy tables: : ods_kktix_ticket_(corporate, individual, reserved)_attendees

david30907d · 2023-06-20T14:29:50Z

contrib/README.md

+## KKTIX BigQuery Transform
+1. Background: Start from 2022, we extract the KKTIX data via KKTIX API and load to "pycontw-225217.ods.ods_kktix_attendeeId_datetime". However most of the data are store in the ATTENDEE_INFO column with json format. To use metabase with SQL, users need to extract the data by json_extract with the knowledge kktix format instead of flat database. And we also need to rewrite all the SQLs build for current databases.
+2. Solution: Transform the tables in backend that we could keep the same user experience by using Metabase.
+3. Run: 
+ - `./kktix_bq_etl.py -t ods_kktix_ticket_reserved_attendees_test -k reserved -y 2023 --upload`
+ - for 3 tables: `./kktix_bq_etl.py 2023`


thx for documenting these up 🙏

david30907d · 2023-06-20T14:38:44Z

contrib/kktix_bq_etl.py

+CANONICAL_COLUMN_NAMES_2020_EXTRA_CORPORATE = {
+    "invoice_policy",
+    "invoiced_company_name",
+    "unified_business_no",
+    "pynight_attendee_numbers",
+    "know_financial_aid",
+    "have_you_ever_attended_pycon_tw",
+    "pynight_attending_or_not",
+    "how_did_you_know_pycon_tw",
+}
+
+CANONICAL_COLUMN_NAMES_2020_EXTRA_INDIVIDUAL = {
+    "pynight_attendee_numbers",
+    "know_financial_aid",
+    "have_you_ever_attended_pycon_tw",
+    "pynight_attending_or_not",
+    "how_did_you_know_pycon_tw",
+}
+
+CANONICAL_COLUMN_NAMES_2020_EXTRA_RESERVED: Set = set()
+
+
+CANONICAL_COLUMN_NAMES_2019_CORE = {


seems to me that these canonical_xxx sets are unused?

yes, the main code is from upload-kktix-ticket-csv-to-bigquery.py and I jus keep them, the set was used for unit test.

david30907d · 2023-06-20T14:39:03Z

contrib/kktix_bq_etl.py

+
+
+HEURISTIC_COMPATIBLE_MAPPING_TABLE = {
+    # from 2020 reformatted column names


thx for writting these comments 🙏

david30907d · 2023-06-20T14:39:07Z

contrib/kktix_bq_etl.py

+    "ive_already_read_and_i_accept_the_epidemic_prevention_of_pycon_tw_2020_pycon_tw_2020_covid19": "ive_already_read_and_i_accept_the_epidemic_prevention_of_pycon_tw",
+    "do_you_know_we_have_financial_aid_this_year": "know_financial_aid",
+    "contact_email": "email",
+    # from 2020 reformatted column names which made it duplicate


david30907d · 2023-06-20T15:01:55Z

contrib/kktix_bq_etl.py

+    # print(df.columns)
+    sanitized_df = sanitize_column_names(df)
+    hash_privacy_info(sanitized_df)    


thx for remembering hashing these sensitive columns. However, we've already encrypted data on extract side. so no need to do it again~

pycon-etl/dags/ods/kktix_ticket_orders/udfs/kktix_api.py

Line 33 in 810096e

transformed_event_raw_data_array = kktix_transformer.transform(

yes, could be removed after confirmed.

david30907d · 2023-06-20T15:07:27Z

contrib/kktix_bq_etl.py

+        df_dict = df_dict.drop(columns = useless_columns) 
+        df_dict = df_dict.rename(columns = {"reg_no": "registration_no", "ticket_name": "ticket_type", "is_paid": "payment_status"})


what do you thiink if we keep all of the original data in the warehouse, and only deserialize/extract those "useful" columns from JSON to another DB layer?

Yes, we should keep all the of the original data that maybe useful someday.

…ames in better way

GeekAngus · 2023-08-12T10:46:28Z

Use #123 instead

Add scripts extract and transform the

3a5c90b

"ods_kktix_attendeeId_datetime" table and load to the legecy tables: : ods_kktix_ticket_(corporate, individual, reserved)_attendees

david30907d reviewed Jun 20, 2023

View reviewed changes

GeekAngus added 7 commits June 24, 2023 17:57

style: lint the code and fix potential SQL inject by parameterized query

79506a8

fix (dataframe): Aggregate the columns with the same purposes or names

a3b954c

[refactor] (dataframe): Group the columns with the same purposes or n…

c3064ba

…ames in better way

[refactor] (dataframe): Clean the columns for 2022

3a2eccb

Merge branch 'pycontw:master' into master

209b5b9

[feature] (dag): integrate the script to kktix_loader.py in dag

d18249f

merged

cd68ec1

GeekAngus closed this Aug 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add scripts extract and transform the #120

Add scripts extract and transform the #120

Uh oh!

GeekAngus commented Jun 18, 2023

Uh oh!

david30907d Jun 20, 2023

Uh oh!

david30907d Jun 20, 2023

Uh oh!

GeekAngus Jun 23, 2023

Uh oh!

david30907d Jun 20, 2023

Uh oh!

david30907d Jun 20, 2023

Uh oh!

david30907d Jun 20, 2023

Uh oh!

GeekAngus Jun 23, 2023

Uh oh!

david30907d Jun 20, 2023

Uh oh!

GeekAngus Jun 23, 2023

Uh oh!

GeekAngus commented Aug 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		HEURISTIC_COMPATIBLE_MAPPING_TABLE = {
		# from 2020 reformatted column names

		df_dict = df_dict.drop(columns = useless_columns)
		df_dict = df_dict.rename(columns = {"reg_no": "registration_no", "ticket_name": "ticket_type", "is_paid": "payment_status"})

Add scripts extract and transform the #120

Add scripts extract and transform the #120

Uh oh!

Conversation

GeekAngus commented Jun 18, 2023

Types of changes

Description

Checklist:

Steps to Test This Pull Request

Expected behavior

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GeekAngus commented Aug 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants