Port bhr collection from databricks #356

BenWu · 2020-12-17T21:18:39Z

Source notebook: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/211711/command/216952

acmiyaguchi

It would useful to include a README at the bhr_collection level on how to invoke the job manually.

r+, but I strongly suggest making the start_date a parameter for the airflow job.

acmiyaguchi · 2020-12-17T21:32:52Z

mozetl/bhr_collection/bhr_collection.py

+sc = SparkContext.getOrCreate()
+spark = SparkSession.builder.appName("bhr-collection").getOrCreate()


Suggested change

sc = SparkContext.getOrCreate()

spark = SparkSession.builder.appName("bhr-collection").getOrCreate()

spark = SparkSession.builder.appName("bhr-collection").getOrCreate()

sc = spark.sparkContext

acmiyaguchi · 2020-12-17T21:36:08Z

mozetl/bhr_collection/bhr_collection.py

+
+    pings_df = (
+        spark.read.format("bigquery")
+        .option("table", "moz-fx-data-shared-prod.telemetry_stable.bhr_v4")


Is this using the new structured format as introduced in mozilla-services/mozilla-pipeline-schemas#636 on 2020-12-07? From what I'm reading, it looks like it does due to the way that the stacks are being parsed, but I just want to make sure.

Yes this is what broke the job last week and needed to get fixed before moving it over.

acmiyaguchi · 2020-12-17T21:41:07Z

mozetl/bhr_collection/bhr_collection.py

+        print_progress(job_start, iterations, x, iteration_start, date_str)
+
+
+def etl_job_incremental_finalize(_, __, config=None):


This is a very strange definition signature. Also this function seems to be unused?

Yes that looks like some weird artifact of working in a notebook. This and a few other etl_job functions are unused so I'll take them out. My code inspector missed these for some reason.

acmiyaguchi · 2020-12-17T21:42:46Z

mozetl/bhr_collection/bhr_collection.py

+        sc,
+        spark,
+        {
+            "start_date": datetime.today() - timedelta(days=3),


This command should take the start_date parameter as an argument so the Airflow dag can take over what date ranges are run.

… date arg

BenWu added 5 commits December 17, 2020 12:43

Port bhr collection from databricks

3148811

clean up

2a928f6

flake8 + black

aa9627a

Add sample size arg

9a9e4a6

◼

511a0ca

BenWu mentioned this pull request Dec 17, 2020

Add bhr_collection dag mozilla/telemetry-airflow#1209

Merged

BenWu requested a review from acmiyaguchi December 17, 2020 21:28

acmiyaguchi approved these changes Dec 17, 2020

View reviewed changes

Add readme with instructions to run job, remove unused functions, add…

f8b280c

… date arg

BenWu merged commit c183d90 into main Dec 17, 2020

BenWu deleted the bhr-collection branch December 17, 2020 23:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port bhr collection from databricks #356

Port bhr collection from databricks #356

Uh oh!

BenWu commented Dec 17, 2020

Uh oh!

acmiyaguchi left a comment

Uh oh!

acmiyaguchi Dec 17, 2020

Uh oh!

acmiyaguchi Dec 17, 2020

Uh oh!

BenWu Dec 17, 2020

Uh oh!

acmiyaguchi Dec 17, 2020

Uh oh!

BenWu Dec 17, 2020

Uh oh!

acmiyaguchi Dec 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		sc = SparkContext.getOrCreate()
		spark = SparkSession.builder.appName("bhr-collection").getOrCreate()

		print_progress(job_start, iterations, x, iteration_start, date_str)


		def etl_job_incremental_finalize(_, __, config=None):

Port bhr collection from databricks #356

Port bhr collection from databricks #356

Uh oh!

Conversation

BenWu commented Dec 17, 2020

Uh oh!

acmiyaguchi left a comment

Choose a reason for hiding this comment

Uh oh!

acmiyaguchi Dec 17, 2020

Choose a reason for hiding this comment

Uh oh!

acmiyaguchi Dec 17, 2020

Choose a reason for hiding this comment

Uh oh!

BenWu Dec 17, 2020

Choose a reason for hiding this comment

Uh oh!

acmiyaguchi Dec 17, 2020

Choose a reason for hiding this comment

Uh oh!

BenWu Dec 17, 2020

Choose a reason for hiding this comment

Uh oh!

acmiyaguchi Dec 17, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants