-
Notifications
You must be signed in to change notification settings - Fork 28
Port bhr collection from databricks #356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
acmiyaguchi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would useful to include a README at the bhr_collection level on how to invoke the job manually.
r+, but I strongly suggest making the start_date a parameter for the airflow job.
| sc = SparkContext.getOrCreate() | ||
| spark = SparkSession.builder.appName("bhr-collection").getOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| sc = SparkContext.getOrCreate() | |
| spark = SparkSession.builder.appName("bhr-collection").getOrCreate() | |
| spark = SparkSession.builder.appName("bhr-collection").getOrCreate() | |
| sc = spark.sparkContext |
|
|
||
| pings_df = ( | ||
| spark.read.format("bigquery") | ||
| .option("table", "moz-fx-data-shared-prod.telemetry_stable.bhr_v4") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this using the new structured format as introduced in mozilla-services/mozilla-pipeline-schemas#636 on 2020-12-07? From what I'm reading, it looks like it does due to the way that the stacks are being parsed, but I just want to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is what broke the job last week and needed to get fixed before moving it over.
| print_progress(job_start, iterations, x, iteration_start, date_str) | ||
|
|
||
|
|
||
| def etl_job_incremental_finalize(_, __, config=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very strange definition signature. Also this function seems to be unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that looks like some weird artifact of working in a notebook. This and a few other etl_job functions are unused so I'll take them out. My code inspector missed these for some reason.
| sc, | ||
| spark, | ||
| { | ||
| "start_date": datetime.today() - timedelta(days=3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command should take the start_date parameter as an argument so the Airflow dag can take over what date ranges are run.
Source notebook: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/211711/command/216952