Skip to content

Conversation

@BenWu
Copy link
Contributor

@BenWu BenWu commented Dec 17, 2020

Copy link
Contributor

@acmiyaguchi acmiyaguchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would useful to include a README at the bhr_collection level on how to invoke the job manually.

r+, but I strongly suggest making the start_date a parameter for the airflow job.

Comment on lines 31 to 32
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("bhr-collection").getOrCreate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("bhr-collection").getOrCreate()
spark = SparkSession.builder.appName("bhr-collection").getOrCreate()
sc = spark.sparkContext


pings_df = (
spark.read.format("bigquery")
.option("table", "moz-fx-data-shared-prod.telemetry_stable.bhr_v4")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this using the new structured format as introduced in mozilla-services/mozilla-pipeline-schemas#636 on 2020-12-07? From what I'm reading, it looks like it does due to the way that the stacks are being parsed, but I just want to make sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is what broke the job last week and needed to get fixed before moving it over.

print_progress(job_start, iterations, x, iteration_start, date_str)


def etl_job_incremental_finalize(_, __, config=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very strange definition signature. Also this function seems to be unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that looks like some weird artifact of working in a notebook. This and a few other etl_job functions are unused so I'll take them out. My code inspector missed these for some reason.

sc,
spark,
{
"start_date": datetime.today() - timedelta(days=3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command should take the start_date parameter as an argument so the Airflow dag can take over what date ranges are run.

@BenWu BenWu merged commit c183d90 into main Dec 17, 2020
@BenWu BenWu deleted the bhr-collection branch December 17, 2020 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants