-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task crash_report_parquet
job is failing in socorro_import failing
#1198
Comments
Two things:
|
@jklukas walked me through looking at Airflow DAG, logs, and related things. The output above is from the 3rd try. The log for the first try looks like this:
The second and third retries fail with the conflict error:
The first attempt fails at some point from a time out, but it's not clear what's timing out. Jeff suggested I reach out and ask for some SRE-help, so I'm going to do that next. |
The dataproc initialization steps have a default timeout of ten minutes (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions). This should be configurable through the airflow operator but we may need to add some parameters to the dataproc utils (https://github.com/mozilla/telemetry-airflow/blob/master/dags/utils/dataproc.py). I took a look at the logs in the gcs bucket and they just show a long pip install so my guess is that it there's no error here; the install just happened to pass the 10 minute threshold. For the 409, from what I understand the cluster is still created if the initialization times out. So the cluster is there until the end of its TTL or maybe until the subdag fails after the retries I'm not sure how that's setup. This is my understanding of it and it looks to me that all that needs to be done is increase the initialization timeout. |
I did take a look at the logs out of curiosity. It looks to me like pip goes into infinite dependency resolution, similar to the bug in pypa/pip#9011. For example, the line It seems to be using pip 20.3.1:
Using 20.2.4, which gets uninstalled for some reason, might solve the issue. |
That's interesting. So changing timeout definitely will not solve it. I saw that some other dataproc jobs are affected as well but not all. The ones that aren't affected set |
Thank you! This makes so much more sense now. pip 20.3.0 and later use the new package dependency resolver. Anna pointed to issue 9011 in the pip issue tracker, but that was fixed before 20.3.0 came out. That got me looking at resolver issues. I think we're hitting this issue which has similar symptoms but is different:
telemetry-airflow/dataproc_bootstrap/dataproc_init.sh Lines 28 to 29 in 60f7dd1
With I think the short term fix would be to fix Then we can spin off a new issue to fix our requirements specification which is causing the excessive backtracking during dependency resolution. |
Currently, `dataproc_init.sh` updates pip to 20.3.1 which tries to install the dependencies, but takes > 10 minutes to figure out a solution set that satisfies the version dependencies. This has us go back to the previous pip which has the old resolver which works while we figure out how to straighten out the dependency requirements.
Currently, `dataproc_init.sh` updates pip to 20.3.1 which tries to install the dependencies, but takes > 10 minutes to figure out a solution set that satisfies the version dependencies. This has us go back to the previous pip which has the old resolver which works while we figure out how to straighten out the dependency requirements.
Currently, `dataproc_init.sh` updates pip to 20.3.1 which tries to install the dependencies, but takes > 10 minutes to figure out a solution set that satisfies the version dependencies. This has us go back to the previous pip which has the old resolver which works while we figure out how to straighten out the dependency requirements.
The failing job has been resolved (for quite some time now), but it looks like #1200 is still relevant. |
The
create_dataproc_cluster
task is failing at some point afterdataproc_init.sh
is called. You can see the requirements getting installed in the logs and then nothing after that:https://workflow.telemetry.mozilla.org/log?task_id=crash_report_parquet&dag_id=socorro_import&execution_date=2020-12-06T00%3A00%3A00%2B00%3A00
It's been failing for 4 days now.
The text was updated successfully, but these errors were encountered: