Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2960] [Bug] Compile performance hit (DBT 1.4.6 -> 1.6.0) #8360

Closed
2 tasks done
nicklubbers opened this issue Aug 11, 2023 · 3 comments · Fixed by #8413
Closed
2 tasks done

[CT-2960] [Bug] Compile performance hit (DBT 1.4.6 -> 1.6.0) #8360

nicklubbers opened this issue Aug 11, 2023 · 3 comments · Fixed by #8413
Assignees
Labels
bug Something isn't working performance regression
Milestone

Comments

@nicklubbers
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

After upgrading DBT, our full compile time increases from 2m52sec to 18min!

python: 3.11.4
dbt_core: 1.4.6 -> 1.6.0
dbt-bigquery: 1.4.3 -> 1.6.1

Expected Behavior

That the compile time would decrease or stay the same.

Steps To Reproduce

  1. Setup environment with the "old" versions described above.
  2. Run dbt compile

output:

14:36:08  Running with dbt=1.4.6
14:36:08  Unable to do partial parsing because saved manifest not found. Starting full parse.
14:37:07  Found 603 models, 8357 tests, 271 snapshots, 54 analyses, 855 macros, 1 operation, 8 seed files, 700 sources, 276 exposures, 0 metrics
14:37:08  
14:37:12  Concurrency: 99 threads (target='dev')
14:37:12  
14:38:58  Done.
  1. Setup environment with the "new" versions described above.
  2. Run dbt compile

output:

08:57:19  Running with dbt=1.6.0
08:57:19  Registered adapter: bigquery=1.6.1
08:57:19  Unable to do partial parsing because saved manifest not found. Starting full parse.
08:58:06  Found 603 models, 268 snapshots, 54 analyses, 8 seeds, 1 operation, 8357 tests, 697 sources, 276 exposures, 0 metrics, 9[24](https://github.com/<redacted>/dbt/actions/runs//<redacted>/job//<redacted>#step:12:25) macros, 0 groups, 0 semantic models
08:58:06  
08:58:10  Concurrency: 99 threads (target='dev')
08:58:10

Relevant log output

No response

Environment

- os: Linux-5.15.0-1042-azure-x86_64-with-glibc2.35
- python: 3.11.4
- dbt_core: 1.4.6 -> 1.6.0
- dbt-bigquery: 1.4.3 -> 1.6.1

Which database adapter are you using with dbt?

bigquery

Additional Context

No response

@nicklubbers nicklubbers added bug Something isn't working triage labels Aug 11, 2023
@github-actions github-actions bot changed the title [Bug] Compile performance hit (DBT 1.4.6 -> 1.6.0) [CT-2960] [Bug] Compile performance hit (DBT 1.4.6 -> 1.6.0) Aug 11, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 11, 2023

@nicklubbers Thanks for opening the issue!

To confirm - are you seeing the slowness in between the Concurrency log line, and when the compile invocation completes?

Just a hunch: I'm wondering if this could be due to the change we made to write run_results.json more frequently, after each node is completed compiling/running:

I heard a similar report the other day about a slowdown in docs generate using dbt-dabricks, and I'm wondering if ^this could be the root cause of both.

I'm going to label this a regression for now while we investigate

@jtcohen6 jtcohen6 added this to the v1.6.x milestone Aug 11, 2023
@nicklubbers
Copy link
Author

nicklubbers commented Aug 11, 2023

@jtcohen6 - Yes the slowness is between the Concurrency log line and when it completes. It also does not show Done anymore in 1.6.

Additional info:

dbt-core = "1.5.4"
dbt-bigquery = "1.5.5"

is not as slow as 1.6.0 and comparable to 1.4.6. So it looks like introduced in 1.6.0.

@peterallenwebb
Copy link
Contributor

peterallenwebb commented Aug 11, 2023

This looks like a potentially serious issue for large projects, introduced by changes to the way run_results.json is written in 1.6.

We use Python's multiprocessing.pool.Pool to run tasks. Each task result is processed individually by a callback function which is fired sequentially by the pool, on a single thread it maintains for that purpose. By including this file write task in the callback, we are violating an important recommendation in the Python documentation.

Writing the run results is not fast for large projects. The size of the run results file increases as the graph is executed and more results are added, and the entire file is rewritten on every node completion. As a result, the total work done writing run_results.json will be O(n²), where n is the number of nodes. Profiling showed that this doubled the total time needed to run dbt compile on the Q-Team-Pristine-2k-Models project, and the slowdown factor will likely be larger for larger projects.

Potential fixes:

  1. Do jerico's "naive" suggestion: Skip this node-level writing of run_results.json for compile + docs generate. This is likely to leave considerable slowdown in the other commands, but it might be masked by query wait times.
  2. Do jerico's "clever" suggestion: Solve the problem using more threads. This is certainly possible, but more threads are more problems, and in order for it to be efficient we would need to come up with a truly clever design that avoids repeatedly re-writing run_results.json while new requests to re-write run_results.json stack up behind it.
  3. Do Peter's suggestion: Find a way to always write run_results.json at the end of processing, even under typical error conditions. I assume we considered this originally and discarded the idea, but if so, perhaps we should reconsider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance regression
Projects
None yet
3 participants