Skip to content

bug: remove busy-wait while sort is ongoing #16322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Jun 7, 2025

Which issue does this PR close?

Rationale for this change

SortPreservingMergeStream works in two phases. It first waits for each input stream to be ready to emit. Once everybody's ready it proceeds to an emit phase.

During the waiting phase, it will poll each stream in a round-robin fashion. If any stream returns Pending the code self-wakes the current task and immediately returns Pending. This results in busy-waiting when waiting for, for instance, a SortExec that's sorting its data or any other pipeline breaker.

While this works, it wastes CPU cycles.

What changes are included in this PR?

Rather than returning immediately when one stream is pending, poll each stream once. Only return pending when there are still streams left that have not started emitting. This assumes that the pending streams are well behaved and will wake the task when they need to be polled again as required by the Stream contract. Note that this may surface bugs in other streams.

Rotation of uninitiated_partitions has been removed since that's no longer useful. There was a comment in the code about 'upstream buffer size increase', but I'm not sure what that was referring to.

Are these changes tested?

Only by existing test and manual testing

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Jun 7, 2025
@pepijnve pepijnve force-pushed the issue_16321 branch 3 times, most recently from 6334f47 to f7405ee Compare June 7, 2025 21:08
@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 7, 2025

A sort preserving merge specific test case started failing. I’ll dig deeper to better understand what’s going on.

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 8, 2025

@berkaysynnada I hope it's ok that I ping you directly; reaching out because I believe you are the author of the test case in question. I believe this PR surfaced a mistake in the CongestedStream implementation of the test case. That Stream was returning Pending without ensuring the waker was set up to be notified when the congestion condition cleared. I'm fairly confident that's not correct since the trait contract explicitly states

Poll::Pending means that this stream’s next value is not ready yet. Implementations will ensure that the current task will be notified when the next value may be ready.

I've fixed the test implementation which makes the test pass again, but would like to get your opinion on this. Maybe I'm missing something.

@pepijnve pepijnve force-pushed the issue_16321 branch 3 times, most recently from ad50fbe to 8503407 Compare June 8, 2025 09:18
@Dandandan
Copy link
Contributor

Dandandan commented Jun 8, 2025

This seems really nice 🚀
On my machine I get roughly 10% improvement on queries with SPM - which I think makes sense on a 10 core machine (with less cores it probably has a larger effect).

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ issue_16321 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  809.46 ms │   733.99 ms │ +1.10x faster │
│ QQuery 2     │  165.80 ms │   148.89 ms │ +1.11x faster │
│ QQuery 3     │  432.56 ms │   386.04 ms │ +1.12x faster │
│ QQuery 4     │  459.88 ms │   467.66 ms │     no change │
│ QQuery 5     │  671.13 ms │   621.34 ms │ +1.08x faster │
│ QQuery 6     │  181.36 ms │   184.77 ms │     no change │
│ QQuery 7     │  959.96 ms │   871.05 ms │ +1.10x faster │
│ QQuery 8     │  672.39 ms │   627.96 ms │ +1.07x faster │
│ QQuery 9     │ 1101.98 ms │  1023.84 ms │ +1.08x faster │
│ QQuery 10    │  638.41 ms │   552.55 ms │ +1.16x faster │
│ QQuery 11    │  126.22 ms │   123.91 ms │     no change │
│ QQuery 12    │  358.25 ms │   329.10 ms │ +1.09x faster │
│ QQuery 13    │  720.99 ms │   656.95 ms │ +1.10x faster │
│ QQuery 14    │  247.23 ms │   251.52 ms │     no change │
│ QQuery 15    │  395.58 ms │   369.79 ms │ +1.07x faster │
│ QQuery 16    │  110.37 ms │   103.96 ms │ +1.06x faster │
│ QQuery 17    │ 1193.78 ms │  1206.00 ms │     no change │
│ QQuery 18    │ 1846.58 ms │  1668.23 ms │ +1.11x faster │
│ QQuery 19    │  412.76 ms │   400.33 ms │     no change │
│ QQuery 20    │  421.08 ms │   392.58 ms │ +1.07x faster │
│ QQuery 21    │ 1363.39 ms │  1275.33 ms │ +1.07x faster │
│ QQuery 22    │  149.67 ms │   132.31 ms │ +1.13x faster │
└──────────────┴────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)          │ 13438.84ms │
│ Total Time (issue_16321)   │ 12528.08ms │
│ Average Time (main)        │   610.86ms │
│ Average Time (issue_16321) │   569.46ms │
│ Queries Faster             │         16 │
│ Queries Slower             │          0 │
│ Queries with No Change     │          6 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 8, 2025

@Dandandan project newbie question, my daily practice at work is to handle code review comments using amend/force-push. Did so out of habit before thinking to as ask. Is that ok in this project or does the community prefer fixup commits and squash before merging?

@Dandandan
Copy link
Contributor

@Dandandan project newbie question, my daily practice at work is to handle code review comments using amend/force-push. Did so out of habit before thinking to as ask. Is that ok in this project or does the community prefer fixup commits and squash before merging?

IMO it doesn't matter too much (personally I like to create individual commits).

We'll squash-merge the commits.

@Dandandan
Copy link
Contributor

I profiled some queries to verify it's no longer busy on the sorting thread:

Main (TPC-H query 1):
image

PR (TPC-H query 1)
image

The thread above is the thread executing SPM, you can see it does close to nothing after this PR (after reading metadata).

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 maybe @alamb or @berkaysynnada can have another look

@alamb
Copy link
Contributor

alamb commented Jun 9, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_16321 (c3d5ae9) to 1daa5ed diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pepijnve and @Dandandan -- this makes sense to me too. I took a careful look at the code and I found it to be easy to follow and clear ❤️

I think we should wait a few days before mergning this to see if others would like to review. I think @jayzhan211 and @ozankabak may also be interested

// If the polling result is Poll::Ready(Some(batch)) or Poll::Ready(None),
// we remove this partition from the queue so it is not polled again.
self.uninitiated_partitions.pop_front();
// The polled stream is ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for these comments 👍

@alamb
Copy link
Contributor

alamb commented Jun 9, 2025

🤖: Benchmark completed

Details

Comparing HEAD and issue_16321
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_16321 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1900.64 ms │  1941.03 ms │     no change │
│ QQuery 1     │   698.77 ms │   685.55 ms │     no change │
│ QQuery 2     │  1423.32 ms │  1336.60 ms │ +1.06x faster │
│ QQuery 3     │   708.54 ms │   677.63 ms │     no change │
│ QQuery 4     │  1450.18 ms │  1371.41 ms │ +1.06x faster │
│ QQuery 5     │ 15851.43 ms │ 14804.05 ms │ +1.07x faster │
│ QQuery 6     │  2016.34 ms │  2046.14 ms │     no change │
│ QQuery 7     │  2124.47 ms │  1948.45 ms │ +1.09x faster │
│ QQuery 8     │   845.96 ms │   828.21 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 27019.65ms │
│ Total Time (issue_16321)   │ 25639.07ms │
│ Average Time (HEAD)        │  3002.18ms │
│ Average Time (issue_16321) │  2848.79ms │
│ Queries Faster             │          4 │
│ Queries Slower             │          0 │
│ Queries with No Change     │          5 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ issue_16321 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    14.98 ms │    15.25 ms │     no change │
│ QQuery 1     │    31.43 ms │    32.60 ms │     no change │
│ QQuery 2     │    81.15 ms │    80.01 ms │     no change │
│ QQuery 3     │    97.82 ms │    99.49 ms │     no change │
│ QQuery 4     │   576.70 ms │   584.19 ms │     no change │
│ QQuery 5     │   833.02 ms │   841.56 ms │     no change │
│ QQuery 6     │    23.21 ms │    23.25 ms │     no change │
│ QQuery 7     │    38.30 ms │    34.79 ms │ +1.10x faster │
│ QQuery 8     │   913.51 ms │   898.29 ms │     no change │
│ QQuery 9     │  1197.30 ms │  1126.00 ms │ +1.06x faster │
│ QQuery 10    │   262.45 ms │   248.43 ms │ +1.06x faster │
│ QQuery 11    │   297.77 ms │   282.01 ms │ +1.06x faster │
│ QQuery 12    │   891.35 ms │   868.87 ms │     no change │
│ QQuery 13    │  1321.37 ms │  1247.47 ms │ +1.06x faster │
│ QQuery 14    │   826.48 ms │   809.29 ms │     no change │
│ QQuery 15    │   808.37 ms │   754.75 ms │ +1.07x faster │
│ QQuery 16    │  1722.67 ms │  1620.69 ms │ +1.06x faster │
│ QQuery 17    │  1594.99 ms │  1593.08 ms │     no change │
│ QQuery 18    │  3101.36 ms │  2892.25 ms │ +1.07x faster │
│ QQuery 19    │    83.25 ms │    84.66 ms │     no change │
│ QQuery 20    │  1121.24 ms │  1146.38 ms │     no change │
│ QQuery 21    │  1313.87 ms │  1298.49 ms │     no change │
│ QQuery 22    │  2157.02 ms │  2129.27 ms │     no change │
│ QQuery 23    │  7965.30 ms │  7780.56 ms │     no change │
│ QQuery 24    │   465.54 ms │   454.83 ms │     no change │
│ QQuery 25    │   394.55 ms │   386.33 ms │     no change │
│ QQuery 26    │   537.44 ms │   523.72 ms │     no change │
│ QQuery 27    │  1593.11 ms │  1547.66 ms │     no change │
│ QQuery 28    │ 13504.42 ms │ 11770.24 ms │ +1.15x faster │
│ QQuery 29    │   535.75 ms │   535.55 ms │     no change │
│ QQuery 30    │   791.38 ms │   784.48 ms │     no change │
│ QQuery 31    │   837.24 ms │   809.86 ms │     no change │
│ QQuery 32    │  2631.49 ms │  2501.55 ms │     no change │
│ QQuery 33    │  3317.30 ms │  3148.53 ms │ +1.05x faster │
│ QQuery 34    │  3329.56 ms │  3233.21 ms │     no change │
│ QQuery 35    │  1265.38 ms │  1238.90 ms │     no change │
│ QQuery 36    │   120.28 ms │   120.88 ms │     no change │
│ QQuery 37    │    57.17 ms │    56.21 ms │     no change │
│ QQuery 38    │   122.92 ms │   122.28 ms │     no change │
│ QQuery 39    │   198.52 ms │   198.42 ms │     no change │
│ QQuery 40    │    44.65 ms │    46.41 ms │     no change │
│ QQuery 41    │    45.72 ms │    45.15 ms │     no change │
│ QQuery 42    │    37.71 ms │    37.05 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 57105.04ms │
│ Total Time (issue_16321)   │ 54052.88ms │
│ Average Time (HEAD)        │  1328.02ms │
│ Average Time (issue_16321) │  1257.04ms │
│ Queries Faster             │         10 │
│ Queries Slower             │          0 │
│ Queries with No Change     │         33 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ issue_16321 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 121.29 ms │   100.08 ms │ +1.21x faster │
│ QQuery 2     │  21.52 ms │    21.01 ms │     no change │
│ QQuery 3     │  33.52 ms │    32.30 ms │     no change │
│ QQuery 4     │  19.89 ms │    18.63 ms │ +1.07x faster │
│ QQuery 5     │  52.00 ms │    49.78 ms │     no change │
│ QQuery 6     │  11.93 ms │    12.03 ms │     no change │
│ QQuery 7     │  96.25 ms │    86.38 ms │ +1.11x faster │
│ QQuery 8     │  25.43 ms │    25.22 ms │     no change │
│ QQuery 9     │  60.11 ms │    54.46 ms │ +1.10x faster │
│ QQuery 10    │  48.60 ms │    43.98 ms │ +1.11x faster │
│ QQuery 11    │  11.51 ms │    11.60 ms │     no change │
│ QQuery 12    │  41.89 ms │    35.09 ms │ +1.19x faster │
│ QQuery 13    │  27.03 ms │    25.77 ms │     no change │
│ QQuery 14    │   9.75 ms │     9.82 ms │     no change │
│ QQuery 15    │  22.33 ms │    19.91 ms │ +1.12x faster │
│ QQuery 16    │  21.29 ms │    18.76 ms │ +1.13x faster │
│ QQuery 17    │  94.81 ms │    96.53 ms │     no change │
│ QQuery 18    │ 212.50 ms │   200.15 ms │ +1.06x faster │
│ QQuery 19    │  26.21 ms │    25.71 ms │     no change │
│ QQuery 20    │  34.31 ms │    31.81 ms │ +1.08x faster │
│ QQuery 21    │ 154.08 ms │   148.59 ms │     no change │
│ QQuery 22    │  16.22 ms │    14.97 ms │ +1.08x faster │
└──────────────┴───────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 1162.48ms │
│ Total Time (issue_16321)   │ 1082.57ms │
│ Average Time (HEAD)        │   52.84ms │
│ Average Time (issue_16321) │   49.21ms │
│ Queries Faster             │        11 │
│ Queries Slower             │         0 │
│ Queries with No Change     │        11 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

@ozankabak
Copy link
Contributor

ozankabak commented Jun 9, 2025

Since this changes the congestion behavior test, which I'm not deeply familiar with, let's hear from @berkaysynnada on this to make sure we are not losing anything (he will be back tomorrow, so it shouldn't be long). If he is OK with the change, 10% performance improvement would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Busy-waiting in SortPreservingMergeStream
4 participants