Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to speed up #[derive(ValidGrouping)] #3163

Closed
wants to merge 2 commits into from

Conversation

lqd
Copy link

@lqd lqd commented May 3, 2022

While looking at rust-lang/rust#81262, I also tried looking at the master branch to see how things had evolved since this issue was opened, in particular since the use of proc_macros.

While testing 32/64/128 columns, I noticed that:

This proc_macro roughly takes one second to execute each time from 100 to 128 columns. I believe it could be interesting for compilation times to have a structure that is less expansion-heavy, esp. the MixedAggregates chain causing a quadratic amount of expansion.

Since I'm not familiar enough with diesel and its aggregates to successfully come up with such a design, I wanted to notify you about this finding, while trying to help in the meantime. This PR reduces the constant factors of this proc_macro (while still retaining its quadratic nature) so maybe it can help a bit, in that it could be "fast enough", and that new slowness and opportunities for improvement are to be found elsewhere.

It seems the mixed aggregates chain, and use of quote for short quoting/unquoting concatenation, builds up big token streams that are slower and slower to parse and reparse at each step. In this PR, I've tried to build up the code in a string buffer, to limit the number of times they are parsed with quote. It's not ideal, probably a bit more error-prone, but I don't think that quote was made to handle this use-case quickly (even after the recent performance improvements)

In a benchmark with a single use of the derive per column-count, I see these results for a non-incremental cargo check of the benchmark, from 1 to 128 columns:

chart - valid grouping diesel benchmark - check - 128 columns.

Proc_macro expansion itself is 30x faster at 128 columns (from 1s to 35ms or so) for this one macro invocation, and as seen in the above chart, total compilation of this simple stress test is also divided by a bit over 2 (it then also shows other bottlenecks in rustc).

That also translates to diesel itself, proc_macro expansion is reduced, for a cargo check:

  • 3.5x for 32-column-tables: from 1s to 280ms
  • 10x for 64-column-tables: from 6s to 600ms
  • 22x for 128-column-tables: from 45s to 2s

End-to-end compile times are reduced a tiny bit for the heavier cases: these -40s are somewhat noticeable, but the cargo check itself is around 10 minutes long, so that's only a 4-5% improvement there.

I'm opening this PR as a draft for feedback, to see if the approach is acceptable; and to see if CI likes it (as it seems likely that I forgot some use-cases with different forms of input).

lqd added 2 commits May 3, 2022 17:39
this results in a 30x improvement for a 128-column struct
this results in a few %age improvements
@weiznich
Copy link
Member

weiznich commented May 3, 2022

Thanks for opening this PR and doing all this analysis ❤️

While trying to put together a simple example showing why those quadratic expansion of MixedAggregates is required I got kind of nerd snipped in trying another approach. The historic reason of having this kind of expansion is to work around rust-lang/rust#75542 (#2480). I kind of assumed that this happens for all traits with such recursive structure, but it turns out that this is not true.

Seems like we can replace the the corresponding derive + some other recursive impls as well. See weiznich@61351f7 for the corresponding change. This brings down the time of a cargo check --features "sqlite 64-column-table" --no-default-features build for me from ~ 1min 20s to ~1 min 05s`, so a nearly 20% improvement.

Given that success I've tried to apply the same approach to the recursive SqlType impl

macro_rules! impl_sql_type {
(
@build
start_ts = [$($ST: ident,)*],
ts = [$T1: ident,],
bounds = [$($bounds: tt)*],
is_null = [$($is_null: tt)*],
)=> {
impl<$($ST,)*> SqlType for ($($ST,)*)
where
$($ST: SqlType,)*
$($bounds)*
$T1::IsNull: OneIsNullable<$($is_null)*>,
{
type IsNull = <$T1::IsNull as OneIsNullable<$($is_null)*>>::Out;
}
};
(
@build
start_ts = [$($ST: ident,)*],
ts = [$T1: ident, $($T: ident,)+],
bounds = [$($bounds: tt)*],
is_null = [$($is_null: tt)*],
)=> {
impl_sql_type!{
@build
start_ts = [$($ST,)*],
ts = [$($T,)*],
bounds = [$($bounds)* $T1::IsNull: OneIsNullable<$($is_null)*>,],
is_null = [<$T1::IsNull as OneIsNullable<$($is_null)*>>::Out],
}
};
($T1: ident, $($T: ident,)+) => {
impl_sql_type!{
@build
start_ts = [$T1, $($T,)*],
ts = [$($T,)*],
bounds = [],
is_null = [$T1::IsNull],
}
};
($T1: ident,) => {
impl<$T1> SqlType for ($T1,)
where $T1: SqlType,
{
type IsNull = $T1::IsNull;
}
}
}
for tuples as well. This is as far as I know one of the spots where rustc spends quite a lot of time. See weiznich@efa182f for the corresponding change.
Unfortunately it is not possible to apply the same pattern there. This results in quite large build times (> 13 minutes for me with all cargo check --features "sqlite" --no-default-features). It's not clear for me why this happens and what's even different there compared to the other traits. All of those tuple impls basically work by forwarding stuff to the next smaller tuple. In therory rustc should be able to resolve that without quadratic behaviour, assuming the corresponding impls are cached.

@lqd
Copy link
Author

lqd commented May 4, 2022

Great news !

I was checking this as part of the initiative to improve compile times, in case it would be harder to remove these tuple wrappers and the associated derives, rather than optimizing the macro.

Your change will help diesel and downstream crates, but not users who use the derive directly, but I assume this is rare, especially with a large number of generic parameters, right ?

I was able to reproduce your results on the first commit: it removes indeed most of the proc-macro expansion step, as well as a chunk of regular macro expansion, and hits your other issue #81263 much less than before.

For the second commit, like you mention it's very slow, but 99% of compile times are because of another rustc issue, about the specialization graph. I've seen it in a few other crates: it's something quadratic in coherence checking IIRC, that starts becoming noticeable as bigger Ns weigh heavy in a O(N²) algorithm, e.g. in bigger crates.

I wonder if your other MCVEs do hit the same issue: it would be great to have a smaller scale reproduction for this issue which is hard to reduce among all the unrelated code that big crates have (along with the actual code triggering the slowness). If your other issues' reproduction examples do hit it, it would be super helpful for people to even begin analyzing the nonlinearity in coherence.

@lqd
Copy link
Author

lqd commented May 4, 2022

In case it helps, here's a breakdown of the traits whose tuple impls trigger the specialization graph slowness (until I have more info on what exactly is slow there, as it's doing a bunch of things at the same time, like setting up specialization/coherence/overlap):

trait number of impls time in seconds
diesel::deserialize::FromSqlRow 17 94
diesel::deserialize::SqlTypeOrSelectable 34 144
diesel::deserialize::FromStaticSqlRow 33 48
diesel::expression::TypedExpressionType 36 47
diesel::expression::QueryMetadata 35 48
diesel::query_dsl::load_dsl::private::CompatibleType 35 143
diesel::sql_types::IntoNullable 18 126
diesel::deserialize::Queryable 49 95
diesel::util::TupleSize 33 47

For reproduction and minimization purposes, it looks like TupleSize and its impls are simple enough and self-contained, that it may be worth trying to extract it to see it it's sufficient to trigger the issue, or if it requires some other diesel structures inside the specialization graph for that (like the intricacies and other impls of SqlType, which seemed unfortunately likely since efa182f mostly changed those impls ?).

@weiznich weiznich mentioned this pull request May 5, 2022
@weiznich
Copy link
Member

weiznich commented May 6, 2022

Your change will help diesel and downstream crates, but not users who use the derive directly, but I assume this is rare, especially with a large number of generic parameters, right ?

Yes, I expect that no one outside of diesel is using this derive with more than a handful type parameters. After all that derive is designed to be used with hand written query DSL extensions and you don't want to maintain a list of dozens of type parameters there manually.

I wonder if your other MCVEs do hit the same issue: it would be great to have a smaller scale reproduction for this issue which is hard to reduce among all the unrelated code that big crates have (along with the actual code triggering the slowness). If your other issues' reproduction examples do hit it, it would be super helpful for people to even begin analyzing the nonlinearity in coherence.

How do I verify that the example given as part of rust-lang/rust#75542 hits the same issues? I would assume that, as this is a quite literal extraction of the relevant code from diesel. This example is fairly self contained, as it only contains 5 trait + 3 structs. Additionally I can add that removing the generic IntoNullable impl in line 26 makes the code compile almost instantly (0.15s for 16 columns), while with this impl added it takes > 1minute.

impl<T> IntoNullable for T
where
    T: SqlType<IsNull = is_nullable::NotNull> + SingleValue,
{
    type Nullable = Nullable<T>;
}

For reproduction and minimization purposes, it looks like TupleSize and its impls are simple enough and self-contained, that it may be worth trying to extract it to see it it's sufficient to trigger the issue, or if it requires some other diesel structures inside the specialization graph for that (like the intricacies and other impls of SqlType, which seemed unfortunately likely since efa182f mostly changed those impls ?).

I've tried to put TupleSize into the temple given by the minimal example in rust-lang/rust#75542, but that by itself seems not be be enough to reproduce the issue. That compiles for me in 0.13s (16 columns) and 0.19s (32 columns).

That all written: I'm happy to help with whatever code/knowledge from diesels side is required to fix this issue.

@lqd
Copy link
Author

lqd commented May 6, 2022

Yes, I expect that no one outside of diesel is using this derive with more than a handful type parameters. After all that derive is designed to be used with hand written query DSL extensions and you don't want to maintain a list of dozens of type parameters there manually.

Great, so we can probably just close this PR then, it's not going to land. #3167 has landed instead, and it's better progress for what I was initially looking at. A check build for 128-column tables is now reduced to 7mins down from 10-11 minutes.

How do I verify that the example given as part of rust-lang/rust#75542 hits the same issues?

Unfortunately, I don't think one easily can. From the high-level self-profiling output, I was expecting this to be a coherence issue: they're both showing up as heavy "specialization_graph_of" activity. I have since then been looking into this part of rustc more, timing the different parts of specialization/coherence/trait selection, and I'm not exactly sure it's the exact same issue (the one I was thinking of looks older than this one). It's pretty close: it's still called in the same rustc query and in coherence, but then is trying to prove predicates for negative impls or something, and then it becomes slow in the trait system for some yet unknown reason.

It seems to me like it could be related to the exponential blowup that was seen when some of the trait system cache was removed to fix incremental compilation issues. But that happened in 1.57 IIRC, and it looks like the regression we're seeing in rust-lang/rust#75542 was actually introduced in 1.56:

version self-profile activity self time % of total time time count
1.55.0 specialization_graph_of 34.42ms 12.915 36.64ms 7
1.56.0 specialization_graph_of 1.39s 85.087 1.39s 7
1.57.0 specialization_graph_of 1.38s 86.052 1.38s 5
1.58.0 specialization_graph_of 1.26s 85.275 1.26s 5
1.59.0 specialization_graph_of 1.11s 82.733 1.11s 5
1.60.0 specialization_graph_of 1.01s 83.529 1.01s 5

I have not yet managed to find which nightly or PR introduced this issue, but will try to do so. It should help show the exact part of rustc that is affected by this example structure.

In the meantime, I have also been trying to talk the experts in that area of the compiler, to see if we can have a discussion and try to understand what's going on in this example, and if/how it can be fixed.

I've also played with your example, and I believe it's good enough for that purpose: you have done the hard work minimizing already. Thanks for that. As you mentioned, I had also seen that its weird behavior is kind of fragile, and even small changes to the repro make it compile fast.

I will continue this investigation, and discuss with the trait system & caching and incremental query system experts.

@lqd lqd closed this May 6, 2022
@lqd lqd deleted the valid_grouping_proc_macro branch May 6, 2022 18:50
@weiznich
Copy link
Member

Thanks for looking into this ❤️

It seems to me like it could be related to the exponential blowup that was seen when some of the trait system cache was removed to fix incremental compilation issues. But that happened in 1.57 IIRC, and it looks like the regression we're seeing in rust-lang/rust#75542 was actually introduced in 1.56:

I would like to add here that the original rustc issue predates 1.55, is from nearly 2 years back. So be prepared for another underlying issue there. This likely does not show up immediately as the minimal example only sets 16 columns as default in line 861. Increasing that to 32 columns did lead to a compile time "explosion" even before the the resent incremental compilation issue fixes.

@lqd
Copy link
Author

lqd commented May 10, 2022

Yes indeed, I've noticed this as well. There are multiple steps of regressions over time, unfortunately, as well as multiple issues. And they affect diesel, nalgebra, bitmaps differently.

For the 16-column case above, it was easy-ish, we can see the 30x increase so we can like look for the version that makes this query go over 1s. I've bisected the 1.56 regression this way to rust-lang/rust#89125 for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants