Try to speed up `#[derive(ValidGrouping)]` #3163

lqd · 2022-05-03T16:28:11Z

While looking at rust-lang/rust#81262, I also tried looking at the master branch to see how things had evolved since this issue was opened, in particular since the use of proc_macros.

While testing 32/64/128 columns, I noticed that:

the macro expansion and proc_macro expansion were relatively the same duration each: indicating trying to optimize the proc_macros could be helpful.
building diesel itself, it seemed most of the proc_macro slowness came from the TupleWrappers and in particular the #[derive(ValidGrouping)]

This proc_macro roughly takes one second to execute each time from 100 to 128 columns. I believe it could be interesting for compilation times to have a structure that is less expansion-heavy, esp. the MixedAggregates chain causing a quadratic amount of expansion.

Since I'm not familiar enough with diesel and its aggregates to successfully come up with such a design, I wanted to notify you about this finding, while trying to help in the meantime. This PR reduces the constant factors of this proc_macro (while still retaining its quadratic nature) so maybe it can help a bit, in that it could be "fast enough", and that new slowness and opportunities for improvement are to be found elsewhere.

It seems the mixed aggregates chain, and use of quote for short quoting/unquoting concatenation, builds up big token streams that are slower and slower to parse and reparse at each step. In this PR, I've tried to build up the code in a string buffer, to limit the number of times they are parsed with quote. It's not ideal, probably a bit more error-prone, but I don't think that quote was made to handle this use-case quickly (even after the recent performance improvements)

In a benchmark with a single use of the derive per column-count, I see these results for a non-incremental cargo check of the benchmark, from 1 to 128 columns:

.

Proc_macro expansion itself is 30x faster at 128 columns (from 1s to 35ms or so) for this one macro invocation, and as seen in the above chart, total compilation of this simple stress test is also divided by a bit over 2 (it then also shows other bottlenecks in rustc).

That also translates to diesel itself, proc_macro expansion is reduced, for a cargo check:

3.5x for 32-column-tables: from 1s to 280ms
10x for 64-column-tables: from 6s to 600ms
22x for 128-column-tables: from 45s to 2s

End-to-end compile times are reduced a tiny bit for the heavier cases: these -40s are somewhat noticeable, but the cargo check itself is around 10 minutes long, so that's only a 4-5% improvement there.

I'm opening this PR as a draft for feedback, to see if the approach is acceptable; and to see if CI likes it (as it seems likely that I forgot some use-cases with different forms of input).

this results in a 30x improvement for a 128-column struct

this results in a few %age improvements

weiznich · 2022-05-03T20:56:44Z

Thanks for opening this PR and doing all this analysis ❤️

While trying to put together a simple example showing why those quadratic expansion of MixedAggregates is required I got kind of nerd snipped in trying another approach. The historic reason of having this kind of expansion is to work around rust-lang/rust#75542 (#2480). I kind of assumed that this happens for all traits with such recursive structure, but it turns out that this is not true.

Seems like we can replace the the corresponding derive + some other recursive impls as well. See weiznich@61351f7 for the corresponding change. This brings down the time of a cargo check --features "sqlite 64-column-table" --no-default-features build for me from ~ 1min 20s to ~1 min 05s`, so a nearly 20% improvement.

Given that success I've tried to apply the same approach to the recursive SqlType impl

diesel/diesel/src/type_impls/tuples.rs

Lines 511 to 560 in 0c6219f

    
           macro_rules! impl_sql_type { 
        
               ( 
        
                   @build 
        
                   start_ts = [$($ST: ident,)*], 
        
                   ts = [$T1: ident,], 
        
                   bounds = [$($bounds: tt)*], 
        
                   is_null = [$($is_null: tt)*], 
        
               )=> { 
        
                   impl<$($ST,)*> SqlType for ($($ST,)*) 
        
                   where 
        
                       $($ST: SqlType,)* 
        
                       $($bounds)* 
        
                       $T1::IsNull: OneIsNullable<$($is_null)*>, 
        
                   { 
        
                       type IsNull = <$T1::IsNull as OneIsNullable<$($is_null)*>>::Out; 
        
                   } 
        
               }; 
        
               ( 
        
                   @build 
        
                   start_ts = [$($ST: ident,)*], 
        
                   ts = [$T1: ident, $($T: ident,)+], 
        
                   bounds = [$($bounds: tt)*], 
        
                   is_null = [$($is_null: tt)*], 
        
               )=> { 
        
                   impl_sql_type!{ 
        
                       @build 
        
                       start_ts = [$($ST,)*], 
        
                       ts = [$($T,)*], 
        
                       bounds = [$($bounds)* $T1::IsNull: OneIsNullable<$($is_null)*>,], 
        
                       is_null = [<$T1::IsNull as OneIsNullable<$($is_null)*>>::Out], 
        
                   } 
        
               }; 
        
               ($T1: ident, $($T: ident,)+) => { 
        
                   impl_sql_type!{ 
        
                       @build 
        
                       start_ts = [$T1, $($T,)*], 
        
                       ts = [$($T,)*], 
        
                       bounds = [], 
        
                       is_null = [$T1::IsNull], 
        
                   } 
        
               }; 
        
               ($T1: ident,) => { 
        
                   impl<$T1> SqlType for ($T1,) 
        
                   where $T1: SqlType, 
        
                   { 
        
                       type IsNull = $T1::IsNull; 
        
                   } 
        
               } 
        
           }

for tuples as well. This is as far as I know one of the spots where rustc spends quite a lot of time. See weiznich@efa182f for the corresponding change.
Unfortunately it is not possible to apply the same pattern there. This results in quite large build times (> 13 minutes for me with all cargo check --features "sqlite" --no-default-features). It's not clear for me why this happens and what's even different there compared to the other traits. All of those tuple impls basically work by forwarding stuff to the next smaller tuple. In therory rustc should be able to resolve that without quadratic behaviour, assuming the corresponding impls are cached.

lqd · 2022-05-04T17:06:24Z

Great news !

I was checking this as part of the initiative to improve compile times, in case it would be harder to remove these tuple wrappers and the associated derives, rather than optimizing the macro.

Your change will help diesel and downstream crates, but not users who use the derive directly, but I assume this is rare, especially with a large number of generic parameters, right ?

I was able to reproduce your results on the first commit: it removes indeed most of the proc-macro expansion step, as well as a chunk of regular macro expansion, and hits your other issue #81263 much less than before.

For the second commit, like you mention it's very slow, but 99% of compile times are because of another rustc issue, about the specialization graph. I've seen it in a few other crates: it's something quadratic in coherence checking IIRC, that starts becoming noticeable as bigger Ns weigh heavy in a O(N²) algorithm, e.g. in bigger crates.

I wonder if your other MCVEs do hit the same issue: it would be great to have a smaller scale reproduction for this issue which is hard to reduce among all the unrelated code that big crates have (along with the actual code triggering the slowness). If your other issues' reproduction examples do hit it, it would be super helpful for people to even begin analyzing the nonlinearity in coherence.

lqd · 2022-05-04T23:34:34Z

In case it helps, here's a breakdown of the traits whose tuple impls trigger the specialization graph slowness (until I have more info on what exactly is slow there, as it's doing a bunch of things at the same time, like setting up specialization/coherence/overlap):

trait	number of impls	time in seconds
`diesel::deserialize::FromSqlRow`	17	94
`diesel::deserialize::SqlTypeOrSelectable`	34	144
`diesel::deserialize::FromStaticSqlRow`	33	48
`diesel::expression::TypedExpressionType`	36	47
`diesel::expression::QueryMetadata`	35	48
`diesel::query_dsl::load_dsl::private::CompatibleType`	35	143
`diesel::sql_types::IntoNullable`	18	126
`diesel::deserialize::Queryable`	49	95
`diesel::util::TupleSize`	33	47

For reproduction and minimization purposes, it looks like TupleSize and its impls are simple enough and self-contained, that it may be worth trying to extract it to see it it's sufficient to trigger the issue, or if it requires some other diesel structures inside the specialization graph for that (like the intricacies and other impls of SqlType, which seemed unfortunately likely since efa182f mostly changed those impls ?).

weiznich · 2022-05-06T13:00:57Z

Your change will help diesel and downstream crates, but not users who use the derive directly, but I assume this is rare, especially with a large number of generic parameters, right ?

Yes, I expect that no one outside of diesel is using this derive with more than a handful type parameters. After all that derive is designed to be used with hand written query DSL extensions and you don't want to maintain a list of dozens of type parameters there manually.

I wonder if your other MCVEs do hit the same issue: it would be great to have a smaller scale reproduction for this issue which is hard to reduce among all the unrelated code that big crates have (along with the actual code triggering the slowness). If your other issues' reproduction examples do hit it, it would be super helpful for people to even begin analyzing the nonlinearity in coherence.

How do I verify that the example given as part of rust-lang/rust#75542 hits the same issues? I would assume that, as this is a quite literal extraction of the relevant code from diesel. This example is fairly self contained, as it only contains 5 trait + 3 structs. Additionally I can add that removing the generic IntoNullable impl in line 26 makes the code compile almost instantly (0.15s for 16 columns), while with this impl added it takes > 1minute.

impl<T> IntoNullable for T
where
    T: SqlType<IsNull = is_nullable::NotNull> + SingleValue,
{
    type Nullable = Nullable<T>;
}

For reproduction and minimization purposes, it looks like TupleSize and its impls are simple enough and self-contained, that it may be worth trying to extract it to see it it's sufficient to trigger the issue, or if it requires some other diesel structures inside the specialization graph for that (like the intricacies and other impls of SqlType, which seemed unfortunately likely since efa182f mostly changed those impls ?).

I've tried to put TupleSize into the temple given by the minimal example in rust-lang/rust#75542, but that by itself seems not be be enough to reproduce the issue. That compiles for me in 0.13s (16 columns) and 0.19s (32 columns).

That all written: I'm happy to help with whatever code/knowledge from diesels side is required to fix this issue.

lqd · 2022-05-06T16:53:04Z

Yes, I expect that no one outside of diesel is using this derive with more than a handful type parameters. After all that derive is designed to be used with hand written query DSL extensions and you don't want to maintain a list of dozens of type parameters there manually.

Great, so we can probably just close this PR then, it's not going to land. #3167 has landed instead, and it's better progress for what I was initially looking at. A check build for 128-column tables is now reduced to 7mins down from 10-11 minutes.

How do I verify that the example given as part of rust-lang/rust#75542 hits the same issues?

Unfortunately, I don't think one easily can. From the high-level self-profiling output, I was expecting this to be a coherence issue: they're both showing up as heavy "specialization_graph_of" activity. I have since then been looking into this part of rustc more, timing the different parts of specialization/coherence/trait selection, and I'm not exactly sure it's the exact same issue (the one I was thinking of looks older than this one). It's pretty close: it's still called in the same rustc query and in coherence, but then is trying to prove predicates for negative impls or something, and then it becomes slow in the trait system for some yet unknown reason.

It seems to me like it could be related to the exponential blowup that was seen when some of the trait system cache was removed to fix incremental compilation issues. But that happened in 1.57 IIRC, and it looks like the regression we're seeing in rust-lang/rust#75542 was actually introduced in 1.56:

version	self-profile activity	self time	% of total time	time	count
1.55.0	specialization_graph_of	34.42ms	12.915	36.64ms	7
1.56.0	specialization_graph_of	1.39s	85.087	1.39s	7
1.57.0	specialization_graph_of	1.38s	86.052	1.38s	5
1.58.0	specialization_graph_of	1.26s	85.275	1.26s	5
1.59.0	specialization_graph_of	1.11s	82.733	1.11s	5
1.60.0	specialization_graph_of	1.01s	83.529	1.01s	5

I have not yet managed to find which nightly or PR introduced this issue, but will try to do so. It should help show the exact part of rustc that is affected by this example structure.

In the meantime, I have also been trying to talk the experts in that area of the compiler, to see if we can have a discussion and try to understand what's going on in this example, and if/how it can be fixed.

I've also played with your example, and I believe it's good enough for that purpose: you have done the hard work minimizing already. Thanks for that. As you mentioned, I had also seen that its weird behavior is kind of fragile, and even small changes to the repro make it compile fast.

I will continue this investigation, and discuss with the trait system & caching and incremental query system experts.

weiznich · 2022-05-10T14:29:14Z

Thanks for looking into this ❤️

It seems to me like it could be related to the exponential blowup that was seen when some of the trait system cache was removed to fix incremental compilation issues. But that happened in 1.57 IIRC, and it looks like the regression we're seeing in rust-lang/rust#75542 was actually introduced in 1.56:

I would like to add here that the original rustc issue predates 1.55, is from nearly 2 years back. So be prepared for another underlying issue there. This likely does not show up immediately as the minimal example only sets 16 columns as default in line 861. Increasing that to 32 columns did lead to a compile time "explosion" even before the the resent incremental compilation issue fixes.

lqd · 2022-05-10T14:45:39Z

Yes indeed, I've noticed this as well. There are multiple steps of regressions over time, unfortunately, as well as multiple issues. And they affect diesel, nalgebra, bitmaps differently.

For the 16-column case above, it was easy-ish, we can see the 30x increase so we can like look for the version that makes this query go over 1s. I've bisected the 1.56 regression this way to rust-lang/rust#89125 for example.

lqd added 2 commits May 3, 2022 17:39

use a string to create ValidGrouping impl

d49eea6

this results in a 30x improvement for a 128-column struct

inline wrap_in_dummy_mod for hot expansion site

0d4eecf

this results in a few %age improvements

weiznich mentioned this pull request May 5, 2022

Optimise build times #3167

Merged

lqd closed this May 6, 2022

lqd deleted the valid_grouping_proc_macro branch May 6, 2022 18:50

Ten0 mentioned this pull request Jul 7, 2022

Joins typeck is exponential in number of joins #3223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to speed up `#[derive(ValidGrouping)]` #3163

Try to speed up `#[derive(ValidGrouping)]` #3163

lqd commented May 3, 2022 •

edited

Loading

weiznich commented May 3, 2022

lqd commented May 4, 2022

lqd commented May 4, 2022 •

edited

Loading

weiznich commented May 6, 2022

lqd commented May 6, 2022

weiznich commented May 10, 2022

lqd commented May 10, 2022

Try to speed up #[derive(ValidGrouping)] #3163

Try to speed up #[derive(ValidGrouping)] #3163

Conversation

lqd commented May 3, 2022 • edited Loading

weiznich commented May 3, 2022

lqd commented May 4, 2022

lqd commented May 4, 2022 • edited Loading

weiznich commented May 6, 2022

lqd commented May 6, 2022

weiznich commented May 10, 2022

lqd commented May 10, 2022

Try to speed up `#[derive(ValidGrouping)]` #3163

Try to speed up `#[derive(ValidGrouping)]` #3163

lqd commented May 3, 2022 •

edited

Loading

lqd commented May 4, 2022 •

edited

Loading