Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling functions #9

Closed
wants to merge 49 commits into from
Closed

rolling functions #9

wants to merge 49 commits into from

Conversation

jangorecki
Copy link

@jangorecki jangorecki commented Apr 24, 2023

draft version for rolling functions requested in #6

we can define scope for those tests here. Some are already there, others only mentioned in comments


PR roadmap:

  • test scripts
    • data.table
    • dplyr
    • pandas
    • duckdb-latest
    • spark
  • validate results
    • data.table
    • dplyr
    • pandas
    • duckdb-latest
    • spark
  • add rollfun to report
  • fix misplaced datafussion code in history.Rmd #29
  • run rollfun benchmark and confirm rollfun benchplot
  • branch installation version logging to logs.csv

optionally: test scripts and validate results:

rolling functions not available in:

@jangorecki
Copy link
Author

jangorecki commented Jun 21, 2023

As there was no feedback on the scope proposed by me in April, I made another step forward and defined it more precisely. Please have a look at the questions proposed. It is not easy, within 10 questions, to cover well all the possible features, so I ended up focusing on:

  • funs: mean, median (moving holistic aggregate), regression, UDF
  • scalability: window small or big (length of the input is not part of the questions, it is parameter of the script, so this scalability is always tested)
  • features: vectorized input, weighted fun, uneven time series, by.column=F
q1: rolling mean
q2: window small
q3: window big
q4: multi vars+cols
q5: median
q6: weighted
q7: uneven dense
q8: uneven sparse
q9: regression (by.column)
q10: UDF (one that will generally not be optimized, to mimic arbitrary UDF)

Looking forward to feedback on the scope for that test, or implementations in other software.
Note that we may be forced to move from N {1e6, 1e7, 1e8} to N {1e5, 1e6, 1e7} depending on how well other tools will scale.

@jangorecki jangorecki marked this pull request as ready for review June 22, 2023 10:51
@jangorecki
Copy link
Author

jangorecki commented Jun 24, 2023

  • I pushed implementation for dplyr (using slider) as well. So we have at the moment data.table and dplyr.
  • All the other scripts and control csv files are amended for new task, so it is now nicely run-able just by calling ./run.sh.
  • I commented out q10 "UDF" for now. It is completely not comparable with other questions, being x1000-10000 slower than the rest. We either have to figure out some very lightweight UDF, or just drop UDF from the scope.

@jangorecki
Copy link
Author

jangorecki commented Jun 25, 2023

Instead of q10 UDF I would propose either:

  • rolling min - still it is quite different implementation than mean/sum
  • second holistic aggregate - but here I don't see that much value added - looking at https://duckdb.org/2021/11/12/moving-holistic.html - mode is not useful for our float64 measure variable (we could as well add low/medium cardinality int variable just for mode question), median is already inside this benchmark, quantile is not much different than median, median is just special case of quantile.
  • ??

From two options above I am in favor of min.

Posting here before doing the change as I hope there may be some other ideas.

edit: amended in 045b7d5

@jangorecki
Copy link
Author

jangorecki commented Jun 25, 2023

Other topics that are subject to community review are:

  1. data sizes, either
  • 1e5, 1e6, 1e7 (could be desired to run slower tools)
  • 1e6, 1e7, 1e8 (currently implemented, feels fine)
  • 1e7, 1e8, 1e9 (if we drop UDF in q10 this should be do-able)
  • 1e5, 1e7, 1e9 (mix of all)
  1. window size
w = nrow(x)/1e3       ## used in 8 out of 10 questions
wsmall = nrow(x)/1e4  ## used q2
wbig = nrow(x)/1e2    ## used q3

In case of 1e9 data size, window size would be 1e6, which feels unrealistically big. I feel we could improve window sizes.

  1. unevenly ordered series (q7, q8 or in recent HEAD q8, q9)
DT[["id2"]] = sort(sample(N*1.1, N))         ## index dense
DT[["id3"]] = sort(sample(N*2, N))           ## index sparse

dense index is 110% range of nrow.
sparse index is 200% range of nrow.
I am not sure if we are stressing well enough the sparse scenario. It could be even 1000% range of nrow. Then the problem is that we cannot easily use 1e9 data size, because index would be in range 1 to 1e10 and many tools would be excluded (maybe its fine?). Using 200% range of nrow, still fits into int32 type.

@AdrianAntico
Copy link

@jangorecki you have a solid set of measures, the only other type I'd consider would be differencing

@era127
Copy link

era127 commented Jun 26, 2023

I think it would be helpful if the dplyr test was documented as a representation of the slider package, as opposed to using RcppRoll. The dplyr benchmark has been confusing to me because you can use dplyr with duckdb or data.table. For data.table, it is more obvious that you are benchmarking the rolling functions inside the data.table package.

@jangorecki
Copy link
Author

@AdrianAntico that is an interesting idea, but as I briefly looked at potential implementation, it doesn't seem to stress windowing computation (use of diff in R). If you could provide an example of code then it will be more clear.

@rdavis120 maintaining new solutions is relatively high cost, putting slider under dplyr was easy way to avoid that. Rather than adding new solution slider I would prefer to rename dplyr to tidyverse so it will fit well and there will be no need for adding another solution. Anyway I would prefer to keep this issue discussion around rolling task scope rather naming details.

@AdrianAntico
Copy link

@jangorecki the intent was more time series related (and cross-row related). I have a diff function in my github package "Rodeo" if you want a full example, starting at line 443... https://github.com/AdrianAntico/Rodeo/blob/main/R/FeatureEngineering_CrossRowOperations.R

@jangorecki
Copy link
Author

@Tmonster could we get CI workflow approval?

@jangorecki
Copy link
Author

@Tmonster Is there a way that I could be approving CI runs? does it run on duckdblabs private runners? if not then I don't think there should be any concerns.

I added pandas rollfun script, and (hopefully) fixed failure in previous GH Actions job.

@jangorecki
Copy link
Author

jangorecki commented Jul 31, 2023

duckdb, spark, pandas q8 q9 only - do not have an option for handling properly an incomplete rolling window. Timings for those solutions will not include required postprocessing to match exactly same result (NULL vs value from an unexpected window size) as the overhead would be too big.

@jangorecki
Copy link
Author

jangorecki commented Jul 31, 2023

Development is possibly finished on this branch. 5 solutions added till now have been validated using https://github.com/jangorecki/db-benchmark/blob/rollfun/_utils/rollfun-ans-validation.txt

@bkamins if you would like to add Julia, you are welcome, please use commands from file linked above to validate answers against one of the solutions.

Once I will confirm report is producing fine (after running whole rollfun bench) then PR will be ready to merge.

@jangorecki
Copy link
Author

jangorecki commented Aug 12, 2023

@Tmonster PR is ready to merge

To reproduce

# install R and python

git clone https://github.com/jangorecki/db-benchmark --branch rollfun --single-branch --depth 1
cd db-benchmark
# install solutions interactively
./dplyr/setup-dplyr.sh
./datatable/setup-datatable.sh
./pandas/setup-pandas.sh
./duckdb-latest/setup-duckdb-latest.sh
./spark/setup-spark.sh

# prepare data
Rscript _data/rollfun-datagen.R 1e6 NA 0 1
Rscript _data/rollfun-datagen.R 1e7 NA 0 1
Rscript _data/rollfun-datagen.R 1e8 NA 0 1
mkdir data
mv R1_1e*_NA_0_1.csv data

vim run.conf
# do_upgrade false
# force run true
# report false
# publish false
# run_task rollfun
# run_solution data.table dplyr pandas duckdb-latest spark

sudo swapoff -a

# workaround for #30: `=~` matching in run.sh causes "duckdb-latest" to match "duckdb"
vim run.sh
# comment out 76 line in rush.sh: if [[ "$RUN_SOLUTIONS" =~ "duckdb" ]]; then ./duckdb/ver-duckdb.sh; fi; 

./run.sh > ./run.out

@jangorecki
Copy link
Author

@Tmonster any idea if PR can make it to master before scheduled September's run?

@jangorecki
Copy link
Author

jangorecki commented Oct 5, 2024

@Tmonster I recall getting an update on this but cannot find it here and in related issue. Could you please provide the status?

I was about to suggest to push this forward, considering that duckdb advertised window functions performance improvements in duckdb 1.1 release: https://duckdb.org/2024/09/09/announcing-duckdb-110.html
as far as I know there were at least few techniques implemented to improve performance of it, so as they are already in, it seems to be good moment to include this new test. Which will not only show the current state but also help to notice perfomance regressions.

Window functions see a lot of use in DuckDB, which is why we are continuously improving performance of executing Window functions over large datasets.

@Tmonster
Copy link
Collaborator

Tmonster commented Oct 9, 2024

Hi @jangorecki ,

I can’t seem to find our previous discussion either for some reason. As for this PR, It has been our decision at DuckDB Labs to maintain the benchmark suite “as-is” when it comes to the tested functionality. The worry is that if we start expanding the questions the benchmark will develop into a “DuckDB Labs” benchmark, which is something we would like to prevent. You are happy to fork and extend it, however.

@Tmonster Tmonster closed this Oct 9, 2024
@AdrianAntico
Copy link

@Tmonster is this because DuckDB doesn't do well on these operations? Shouldn't this be a reason to motive your team to improve the performance of these query types vs hiding the performance issue?

image

@jangorecki
Copy link
Author

@AdrianAntico I think there is no way for duckdb do be competitive in this kind of operations (that depends on the order of data) because it does not have a concept of physical order of data (clustered index). I believe it can be best in class among such tools although it won't really compete with tools that have clustered index. Therefore for time series it may not be ideal solution. Unless clustered index will make its way to duckdb, but that probably required change of data formats.

@AdrianAntico
Copy link

@jangorecki it's such a common operation (windowing) that it's hard to consider DuckDB as a feasible option for everyday use without it. How was this overlooked from inception?

@jangorecki
Copy link
Author

It is not really question for me to answer, but I think duckdb aligned to existing sql rdbms'es in that regard which by default does not maintain order. Many of them do have clustered index but OLTP databases don't really need clustered index, it is OLAP that has much more to gain from it. Considering duckdb is OLAP then it feels to be a good FR for duckdb.

@Mytherin
Copy link

@Tmonster is this because DuckDB doesn't do well on these operations? Shouldn't this be a reason to motive your team to improve the performance of these query types vs hiding the performance issue?

As explained above - the original reason we decided to host the h2oai db-benchmark is because the original hosted version was no longer being updated, and had very outdated results using old versions of DuckDB and other tools. The goal of this project is to resurrect and maintain the db-benchmark as-is, but re-running it with new versions of different tools.

Extending and changing the benchmark means we are no longer running the h2oai db-benchmark, but rather a DuckDB Labs customized version of the db-benchmark. This then comes with a lot of added work and drama - which benchmarks do you add, which do you not - together with potential accusations. What if we add a benchmark for which we perform better than system X? What if we don't add a benchmark where we perform worse than system X?

As such, to avoid this drama, we have opted to only preserve the original benchmark rather than extend it.

We are not trying to hide any performance issues. You are free to run any benchmarks of your own choosing using DuckDB. Our response to benchmarks has always been to take them and use them to improve the system. We have actually recently greatly improved our windowing lag performance in DuckDB v1.1.

@AdrianAntico I think there is no way for duckdb do be competitive in this kind of operations (that depends on the order of data) because it does not have a concept of physical order of data (clustered index). I believe it can be best in class among such tools although it won't really compete with tools that have clustered index. Therefore for time series it may not be ideal solution. Unless clustered index will make its way to duckdb, but that probably required change of data formats.

This is not entirely true - and we are working on adding further optimizations that take better advantage of natural order in the data. The reason this hasn't been done in the past is that it is a lot harder to do so when working in a streaming engine that operates on larger-than-memory data, versus when working with in-memory arrays. In addition, the Window operator in SQL is a lot more extensive and complex than what is generally provided in DataFrame libraries.

@jangorecki
Copy link
Author

jangorecki commented Oct 11, 2024

Very happy to hear about incoming optimizations. It is actually a great news to run window functions on out of memory data!
Is there an issue for that so we can subscribe and be notified when it will be resolves?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants