Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write DataFusion paper for (SIGMOD / VLDB / ICDE) #6782

Closed
18 tasks done
alamb opened this issue Jun 27, 2023 · 124 comments
Closed
18 tasks done

Write DataFusion paper for (SIGMOD / VLDB / ICDE) #6782

alamb opened this issue Jun 27, 2023 · 124 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 27, 2023

UPDATE: Final paper: https://dl.acm.org/doi/10.1145/3626246.3653368

Task List for SIGMOD Paper:

Per #6782 (comment), here is a list of TODO items:

  • Initial draft of Introduction
  • Initial draft of Background
  • Initial draft of Design @Dandandan adding info about join
  • Initial draft of Optimizations @Dandandan initial work on adding some contents
  • Initial draft of Extensibility
  • Initial draft of Related Work
  • Initial draft of Conclusion
  • Create Intro diagram ("teaser")
  • Create Architecture diagram
  • Take a first pass to add references
  • Create CMT account on https://cmt3.research.microsoft.com/SIGMODIndustry2024
  • Create Intro diagram ("teaser")
  • Iterate / Polish

Issues Blocking Full Performance Results

Issues that would make the results more compelling

  • (TO FILE or Find): Ensure performance doesn't degrade with larger core counts on some queries

Is your feature request related to a problem or challenge?

I would like to increase awareness of DataFusion in the broader technical community. One way to build mindshare is to get a paper / talk published in a prestigious conference like VLDB or SIGMOD

Writing a paper is a good way to show the strength of the arrow/datafusion.
Through the papers, more teachers, students and researcher may be involved, and contribute to the project.

Describe the solution you'd like

I would like to write a paper that explains DataFusion
Thesis: "You don't need a tightly integrated execution system to get good performance"

These blogs have some good material in the introduction
https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/
https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0/

Then we would compare and contrast the approaches of other tightly integrated systems like pola.rs and duckdb to DataFusion

We would then describe the architecture of DataFusion and its many extension points (DataFrame, functions, aggregates, window functions, sinks, etc)

Performance:
Show DataFusion in the same ballpark as DuckDB for aggregation, grouping, etc (e.g. TPCH)
We already have this for querying parquet

Describe alternatives you've considered

VLDB: https://vldb.org/2024/?call-for-industrial-track

Submissions open December 6, 2023
Short abstracts deadline February 16, 2024
Full papers or extended abstracts deadline March 1, 2024
Notifications May 8, 2024
Camera-ready June 15, 2024

SIGMOD: https://2024.sigmod.org/calls_papers_important_dates.shtml

Industrial track: https://2024.sigmod.org/comingsoon.shtml (TBD)

Research paper submission round 4 (All Deadlines are 11:59 PM Pacific Time)

October 15, 2023: Paper submission
November 26-28, 2023: Author feedback phase
December 20, 2023: Notification of accept/reject/review again
January 20, 2024: Revised paper submission
February 23, 2024: Final notification of accept/reject

ICDE:

Industrial Track: https://icde2024.github.io/CFP_industry.html

All deadlines below are 5 PM Pacific Time.

Paper submission: Monday, November 20, 2023
Notification of accept/reject: Wednesday, January 31, 2024
Camera-ready deadline: Thursday, March 28, 2024

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jun 27, 2023
@alamb alamb self-assigned this Jun 27, 2023
@alamb
Copy link
Contributor Author

alamb commented Jul 11, 2023

Some ideas about the paper:

Thesis:

We demonstrate it is possible to get DuckDB like performance using standards like Parquet and Arrow as the internal interchange format, both inside of and outside of the engine. Previously the conventional wisdom has been that such performance levels require a tightly integrated engine where the disk format, in memory layout, and processing engine are engineered in tandem to work well together.

While the engineering effort required for such an engine is large, it is possible by leveraging the open source model and Apache governance model to poll resources amongst users. Given the availablity of fast, standards based, interoperable vectorized engines like DataFusion, we predict a Cambrian explosion of new analytic systems which would not have been possible before if they had to create their own engines

Background

(Section on Arrow) -- including in memory model, compute kernels, and Flight / FlightSQL

(Section on Parquet)

(Section on tokio -- aka not have to reinvent the scheduler)

(DataFusion usecases, examples of systems built on DataFusion)

Architecture

Internally DataFusion uses Arrow as the interchange between operators, though internally different, non standard formats are used (such as the Arrow Row Format)

Describe planning at a high level

List extension points + briefly describe how users can extend the engine (and how that is desiged for speed)

  • UDF (everyone does this)
  • UDAF (everyone does this)
  • UDWF
  • Optimizer Rules
  • Custom Operators
  • Own catalog
  • Own partition pruning / elimination rules
  • Provide information on sortedness

List briefly the features of the engine

  • parquet / predicate pushdown with late materialization
  • use of tokio scheduler and the benfits of that model (has good NUMA performance given how tasks are scheduled). Also note that use of morsel driven parallelism doesn't solve the typical bottneck on priorities -- memory
  • group hash aggregates -- templated code
  • Joins (TBD)
  • Sophisticated sort based optimizations (including pre-sorted data) streaming group by, merge-join,
  • "Partition Pruning" (Pruning predicate) for custom catalog support

Performance

(TPCH / ClickBench results)

Parquet / pushdown results

Talk about bespoke file format vs only using Parquet

Related work

Velox (focuses on the execution engine side)
Apache Calcite (focuses on sql and frontend)
DataFusion has all the pieces of the toolkit (sql frontend, logical plan, and execution plans)

@alamb
Copy link
Contributor Author

alamb commented Jul 11, 2023

Ideas for experiments I would like to run

Comparisons to existing systems:

SQL based

Clickbench (DataFusion + DuckDB on parquet)
TPCH (DataFusion + DuckDB on parquet)
h2o.ai (DataFusion + DuckDB in parquet)

DataFrame

Some sort of comparison with Pol.rs on aggregation
Some sort of comparison with pandas

Scalabilty performance

Also it would be awesome to have charts showing DataFusion scaling:

  1. Performance as number of cores increases (ideally this would be linear)
  2. Memory Usage vs input data size

Since I imagine there will be some tweaks given these numbers, having the run and chart generation automated would be very helpful

@alamb alamb added the devrel label Jul 24, 2023
@yjshen yjshen pinned this issue Aug 4, 2023
@yjshen yjshen unpinned this issue Aug 4, 2023
@alamb alamb changed the title Write DataFusion paper for (SIGMOD / VLDB) Write DataFusion paper for (SIGMOD / VLDB / ICDE) Aug 6, 2023
@alamb
Copy link
Contributor Author

alamb commented Aug 6, 2023

@alamb
Copy link
Contributor Author

alamb commented Aug 6, 2023

I have begun work on this paper. I am planning to submit it to the SIGMOD industrial track https://2024.sigmod.org/calls_industrial_track_papers.shtml

I am planning to use Overleaf (recommended by the ACM) rather than setup the latex stuff myself, as it worked well when I used it on another paper earlier this year,

Here is the overleaf https://www.overleaf.com/read/qjhrxqhgksvr -- right now not much more than the template

If you are interested in being an author and helping to write the paper, please let me know. I am especially looking for help help running benchmarks / generating performance numbers as described in #6782 (comment)

@yjshen
Copy link
Member

yjshen commented Aug 14, 2023

Some initial thoughts for the paper organization after checking the current version:

1. Introduction (1 page)
	- The deconstructed system(against conventional wisdom), the Apache way
	- technical methodology: Vectorized execution / Columnar input/output/intermediate / Pull mode with Morsel benefits

2. Background (1-1.5 pages)
	- The mature of ecosystem, columnar format parquet, in-mem format arrow, high performance thread scheduler tokio, and the surge of generalized relational algebra substrait
	- The needs for a deconstructured engine (high performance, and highly customizable)
		- an open engine for high performance (influxdb/ streamingdb/distrubuted olap)
		- an standard, extensible planner for resue (dask-sql,)
		- an general purpose parser sqlparser-rs (for TP, AP and streaming)

3. Design (3-4 pages)
	- Engine overview
	- execution model and scheduler
	- sort and the row model
	- aggregate and the aggregator model

4. Optimizations (1-1.5 pages)
	- Late materialization and predicate pushdown
	- Predicate prunning

5. Extensibilities (1 page)
	- UDF / UDAF
	- data source
	- catalog
	- parser?
	- ?

6. Benchmarks (2-3 pages)
	- micro-bench?
		- for sort?
		- for agg?
		- for scheduler?
	- e2e-bench
		- tpc-h?
		- clickbench?
		- ?

7. Related work (0.75 page)
	- Velox
	- DuckDB
	- Spark Photon

8. Conclusion (0.25 page)
	- the open source, the Apache way we believe
	- always pursing for performance start from real world problems

After some discussion with @alamb, we agreed that we could start with this above structure first and adapt it as we proceed.

@yjshen
Copy link
Member

yjshen commented Aug 14, 2023

I incorporated the above structure into the paper, but split the background part into two sections:

  • 2. Foundational Ecosystem
  • 3. Deconstructing the Database,

Section 2 mainly talks about DataFusion's foundation technologies/systems, and Section 3 keeps the material from the analytical database that @alamb has previously shared the philosophy and vision of DataFusion.

I also finished my first pass for section 2, Foundational Ecosystem.

@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2023

Thank you @yjshen -- this is amazing ❤️ . I plan to take a pass through the architecture section later this morning, and see how far I get. I also hope to start working on some diagrams to emphasize some of the points in the paper.

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2023

Here is my summary of the state of this project so far. Please correct me if I got it wrong:

Current Author List:

Goal:

The goal is to submit an industrial track paper to SIGMOD 2024:

"The Industrial Track of SIGMOD 2024 covers all aspects of innovative commercial or industrial-strength data management systems and solutions. We also welcome novel applications of data management systems and experience in applying recent research advances to problems relevant to the database industry."

  • Paper submission: Monday, November 20, 2023
  • Notification of accept/reject: Wednesday, January 31, 2024
  • Camera-ready deadline: Thursday, March 28, 2024

Proposed Timeline:

End of August: Rough Draft complete
Sep / Oct: Improve results, hone language
Nov: Submit paper

Logistics

  1. We will use overleaf to write the paper. Anyone should be able to see the current state of the paper here https://www.overleaf.com/read/qjhrxqhgksvr and the authors have write access
  2. We will coordinate / communicate in the open on this ticket (per the Apache Way)
  3. We will use the google drawing / PDF export for diagrams: Google Drive folder

I will also add a todo / task list to the description of this ticket for us to coordinate on remaining open work

@Dandandan
Copy link
Contributor

Dandandan commented Aug 16, 2023

What is a good way to provide feedback? Can we open a chat somewhere for more quick/detailed discussions?

  • @JayjeetAtGithub The scalability graph look be more interpretable if it's presented in log-scale or exponential with the speedup shown as 1x,2x,4x,8x, 16x etc.

@Dandandan
Copy link
Contributor

I also suggest for small suggestions to use and review the "comment" functionality in Overleaf.

@alamb
Copy link
Contributor Author

alamb commented Aug 18, 2023

BTW I plan to work on this paper over the weekend (Sat and Sun morning)

@Weijun-H
Copy link
Member

If you are interested in being an author and helping to write the paper, please let me know. I am especially looking for help help running benchmarks / generating performance numbers as described in #6782 (comment)

This work sounds impressive! Could I get involved and contribute to it?

@alamb
Copy link
Contributor Author

alamb commented Aug 21, 2023

Sorry I was not feeling well this weekend and did not have a chance to work on this paper. However, I did take a pass this morning.

So thank you again @JayjeetAtGithub -- I just spent some serious time reviewing the performance scalability section and it was just what I had been hoping for (actually quite a bit better). I have some suggested actions:

  1. Can you post the scripts you used to generate the data (and the plots) somewhere (that we can add to the paper as well as potentially also review)?
  2. Would it be possible to change the charts so both axes are log/log (right now the x axis, is log scale, but the y axis is linear which I think might be misleading)

BTW I think @JayjeetAtGithub also filed the following issues (that prevent us from running all the clickbench queries on the partitioned dataset):
#7342
#7343
#7344
#7345

@alamb
Copy link
Contributor Author

alamb commented Aug 21, 2023

This work sounds impressive! Could I get involved and contribute to it?

Yes, absolutely! Thank you @Weijun-H

I think there are several major areas in need of work:

  1. The writing (but @yjshen @Dandandan myself and @JayjeetAtGithub are making good progress)
  2. Fixing bugs that prevent full performance results (I stared listing them above)
  3. Improving parts of DataFusion that would make the results / description in the paper better (e.g. investigating why the performance slows down in some cases for high core counts).

Are any of those areas interesting to you?

@JayjeetAtGithub
Copy link
Contributor

Script to generate data, run experiments, and plot results (also plots) can be found here.

@Dandandan
Copy link
Contributor

Dandandan commented Aug 21, 2023

Maybe another suggestion for improvement of the scalability graph: the instance shown has 176 virtual cores, can we update the graphs to show scaling to 176 cores?

@JayjeetAtGithub
Copy link
Contributor

@Dandandan Sure. We will take another (probably a couple more) round of benchmarks after the issues in data fusion get fixed. For now, its kind of acting as a placeholder.

@Weijun-H
Copy link
Member

I think there are several major areas in need of work:

  1. The writing (but @yjshen @Dandandan myself and @JayjeetAtGithub are making good progress)
  2. Fixing bugs that prevent full performance results (I stared listing them above)
  3. Improving parts of DataFusion that would make the results / description in the paper better (e.g. investigating why the performance slows down in some cases for high core counts).

Are any of those areas interesting to you?

Hi @alamb .
Task three seems interesting and challenging. I'd like to take it. Additionally, I can refine and proofread your writing and fix bugs in my spare time.

@JayjeetAtGithub
Copy link
Contributor

@alamb
Copy link
Contributor Author

alamb commented Sep 9, 2023

Related work: https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf

I added a reference to this in the related work section

Sorry for the delay here -- I was away. I took another pass through this document this morning, and I am hoping to have all the sections drafted by the end of this upcoming week. Then I believe we will be on to the 'editing / refining' phase.

@alamb
Copy link
Contributor Author

alamb commented Sep 10, 2023

I took another pass through the paper this morning. I think it is nearing "complete rough draft" stage ❤️ and then we can move on to honing / refining the content

One thing I think that would make it stronger is a few short case study descriptions of systems built using DataFusion (as concrete examples of what is possible). I can obviously write about IOx but I'll need help to properly describe others. I'll follow up in the next few days with other potential

@Dandandan
Copy link
Contributor

I will take some time this week again to contribute (last weeks were quite busy). I will take a look at the paper and see where I can contribute.
Let me know if I can help in some specific area as well.

@edmondop
Copy link
Contributor

Do you think https://github.com/ArroyoSystems/arroyo can be also added as a "as well as streaming SQL platforms
(e.g. Synnada[69], Arroyo[xyz])"?

@alamb
Copy link
Contributor Author

alamb commented Nov 29, 2023

Do you think https://github.com/ArroyoSystems/arroyo can be also added as a "as well as streaming SQL platforms (e.g. Synnada[69], Arroyo[xyz])"?

I have done so. Thank you

@alamb
Copy link
Contributor Author

alamb commented Nov 29, 2023

Ok, I have submitted a draft of the paper. 😅 🚀 We can still update the submission until tomorrow

Thank you very much @ozankabak who has been copy editing away these last few days and everyone who has worked on this paper over the last several months

Submission

A snapshot of what I uploaded to the site is here:

Conflicts

@Dandandan @JayjeetAtGithub @viirya @sunchao @ozankabak and @yjshen and @ozankabak can you please let me know if you have a conflict of interest (definition) with anyone on the following list (from the CMT site):

Adam Prout Singlestore No   Edit
cheng chen bytedance No   Edit
Daniel Ritter SAP No   Edit
Douglas Burdick Adobe No   Edit
Feifei Li Alibaba Group No   Edit
Ioannis Papapanagiotou Gemini No   Edit
Iraklis Psaroudakis ORACLE No   Edit
Jesús Camacho-Rodríguez Microsoft No   Edit
Jianmin Wang "Tsinghua University, China" No   Edit
Jingren Zhou Alibaba Group No   Edit
Konstantinos Karanasos Konstantinos Karanasos No   Edit
Krishna Kantikiran Pasupuleti Oracle No   Edit
Lyublena Antova Meta No   Edit
Matteo Interlandi Microsoft No   Edit
Mohamed Soliman Apple No   Edit
Norman May SAP SE No   Edit
Onur Kocberber Oracle No   Edit
Pedro Pedreira Meta Platforms Inc. No   Edit
Rebecca Taft Cockroach Labs No   Edit
Richard Sidle IBM Research No   Edit
Russell Sears Apple No   Edit
Shimin Chen Chinese Academy of Sciences No   Edit
Stavros Papadopoulos Intel Labs & MIT No   Edit
Supun Nakandala Databricks Inc No   Edit
Themis Melissaris Snowflake Inc. No
Theodoros Rekatsinas Apple No   Edit
Umar Farooq Minhas Apple No

@yjshen
Copy link
Member

yjshen commented Nov 29, 2023

No one from my side. Thanks for the excellent work!

@sunchao
Copy link
Member

sunchao commented Nov 29, 2023

Thanks @alamb for driving this!

can you please let me know if you have a conflict of interest

I see the definition of conflict of interest contains:

The PC member has been a co-worker in the same company or university within the past five years

Since there are several people from Apple in the above list, which is what @viirya and me are working for, will that be considered as conflicts of interest? Although, we've never collaborated with these people within the company, and this is the first time I've heard their names.

@ozankabak
Copy link
Contributor

I was an employee of Meta at some point during the last five years but no CoI otherwise

@alamb
Copy link
Contributor Author

alamb commented Nov 29, 2023

Since there are several people from Apple in the above list, which is what @viirya and me are working for, will that be considered as conflicts of interest? Although, we've never collaborated with these people within the company, and this is the first time I've heard their names.

Yes I think this is covered by a separate "domain conflict" list:

Screenshot 2023-11-29 at 3 50 33 PM

I was an employee of Meta at some point during the last five years but no CoI otherwise

I have added "meta.com" as another domain conflict just to be safe.

@Dandandan
Copy link
Contributor

conflicts

None from my side

@Dandandan
Copy link
Contributor

Dandandan commented Nov 29, 2023

I found one issue in the benchmarks.
We're using fetchall to retrieve the results for duckdb rather than something more optimized like fetch_arrow_table. This is not a problem when the output is small, but for large outputs it penalizes duckdb as it needs to convert each individual row to python objects (rather than e.g. doing it per batch or keeping most of it outside of Python).

Running locally (query 10 of h2o benchmarks has a large output):

qnum: 10
SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM h2o GROUP BY id1, id2, id3, id4, id5, id6;

8.523120959027437
9.530327208980452
9.380471624986967
9.175977834005607
9.123251708020689

when using fetch_arrow_table it becomes quite a bit faster:

qnum: 10
SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM h2o GROUP BY id1, id2, id3, id4, id5, id6;

3.766681333014276
3.5825100420042872
3.5103830420121085
3.5395747090224177
3.5533452079980634

As we can see in the results, duckdb is performing not so well on this query compared to DataFusion because of this:

image

@viirya
Copy link
Member

viirya commented Nov 30, 2023

Since there are several people from Apple in the above list, which is what @viirya and me are working for, will that be considered as conflicts of interest? Although, we've never collaborated with these people within the company, and this is the first time I've heard their names.

Yes I think this is covered by a separate "domain conflict" list:

Yea, I only have the conflicts with the members from Apple. Otherwise, no conflicts.

Thanks @alamb !

@ozankabak
Copy link
Contributor

I found one issue in the benchmarks.

Thank you for taking a look. It'd be great if double checked all the config parameters and our usage of APIs on both sides to make sure we don't accidentally distort any measurement.

@alamb
Copy link
Contributor Author

alamb commented Nov 30, 2023

I found one issue in the benchmarks.

Thank you for taking a look. It'd be great if double checked all the config parameters and our usage of APIs on both sides to make sure we don't accidentally distort any measurement.

I did review this quite extensively when working on the benchmarks and I believe they are defensible. The runner scripts are based on the (duckdb authored) scripts to run ClickBench (that use fetchall):

https://github.com/ClickHouse/ClickBench/blob/1a8ecca8808378da011a3050d648cb9dbd2a1d95/duckdb-parquet/query.py#L15

So while we can (and should) improve the scripts if the paper is accepted for publication, I recommend keeping the existing results that have fetchall for the draft because:

  1. we used the same methodology as experts in DuckDB
  2. the overall conclusion is the same, even if the some of the relative numbers are slightly different

@alamb
Copy link
Contributor Author

alamb commented Nov 30, 2023

I have started collecting "things to do if the paper is accepted" in #8373

@alamb
Copy link
Contributor Author

alamb commented Dec 1, 2023

Ok, well the deadline is past so I am going to claim we have 'written' the paper and closing this ticket. We'll track the status of the paper in #8373 for anyone interested

@alamb alamb closed this as completed Dec 1, 2023
@JasonLi-cn
Copy link
Contributor

In addition to DuckDB and pola.rs, have you considered comparing Clickhouse?

@alamb
Copy link
Contributor Author

alamb commented Dec 11, 2023

In addition to DuckDB and pola.rs, have you considered comparing Clickhouse?

Hi @JasonLi-cn -- One comparison DataFusion and ClickHouse is ClickBench https://benchmark.clickhouse.com/

While similar, I think DataFusion and ClickBench are different enough to make comparing hard. ClickBench is a real database management, and unlike DataFusion it targets end-users rather than developers of other database systems.

@JasonLi-cn
Copy link
Contributor

In addition to DuckDB and pola.rs, have you considered comparing Clickhouse?

Hi @JasonLi-cn -- One comparison DataFusion and ClickHouse is ClickBench https://benchmark.clickhouse.com/

While similar, I think DataFusion and ClickBench are different enough to make comparing hard. ClickBench is a real database management, and unlike DataFusion it targets end-users rather than developers of other database systems.

thanks~

@alamb
Copy link
Contributor Author

alamb commented Jan 9, 2024

In case someone is looking for it, here is the draft we submitted:
DataFusion_Query_Engine___SIGMOD_2024.pdf

@jimexist
Copy link
Member

congrats on the submission - sorry that i was not able to find time to contribute. hopefully the draft gets good reception!

@alamb
Copy link
Contributor Author

alamb commented Feb 4, 2024

Our paper was accepted. See more details on #8373 (comment)

@rohitrastogi
Copy link
Contributor

Hi, it looks like the link posted on 1/9 is broken. Is there a place to find the finalized paper? Thanks

@alamb
Copy link
Contributor Author

alamb commented Mar 13, 2024

Hi, it looks like the link posted on 1/9 is broken. Is there a place to find the finalized paper? Thanks

@rohitrastogi -- We don't have the final draft of the paper yet (we will have it by the end of the month -- March 28). Here is the draft we submitted:
DataFusion_Query_Engine___SIGMOD_2024 (8).pdf

@alamb
Copy link
Contributor Author

alamb commented Mar 31, 2024

Final paper draft can be found here: #8373 (comment)

@alamb
Copy link
Contributor Author

alamb commented Jul 9, 2024

Final ACM link: https://dl.acm.org/doi/10.1145/3626246.3653368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests