-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write DataFusion paper for (SIGMOD / VLDB / ICDE) #6782
Comments
Some ideas about the paper: Thesis:We demonstrate it is possible to get DuckDB like performance using standards like Parquet and Arrow as the internal interchange format, both inside of and outside of the engine. Previously the conventional wisdom has been that such performance levels require a tightly integrated engine where the disk format, in memory layout, and processing engine are engineered in tandem to work well together. While the engineering effort required for such an engine is large, it is possible by leveraging the open source model and Apache governance model to poll resources amongst users. Given the availablity of fast, standards based, interoperable vectorized engines like DataFusion, we predict a Cambrian explosion of new analytic systems which would not have been possible before if they had to create their own engines Background(Section on Arrow) -- including in memory model, compute kernels, and Flight / FlightSQL (Section on Parquet) (Section on tokio -- aka not have to reinvent the scheduler) (DataFusion usecases, examples of systems built on DataFusion) ArchitectureInternally DataFusion uses Arrow as the interchange between operators, though internally different, non standard formats are used (such as the Arrow Row Format) Describe planning at a high level List extension points + briefly describe how users can extend the engine (and how that is desiged for speed)
List briefly the features of the engine
Performance(TPCH / ClickBench results) Parquet / pushdown results Talk about bespoke file format vs only using Parquet Related workVelox (focuses on the execution engine side) |
Ideas for experiments I would like to run Comparisons to existing systems:SQL basedClickbench (DataFusion + DuckDB on parquet) DataFrameSome sort of comparison with Pol.rs on aggregation Scalabilty performanceAlso it would be awesome to have charts showing DataFusion scaling:
Since I imagine there will be some tweaks given these numbers, having the run and chart generation automated would be very helpful |
Other material: |
I have begun work on this paper. I am planning to submit it to the SIGMOD industrial track https://2024.sigmod.org/calls_industrial_track_papers.shtml I am planning to use Overleaf (recommended by the ACM) rather than setup the latex stuff myself, as it worked well when I used it on another paper earlier this year, Here is the overleaf https://www.overleaf.com/read/qjhrxqhgksvr -- right now not much more than the template If you are interested in being an author and helping to write the paper, please let me know. I am especially looking for help help running benchmarks / generating performance numbers as described in #6782 (comment) |
Some initial thoughts for the paper organization after checking the current version:
After some discussion with @alamb, we agreed that we could start with this above structure first and adapt it as we proceed. |
I incorporated the above structure into the paper, but split the background part into two sections:
Section 2 mainly talks about DataFusion's foundation technologies/systems, and Section 3 keeps the material from the I also finished my first pass for section 2, |
Thank you @yjshen -- this is amazing ❤️ . I plan to take a pass through the architecture section later this morning, and see how far I get. I also hope to start working on some diagrams to emphasize some of the points in the paper. |
Here is my summary of the state of this project so far. Please correct me if I got it wrong: Current Author List:Goal:The goal is to submit an industrial track paper to SIGMOD 2024:
Proposed Timeline:End of August: Rough Draft complete Logistics
I will also add a todo / task list to the description of this ticket for us to coordinate on remaining open work |
What is a good way to provide feedback? Can we open a chat somewhere for more quick/detailed discussions?
|
I also suggest for small suggestions to use and review the "comment" functionality in Overleaf. |
BTW I plan to work on this paper over the weekend (Sat and Sun morning) |
This work sounds impressive! Could I get involved and contribute to it? |
Sorry I was not feeling well this weekend and did not have a chance to work on this paper. However, I did take a pass this morning. So thank you again @JayjeetAtGithub -- I just spent some serious time reviewing the performance scalability section and it was just what I had been hoping for (actually quite a bit better). I have some suggested actions:
BTW I think @JayjeetAtGithub also filed the following issues (that prevent us from running all the clickbench queries on the partitioned dataset): |
Yes, absolutely! Thank you @Weijun-H I think there are several major areas in need of work:
Are any of those areas interesting to you? |
Script to generate data, run experiments, and plot results (also plots) can be found here. |
Maybe another suggestion for improvement of the scalability graph: the instance shown has 176 virtual cores, can we update the graphs to show scaling to 176 cores? |
@Dandandan Sure. We will take another (probably a couple more) round of benchmarks after the issues in data fusion get fixed. For now, its kind of acting as a placeholder. |
Hi @alamb . |
Related work: https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf |
I added a reference to this in the related work section Sorry for the delay here -- I was away. I took another pass through this document this morning, and I am hoping to have all the sections drafted by the end of this upcoming week. Then I believe we will be on to the 'editing / refining' phase. |
I took another pass through the paper this morning. I think it is nearing "complete rough draft" stage ❤️ and then we can move on to honing / refining the content One thing I think that would make it stronger is a few short case study descriptions of systems built using DataFusion (as concrete examples of what is possible). I can obviously write about IOx but I'll need help to properly describe others. I'll follow up in the next few days with other potential |
I will take some time this week again to contribute (last weeks were quite busy). I will take a look at the paper and see where I can contribute. |
Do you think https://github.com/ArroyoSystems/arroyo can be also added as a "as well as streaming SQL platforms |
I have done so. Thank you |
Ok, I have submitted a draft of the paper. 😅 🚀 We can still update the submission until tomorrow Thank you very much @ozankabak who has been copy editing away these last few days and everyone who has worked on this paper over the last several months SubmissionA snapshot of what I uploaded to the site is here: Conflicts@Dandandan @JayjeetAtGithub @viirya @sunchao @ozankabak and @yjshen and @ozankabak can you please let me know if you have a conflict of interest (definition) with anyone on the following list (from the CMT site):
|
No one from my side. Thanks for the excellent work! |
Thanks @alamb for driving this!
I see the definition of conflict of interest contains:
Since there are several people from Apple in the above list, which is what @viirya and me are working for, will that be considered as conflicts of interest? Although, we've never collaborated with these people within the company, and this is the first time I've heard their names. |
I was an employee of Meta at some point during the last five years but no CoI otherwise |
Yes I think this is covered by a separate "domain conflict" list:
I have added "meta.com" as another domain conflict just to be safe. |
None from my side |
I found one issue in the benchmarks. Running locally (query 10 of h2o benchmarks has a large output):
when using
As we can see in the results, duckdb is performing not so well on this query compared to DataFusion because of this: |
Yea, I only have the conflicts with the members from Apple. Otherwise, no conflicts. Thanks @alamb ! |
Thank you for taking a look. It'd be great if double checked all the config parameters and our usage of APIs on both sides to make sure we don't accidentally distort any measurement. |
I did review this quite extensively when working on the benchmarks and I believe they are defensible. The runner scripts are based on the (duckdb authored) scripts to run ClickBench (that use So while we can (and should) improve the scripts if the paper is accepted for publication, I recommend keeping the existing results that have
|
I have started collecting "things to do if the paper is accepted" in #8373 |
Ok, well the deadline is past so I am going to claim we have 'written' the paper and closing this ticket. We'll track the status of the paper in #8373 for anyone interested |
In addition to DuckDB and pola.rs, have you considered comparing Clickhouse? |
Hi @JasonLi-cn -- One comparison DataFusion and ClickHouse is ClickBench https://benchmark.clickhouse.com/ While similar, I think DataFusion and ClickBench are different enough to make comparing hard. ClickBench is a real database management, and unlike DataFusion it targets end-users rather than developers of other database systems. |
thanks~ |
In case someone is looking for it, here is the draft we submitted: |
congrats on the submission - sorry that i was not able to find time to contribute. hopefully the draft gets good reception! |
Our paper was accepted. See more details on #8373 (comment) |
Hi, it looks like the link posted on 1/9 is broken. Is there a place to find the finalized paper? Thanks |
@rohitrastogi -- We don't have the final draft of the paper yet (we will have it by the end of the month -- March 28). Here is the draft we submitted: |
Final paper draft can be found here: #8373 (comment) |
Final ACM link: https://dl.acm.org/doi/10.1145/3626246.3653368 |
UPDATE: Final paper: https://dl.acm.org/doi/10.1145/3626246.3653368
Task List for SIGMOD Paper:
Per #6782 (comment), here is a list of TODO items:
Issues Blocking Full Performance Results
Issues that would make the results more compelling
Is your feature request related to a problem or challenge?
I would like to increase awareness of DataFusion in the broader technical community. One way to build mindshare is to get a paper / talk published in a prestigious conference like VLDB or SIGMOD
Writing a paper is a good way to show the strength of the arrow/datafusion.
Through the papers, more teachers, students and researcher may be involved, and contribute to the project.
Describe the solution you'd like
I would like to write a paper that explains DataFusion
Thesis: "You don't need a tightly integrated execution system to get good performance"
These blogs have some good material in the introduction
https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/
https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0/
Then we would compare and contrast the approaches of other tightly integrated systems like pola.rs and duckdb to DataFusion
We would then describe the architecture of DataFusion and its many extension points (DataFrame, functions, aggregates, window functions, sinks, etc)
Performance:
Show DataFusion in the same ballpark as DuckDB for aggregation, grouping, etc (e.g. TPCH)
We already have this for querying parquet
Describe alternatives you've considered
VLDB: https://vldb.org/2024/?call-for-industrial-track
SIGMOD: https://2024.sigmod.org/calls_papers_important_dates.shtml
Industrial track: https://2024.sigmod.org/comingsoon.shtml (TBD)
Research paper submission round 4 (All Deadlines are 11:59 PM Pacific Time)
October 15, 2023: Paper submission
November 26-28, 2023: Author feedback phase
December 20, 2023: Notification of accept/reject/review again
January 20, 2024: Revised paper submission
February 23, 2024: Final notification of accept/reject
ICDE:
Industrial Track: https://icde2024.github.io/CFP_industry.html
All deadlines below are 5 PM Pacific Time.
Paper submission: Monday, November 20, 2023
Notification of accept/reject: Wednesday, January 31, 2024
Camera-ready deadline: Thursday, March 28, 2024
Additional context
No response
The text was updated successfully, but these errors were encountered: