-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement output as CSV #54
base: main
Are you sure you want to change the base?
Conversation
331449b
to
2a05485
Compare
@@ -61,137 +68,323 @@ enum Table { | |||
LineItem, | |||
} | |||
|
|||
impl Table { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has many lines, but it is a lot of documentation and boilerplate
} | ||
|
||
#[derive(Debug, Copy, Clone, PartialEq, Eq, PartialOrd, Ord, ValueEnum)] | ||
enum OutputFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an output format enum and then made the various generation functions methods on Cli
fn generate_part(cli: &Cli) -> io::Result<()> { | ||
let filename = "part.tbl"; | ||
let mut writer = new_table_writer(cli, filename)?; | ||
fn generate_nation(&self) -> io::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to leave the structure one function per table, which then dispatches to a specialized function for format
We could likely avoid the replication with macros / traits, but I think this way is pretty explicit (and the number of tables will never change)
} | ||
} | ||
|
||
// Separate functions for each table/output format combination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The replication of these functions are unfortunate, but I think they are straightforward and easy to understand and guarantees that each table/format gets specialized code
use core::fmt; | ||
use std::fmt::Display; | ||
|
||
/// Write [`Nation`]s in CSV format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To generate CSV, the idea is to make a new type zero copy wrapper for each table row that formats as CSV instead of tbl. It is fairly repetitive and about half the code is doc examples which double as unit test
Add ability to print tpch tables as csv files
This is a step towards parallel generation and parquet output
I also verified you can query these files from datafusion-cli. A teaser:
Details
you can also use
CREATE EXTERNAL TABLE
syntax:Performance
The time to make tbl format hasn't changed:
The time to make CSV format is about th esame:
Testing
The CSV generation is tested via doc examples (which now also run in CI after #55).
I did not add integration tests for tpchgen-cli