Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement output as CSV #54

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Implement output as CSV #54

wants to merge 1 commit into from

Conversation

alamb
Copy link
Collaborator

@alamb alamb commented Mar 21, 2025

Add ability to print tpch tables as csv files

This is a step towards parallel generation and parquet output

I also verified you can query these files from datafusion-cli. A teaser:

> select * from '/tmp/tpchdbgen-rs/lineitem.csv' limit 10;
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+-------------------+------------+-------------------------------------+
| l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l_extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_shipdate | l_commitdate | l_receiptdate | l_shipinstruct    | l_shipmode | l_comment                           |
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+-------------------+------------+-------------------------------------+
| 1          | 15519     | 785       | 1            | 17         | 24386.67        | 0.04       | 0.02  | N            | O            | 1996-03-13 | 1996-02-12   | 1996-03-22    | DELIVER IN PERSON | TRUCK      | egular courts above the             |
| 1          | 6731      | 732       | 2            | 36         | 58958.28        | 0.09       | 0.06  | N            | O            | 1996-04-12 | 1996-02-28   | 1996-04-20    | TAKE BACK RETURN  | MAIL       | ly final dependencies: slyly bold   |
| 1          | 6370      | 371       | 3            | 8          | 10210.96        | 0.1        | 0.02  | N            | O            | 1996-01-29 | 1996-03-05   | 1996-01-31    | TAKE BACK RETURN  | REG AIR    | riously. regular, express dep       |
| 1          | 214       | 465       | 4            | 28         | 31197.88        | 0.09       | 0.06  | N            | O            | 1996-04-21 | 1996-03-30   | 1996-05-16    | NONE              | AIR        | lites. fluffily even de             |
| 1          | 2403      | 160       | 5            | 24         | 31329.6         | 0.1        | 0.04  | N            | O            | 1996-03-30 | 1996-03-14   | 1996-04-01    | NONE              | FOB        |  pending foxes. slyly re            |
| 1          | 1564      | 67        | 6            | 32         | 46897.92        | 0.07       | 0.02  | N            | O            | 1996-01-30 | 1996-02-07   | 1996-02-03    | DELIVER IN PERSON | MAIL       | arefully slyly ex                   |
| 2          | 10617     | 138       | 1            | 38         | 58049.18        | 0.0        | 0.05  | N            | O            | 1997-01-28 | 1997-01-14   | 1997-02-02    | TAKE BACK RETURN  | RAIL       | ven requests. deposits breach a     |
| 3          | 430       | 181       | 1            | 45         | 59869.35        | 0.06       | 0.0   | R            | F            | 1994-02-02 | 1994-01-04   | 1994-02-23    | NONE              | AIR        | ongside of the furiously brave acco |
| 3          | 1904      | 658       | 2            | 49         | 88489.1         | 0.1        | 0.0   | R            | F            | 1993-11-09 | 1993-12-20   | 1993-11-24    | TAKE BACK RETURN  | RAIL       |  unusual accounts. eve              |
| 3          | 12845     | 370       | 3            | 27         | 47461.68        | 0.06       | 0.07  | A            | F            | 1994-01-16 | 1993-11-22   | 1994-01-23    | DELIVER IN PERSON | SHIP       | nal foxes wake.                     |
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+-------------------+------------+-------------------------------------+
10 row(s) fetched.
Elapsed 0.029 seconds.
Details

> describe '/tmp/tpchdbgen-rs/lineitem.csv';
+-----------------+-----------+-------------+
| column_name     | data_type | is_nullable |
+-----------------+-----------+-------------+
| l_orderkey      | Int64     | YES         |
| l_partkey       | Int64     | YES         |
| l_suppkey       | Int64     | YES         |
| l_linenumber    | Int64     | YES         |
| l_quantity      | Int64     | YES         |
| l_extendedprice | Float64   | YES         |
| l_discount      | Float64   | YES         |
| l_tax           | Float64   | YES         |
| l_returnflag    | Utf8      | YES         |
| l_linestatus    | Utf8      | YES         |
| l_shipdate      | Date32    | YES         |
| l_commitdate    | Date32    | YES         |
| l_receiptdate   | Date32    | YES         |
| l_shipinstruct  | Utf8      | YES         |
| l_shipmode      | Utf8      | YES         |
| l_comment       | Utf8      | YES         |
+-----------------+-----------+-------------+
16 row(s) fetched.
Elapsed 0.006 seconds.

you can also use CREATE EXTERNAL TABLE syntax:

DataFusion CLI v46.0.1
> CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (
        l_orderkey BIGINT,
        l_partkey BIGINT,
        l_suppkey BIGINT,
        l_linenumber INTEGER,
        l_quantity DECIMAL(15, 2),
        l_extendedprice DECIMAL(15, 2),
        l_discount DECIMAL(15, 2),
        l_tax DECIMAL(15, 2),
        l_returnflag VARCHAR,
        l_linestatus VARCHAR,
        l_shipdate DATE,
        l_commitdate DATE,
        l_receiptdate DATE,
        l_shipinstruct VARCHAR,
        l_shipmode VARCHAR,
        l_comment VARCHAR,
) STORED AS CSV LOCATION '/tmp/tpchdbgen-rs/lineitem.csv';

0 row(s) fetched.
Elapsed 0.002 seconds.

> select * from lineitem limit 10;
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+-------------------+------------+------------------------------------------+
| l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l_extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_shipdate | l_commitdate | l_receiptdate | l_shipinstruct    | l_shipmode | l_comment                                |
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+-------------------+------------+------------------------------------------+
| 75523      | 2979      | 232       | 2            | 44.00      | 82806.68        | 0.09       | 0.00  | N            | O            | 1995-10-06 | 1995-11-05   | 1995-11-04    | NONE              | REG AIR    | ructions sleep blithely deposits. fluff  |
| 75523      | 5359      | 615       | 3            | 14.00      | 17700.90        | 0.02       | 0.05  | N            | O            | 1995-12-07 | 1995-11-11   | 1995-12-18    | NONE              | REG AIR    | eas-- finally even depo                  |
| 75524      | 19433     | 434       | 1            | 28.00      | 37868.04        | 0.06       | 0.07  | N            | O            | 1995-12-28 | 1995-11-07   | 1996-01-11    | COLLECT COD       | REG AIR    | e the quickly regular foxes              |
| 75524      | 17789     | 591       | 2            | 27.00      | 46083.06        | 0.08       | 0.08  | N            | O            | 1995-10-09 | 1995-11-19   | 1995-10-30    | COLLECT COD       | MAIL       | ans. furiously even depo                 |
| 75524      | 16599     | 398       | 3            | 43.00      | 65170.37        | 0.00       | 0.01  | N            | O            | 1995-12-24 | 1995-12-16   | 1996-01-17    | TAKE BACK RETURN  | AIR        | ges. boldly ironic foxes p               |
| 75524      | 15039     | 40        | 4            | 15.00      | 14310.45        | 0.05       | 0.05  | N            | O            | 1995-10-29 | 1995-12-10   | 1995-11-20    | DELIVER IN PERSON | AIR        | nd the theodolites sleep carefully ca    |
| 75525      | 9871      | 872       | 1            | 36.00      | 64111.32        | 0.00       | 0.02  | N            | O            | 1995-08-30 | 1995-07-10   | 1995-09-20    | TAKE BACK RETURN  | TRUCK      | e slyly pending deposits. blithely bo    |
| 75525      | 171       | 172       | 2            | 24.00      | 25708.08        | 0.05       | 0.04  | N            | O            | 1995-07-22 | 1995-07-21   | 1995-08-17    | NONE              | RAIL       | kly special deposits. pending, bold shea |
| 75526      | 9745      | 264       | 1            | 42.00      | 69499.08        | 0.08       | 0.01  | R            | F            | 1994-06-23 | 1994-04-26   | 1994-07-14    | DELIVER IN PERSON | TRUCK      |  excuses are idly. qui                   |
| 75527      | 1682      | 934       | 1            | 31.00      | 49094.08        | 0.00       | 0.06  | N            | O            | 1997-11-02 | 1997-10-10   | 1997-11-26    | DELIVER IN PERSON | REG AIR    | among the furiously sile                 |
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+-------------------+------------+------------------------------------------+
10 row(s) fetched.
Elapsed 0.019 seconds.

> describe lineitem;
+-----------------+-------------------+-------------+
| column_name     | data_type         | is_nullable |
+-----------------+-------------------+-------------+
| l_orderkey      | Int64             | YES         |
| l_partkey       | Int64             | YES         |
| l_suppkey       | Int64             | YES         |
| l_linenumber    | Int32             | YES         |
| l_quantity      | Decimal128(15, 2) | YES         |
| l_extendedprice | Decimal128(15, 2) | YES         |
| l_discount      | Decimal128(15, 2) | YES         |
| l_tax           | Decimal128(15, 2) | YES         |
| l_returnflag    | Utf8              | YES         |
| l_linestatus    | Utf8              | YES         |
| l_shipdate      | Date32            | YES         |
| l_commitdate    | Date32            | YES         |
| l_receiptdate   | Date32            | YES         |
| l_shipinstruct  | Utf8              | YES         |
| l_shipmode      | Utf8              | YES         |
| l_comment       | Utf8              | YES         |
+-----------------+-------------------+-------------+
16 row(s) fetched.
Elapsed 0.001 seconds.

Performance

The time to make tbl format hasn't changed:

$ time target/release/tpchgen-cli -s 1 --output-dir=/tmp/tpchdbgen-rs
Generation complete!

real	0m5.281s
user	0m5.010s
sys	0m0.250s

The time to make CSV format is about th esame:

$ time target/release/tpchgen-cli -s 1 --format csv --output-dir=/tmp/tpchdbgen-rs
Generation complete!

real	0m5.319s
user	0m5.041s
sys	0m0.235s

Testing

The CSV generation is tested via doc examples (which now also run in CI after #55).

I did not add integration tests for tpchgen-cli

@alamb alamb force-pushed the alamb/csv branch 2 times, most recently from 331449b to 2a05485 Compare March 21, 2025 14:28
@alamb alamb marked this pull request as ready for review March 21, 2025 14:42
@@ -61,137 +68,323 @@ enum Table {
LineItem,
}

impl Table {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has many lines, but it is a lot of documentation and boilerplate

}

#[derive(Debug, Copy, Clone, PartialEq, Eq, PartialOrd, Ord, ValueEnum)]
enum OutputFormat {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an output format enum and then made the various generation functions methods on Cli

fn generate_part(cli: &Cli) -> io::Result<()> {
let filename = "part.tbl";
let mut writer = new_table_writer(cli, filename)?;
fn generate_nation(&self) -> io::Result<()> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to leave the structure one function per table, which then dispatches to a specialized function for format

We could likely avoid the replication with macros / traits, but I think this way is pretty explicit (and the number of tables will never change)

}
}

// Separate functions for each table/output format combination
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replication of these functions are unfortunate, but I think they are straightforward and easy to understand and guarantees that each table/format gets specialized code

use core::fmt;
use std::fmt::Display;

/// Write [`Nation`]s in CSV format.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To generate CSV, the idea is to make a new type zero copy wrapper for each table row that formats as CSV instead of tbl. It is fairly repetitive and about half the code is doc examples which double as unit test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Directly write csv format (in addition to tbl format)
1 participant