This repository holds the data generation and query generation utilities (dbgen
and qgen
respectively) for the TPC-H analytics benchmark for DBMSes.
Repository build status on Github Workflow:
Table of contents |
---|
"What? Another fork of tpch-dbgen? Why?" About the TPC-H benchmark Building the generation utilities Using dbgen to generate data |
The tpch-dbgen
utility is a minor(ish) m odification of the original utility available on the TPC-H benchmark's dedicated page. It's somewhat crufty code with all kinds of warts. It's not the intention of this repository to make all of those disappear, or to effect a rewrite - but avoiding compiler warnings, ensuring the build passes on more platforms, and other such minor issues are what this endeavor is about.
Now, there are several repositories on GitHub with such modifications of the official TPC-H dbgen
- each resolving some issues and ignoring others; some adding customizations relevant only to some users; and each starting out with a potentially outdated version of the official utility. This particular repository is an attempt to unify all of them, applying all changes to the code which - in my opinion - are generally useful, and merging them all together. Details of what's already been done can be found on the Closed Issues Page and of course by examining the commit comments.
If you are the author of one of the other repositories - please contact me for better coordination of this effort.
TPC benchmark H is one of several benchmarks created and maintained by the Transaction Processiong Council - TPC for short. The TPC, is industrial consortium with the major DBMS vendors as members, which is intended to standardize the performance benchmarking of DBMSes in various usage scenarios. Other TPC-H benchmarks have been named TPC-A, TPC-B, TPC-R, TPC-DS and so on; some are deprecated and no longer irrelevant. You can read about the actively-supported benchmarks here.
TPC-H is intended for benchmarking the Decision Support performance of a DBMS, or in other words - analytic query processing performance. It follows a commercial business scenario involving suppliers, customers, parts, suppliers of parts and orders of parts shipped to clients, over the course of several business years (1992-1998). Its schema is not very complicated, but not trivial either (e.g. it is not a "star schema"). It has 22 queries ranging from simpler to relatively long and involved - although none of them utilizes more advanced or esoteric SQL features (such as window functions, UDFs and so on). It is in wide use both in its "proper" form - of sending random variants of the queries from the set to a running DBMS over the course of an hour - and in the "artificial" form of running individual queries on a cold DBMS.
The original, TPC-distributed version of the code in this repository required manually creating a Makefile
from am template (makefile.suite
) - that is not necessary with this repository.
The build process is automated using Kitware CMake. There are several settings which you must make before building, which CMake will guide you through if you invoke it properly - that is, using the GUI or terminal-based user interface that presents the configuration options for you to choose from.
Following is an explanation of the values you would need to set using the GUI or TUI:
Variable | Used by | How to set it? | List of options |
---|---|---|---|
DATABASE |
qgen | Select the name of the DBMS closest to the one you're benchmarking in terms of syntax. If unsure, choose DB2 |
INFORMIX DB2 TDAT SQLSERVER SYBASE ORACLE VECTORWISE POSTGRES |
CSV_OUTPUT_FORMAT |
dbgen | Set to ON for output in the CSV format - with comma-delimited fields and double-quoted strings. |
ON OFF |
WORKLOAD |
dbgen, qgen | Use TPCH |
|
EOL_HANDLING |
dbgen | Set to OFF if your DBMS doesn't support loading the data if it has a separator character at the end of each line |
ON or OFF (it's a boolean really) |
CMake will generate files (including a Makefile) which you can then use with your platform-specific build tools, e.g. NMake for Windows or GNU Make on Unix-like systems. If you're not sure how to use them, consult the documentation or Goole.
Note: The PLATFORM
or MACHINE
variable, which in TPC's own version had to be manually set within the Makefile
, is now not necessary: Platform features will be auto-detected and taken into account (like any normal software package these days).
Typically, dbgen is invoked with a specific scale factor, and it must be directed at the dists.dss
file. Thus the command-line should look something like this:
$ /path/to/dbgen -v -s 10 -b /path/to/dists.dss
which will create the various table files (e.g. customer.tbl
, nation.tbl
, region.tbl
, supplier.tbl
and so on) in the current directory, with a scale factor of 10, i.e. 300,000 customer lines. Here are the first few lines of a resulting customer.tbl
:
1|Customer#000000001|j5JsirBM9P|MOROCCO 0|MOROCCO|AFRICA|25-989-741-2988|BUILDING|
2|Customer#000000002|487LW1dovn6Q4dMVym|JORDAN 1|JORDAN|MIDDLE EAST|23-768-687-3665|AUTOMOBILE|
3|Customer#000000003|fkRGN8n|ARGENTINA7|ARGENTINA|AMERICA|11-719-748-3364|AUTOMOBILE|
4|Customer#000000004|4u58h f|EGYPT 4|EGYPT|MIDDLE EAST|14-128-190-5944|MACHINERY|
the fields are separated by a pipe character (|
), and there's a trailing pipe at the end of the line - unless you set EOL_HANDLING
option to OFF
.
After generating .tbl
files for all tables, you should now either load them directly into your DBMS, or apply some textual processing on them.
Note: On Unix-like systems, it is also possible to write the generated data into a FIFO filesystem node, reading from the other side with a compression utility, so as to only write compressed data to disk. This may be useful of disk space is limited and you are using a particularly high scale factor.
Have you encountered some other issue with dbgen
or qgen
? Please open a new issue on the Issues Page; be sure to list exactly what you did and enter a copy of the terminal output of the commands you used. Note that the issue might be with the original, unmodified TPC-H dbgen
/qgen
code, in which case the issue may be forwarded rather than resolved.