Skip to content

Commit

Permalink
Added most of TPC-H, some queries need to be fixed.
Browse files Browse the repository at this point in the history
Major improvements to the build and generate scripts.
  • Loading branch information
cartershanklin committed Mar 28, 2014
1 parent 822aa6b commit 2b4fa2e
Show file tree
Hide file tree
Showing 23 changed files with 864 additions and 83 deletions.
58 changes: 31 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,61 +12,65 @@ Prerequisites
=============

You will need:
* A Linux-based HDP cluster (or Sandbox).
* Hadoop 2.2 or later cluster or Sandbox.
* Hive 13 or later.
* Between 15 minutes and 6 hours to generate data (depending on the Scale Factor you choose and available hardware).

Install and Setup
=================

All of these steps should be carried out on your Hadoop cluster.

- Optional: Install a Tez capable version of Hive.

If you want to compare and contrast Hive on Map/Reduce versus Hive on Tez, install a version of Hive that works with Tez. For now that means installing the [Stinger Phase 3 Preview](http://www.hortonworks.com). Hive 13 and beyond, when they are released, will include Tez support by default.

- Step 1: Prepare your environment.

Before you begin ensure ```gcc``` is installed and available on your system path. If you system does not have it, install it using yum or apt-get.
In addition to Hadoop and Hive 13+, before you begin ensure ```gcc``` is installed and available on your system path. If you system does not have it, install it using yum or apt-get.

- Step 2: Compile and package the data generator.
- Step 2: Decide which test suite(s) you want to use.

```./build.sh``` downloads, compiles and packages the data generator.
hive-testbench comes with data generators and sample queries based on both the TPC-DS and TPC-H benchmarks. You can choose to use either or both of these benchmarks for experiementation. More information about these benchmarks can be found at the Transaction Processing Council homepage.

- Step 3: Decide how much data you want to generate.
- Step 3: Compile and package the appropriate data generator.

You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes. One terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 200 (200GB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 1000 (1TB) or more.
For TPC-DS, ```./tpcds-build.sh``` downloads, compiles and packages the TPC-DS data generator.
For TPC-H, ```./tpch-build.sh``` downloads, compiles and packages the TPC-H data generator.

- Step 4: Generate and load the data.
- Step 4: Decide how much data you want to generate.

The ```tpcds-setup.sh``` script generates and loads data for you. General usage is ```tpcds-setup.sh scale [directory] [mode]```. Only the scale is mandatory. The directory argument causes data to be generated in a specific location. Mode can be partitioned or unpartitioned. Partitioned causes data to be partitioned by day. Unpartitioned creates one flat schema and is faster to generate.
You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes and one terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 1000 (1 TB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 10000 (10 TB) or more. The notion of scale factor is similar between TPC-DS and TPC-H.

- Option 1: Generate data on a Hadoop cluster.
- Step 5: Generate and load the data.

Use this approach if you want to try Hive out at scale. This approach assumes you have multiple physical Hadoop nodes with plenty of RAM. All tables will be created and large tables will be partitioned by date and bucketed which improves performance among queries that take advantage of partition pruning or SMB joins.
The scripts ```tpcds-setup.sh``` and ```tpch-setup.sh``` generate and load data for TPC-DS and TPC-H, respectively. General usage is ```tpcds-setup.sh scale_factor [directory]``` or ```tpch-setup.sh scale_factor [directory]```

Example: ```./tpcds-setup.sh 200```
Some examples:
Build 1 TB of TPC-DS data: ```./tpcds-setup 1000```
Build 1 TB of TPC-H data: ```./tpch-setup 1000```
Build 100 TB of TPC-DS data: ```./tpcds-setup 100000```

- Option 2: Generate data on a Sandbox.
- Step 6: Run queries.

Use this approach if you want to try Hive or Hive/Tez in a Sandbox environment. This approach creates an unpartitioned schema by default, which is faster to generate. This option is appropriate for smaller data scales, say 20GB or smaller.
More than 50 sample TPC-DS queries and all TPC-H queries are included for you to try. You can use ```hive```, ```beeline``` or the SQL tool of your choice. The testbench also includes a set of suggested settings.

Example: ```./tpcds-setup-sandbox.sh 10```
This example assumes you have generated 1 TB of TPC-DS data during Step 5:

- Step 5: Run queries.
```
cd sample-queries-tpcds
hive -i testbench.settings
hive> use tpcds_bin_partitioned_orc_1000;
hive> source query55.sql;
```

More than 50 sample TPC-DS queries are included for you to try out. You can use ```hive```, ```beeline``` or the SQL tool of your choice.
Note that the database is named based on the Data Scale chosen in step 3. At Data Scale 10000, your database will be named tpcds_bin_partitioned_orc_10000. At Data Scale 1000 it would be named tpcds_bin_partitioned_orc_1000. You can always ```show databases``` to get a list of available databases.

Example:
Similarly, if you generated 1 TB of TPC-H data during Step 5:

```
cd sample-queries
hive
hive> use tpcds_bin_partitioned_orc_200;
hive> source query12.sql;
cd sample-queries-tpch
hive -i testbench.settings
hive> use tpch_bin_partitioned_orc_1000;
hive> source tpch_query1.sql;
```

Note that the database is named based on the Data Scale chosen in step 3. At Data Scale 200, your database will be named tpcds_bin_partitioned_orc_200. At Data Scale 50 it would be named tpcds_bin_partitioned_orc_50. You can always ```show databases``` to get a list of available databases.

Feedback
========

Expand Down
96 changes: 96 additions & 0 deletions ddl-tpch/bin_flat/alltables.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
create database if not exists ${DB};
use ${DB};

drop table if exists lineitem;
create external table lineitem
(L_ORDERKEY INT,
L_PARTKEY INT,
L_SUPPKEY INT,
L_LINENUMBER INT,
L_QUANTITY DOUBLE,
L_EXTENDEDPRICE DOUBLE,
L_DISCOUNT DOUBLE,
L_TAX DOUBLE,
L_RETURNFLAG STRING,
L_LINESTATUS STRING,
L_SHIPDATE STRING,
L_COMMITDATE STRING,
L_RECEIPTDATE STRING,
L_SHIPINSTRUCT STRING,
L_SHIPMODE STRING,
L_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '${LOCATION}/lineitem';

drop table if exists part;
create external table part (P_PARTKEY INT,
P_NAME STRING,
P_MFGR STRING,
P_BRAND STRING,
P_TYPE STRING,
P_SIZE INT,
P_CONTAINER STRING,
P_RETAILPRICE DOUBLE,
P_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '${LOCATION}/part/';

drop table if exists supplier;
create external table supplier (S_SUPPKEY INT,
S_NAME STRING,
S_ADDRESS STRING,
S_NATIONKEY INT,
S_PHONE STRING,
S_ACCTBAL DOUBLE,
S_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '${LOCATION}/supplier/';

drop table if exists partsupp;
create external table partsupp (PS_PARTKEY INT,
PS_SUPPKEY INT,
PS_AVAILQTY INT,
PS_SUPPLYCOST DOUBLE,
PS_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION'${LOCATION}/partsupp';

drop table if exists nation;
create external table nation (N_NATIONKEY INT,
N_NAME STRING,
N_REGIONKEY INT,
N_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '${LOCATION}/nation';

drop table if exists region;
create external table region (R_REGIONKEY INT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '${LOCATION}/region';

drop table if exists customer;
create external table customer (C_CUSTKEY INT,
C_NAME STRING,
C_ADDRESS STRING,
C_NATIONKEY INT,
C_PHONE STRING,
C_ACCTBAL DOUBLE,
C_MKTSEGMENT STRING,
C_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '${LOCATION}/customer';

drop table if exists orders;
create external table orders (O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE DOUBLE,
O_ORDERDATE STRING,
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
LOCATION '${LOCATION}/orders';
8 changes: 8 additions & 0 deletions ddl-tpch/bin_flat/customer.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
create database if not exists ${DB};
use ${DB};

drop table if exists customer;

create table customer
stored as ${FILE}
as select * from ${SOURCE}.customer;
8 changes: 8 additions & 0 deletions ddl-tpch/bin_flat/lineitem.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
create database if not exists ${DB};
use ${DB};

drop table if exists lineitem;

create table lineitem
stored as ${FILE}
as select * from ${SOURCE}.lineitem;
8 changes: 8 additions & 0 deletions ddl-tpch/bin_flat/nation.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
create database if not exists ${DB};
use ${DB};

drop table if exists nation;

create table nation
stored as ${FILE}
as select * from ${SOURCE}.nation;
8 changes: 8 additions & 0 deletions ddl-tpch/bin_flat/orders.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
create database if not exists ${DB};
use ${DB};

drop table if exists orders;

create table orders
stored as ${FILE}
as select * from ${SOURCE}.orders;
8 changes: 8 additions & 0 deletions ddl-tpch/bin_flat/part.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
create database if not exists ${DB};
use ${DB};

drop table if exists part;

create table part
stored as ${FILE}
as select * from ${SOURCE}.part;
8 changes: 8 additions & 0 deletions ddl-tpch/bin_flat/partsupp.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
create database if not exists ${DB};
use ${DB};

drop table if exists partsupp;

create table partsupp
stored as ${FILE}
as select * from ${SOURCE}.partsupp;
8 changes: 8 additions & 0 deletions ddl-tpch/bin_flat/region.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
create database if not exists ${DB};
use ${DB};

drop table if exists region;

create table region
stored as ${FILE}
as select * from ${SOURCE}.region;
8 changes: 8 additions & 0 deletions ddl-tpch/bin_flat/supplier.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
create database if not exists ${DB};
use ${DB};

drop table if exists supplier;

create table supplier
stored as ${FILE}
as select * from ${SOURCE}.supplier;
23 changes: 5 additions & 18 deletions settings/init.sql
Original file line number Diff line number Diff line change
@@ -1,27 +1,14 @@
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000000;
set hive.exec.max.dynamic.partitions=1000000;
set hive.exec.max.created.files=1000000;
set hive.map.aggr=true;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.mapred.reduce.tasks.speculative.execution=false;
set mapreduce.reduce.speculative=false;
set hive.auto.convert.join=true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000000;
set hive.optimize.reducededuplication.min.reducer=1;
set hive.optimize.mapjoin.mapreduce=true;
set hive.stats.autogather=true;

set mapred.reduce.parallel.copies=30;
set mapred.reduce.tasks=16;
set mapred.job.shuffle.input.buffer.percent=0.5;
set mapred.job.reduce.input.buffer.percent=0.2;
set mapred.map.child.java.opts=-server -Xmx2248m -Djava.net.preferIPv4Stack=true;
set mapred.reduce.child.java.opts=-server -Xmx4500m -Djava.net.preferIPv4Stack=true;
set mapred.map.child.java.opts=-server -Xmx2800m -Djava.net.preferIPv4Stack=true;
set mapred.reduce.child.java.opts=-server -Xmx3800m -Djava.net.preferIPv4Stack=true;
set mapreduce.map.memory.mb=3072;
set mapreduce.reduce.memory.mb=6144;
set hive.optimize.tez=true;
set mapreduce.reduce.memory.mb=4096;
11 changes: 5 additions & 6 deletions settings/load-flat.sql
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,9 @@ set hive.exec.max.dynamic.partitions.pernode=1000000;
set hive.exec.max.dynamic.partitions=1000000;
set hive.exec.max.created.files=1000000;

set mapred.min.split.size=240000000;
set mapred.max.split.size=240000000;
set mapred.min.split.size.per.node=240000000;
set mapred.min.split.size.per.rack=240000000;
set mapreduce.input.fileinputformat.split.minsize=240000000;
set mapreduce.input.fileinputformat.split.maxsize=240000000;
set mapreduce.input.fileinputformat.split.minsize.per.node=240000000;
set mapreduce.input.fileinputformat.split.minsize.per.rack=240000000;
set hive.exec.parallel=true;
set hive.stats.autogather=false;
set hive.optimize.tez=false;
set hive.stats.autogather=true;
19 changes: 8 additions & 11 deletions settings/load-partitioned.sql
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,12 @@ set hive.exec.reducers.max=2000;
set hive.stats.autogather=true;

set mapred.job.reduce.input.buffer.percent=0.0;
set mapred.min.split.size=240000000;
set mapred.min.split.size.per.node=240000000;
set mapred.min.split.size.per.rack=240000000;
set mapreduce.input.fileinputformat.split.minsizee=240000000;
set mapreduce.input.fileinputformat.split.minsize.per.node=240000000;
set mapreduce.input.fileinputformat.split.minsize.per.rack=240000000;

set mapred.map.child.java.opts=-server -Xmx2500m -Djava.net.preferIPv4Stack=true;
set mapred.reduce.child.java.opts=-server -Xms1024m -Xmx7900m -Djava.net.preferIPv4Stack=true;
set mapreduce.map.memory.mb=3072;
set mapreduce.reduce.memory.mb=8192;

set io.sort.mb=800;

set hive.optimize.tez=false;
-- set mapred.map.child.java.opts=-server -Xmx2800m -Djava.net.preferIPv4Stack=true;
-- set mapred.reduce.child.java.opts=-server -Xms1024m -Xmx3800m -Djava.net.preferIPv4Stack=true;
-- set mapreduce.map.memory.mb=3072;
-- set mapreduce.reduce.memory.mb=4096;
-- set io.sort.mb=800;
29 changes: 29 additions & 0 deletions tpcds-build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/sh

# Check for all the stuff I need to function.
for f in gcc; do
which $f > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "Required program $f is missing. Please install it and try again."
exit 1
fi
done

# Check if Maven is installed and install it if not.
which mvn > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "Maven not found, automatically installing it."
curl -O http://www.us.apache.org/dist/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz 2> /dev/null
if [ $? -ne 0 ]; then
echo "Failed to download Maven, check Internet connectivity and try again."
exit 1
fi
tar -zxf apache-maven-3.0.5-bin.tar.gz > /dev/null
CWD=$(pwd)
export MAVEN_HOME="$CWD/apache-maven-3.0.5"
export PATH=$PATH:$MAVEN_HOME/bin
fi

echo "Building TPC-DS Data Generator"
(cd tpcds-gen; make)
echo "TPC-DS Data Generator built, you can now use tpcds-setup.sh to generate data."
2 changes: 1 addition & 1 deletion tpcds-gen/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ tpcds_kit.zip:
curl --output tpcds_kit.zip http://www.tpc.org/tpcds/dsgen/dsgen-download-files.asp?download_key=NaN

target/lib/dsdgen.jar: target/tools/dsdgen
cd target/; mkdir -p lib/; gjar cvf lib/dsdgen.jar tools/
cd target/; mkdir -p lib/; ( jar cvf lib/dsdgen.jar tools/ || gjar cvf lib/dsdgen.jar tools/ )

target/tools/dsdgen: target/tpcds_kit.zip
test -d target/tools/ || (cd target; unzip tpcds_kit.zip; cd tools; cat ../../*.patch | patch -p0 )
Expand Down
Loading

0 comments on commit 2b4fa2e

Please sign in to comment.