Added most of TPC-H, some queries need to be fixed.

Major improvements to the build and generate scripts.
quiescentsam · Mar 28, 2014 · 2b4fa2e · 2b4fa2e
1 parent 822aa6b
commit 2b4fa2e
Show file tree

Hide file tree

Showing 23 changed files with 864 additions and 83 deletions.
diff --git a/README.md b/README.md
@@ -12,61 +12,65 @@ Prerequisites
 =============
 
 You will need:
-* A Linux-based HDP cluster (or Sandbox).
+* Hadoop 2.2 or later cluster or Sandbox.
+* Hive 13 or later.
 * Between 15 minutes and 6 hours to generate data (depending on the Scale Factor you choose and available hardware).
 
 Install and Setup
 =================
 
 All of these steps should be carried out on your Hadoop cluster.
 
-- Optional: Install a Tez capable version of Hive.
-
-  If you want to compare and contrast Hive on Map/Reduce versus Hive on Tez, install a version of Hive that works with Tez. For now that means installing the [Stinger Phase 3 Preview](http://www.hortonworks.com). Hive 13 and beyond, when they are released, will include Tez support by default.
-
 - Step 1: Prepare your environment.
 
-  Before you begin ensure ```gcc``` is installed and available on your system path. If you system does not have it, install it using yum or apt-get.
+  In addition to Hadoop and Hive 13+, before you begin ensure ```gcc``` is installed and available on your system path. If you system does not have it, install it using yum or apt-get.
 
-- Step 2: Compile and package the data generator.
+- Step 2: Decide which test suite(s) you want to use.
 
-  ```./build.sh``` downloads, compiles and packages the data generator.
+  hive-testbench comes with data generators and sample queries based on both the TPC-DS and TPC-H benchmarks. You can choose to use either or both of these benchmarks for experiementation. More information about these benchmarks can be found at the Transaction Processing Council homepage.
 
-- Step 3: Decide how much data you want to generate.
+- Step 3: Compile and package the appropriate data generator.
 
-  You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes. One terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 200 (200GB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 1000 (1TB) or more.
+  For TPC-DS, ```./tpcds-build.sh``` downloads, compiles and packages the TPC-DS data generator.
+  For TPC-H, ```./tpch-build.sh``` downloads, compiles and packages the TPC-H data generator.
 
-- Step 4: Generate and load the data.
+- Step 4: Decide how much data you want to generate.
 
-  The ```tpcds-setup.sh``` script generates and loads data for you. General usage is ```tpcds-setup.sh scale [directory] [mode]```. Only the scale is mandatory. The directory argument causes data to be generated in a specific location. Mode can be partitioned or unpartitioned. Partitioned causes data to be partitioned by day. Unpartitioned creates one flat schema and is faster to generate.
+  You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes and one terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 1000 (1 TB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 10000 (10 TB) or more. The notion of scale factor is similar between TPC-DS and TPC-H.
 
-  - Option 1: Generate data on a Hadoop cluster.
+- Step 5: Generate and load the data.
 
-    Use this approach if you want to try Hive out at scale. This approach assumes you have multiple physical Hadoop nodes with plenty of RAM. All tables will be created and large tables will be partitioned by date and bucketed which improves performance among queries that take advantage of partition pruning or SMB joins.
+  The scripts ```tpcds-setup.sh``` and ```tpch-setup.sh``` generate and load data for TPC-DS and TPC-H, respectively. General usage is ```tpcds-setup.sh scale_factor [directory]``` or ```tpch-setup.sh scale_factor [directory]```
 
-    Example: ```./tpcds-setup.sh 200```
+  Some examples:
+  Build 1 TB of TPC-DS data: ```./tpcds-setup 1000```
+  Build 1 TB of TPC-H data: ```./tpch-setup 1000```
+  Build 100 TB of TPC-DS data: ```./tpcds-setup 100000```
 
-  - Option 2: Generate data on a Sandbox.
+- Step 6: Run queries.
 
-    Use this approach if you want to try Hive or Hive/Tez in a Sandbox environment. This approach creates an unpartitioned schema by default, which is faster to generate. This option is appropriate for smaller data scales, say 20GB or smaller.
+  More than 50 sample TPC-DS queries and all TPC-H queries are included for you to try. You can use ```hive```, ```beeline``` or the SQL tool of your choice. The testbench also includes a set of suggested settings.
 
-    Example: ```./tpcds-setup-sandbox.sh 10```
+  This example assumes you have generated 1 TB of TPC-DS data during Step 5:
 
-- Step 5: Run queries.
+  	```
+  	cd sample-queries-tpcds
+  	hive -i testbench.settings
+  	hive> use tpcds_bin_partitioned_orc_1000;
+  	hive> source query55.sql;
+  	```
 
-  More than 50 sample TPC-DS queries are included for you to try out. You can use ```hive```, ```beeline``` or the SQL tool of your choice.
+  Note that the database is named based on the Data Scale chosen in step 3. At Data Scale 10000, your database will be named tpcds_bin_partitioned_orc_10000. At Data Scale 1000 it would be named tpcds_bin_partitioned_orc_1000. You can always ```show databases``` to get a list of available databases.
 
-  Example:
+  Similarly, if you generated 1 TB of TPC-H data during Step 5:
 
   	```
-  	cd sample-queries
-  	hive
-  	hive> use tpcds_bin_partitioned_orc_200;
-  	hive> source query12.sql;
+  	cd sample-queries-tpch
+  	hive -i testbench.settings
+  	hive> use tpch_bin_partitioned_orc_1000;
+  	hive> source tpch_query1.sql;
   	```
 
-  Note that the database is named based on the Data Scale chosen in step 3. At Data Scale 200, your database will be named tpcds_bin_partitioned_orc_200. At Data Scale 50 it would be named tpcds_bin_partitioned_orc_50. You can always ```show databases``` to get a list of available databases.
-
 Feedback
 ========
 

diff --git a/ddl-tpch/bin_flat/alltables.sql b/ddl-tpch/bin_flat/alltables.sql
@@ -0,0 +1,96 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists lineitem;
+create external table lineitem 
+(L_ORDERKEY INT,
+ L_PARTKEY INT,
+ L_SUPPKEY INT,
+ L_LINENUMBER INT,
+ L_QUANTITY DOUBLE,
+ L_EXTENDEDPRICE DOUBLE,
+ L_DISCOUNT DOUBLE,
+ L_TAX DOUBLE,
+ L_RETURNFLAG STRING,
+ L_LINESTATUS STRING,
+ L_SHIPDATE STRING,
+ L_COMMITDATE STRING,
+ L_RECEIPTDATE STRING,
+ L_SHIPINSTRUCT STRING,
+ L_SHIPMODE STRING,
+ L_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE 
+LOCATION '${LOCATION}/lineitem';
+
+drop table if exists part;
+create external table part (P_PARTKEY INT,
+ P_NAME STRING,
+ P_MFGR STRING,
+ P_BRAND STRING,
+ P_TYPE STRING,
+ P_SIZE INT,
+ P_CONTAINER STRING,
+ P_RETAILPRICE DOUBLE,
+ P_COMMENT STRING) 
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE 
+LOCATION '${LOCATION}/part/';
+
+drop table if exists supplier;
+create external table supplier (S_SUPPKEY INT,
+ S_NAME STRING,
+ S_ADDRESS STRING,
+ S_NATIONKEY INT,
+ S_PHONE STRING,
+ S_ACCTBAL DOUBLE,
+ S_COMMENT STRING) 
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE 
+LOCATION '${LOCATION}/supplier/';
+
+drop table if exists partsupp;
+create external table partsupp (PS_PARTKEY INT,
+ PS_SUPPKEY INT,
+ PS_AVAILQTY INT,
+ PS_SUPPLYCOST DOUBLE,
+ PS_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION'${LOCATION}/partsupp';
+
+drop table if exists nation;
+create external table nation (N_NATIONKEY INT,
+ N_NAME STRING,
+ N_REGIONKEY INT,
+ N_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/nation';
+
+drop table if exists region;
+create external table region (R_REGIONKEY INT,
+ R_NAME STRING,
+ R_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/region';
+
+drop table if exists customer;
+create external table customer (C_CUSTKEY INT,
+ C_NAME STRING,
+ C_ADDRESS STRING,
+ C_NATIONKEY INT,
+ C_PHONE STRING,
+ C_ACCTBAL DOUBLE,
+ C_MKTSEGMENT STRING,
+ C_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/customer';
+
+drop table if exists orders;
+create external table orders (O_ORDERKEY INT,
+ O_CUSTKEY INT,
+ O_ORDERSTATUS STRING,
+ O_TOTALPRICE DOUBLE,
+ O_ORDERDATE STRING,
+ O_ORDERPRIORITY STRING,
+ O_CLERK STRING,
+ O_SHIPPRIORITY INT,
+ O_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/orders';
diff --git a/ddl-tpch/bin_flat/customer.sql b/ddl-tpch/bin_flat/customer.sql
@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists customer;
+
+create table customer
+stored as ${FILE}
+as select * from ${SOURCE}.customer;
diff --git a/ddl-tpch/bin_flat/lineitem.sql b/ddl-tpch/bin_flat/lineitem.sql
@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists lineitem;
+
+create table lineitem
+stored as ${FILE}
+as select * from ${SOURCE}.lineitem;
diff --git a/ddl-tpch/bin_flat/nation.sql b/ddl-tpch/bin_flat/nation.sql
@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists nation;
+
+create table nation
+stored as ${FILE}
+as select * from ${SOURCE}.nation;
diff --git a/ddl-tpch/bin_flat/orders.sql b/ddl-tpch/bin_flat/orders.sql
@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists orders;
+
+create table orders
+stored as ${FILE}
+as select * from ${SOURCE}.orders;
diff --git a/ddl-tpch/bin_flat/part.sql b/ddl-tpch/bin_flat/part.sql
@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists part;
+
+create table part
+stored as ${FILE}
+as select * from ${SOURCE}.part;
diff --git a/ddl-tpch/bin_flat/partsupp.sql b/ddl-tpch/bin_flat/partsupp.sql
@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists partsupp;
+
+create table partsupp
+stored as ${FILE}
+as select * from ${SOURCE}.partsupp;
diff --git a/ddl-tpch/bin_flat/region.sql b/ddl-tpch/bin_flat/region.sql
@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists region;
+
+create table region
+stored as ${FILE}
+as select * from ${SOURCE}.region;
diff --git a/ddl-tpch/bin_flat/supplier.sql b/ddl-tpch/bin_flat/supplier.sql
@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists supplier;
+
+create table supplier
+stored as ${FILE}
+as select * from ${SOURCE}.supplier;
diff --git a/settings/init.sql b/settings/init.sql
@@ -1,27 +1,14 @@
-set hive.enforce.bucketing=true;
-set hive.enforce.sorting=true;
-set hive.exec.dynamic.partition.mode=nonstrict;
-set hive.exec.max.dynamic.partitions.pernode=1000000;
-set hive.exec.max.dynamic.partitions=1000000;
-set hive.exec.max.created.files=1000000;
 set hive.map.aggr=true;
-set hive.optimize.bucketmapjoin=true;
-set hive.optimize.bucketmapjoin.sortedmerge=true;
-set hive.mapred.reduce.tasks.speculative.execution=false;
+set mapreduce.reduce.speculative=false;
 set hive.auto.convert.join=true;
-set hive.auto.convert.sortmerge.join=true;
-set hive.auto.convert.sortmerge.join.noconditionaltask=true;
-set hive.auto.convert.join.noconditionaltask=true;
-set hive.auto.convert.join.noconditionaltask.size=10000000000;
 set hive.optimize.reducededuplication.min.reducer=1;
 set hive.optimize.mapjoin.mapreduce=true;
+set hive.stats.autogather=true;
 
 set mapred.reduce.parallel.copies=30;
-set mapred.reduce.tasks=16;
 set mapred.job.shuffle.input.buffer.percent=0.5;
 set mapred.job.reduce.input.buffer.percent=0.2;
-set mapred.map.child.java.opts=-server -Xmx2248m -Djava.net.preferIPv4Stack=true;
-set mapred.reduce.child.java.opts=-server -Xmx4500m -Djava.net.preferIPv4Stack=true;
+set mapred.map.child.java.opts=-server -Xmx2800m -Djava.net.preferIPv4Stack=true;
+set mapred.reduce.child.java.opts=-server -Xmx3800m -Djava.net.preferIPv4Stack=true;
 set mapreduce.map.memory.mb=3072;
-set mapreduce.reduce.memory.mb=6144;
-set hive.optimize.tez=true;
+set mapreduce.reduce.memory.mb=4096;
diff --git a/settings/load-flat.sql b/settings/load-flat.sql
@@ -5,10 +5,9 @@ set hive.exec.max.dynamic.partitions.pernode=1000000;
 set hive.exec.max.dynamic.partitions=1000000;
 set hive.exec.max.created.files=1000000;
 
-set mapred.min.split.size=240000000;
-set mapred.max.split.size=240000000;
-set mapred.min.split.size.per.node=240000000;
-set mapred.min.split.size.per.rack=240000000;
+set mapreduce.input.fileinputformat.split.minsize=240000000;
+set mapreduce.input.fileinputformat.split.maxsize=240000000;
+set mapreduce.input.fileinputformat.split.minsize.per.node=240000000;
+set mapreduce.input.fileinputformat.split.minsize.per.rack=240000000;
 set hive.exec.parallel=true;
-set hive.stats.autogather=false;
-set hive.optimize.tez=false;
+set hive.stats.autogather=true;
diff --git a/settings/load-partitioned.sql b/settings/load-partitioned.sql
@@ -9,15 +9,12 @@ set hive.exec.reducers.max=2000;
 set hive.stats.autogather=true;
 
 set mapred.job.reduce.input.buffer.percent=0.0;
-set mapred.min.split.size=240000000;
-set mapred.min.split.size.per.node=240000000;
-set mapred.min.split.size.per.rack=240000000;
+set mapreduce.input.fileinputformat.split.minsizee=240000000;
+set mapreduce.input.fileinputformat.split.minsize.per.node=240000000;
+set mapreduce.input.fileinputformat.split.minsize.per.rack=240000000;
 
-set mapred.map.child.java.opts=-server -Xmx2500m -Djava.net.preferIPv4Stack=true;
-set mapred.reduce.child.java.opts=-server -Xms1024m -Xmx7900m -Djava.net.preferIPv4Stack=true;
-set mapreduce.map.memory.mb=3072;
-set mapreduce.reduce.memory.mb=8192;
-
-set io.sort.mb=800;
-
-set hive.optimize.tez=false;
+-- set mapred.map.child.java.opts=-server -Xmx2800m -Djava.net.preferIPv4Stack=true;
+-- set mapred.reduce.child.java.opts=-server -Xms1024m -Xmx3800m -Djava.net.preferIPv4Stack=true;
+-- set mapreduce.map.memory.mb=3072;
+-- set mapreduce.reduce.memory.mb=4096;
+-- set io.sort.mb=800;
diff --git a/tpcds-build.sh b/tpcds-build.sh
@@ -0,0 +1,29 @@
+#!/bin/sh
+
+# Check for all the stuff I need to function.
+for f in gcc; do
+	which $f > /dev/null 2>&1
+	if [ $? -ne 0 ]; then
+		echo "Required program $f is missing. Please install it and try again."
+		exit 1
+	fi
+done
+
+# Check if Maven is installed and install it if not.
+which mvn > /dev/null 2>&1
+if [ $? -ne 0 ]; then
+	echo "Maven not found, automatically installing it."
+	curl -O http://www.us.apache.org/dist/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz 2> /dev/null
+	if [ $? -ne 0 ]; then
+		echo "Failed to download Maven, check Internet connectivity and try again."
+		exit 1
+	fi
+	tar -zxf apache-maven-3.0.5-bin.tar.gz > /dev/null
+	CWD=$(pwd)
+	export MAVEN_HOME="$CWD/apache-maven-3.0.5"
+	export PATH=$PATH:$MAVEN_HOME/bin
+fi
+
+echo "Building TPC-DS Data Generator"
+(cd tpcds-gen; make)
+echo "TPC-DS Data Generator built, you can now use tpcds-setup.sh to generate data."
diff --git a/tpcds-gen/Makefile b/tpcds-gen/Makefile
@@ -12,7 +12,7 @@ tpcds_kit.zip:
 	curl --output tpcds_kit.zip http://www.tpc.org/tpcds/dsgen/dsgen-download-files.asp?download_key=NaN
 
 target/lib/dsdgen.jar: target/tools/dsdgen
-	cd target/; mkdir -p lib/; gjar cvf lib/dsdgen.jar tools/
+	cd target/; mkdir -p lib/; ( jar cvf lib/dsdgen.jar tools/ || gjar cvf lib/dsdgen.jar tools/ )
 
 target/tools/dsdgen: target/tpcds_kit.zip
 	test -d target/tools/ || (cd target; unzip tpcds_kit.zip; cd tools; cat ../../*.patch | patch -p0 )