-
Notifications
You must be signed in to change notification settings - Fork 340
Performance optimization skill:Second half Ordered Grouping
First of all, what is second-half ordered grouping? When data set T is already ordered by field “a” and field “b” and we want to sort or group the table T by field “b”, in this case, in the segment with the same “a” value, field “b” is always ordered. So this is the scenario where the sorting field/grouping field is ordered in each segment grouped by the first-half field, and we name it as second-half ordered grouping.
As we all know, the quick-sort algorithm is to segment data then sort and merge recursively. For data set that is already ordered by the second-half fields, the execution speed of quick-sort algorithm can be very fast already. So if we sort the data set T by field “b” using the quick-sort algorithm, then we can optimize the grouping with the technique introduced in Performance Optimization Skill: Ordered Grouping.
SPL provides such an algorithm of second-half ordered grouping. In the following part, we’ll test it and compare it with the hashing grouping algorithm in Oracle.
The test computer has two Intel2670 CPUs, 2.6G frequency, 16 cores in total, 64G memory and an SSD hard disk, where a 16-core virtual machine with 8G memory is set for testing.
On the virtual machine, we create the data table “sales” which consists of 4 fields - orderdate, area (strings), salesman (strings) and amount (real numbers) – and 1 billion rows of records. We import the data table into the Oracle database, and use it to generate the esProc SPL composite table for testing.
The data table is sorted by orderdate, area and salesman in ascending order. The task goal is to query the sales amount of every salesman in each area, which requires grouping by “area” and “salesman” and there are 1 million groups in the result set. Since it takes Oracle a lot of time to output such a big result set, we need to filter the result set again to output only the order whose total amount is less than CNY 471,000. And there are only 11 results left, so the output hardly takes up any time.
SQL query statement:
select * from (
select /*+ parallel(n) */
area, salesman, sum(amount) as amount
from sales
group by area, salesman
) where amount<471000;
/*+ parallel(n) */ is for parallel processing, where n is the number of parallel tasks.
SPL script:
A | |
---|---|
1 | =now() |
2 | =file("/home/ctx/sales.ctx").open().cursor@m(area,salesman,amount;;1) |
3 | =A2.groups@h(area,salesman;sum(amount):amount).select(amount< 471000) |
4 | =interval@s(A1,now()) |
The groups function with @h option indicates the grouping fields are second-half sorting fields (ordered in each segment), which is used to sort the grouping fields first with the quick-sort algorithm in SPL and then optimize the performance with ordered grouping technique.
It is worth noting that the second-half ordered grouping is all performed in memory, which requires the memory to be able to hold the grouping result set and n result set when multi-thread processing is performed (n is the number of parallel threads).
To use the hash grouping in SPL, remove the @h option from the groups function in the previous script.
Test results (Unit: second):
Number of parallel threads | 1 | 2 | 4 |
---|---|---|---|
Oracle | 387 | 195 | 104 |
SPL (HASH) | 405 | 208 | 121 |
SPL (Second-half ordered grouping) | 252 | 142 | 83 |
From the test results, the second-half ordered grouping in SPL is over 50% faster than the hash grouping in both SPL and Oracle, which shows significant improvement in the performance. And it’s normal that the Java-based SPL ordinary grouping is slightly slower than the C-based Oracle in performing hash grouping (all columns are involved in the tests, so the columnar storage of SPL shows no strength).
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code