-
Notifications
You must be signed in to change notification settings - Fork 340
SPL High performance Implementation Routine
In performance optimization cases, often SPL runs faster than SQL, having performance orders of magnitude higher. Our explanation of SPL’s winning is this: the language can implement low-complexity algorithms involving less amount of computation that SQL cannot implement; and this results in higher performance; SQL can only turn to the database optimizer, which, however, doesn’t work in the complex computing scenarios. There is nothing wrong with this logic when discussed in the abstract. But the specific technical factors are still wanted.
The fundamental factor lies in the discreteness that we’ve talked about many times.
Real-world business needs include more than computations on the integral sets. Individual members could have freely split up from or combined with an integral set, they could have been computed mixedly, and there could have been no need to specifically design processes for many computations – we could have just written them in a natural way according to the requirement. But as SQL lacks discreteness, it has difficulty in describing the business logic directly. People can only think in a zigzag manner to write very complicated and roundabout code. The true computing target is hard to identify literally. And this is the important reason that the optimizer becomes overwhelmed. We cited a lot of examples when discussing the discreteness, involving order-based computations, locate computations and the use of sequence numbers, grouping & aggregation, etc. Here we’ll skip them (Discreteness Discussion).
In view of what we’ve said above, the algorithmic logic SQL describes is actually “wrong”. Therefore, SPL cannot ensure high performance after translating the SQL statement to SPL, because it is the algorithm rather than the syntax that enables high performance. With the same algorithm logic, there isn’t essential difference between SPL and SQL in performance. Moreover, SPL provides lower optimization level than the traditional databases, and as a result, runs slower.
Having discreteness, SPL redefines the JOIN operation. The SPL JOIN definition emphasizes the role of primary key and classifies the operation into foreign key-based association and primary key-based association, which suit different application scenarios. The classification based on different situations means the emphasis on conditions, and this helps obtain more useful characteristics for performance optimization.
SQL oversimplifies and overgeneralizes the JOIN definition. This appears to be able to cover all scenarios, but on the other hand, almost cannot exploit any useful characteristics. In consequence, the JOIN operation becomes a long-standing SQL performance optimization problem.
The problem is already discussed in the SQL joins series. (JOIN Series Discussion).
Besides reinforcing the discreteness, SPL also offers the order-based storage mechanism. The mechanism ensures that data is stored and retrieved in order and, on the basis of orderliness, enables implementation of the order-based traversal method. This can effectively avoid the biggest SQL headaches – big-table association and grouping operation that returns a big result set.
SQL is designed based on unordered sets and thus cannot make sure that data stored is ordered. Even if it purposely stores data in order, it cannot make use of the orderliness characteristic.
We already analyzed this in https://c.scudata.com/article/1719281511305.
Those SQL problems are irrelevant to the syntactic style but are rooted in SQL’s theoretical basis – the relational algebra. They cannot be solved by modifying or adding certain keywords or functions unless the theoretical basis is reformed. But once the foundation is changed, it is hard to say whether the relational algebra and the SQL should be still called as now they are.
SPL’s true strength comes not from the syntax but the algebraic theory of discrete dataset on which it is based. Actually, the SPL syntax can be also designed in the English-like SQL way.
It is not only the well-crafted code but, more importantly, the creative algebra that helps achieve high performance.
SPL supplies a variety of performance optimization methods, which are enough for writing a book. This seems like complicated, and you may feel confused about where to start when encountering a problem.
In fact, as we mentioned in https://c.scudata.com/article/1669868875289, SPL has its routine for high-performance implementations, and the methods are not many. The general scope was already indicated in the above analysis of key characteristics.
If there is a large amount of data involved, the first and foremost thing is to design a proper storage scheme because there is strong correlation between high performance and storage schemes.
It is relatively easy to choose row-wise or column-wise (search with row-wise storage and traversal with column-wise storage). It is also not difficult to design the multi-file storage (which is similar to the database table partition) as long as we learn about the data size and the generation frequency. The key is to make clear which fields on which data should be ordered.
Order-based storage and order-based traversal technique based on this storage plan is an important SPL feature. The feature helps effectively avoid high-complexity big-table association and grouping operations that return big result sets. Note that it is “avoid” rather than “solve” that we use here. With the order-based storage and the order-based traversal, it is easy to write business logics according to the natural way of thinking instead of turning to a roundabout route, and as a result, grouping operations that return big result sets (and the degenerated COUNT DISTINCT) and many big-table associations (especially the self-association) are eliminated.
In many difficult-to-handle data intensive computing scenarios, data is event records that are associated with a certain object’s ID and should be ordered by this ID rather than arranged in the original time order. It is not hard for programmers who are familiar with algorithms to understand that, if data is ordered by ID, how easy the original difficult COUNT DISTINCT ID will become.
Of course, the sorting is slow, but it is a one-off action. SPL also offers increment sorting mechanism that disperses the huge computation amount to each data maintenance process, making performance compromise almost invisible. Once data is ordered, the subsequent computations become orders of magnitude of faster. Ordering data amounts to a kind of widely applicable pre-computation, and the cost of pre-computation is scattered among the ordinary daily data maintenance activities.
From https://c.scudata.com/article/1719281511305, we can clearly feel the significance of order to such type of computing tasks and learn how to identify the field on which data should be ordered.
SPL and SQL have very different understandings about the JOIN. In SPL, the primary key is always involved in equi-joins, which can be divided into two types – foreign-key-based association and primary-key-based association. The two types of joins are regarded as completely different operations and are optimized in different ways. The essence of foreign-key-based association is search and that of the primary-key-based association is merge; they are indeed not the same thing.
To implement a join in SPL, the first thing is to identify its type correctly and the involved primary key (maybe a logical one). Then we choose the proper optimization algorithm for it.
Take a JOIN between a big table and small table as an example. If they are a big fact table and a small dimension table, load the small table to the memory and traverse the big table to perform the join (this is the usual SQL method); if there are a big dimension table and a small fact table (which is often result of the filtering operation), we should perform the join using the search technique so that big dimension table traversal can be avoided. Completely different methods are used to handle the two cases.
One more example. When the JOIN involves multi-table association – usually between one fact table and multiple smaller dimension tables (maybe multilayer), we can load the smaller dimension tables into the memory, perform the pre-association and then traverse the big fact table once to resolve all associations.
Some foreign-key-based associations are not very noticeable, but the search in a certain set (the dimension table) is still according to key value of the to-be-traversed set (the fact table); and in a wider sense, an association that corresponds to multiple records still belongs to this type. Actually, non-equi-joins can be also identified as two types – respective correspondence search and merge technique; but without involving the primary key, the non-equi-join is handled with a different search method and merge method.
Usually, the database optimizes and handles the enumerable string by converting it to integers. For example, representing regions and genders with integers helps get better performance in both storage and computations. SPL also uses this optimization method.
Dates can be also represented by integers. There are many judges involved during computing date type data, and the computation is slow. If we can convert dates to integers that can be computed through addition, subtraction, multiplication and division, performance will be enhanced. This is already discussed in Performance Optimization .
Besides the ordinary integerization, SPL puts special emphasis on sequence-nuberization, which means using as much as possible the natural numbers (or numbers from which natural numbers can be conveniently computed). SPL provides a large number of sequence-number-related methods that SQL does not have, such as the basic value retrieval by sequence numbers, the advanced sequence-number-based grouping and association, and further, the alignment sequence. Accessing data according to sequence numbers has the lowest computing complexity, and making most use of the characteristic of sequence numbers can noticeably reduce the computing amount.
With the support of discreteness, SPL obtains more powerful ability to handle set operations.
Multilayer sets (members of the set are also sets) often appear in the computing process written in SPL. For example, the grouping result is a two-layer set and the result of getting TopN based on the grouping result is also a set of sets. Using multilayer sets to describe business logics is very convenient. But as it is rare for the other programming languages to use them in the actual practices (though those languages support the approach), programmers are not accustomed to the approach. They can understand it but need some exercises to become familiar with it, otherwise they might misuse it.
SPL not only has the data table object (called table sequence) SQL has, it also particularly prefers to use the single-column data without a structure (sequence, also known as array). The sequence object is simpler and faster to compute. We need to get used to considering whether to use sequences when implementing business logics.
SPL Lambda syntax is extremely powerful. It implements most of the set operations using loop functions rather than the for statement (except that too many layers of sets are involved and the statement becomes too complicated). Loop functions enable shorter, more concise and faster to execute code than the for statement.
Once the set-oriented thinking is developed, you will be able to write elegant and efficient code.
There are also the small performance optimization techniques. We may not be able to think of them and don’t need to think about all of them when we start writing an algorithm. We can first write the code, test it and then optimize it.
SPL can automatically record the execution time of each code cell during debugging, and it is easy in SPL to write auxiliary code besides the main code to record the execution time of a code block. Through the information we can locate the stages causing the poor performance.
Here are some commonly used techniques:
-
Put filtering on big amounts of data on the cursor. The target of some foreign-key-based associations is to filter data, and fjoin() function has a rich collection of parameters to handle such scenarios. For primary-key-based associations, probably when one table is filtered and data in the other can be located accordingly; pjoin() function can handle such a scenario.
-
To avoid the complicated string computations, some “like” operations can be implemented with the fast pos() function.
-
Types of JSON data structures are complicated and the data is slow to handle. We can create redundant data columns when necessary.
-
To perform multiple searches or batch search on a certain set (such as the irregular foreign-key-based associations), we can first sort the set and then use binary search or create the in-memory index; but both sorting and indexing are ineffective for a single search.
-
Try not to fetch and store cursor data (meaning storing it as a temporary file), but try using the program cursor.
As previously mentioned, programmers do not need to master all those techniques (actually not all techniques are listed here), they can look up methods in the related documentation (Performance Optimization and Technical Subject) for stages causing the poor performance after the test. And during this process they become familiar with the techniques.
It is worth noting that SPL often provides more than one method under one functionality to suit different scenarios. So we can try writing different codes for the same target; for example, replacing pjoin() with T.new/news() and sometimes, using the attached table to store data to get better performance.
Cursor is often used to handle big data tables. But many computations on the cursor are delayed ones; the recorded execution time is not accurate, and cursor data is not easy to view and leads to inconvenient debugging. In this case we can use the corresponding in-memory data tables to conduct debugging. SPL intentionally designs basically consistent syntax for computations on cursors and those on table sequences. To conduct debugging, programmers just need to change the cursor creation part at the beginning of the code to table sequence creation. Then change back after debugging is finished.
In addition, SPL at present works under JVM, and Java will frequently start GC actions when available memory become insufficient. This will result in very unsteady performance test results. Programmers need to keep a watchful eye on GC status during debugging.
In the end, there are two points to make.
Five stages are discussed here, which are generally the process of performance optimization. But it is likely that the optimization cannot be done in one round of going through the five stages. Probably we would go back to modify the storage design when new problems appear during debugging. The whole process may involve repeated modification of certain stages before it is completed, particularly for the beginners.
Routine isn’t the key to all problems, as there isn’t a universal methodology for solving all math problems even if many methods are taught at school. However, the routine is useful and important in dealing with many problems.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code