-
Notifications
You must be signed in to change notification settings - Fork 339
Is It Necessary to Use a Specialized In Memory Database?
It is easy to think about using an in-memory database to solve the performance problem of reporting, BI analysis, batch processing, and other data analysis tasks. An in-memory database allows storing all data permanently in the memory so that the computation external memory accesses (disk reads) are not needed, disk I/O can be avoided and data processing performance can be effectively improved.
In-memory databases are advertised to have high performance and be able to solve many performance problems in business data analyses. They are fast because of not only the zero disk I/O costs but also the application of specialized memory optimization techniques. For example, the memory supports random access and has strong parallel processing ability; and supports optimization techniques such as pre-load and pre-index to enhance the computing performance.
But these techniques are not unique to in-memory databases. Actually, all computations are performed in the memory because, after all, the CPU can only compute data in the memory and data stored in the external memory needs to be read into the memory for processing. As general memory optimization techniques, they are adopted by many database products and computing engines. As long as the memory optimization is carried out cleverly, performance will catch up. In this sense, if the memory is large enough to hold all data, every database can be called in-memory database. Now the so-called “specialized in-memory databases” only offer optimization methods for the in-memory computations. They are either not good at performing computations involving a volume of data exceeding the memory capacity, or unable to handle them (because external memory computations are much more difficult). In order to highlight advantages while downplaying disadvantages, vendors name these databases “In-memory databases” in order to distinguish them from ordinary ones. On the contrary, the ability of real specialized in-memory databases is weakened.
Therefore, there’s no such thing as “specialized in-memory databases”, there are only specialized in-memory data processing techniques. It doesn’t matter whether a product claims to be an in-memory database. The really important thing is to find out whether in-memory techniques the product uses are effective or not by testing and comparing the computing performance under a large memory.
As we have understood what in-memory databases are, let’s move on to discuss in-memory processing techniques and computing performance.
We all know that the SQL-based relational databases are still the mainstream database systems – whether the database is the so-called in-memory database or not. But it’s a pity that SQL cannot make the best use of the characteristics of memory, leaving much room for performance increase. Though certain optimization methods, such as compression, column-wise storage, index, parallel processing and vector computing, can be well implemented and employed based on SQL, it is inconvenient to implement other techniques that can greatly help increase computing performance in SQL. This is because SQL has its intrinsic limits.
SQL lacks the data type for representing records. SELECT’s result (record) is a new data table unrelated to the original table, even if it has only one record. As a result, a lot of data needs to be copied during the computation (new memory space is thus needed) and costs increase in terms of both space usage and time consumption. If there is the explicit record type, we can directly use records to form a data table. The records will store memory addresses of data in the original table; data won’t be copied frequently or in a large amount in the computing process, memory will be effectively used and performance will be higher.
For the multi-table join computation, if we can convert the foreign key in the fact table into memory addresses of the records in a dimension table in advance (pre-join), no association will be needed when records are used and we will get performance as good as that of the single table query. Yet, it is a pity that SQL cannot exploit the characteristic.
SQL also has poor support for order-based computations. The language is designed based on unordered sets. It does not define ordinal numbers for members of the unordered sets or offer location operations and neighboring member reference mechanism, making it difficult to exploit data orderliness to implement high-efficiency algorithms. For example, if data in the memory is ordered, we can use binary search to speed up the query. The method can noticeably increase performance of handling scenarios involving a large volume of data. Location by ordinal numbers is a similar and more efficient method, which can make use of the high-speed memory random access characteristic to quickly retrieve data from the memory according to the specified ordinal numbers. Both methods are difficult to implement in SQL.
Making the most use of the memory power also refers to the ability to describe data structure. The relational model (SQL) depends on two-dimensional tables to represent the relationship; it is difficult for it to describe and employ more complex data structures (such as multilevel JSON format) that are very suitable for in-memory storage and has advantages in both space utilization and usage efficiency. Frankly, a language needs to support complex data structures to implement the real grouping operation. SQL forcibly performs aggregation after each grouping action, but sometimes it is the result members of the grouping operation (such as those in each group) that we concern with. To achieve the goal, SQL needs to first invent group ordinal numbers using the subquery, producing cumbersome code, repeated queries and low efficiency.
The root of SQL’s inability to make good use of the features of memory is the theory. During the time when SQL (relational model) was invented, the computer’s memory was very small. It is understandable that the language finds it awkward to adapt to the current large-memory environment. Though contemporary databases have implemented many optimizations in engineering and improved the above-mentioned situation to some extent, the optimization engine becomes useless when the scenario gets even a little complicated; after all, it is difficult to compensate the theoretical shortcomings with engineering. Moreover, the quality and application scope of an optimization engine can only be determined through strict test and evaluation. Generally, tests are limited to a range of scenarios that SQL can easily handle and do not involve the complex ones. This makes the database product selection failure-prone and highly risked.
SPL can solve all the problems.
SPL (Structured Process Language) is an open-source computing engine intended for computing structured and semi-structured data. In order to solve the above-mentioned SQL problems, SPL is designed based on a brand-new model instead of the SQL’s relational algebra theory. The language supports both the in-memory and external memory computations. It specifically offers memory optimization methods that help achieve, even surpass, the performance of an in-memory database. When the volume of data involved exceeds the memory capacity, SPL allows to use the external memory computation (load data to the memory in batches) and in certain computing scenarios the performance is nearly as good as that of the full-memory computation, enabling high-performance computations even with small available memory space. A unique thing about SPL is that it can integrate with an application by being embedded into the latter to supply high-efficiency computing ability.
SPL also offers the already mature engineering optimization techniques that the in-memory databases use, such as the previously mentioned column-wise vector computing, pre-load and pre-indexing. These general engineering strategies are significant enough to achieve performance as good as that of the in-memory database, but they are not enough to bring into play the memory’s characteristics, which, on the contrary, is SPL’s core strength.
As we said previously, we can make use of the characteristics of memory and computations to achieve higher performance when handling scenarios involving complex data structures or computations. In-memory databases cannot exploit those characteristics because of SQL’s limits.
SPL can.
Unlike SQL’s SELECT that copies data and makes data volume bigger and efficiency lower, SPL only keeps the original records’ addresses, that is, their memory addresses, but does not copy records themselves, creating advantages in both space usage and computing efficiency. The reason SPL can do this is that it has the specialized record type for storing the original data’s memory addresses (references). SQL, however, does not have the record data type. In SQL, a single record is actually a data table consisting of a single row; different data tables cannot share records; and the filtering operation copies records to generate new ones to form a new data table, producing unsatisfactory space utilization and time costs.
We all know that the CPU accesses data from the memory via addresses. If we store the addresses of certain data beforehand, it will be very fast to access them later. Take the join operation as an example. If we store addresses of the foreign key table (dimension table)’s records in the fact table, there is no need to perform the join (HASH computations) when trying to use data of the two tables and thus we can achieve an equal performance as that of the single-table query. The pre-association method has obvious advantages in handling scenarios involving data reuse and multi-table association (the number of dimension tables is relatively big). SPL offers such a mechanism to perform a pre-join between the fact table and the multiple dimension tables (store memory addresses of the dimension table’s records in the foreign key field), save the result in the memory and speed up the computation by using the memory addresses. Find more information HERE.
SQL is based on unordered sets. This makes many high-performance algorithms impossible to implement. To solve the problem, SPL directly offers the ordered sets that enable to bring into play the full potential of orderliness. For the very complicated celestial bodies contrast computations, the computation amount can be greatly reduced if we first perform an initial filtering using the order-based binary search and this makes the subsequent computations more conveniently. SQL cannot describe the computation and thus cannot exploit the characteristic of orderliness. In actual practice, it is over 3 magnitudes slower than SPL. Find more information HERE.
SPL can make most use of the ordinal numbers to perform high-efficiency accesses. For a query task, if the value of the to-be-queried key is the target value’s ordinal number in the table sequence, or if the target value’s ordinal number can be easily obtained according to the to-be-queried value, we can use the ordinal-number-based location method to complete the computation in the constant time interval without comparisons.
What’s more, we can implement the association involving a large fact table (external memory computations) according to the ordinal numbers. When a fact table is too large to fit into the memory, the previously mentioned address reference method becomes useless. In that case, we can convert the fact table’s foreign key values into positions of their corresponding records in the dimension table, which is called numberization. Then we are able to use the more efficient ordinal-number-based location method to perform searches when trying to create association between the numberized fact table and the dimension table without any comparisons, producing performance nearly as good as that of memory address reference method. In SPL, the use of ordinal numbers is an important performance optimization strategy.
We can see that various computing scenarios are taken into account when SPL is designed. It isn’t intended just for memory or external memory; it targets both so that problems of all scenarios can be well handled. Find more information HERE.
Sometimes the computing goal has definite requirements regarding the order. For example, the security industry’s consecutive rising/falling computations require comparisons between neighboring data values. Databases that implement SQL well can use the window function to complete this type of computations. But it is very difficult to express the roundabout method even for the simple computations; and the efficiency is slow.
To count the longest consecutive rising days for a certain stock, for example, SPL needs a 3-layer nested query to express the computing process even with a window function, and the execution efficiency is poor (database optimization strategies become useless for complicated scenarios). By contrast, SPL conveniently codes the computation thanks to the support of order-based computations (order-based grouping):
stock.sort(date).group@i(price>price[-1]).max(~.len())
SPL references the directly previous/next record through the relative position.
The performance of SPL’s order-based computation is particularly outstanding in handling ecommerce industry’s customer churn rate. Find detailed explanations in the article SPL computing performance test series: funnel analysis
SPL also shows its ability of making most use the memory in describing the complicated data structures such as JSON format. It supports the multilevel data structure directly. With the generic type, SPL allows using sets as the members of a sequence, that is to say, members of a set are also sets, making the language have the natural ability to describe multilevel data structures such as JSON/XML. This also enables SPL to keep the grouped subsets (a set of sets) and perform a further computation on each grouped subset, effectively avoiding the SQL-style roundabout, nested query and helping achieve higher performance.
By offering all these techniques that enable a full exploit of the memory features, SPL outperforms in-memory databases to achieve a higher computing performance and more concise code for phrasing the computing process.
That being the case, is it still necessary for “specialized in-memory databases” to exist?
Not necessary any more. Though not advertising the concept of in-memory databases, SPL actually possesses in-memory computing techniques that are more powerful than in-memory databases.
SPL also provides external memory computation ability to handle data exceeding the memory capacity, creating a broader range of application scenarios and combining the in-memory ability and external-memory ability to bring into play the most power. The one typical example of this combination is the previously mentioned funnel analysis. A product claiming that it is the in-memory database usually does not have the external memory computation ability; the lack greatly limits the application scope. SPL, however, boasts a wider application scope.
SPL can also work as an embedded computing engine and integrate with the application through its jars. It is deployed together with the application and offers in-memory database-like high-performance computations directly from within the application; it also supports the cooperative use of the in-memory ability and external memory ability.
Having high-efficiency in-memory computing ability, SPL is completely qualified to replace the “specialized in-memory databases”. Together with the external memory ability and the flexible integration ability, SPL gets a broader application scope. It is not important that whether a product is called in-memory database or not when we are examining a product claiming to supply high-performance computation. It is a good product as long as the in-memory computing techniques are awesome and their application scope is wide. According to this rule, SPL is an ideal choice.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code