-
Notifications
You must be signed in to change notification settings - Fork 340
SPL Cloud Data Warehouse
The overwhelming majority of the cloud data warehouse services on the market (actually we can say all of them) are based on SQL. After all, data warehouses’ primary responsibility is analytical computing. NoSQL has technological advantages in handling TP but they are not nearly as good as SQL in dealing with AP. Cloud service providers who participate in the competition think inside this box, believing that supplying SQL alone is enough for a complete collection of functionalities while trying to take the upper hand on performance, operation and maintenance, price and other aspects.
That’s of course a suitable way. There are already many successful service providers.
But my question is this: Is the 50-year-old SQL really right for implementing the cloud data analytics which has become popular only of late years?
And is SPL, a new data computing technique, a better alternative?
Let’s first look at how SQL and SPL implement essential cloud service characteristics.
1. Elastic computing
Cloud services require elastic computing, which requires the separation of storage and computation (here we skip the reason behind this). SQL (represented by RDB), however, has never taken account of the separation. It breaks through the file system to directly access hard disks instead. To transform it into a mechanism where storage and computation is separated and that uses the NFS (Network File System) storage and cloud object storage, we need to restructure the language from the base-level. This is a complex job that will bring about not a few implementation risks. Therefore, the separation of storage and computation has become an important SQL database technology issue.
SPL storage is based on file system, so it naturally supports NFS. There isn’t any technical barrier. It is easy for SPL to use cloud object storage and achieve the separation of storage and computation. We just need to create a simple file object having a cache (In this industry there are many mature open-source technologies for doing this).
Elastic computing demands launching new computing nodes as fast as possible. But SQL execution needs to load metadata first. When data structure is complex, it takes long to load all metadata at once, making quick startup of computing nodes impossible. In order to get rid of the metadata loading time, we can create a center that administers metadata of all users, which amounts to metadata being loaded in advance. But this results in complex operations and maintenance, and increases burden on the center (usually there are a huge number of users on cloud). Another way is to classify the metadata and load it ad hoc (which means loading the desired metadata only), but that complicates the administration and even lengthens the whole loading time. No methods can satisfy both sides.
SPL does not have metadata. We just execute scripts directly.
2. Multitenancy
Multitenancy is preferred by cloud service providers because cloud products need to use the architecture to increase hardware utilization.
Because of the existence of metadata, VM for executing SQL cannot serve multiple users and segregation is required. Shifting to the target metadata is needed when a VM is used to perform another task, which complicates operations and maintenance.
SPL has no metadata, making it unnecessary to segregate users and enabling different tasks to be performed on same JVM. We need only to handle workload balance (SQL needs to handle it, too). The operations and maintenance become really simple.
3. Serverless
Serverless is another hot concept these days.
In SQL, the existence of metadata is equivalent to an indispensable environment that is related to the frontend users on the backend. This is incompatible with the serverless concept. Though we can simulate the serverless effect transparently, we still need to handle the above-mentioned metadata loading issue and the resultant complications of operations and maintenance.
Without the metadata, SPL just execute service script directly.
According to the previous discussion, the existence of metadata is a key difference between SQL cloud architecture and SPL cloud architecture. The SPL cloud that does not have the metadata is lighter than SQL cloud.
The underlying reason is that SQL and SPL organize data in different ways.
SQL organizes data as a whole logically. Data on the same theme forms a database, and there is metadata for describing data structure and relationship in the database. A database is a logically undivided entity, which has a clear boundary between inside and outside. Data in different databases cannot be combined and computed together. These constitute the database closedness. Early databases were thus designed in order to ensure data consistency, which is important for handling TP. But the closedness is pointless for AP tasks where a major part is analytical computing. Relational databases, however, are designed as they originally were, inherit the architecture, and bring it to the cloud.
Having metadata, as mentioned above, adds complexity to the implementation of elastic computing, multi-tenancy and serverless computing. You can accomplish them eventually but this pushes up the cost of operations and maintenance.
Closedness can cause more problems.
These days, data fed to applications come from a much wider range of sources. There are a variety of data sources and so are their formats. The closed SQL databases cannot open their computing ability to data outside it, but can only load the data into it for further computation. This adds an ETL action, increases workload for both the programmer and the database, and results in the loss of data real-timeness. Often data outside the database has irregular formats, and it is not easy to load it into databases having strong constraints. Moreover, even ETL requires to first load the raw data into the database in order to use the latter’s computing ability, resulting in ELT and increased database workload. None can be achieved without increasing the cost of operations and maintenance.
The closed mechanism also makes it difficult for user to speed up queries through flexible space-time tradeoff. We all know that storage resources are much cheaper than computational resources. If we store data redundantly and in different forms for different computing goals, we can obtain better query experience. SQL stores data in tables. Users’ creation of too many tables will cause more metadata and seriously increase the cost of operations and maintenance (There are tens of thousands of intermediate tables in the centralized data warehouses of many large organizations after years of accumulation, which have never been deleted out of extreme caution. Their metadata is huge. No one anticipated when relational databases were invented that the number of data tables can reach an order of magnitude of ten thousand).
Yet the heavy, closed, unified data organization has infiltrated into all aspects of the SQL language, making it hard to root it out.
SPL data organization is logically fragmented. There isn’t the concept of “theme”. Thus, no metadata and no difference between “in-database” and “outside-database” exist. Any accessible data can be computed. The only difference is that sources have different access performance. In SPL, there are specialized high-performance data storage formats (columnar storage, data compression, etc.). Logically, there isn’t any difference between data stored in these formats and data retrieved from a text file or from RESTful. Even data coming from a SQL database can be handled in the same way. These are manifestations of open computing ability.
The file-system-based physical data storage naturally supports separation of storage and computation. Without metadata, it is easy to achieve lightweight elastic computing, multi-tenancy and serverless.
The open computing ability and the elimination of data loading action from data arrangement phase allow users to compute data directly. Mixed computations involving different data sources can be handled directly, too. That creates both lightweight and real-time computations.
The fragmented data organization scheme enables users to devise their own space-time tradeoff strategies, which only involves storing more files. Even if the number of redundant files reaches over ten thousand (it is easy for contemporary file systems to handle data of such a scale), there isn’t any burden caused by metadata (as there isn’t metadata at all). Also, it is simple and clear to manage data files by category under the file system’s tree structure. And the cost of operations and maintenance is not high.
SPL’s data organization schema is naturally open and lightweight, and is suitable to be moved to the cloud.
Another important difference between SPL and SQL is the language ability and resulted algorithmic implementation ability. This also has great impact on the costs of applications of cloud computing.
We all know that SQL’s language ability is not complete. The language even cannot handle a pure data task alone. A familiar example is the consecutive rising stock problem, and more complicated examples include the ecommerce conversion funnel (both are not rare demands but very common in business computing). Many programmers will tell you, almost intuitively, that they are not suitable to be handled in SQL. Instead, they say data needs to be retrieved from the source and handled in Python or Java.
Their gut reaction is reasonable because it is both annoying and hard to code such computations in SQL. The right choice is indeed retrieving data and computing it in another language. Sometimes programmers need to write UDFs using a low-level programming language in the database in order to obtain better performance. When the database is used at home, that is the usual practice. Though it complicates framework and technology stack, the problem does not look so serious as there is only one user having the conn.
When databases are moved up to the cloud, the problem becomes more than the complicated framework and technology stack. Unlike the private environment, cloud is public and individuality is restricted. Familiar languages and tools may not be available on the cloud and the available is the only choice. Usually, cloud databases only allow the use of interpreted languages like JS, which can be simply deployed as the SQL extension, in order to avoid too much load on operations and maintenance. Yet JS performs too bad in terms of both functionality and performance, only better than not at all (as SQL is too annoying in dealing with certain computing scenarios). Providing interfaces of compiled languages, such as Java and C++ (often UDF is written in these languages) amounts to letting each user have their own database engine. This makes operations and maintenance exceptionally complicated (seriously aggravate difficulties of the above-mentioned metadata loading and the VM launch and reuse). Of course, we can eventually accomplish all this, but both cost and risk will increase.
We all know the problem, including cloud data warehouse service providers. Some providers have responded by making every effort to deeply merge Python into SQL. But whether the two very different systems can go harmoniously together remains unclear.
The current situation is that you probably still need to depend on SQL to deal with everything on the cloud. Actually, there are coding aces who have successfully written the conversion funnel in SQL.
You can work out everything in SQL because, in theory, the language has a complete set of computing abilities (plus stored procedure). But a programmer needs to be the best to be able to code them, and the development cost is high (debugging SQL is always not easy). Isn’t it that moving up to the cloud should have made programming more convenient and easier and reduced cost? Plus, developers’ time is precious as they are paid handsomely. What’s more, it is probably that two months later even the programmer themselves cannot understand the code. The maintenance is hard and expensive, too.
SQL’s weak language ability also makes it hard to achieve many low-complexity algorithms, such as user distinct count in the e-commerce sector. If data is already ordered, we can accomplish the computation fast using very small memory space. SQL’s theoretical system does not ensure the orderliness in data storage and cannot enable writing order-based computing logics. The only choice is turning to database optimizer, which is often puzzled by the roundabout nested SQL statements. In view of this, database vendors stand still and adopt high-complexity algorithms. To handle distinct count, for instance, databases need to keep traversed key values, and compute and compare HASH values. The computing process is slow, and CPU and memory resources consumption are high. Now the only way out is to buy more computing resources.
SPL’s language ability is more powerful. It not only can perform order-based computations, such as the above distinct count example needs, but provides a wealth of functions related to matrix, fitting and AI modelling. Most of the data computing tasks can be handled within the SPL system without the need to turn to other languages and write UDFs. Both the framework and the technology stack are simple.
SPL code is simpler. Programmers just need to write code according to the natural way of thinking, even for complex tasks. This is much more direct than SQL. Development cost is greatly reduced (SPL debugger is much more convenient to use than SQL’s), code is simple and easy to read, and maintenance is easy. Of course, programmers need to learn SPL in order to grasp its unique concepts.
It is easy to code low-complexity algorithms in SPL, speeding up execution as much as possible and reducing hardware resource consumption. This helps lower the application cost. Still take the user distinct count as example, SPL can achieve a high-performance computation making use of the ordered storage with small memory usage. The computing process is fast and cost-effective.
There is another advantage of using SPL. The strong computing ability makes it unnecessary to use clusters for most computing tasks for most users. They only need a powerful VM. This makes operations and maintenance much simpler , enabling users to give as much the limited and precious resources to computations as possible.
Resources on the cloud are basically unlimited. Yet user’s money is definitely limited. Actually, resources on the cloud, in terms of cost, are expensive. The computing resources for multi-tenancy (CPU and memory space), particularly, are far from being infinite.
Well, it appears that SPL, compared with SQL, does not have any weaknesses.
Objectively speaking, this is not true. There is no such thing as perfection in the world.
After decades of progress, many SQL databases have developed powerful optimization engines. For simple computing scenarios suitable to be handled in SQL, the optimizers are able to optimize slow statements written by mediocre programmers and increase their performance. In this sense programmers do not need to be the best. Some scenarios (such as multidimensional analysis) have already been heavily optimized, and certain SQL engines can produce amazing performance.
Compared with SQL, SPL has done little in auto-optimization. It depends almost completely on programmers writing low-complexity code to achieve high performance. Programmers need some training and exercises in order to be familiar with SPL concepts and library functions. There is a threshold for getting started. Besides, SPL is written in Java, making it easy to adapt to the cloud environment while being unable to making most use of CPU and memory resources due to JVM limitations. For some simple scenarios, SPL is not as good as a fully optimized SQL engine in terms of performance.
Now it’s time to sum up. Let’s look at and compare SPL cloud data warehouse and SQL cloud data warehouse.
SPL cloud has light architecture, which enables simple administration, high reuse rate, low error rate, as well as low cost of operations and maintenance, creating balanced cost of maintenance without a large-scale business. SQL cloud is heavy. It usually needs a massive-scale business to dilute the cost of operations and maintenance.
SPL’s powerful language ability makes it easy to achieve high-performance algorithms having complex logics, leading to low development cost and hardware resource consumption. This helps keep the cost of users in a rather low level. SQL is only able to achieve low cost for simple scenarios. When facing complex computing goals, the cost becomes uncontrollable.
As data storage is cheap, we can exploit both SQL cloud and SPL cloud through data redundancy. We use SPL cloud, instead of SQL cloud, to deal with resource-intensive tasks, and giving tasks suitable to be handled in SQL to the fully-optimized SQL engine. Respectively, the two clouds handle requirements that they are good at. The mixed use of them can greatly reduce users’ total costs.
SPL’s open computing ability enables straightforward and efficient data arrangement and preparation (ETL). It can be also used to prepare data for SQL cloud (loading the processed raw data into SQL cloud).
SPL’s open computing ability offers solution to HTAP requirements. Users do not need to move the TP system to the new HTAP database (the migration has fairly high risks). Instead, they can still update the cold data to the AP database periodically (or use the SPL cloud directly), and perform mixed computations between the TP database and the AP database using SPL. As the hot data in TP database only takes up a small portion, retrieving it ad hoc for computations does not put too much pressure on resources. On the other hand, we can meticulously arrange data in the AP database to obtain higher computing performance. However, there is not enough time for a pure HTAP database to perform complicated data arrangement and preparation on a huge amount of data in the AP database. Its overall computing performance thus becomes bad.
The uses of SPL cloud are beyond TP. It can be used as data microservices cloud. SPL boasts all-around language abilities (including branch, loop and subroutine) for achieving extremely complex business logics. Each SPL script is like a microservice, where the frontend application requests services, and the cloud executes the script and returns result. We can write accessing the TP database or any other data sources in the script. We can also buffer the intermediate data to the cloud using the cloud storage. The application development only needs to take care of the interface while leaving data handling services to SPL cloud.
By setting up the SPL cloud on the mature, large-scale cloud storage, we can achieve true data lakehouse. Data of non-standardized formats can be first thrown to the “lake” to be computed by SPL, and processed and collated into the “warehouse” when high performance is demanded. This way the building of lakehouse is step by step. SQL’s data organization mechanism refuses unstandardized data. There is the “house” but isn’t the “lake”. Other technologies are thus needed to accomplish the data lake. And it is hard to achieve real lakehouse.
Much as we have said, SPL cloud is currently under development. We are putting forward ideas and views. There are still a lot of things to do to transform a software into a cloud service. This is the key job of our SPL team in the next phase. We are looking forward to the day when our SPL cloud product is launched.
Finally, we think there is an analogy between SQL and SPL.
SQL is like the rail transport, which has complex management and scheduling system and requires pre-planning lines and optimization. Trains run fast on certain fixed lines and have large carrying capacity. Passengers buy tickets and get on board without taking care of the driving. Yet this means of transportation is clumsy and inefficient when it needs to handle flexible and complicated requirements.
SPL is like the auto, which does not have a scheduling system but can handle complicated requirements flexibly and efficiently. Passengers, where the driver is one of them, need to plan the route in their own ways. And the driver should get a special license before hitting the road. On the other hand, autos do not have as large carrying capacity as trains in dealing with bulk cargo shipping demands on certain fixed lines.
The demand for autos is much higher than that for trains probably. Yet autos cannot completely replace trains.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code