BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Create Your Distributed Database on Kubernetes with Existing Monolithic Databases

Create Your Distributed Database on Kubernetes with Existing Monolithic Databases

Bookmarks

Key Takeaways

  • Cloud-native is becoming the trend, and for databases, Kubernetes is the environment provider, and now Apache ShardingSphere provides a solution for putting a monolithic database on Kubernetes
  • Apache ShardingSphere can transform any database to a distributed database system, while enhancing it with functions such as sharding, elastic scaling, encryption features, etc.
  • Users may deploy and manage ShardingSphere clusters and create their distributed database system on Kubernetes using these tools, regardless of the location of databases.
  • The advantages of running ShardingSphere on Kubernetes include: leveraging existing database capacity, efficient and steady migration, providing more cloud-native running and governance and traditional SQL-like approach, flexible auto-scaling feature and more necessary features, more clients to choose from, open-source support.
  • The example case given in the article demonstrates how to deploy ShardingSphere-Operator, create a sharding table using DistSQL, and test the Scaling and HA of the ShardingSphere-Proxy cluster

Background

Most of the recent convenience upgrades that have blessed peoples’ lives in the 21st century can be traced back to the widespread adoption of the Internet.

Constant connectivity at our fingertips improved our lives, and created new technical infrastructure requirements to support high-performance Internet services. Developers and DevOps teams have become focused on ensuring the backend infrastructure’s availability, consistency, scalability, resilience, and fully automated management.

Examples of issues that tech teams are constantly struggling with include managing and storing large amounts of business data and creating the conditions to ensure that infrastructures deliver optimal service to the applications. Also, designing technical architecture while thinking ahead to meet future requirements and evolving modern applications to be able to "live" in the cloud.

The cloud is game-changing technology, and if you haven’t yet, you should get familiar with it. It has already transformed infrastructure as we know it, from development to delivery, deployment, and maintenance. Nowadays, modern applications are embracing the concept of anything-as-a-service from various cloud vendors, and developer and operations teams are considering upgrading legacy workloads to future cloud-native applications.

Microservices on Kubernetes

To address the challenges mentioned above, we are witnessing an evolution of the application layer from monolithic services to microservices. By dividing a single monolithic service into smaller units, modern applications can become independent of one another while eliminating unwanted effects of development, deployment, and upgrading.

Moreover, to decouple and simplify communication services, such as APIs and calls, service mesh appeared and took over. Kubernetes provides an abstract platform and mechanism for this evolution, explaining its popularity.

If I had to pinpoint the reason why Kubernetes is so popular, I’d say that it’s because, according to the Kubernetes docs:

Kubernetes provides you with a framework to run distributed systems resiliently. It takes care of scaling and failover for your application, provides deployment patterns, and more. For example, Kubernetes can easily manage a canary deployment for your system. (From "Why you need Kubernetes and what it can do" section.)

Kubernetes is an ideal platform for managing the microservice’s lifecycle, but what about the database, a stateful service?

Databases

The application layer has adopted microservices as the solution to address the issues previously introduced here. Still, when it comes to the database layer, the situation is a little different.

To answer the pain points we raised, we can look at the database layer. It uses a different method, yet somewhat similar: sharding, a.k.a. distributed architecture.

Currently, this distributed architecture is ubiquitous, whether we’re talking about NoSQL databases, such as MongoDB, Cassandra, Hbase, DynamoDB, or NewSQL databases, such as CockroachDB, Google Spanner, Aurora, and so forth. Distributed databases require splitting the monolithic one into smaller units, or shards, for higher performance, improved capability, elastic scalability, etc.

One thing all of these database vendors have in common is that they all must consider historical migration to streamline this evolution process. They all provide data migration from existing Oracle, MySQL, PostgreSQL, and SQLServer databases, just to name a few, to their new database offerings. That’s why CockroachDB is compatible with the PostgreSQL protocol, Vitess provides a sharding feature for MySQL, or AWS has Aurora-MySQL and Aurora-PostgreSQL.

Database on Cloud and Kubernetes

The advent of the cloud represents the next challenge for databases. Cloud platforms that are "go-on-demand," "everything-as-a-service," or "out-of-box" are currently changing the tech world.

Consider an application developer. To stay on pace with the current trends, the developer adheres to the cloud-native concept and prefers to deliver the applications on the cloud or Kubernetes. Does this mean it is time for databases to be on the cloud or Kubernetes? The majority of readers would probably answer with a resounding yes - which explains why the market share of the Database-as-a-service (DBaaS) is steadily increasing.

Nevertheless, if you’re from the buy side for these services, you may wonder which vendor can promise you indefinite support. The truth is that nobody can give a definitive answer, so multi-cloud comes to mind, and databases on Kubernetes seem to have the potential to deliver on this front.

This is because Kubernetes is essentially an abstraction layer for container orchestration and is highly configurable and extensible, allowing users to do custom coding for their specific scenarios. Volumes on Kubernetes, for example, are implemented and provided by many cloud vendors. If services are deployed on Kubernetes, applications will be able to interact with Kubernetes rather than different types of specific cloud services or infrastructure. This philosophy has already proven to work well in the case of stateless applications or microservices. As a result of these successful cases, people are thinking about how to put databases on Kubernetes to become cloud neutral.

A drawback to this solution is that it is more difficult to manage than the application layer, as Kubernetes is designed for stateless applications rather than databases and stateful applications. Many attempts to leverage Kubernetes’ fundamental mechanisms, such as StatefulSet and Persistent Volume, overlay their custom coding to address the database challenge on Kubernetes. This approach can be seen in operators of MongoDB, CockroachDB, PostgreSQL, and other databases.

Database Compute-Storage Architecture

This approach has become common, but is it the only one? My answer is no, and the following content will introduce you to and demonstrate another method for converting your existing monolithic database into a distributed database system running on Kubernetes in a more cloud-native pattern.

With the help of the following illustration, let’s first consider why this is possible.

As you can see from the illustration, the database has two capabilities: computing and storage.

MySQL, PostgreSQL, and other single-node databases combine or deploy two components on a single server or container.

Apache ShardingSphere

Apache ShardingSphere is the ecosystem to transform any database into a distributed database system and enhance it with sharding, elastic scaling, encryption features, and more. It provides two clients, ShardingSphere-Proxy and ShardingSphere-Driver.

ShardingSphere-Proxy is a transparent database proxy that acts as a MySQL or PostgreSQL database server while supporting sharding databases, traffic governance (e.g., read/write splitting), automatically encrypting data, SQL auditing, and so on. All of its features are designed as plugins, allowing users to leverage DistSQL (Distributed SQL) or a YAML configuration to select and enable only their desired features.

ShardingSphere-JDBC is a lightweight Java framework that brings additional features to Java’s JDBC layer. This driver shares most of the same features with ShardingSphere-Proxy.

As I’ve introduced earlier, if we view monolithic databases as shards (aka storage nodes) and ShardingSphere-Proxy or ShardingSphere-JDBC as the global server (aka computing node), then ultimately, the result is a distributed database system. It can be graphically represented as follows:

Because ShardingSphere-Proxy acts as a MySQL or PostgreSQL server, there is no need to change the connection method to your legacy databases while ShardingSphere-JDBC implements the JDBC standard interface. This significantly minimizes the learning curve and migration costs.

Furthermore, ShardingSphere provides DistSQL, a SQL-style language for managing your sharding database and dynamically controlling these distributed database system’s workloads, such as SQL audit, read/writing splitting, authority, and so on.

For example, you may use `CREATE TABLE t_order ()` SQL to create a new table in MySQL. With ShardingSphere-Proxy, `CREATE SHARDING TABLE RULE t-order ()` will help you create a sharding table in your newly upgraded distributed database system.

ShardingSphere-On-Cloud

So far, we’ve solved the sharding problem, but how do we make it work on Kubernetes? ShardingSphere-on-cloud provides ShardingSphere-Operator-Chart and ShardingSphere-Chart to help users deploy ShardingSphere-Proxy and ShardingSphere-Operator clusters on Kubernetes.

ShardingSphere-Chart and ShardingSphere-Operator-Chart

Two Charts help users deploy the ShardingSphere-Proxy cluster, including proxies, governance center, and Database connection driver, and ShardingSphere-Operator using helm commands.

ShardingSphere-Operator

ShardingSphere-Operator is a predefined CustomResourceDefinition that describes ShardingSphere-Proxy Deployment on Kubernetes. Currently, this operator provides HPA (Horizontal Pod Autoscaler) based on CPU metric and ensures ShardingSphere-Proxy high availability to maintain the desired replica number. Thanks to community feedback, throughout development iterations, we’ve found out that autoscaling and availability are our users’ foremost concerns. In the future, the open-source community will release even more useful features.

New solution

Users can easily deploy and manage ShardingSphere clusters and create their distributed database system on Kubernetes using these tools, regardless of where their monolithic databases reside.

As previously stated, a database is made up of computing nodes and storage nodes. A distributed database will divide and distribute these nodes. As a result, you can use your existing databases as the new distributed database system’s storage nodes. The highlight of this solution is adopting a flexible computing-storage-splitting architecture, utilizing Kubernetes to manage stateless computing nodes, allowing your database to reside anywhere and drastically reducing upgrading costs.

ShardingSphere-Proxy will act as global computing nodes to handle user requests, obtain local resultSet from the sharded storage nodes, and compute the final resultSet for users. This means there is no need to do dangerous manipulation work on your database clusters. You only have to import ShardingSphere into your database infrastructure layer and combine databases and ShardingSphere to make it a distributed database system.

ShardingSphere-Proxy is a stateless application that is best suited to being managed on Kubernetes. As a stateful application, your databases can run on Kubernetes, any cloud, or on-premise.

On the other hand, ShardingSphere-Operator serves as a manual operator working on Kubernetes to offer availability and auto-scaling features for the ShardingSphere-Proxy cluster. Users can scale-in or scale-out ShardingSphere-Proxy (computing nodes) and Databases (storage nodes) as needed. For example, some users simply want more computing power, and ShardingSphere-Operator will automatically scale out ShardingSphere-Proxy in seconds. Others may discover that they require more storage capacity; in this case, they simply need to spin up more empty database instances and execute a DistSQL command. ShardingSphere-Proxy will reshard the data across these old and new databases to improve capacity and performance.

Finally, ShardingSphere can assist users in resolving the issue of smoothly sharding existing database clusters and taking them into Kubernetes in a more native manner. Instead of focusing on how to fundamentally break the current database infrastructure and seeking a new and suitable distributed database that can be managed efficiently on Kubernetes as a stateful application, why don’t we consider this issue from the other side. How can we make this distributed database system more stateless and leverage the existing database clusters? Let me show you two examples of real-world scenarios.

Databases on Kubernetes

Consider that you have already deployed databases, such as MySQL and PostgreSQL, to Kubernetes using Helm charts or other methods and that you are now only using ShardingSphere charts to deploy ShardingSphere-Proxy and ShardingSphere-Operator clusters.

Once the computing nodes have been deployed, we connect to ShardingSphere-Proxy in the original way to use DistSQL to make Proxy aware of databases. Finally, the distributed computing nodes connect the storage nodes to form the final distributed database solution.

Databases on cloud or on-premise

If you have databases on the cloud or on-premises, the deployment architecture will be as shown in the image below. The computing nodes, ShardingSphere-Operator and ShardingSphere-Proxy, are running on Kubernetes, but your databases, the storage nodes, are located outside of Kubernetes.

Pros and Cons

We’ve seen a high-level introduction to ShardingSphere and some real-world examples of deployment. Let me summarize its pros and cons based on these real-world cases and the previous solution introduction to help you decide whether to adopt it based on your particular case.

Pros

  • Leverage your existing database capability

Instead of blowing up all your legacy database architecture, it’s a smooth and safe way to own a distributed database system.

  • Migrate efficiently and steadily

With almost no downtime, ShardingSphere offers a migration process that allows you to move and shard your databases simultaneously.

  • Traditional SQL-like approach to harness it

ShardingSphere’s DistSQL enables you to use the distributed database system’s features, such as sharding, data encryption, traffic governance, and so on, in a database native manner, i.e., SQL.

  • Flexible auto-scaling feature for separate computing and storage power

You can scale-in or scale-out ShardingSphere-Proxy and Databases separately and flexibly depending on your needs, thanks to a non-aggressive computing-storage splitting architecture.

  • More cloud-native running and governance way

ShardingSphere-Proxy is much easier to manage and natively deploy on Kubernetes because it is essentially a type of stateless global computing server that also acts as a database server.

  • Multi-cloud or cross-cloud

As stateful storage nodes, databases can reside on Kubernetes or on any cloud to avoid a single cloud platform lock-in. With ShardingSphere to connect your nodes, you will get a distributed database system.

  • More necessary features around databases

ShardingSphere is a database ecosystem that provides data encryption, authentication, read/write splitting, SQL auditing, and other useful features. Users gradually discover their advantages, regardless of sharding.

  • More clients for you to choose from, or a hybrid one

ShardingSphere offers two clients based on user requirements: ShardingSphere-Proxy and ShardingSphere-JDBC. Generally, ShardingSphere-JDBC has better performance than ShardingSphere-Proxy, whereas ShardingSphere-Proxy supports all development languages and Database management capabilities. A hybrid architecture with ShardingSphere-JDBC and ShardingSphere-Proxy is also a good way to reconcile their capabilities.

  • Open-source support

Apache ShardingSphere is one of the Apache Foundation’s Top-Level projects. It has been open-sourced for over 5 years. As a mature community, it is a high-quality project with many user cases, detailed documentation, and strong community support.

Cons

  • Distributed transactions

Even in a distributed database system, the transaction is critical. However, because this tech architecture was not developed from the storage layer, it currently relies on the XA protocol to coordinate the transaction handling of various data sources. It is not, however, a perfect and comprehensive distributed transaction solution.

  • SQL-compatibility issue

Some SQL queries work well in a storage node (database) but not in this new distributed system. This is a difficult issue to achieve 100% support, but thanks to the open-source community, we’re getting close.

  • Consistent global backup

Although ShardingSphere defines itself as a computing database server, many users prefer to think of it and their databases as a distributed database. As a result, people must think about obtaining a consistent global backup of this distributed database system. ShardingSphere is working on such a feature, but it is not yet supported (release 5.2.1). Users may require manual or RDS backups of these databases.

  • Some overhead

Each request will be received by ShardingSphere, calculated, and forwarded to the storage nodes. It is unavoidable that the overhead for each query will increase. This mechanism happens in any distributed database compared to a monolithic one.

Hands-on

This section demonstrates how to use ShardingSphere and PostgreSQL RDS to build a distributed PostgreSQL database that will allow users to shard data across two PostgreSQL instances.

For this demonstration, ShardingSphere-Proxy runs on Kubernetes, and PostgreSQL RDS runs on AWS. The deployment architecture is depicted in the following figure.

This demo will include the following major sections:

  1. Deploy the ShardingSphere-Proxy cluster and ShardingSphere-Operator.
  2. Create a distributed database and table using Distributed SQL.
  3. Test the Scaling and HA of the ShardingSphere-Proxy cluster (computing nodes).

Prepare database RDS

We need to create two PostgreSQL RDS instances on AWS or any other cloud. They will act as storage nodes.

Deploy ShardingSphere-Operator

  1. Download the repo, and create a namespace named `sharding-test` on Kubernetes.

git clone https://github.com/apache/shardingsphere-on-cloud
kubectl create ns sharding-test
cd charts/shardingsphere-operator
helm dependency build
cd ../
helm install shardingsphere-operator shardingsphere-operator -n sharding-test
cd shardingsphere-operator-cluster
vim values.yaml
helm dependency build
cd ..
helm install shardingsphere-cluster shardingsphere-operator-cluster -n sharding-test
  1. Change `automaticScaling: true` and `proxy-frontend-database-protocol-type: PostgreSQL` in values.yaml of `shardingsphere-operator-cluster` and deploy it.

3. Following these operations, you will create a ShardingSphere-Proxy cluster containing 1 Proxy instance, 2 Operator instances, and 1 Proxy governance instance showing as follows.

Create a sharding table by using Distributed SQL

  1. Login to ShardingSphere Proxy and add PostgreSQL instances to Proxy.

kubectl port-forward --namespace sharding-test svc/shardingsphere-cluster-shardingsphere-operator-cluster 3307:3307
psql --host 127.0.0.1 -U root -p 3307 -d postgres
kubectl port-forward --namespace sharding-test svc/shardingsphere-cluster-shardingsphere-operator-cluster 3307:3307
psql --host 127.0.0.1 -U root -p 3307 -d postgres
  1. Execute DistSQL to create a sharding table `t_user` with MOD (user_id, 4), and show the actual tables of this logic table `t_user`.

  1. Insert some test rows and do a query on ShardingSphere-Proxy to get the merged final result.

  1. Login to two PostgreSQL instances to get their local results.

This simple test will help you understand that ShardingSphere can help you manage and shard your databases. People don’t need to care about the separate data in different shards.

Test the Scaling and HA of the ShardingSphere-Proxy cluster (computing nodes)

If you discover that the TPS (transactions per second) or QPS (queries per second) of this new system are extremely high and users complain that it takes too long to open a webpage, it’s time to upgrade your database system’s computing power.

Compared to other distributed database systems, ShardingSphere-Proxy is the simplest way to increase computing nodes. ShardingSphere-Operator can ensure ShardingSphere-Proxy availability and autoscale them based on CPU metrics. Furthermore, by modifying its specifications, it is possible to make it scale-in or scale-out, just as follows:

You will receive two ShardingSphere-Proxy instances after upgrading the release. This implies that you have more computing power.

If, as mentioned above, you require more storage capacity, you can take the following steps.

  1. Launch additional PostgreSQL instances in the cloud or on-premises.
  2. Add these new storage nodes to the ShardingSphere-Proxy.
  3. Run distributed SQL to allow ShardingSphere to assist you with resharding.

Wrap-up

The focus of this article is a new sharding database architecture on Kubernetes that leverages your existing monolithic databases, allowing the DevOps team to evolve their database infrastructure to a modern one efficiently and fluently.

The database computing-storage split is a vintage architecture that is re-interpreted and fully leveraged on Kubernetes today to help users address the governance issue of the stateful database on Kubernetes.

These days, distributed databases, cloud computing, open source, big data, and modern digital transformation are common buzzwords. But they represent useful new concepts, ideas, and solutions that address production concerns and needs. As I always recommend to our end-users, look forward to welcoming new ideas, learning their pros and cons, and then choosing the best one for your specific situation, as there is no such thing as a perfect solution.

 

About the Author

Rate this Article

Adoption
Style

BT