If you ask companies who operate mission-critical services, they will tell:
1) that a relational database system is still the best choice for mission-critical data;
2) that service availability is more important than performance;
3) that high performance is good, but predictable performance is the king.
This is a fact, and we know it. At NHN we have over 30,000 Web servers that operate over 150 large scale Web and mobile services. At such scale we must know what scales, how to provide high-availability and operate at predictable speed.
At Percona Live MySQL Conference 2013 I will talk about CUBRID SHARD, a universal database sharding solution for CUBRID, MySQL, and Oracle. CUBRID SHARD can be used with a heterogeneous database backend, i.e. some shards can be stored in CUBRID, some in MySQL or even Oracle. At NHN we deploy various combinations: MySQL only, MySQL + Oracle, MySQL + CUBRID, CUBRID only, and Oracle only. I will explain how DBAs can easily configure it, and how we have implemented this feature.
CUBRID SHARD allows to store unlimited number of database shards and distribute data based on modulo, DATETIME, or hash/range calculations. The developers can even feed in their own library to calculate the SHARD_ID using some complicated custom algorithm. At the session I will show how easy it is to setup all this. No need for a third-party management tool. With CUBRID SHARD application developers do not need to modify the application logic to provide data sharding. This is DBAs job as all this is handled by the database system automatically.
CUBRID SHARD is designed to be very efficient. It provides built-in (*) distributed load balancing and (*) connection and statement pooling. At the conference I will present several cases where CUBRID SHARD is deployed as a shard manager and a connection manager, or where it's used as a way for seamless data migration between different systems.
Who should come to the session?
If you run a service which spends money on a database solution, on tools you need to shard databases or manage connections, you should come and learn how CUBRID SHARD can provide your applications native scale-out through single database view.
5. Sharding in Production
• Uses RDBMS with Sharding
• Data is stored as simple Key-Value.
•
•
•
•
•
•
•
•
•
•
•
•
6. Sharding Solutions
Name Type Requirements Interface
DB ETC
Hibernate shards AS framework
DBMS w/
Hibernate
support
- Hibernate
- JVM
Java
HiveDB AS framework MySQL
- Hibernate
- JVM
Java
dbShards AS & Middleware MySQL
Java, C, PHP, Python,
Ruby
Gizzard (Twitter) Middleware Any storage - JVM Java
Spider for MySQL
Middleware &
Storage Engine
MySQL Any
Spock Proxy Middleware MySQL Any
Shard-Query Middleware MySQL PHP, RESTful API
CUBRID SHARD Middleware
- CUBRID
- MySQL
- Oracle
Any
8. Application & Storage Layers
Application Layer
• Hibernate Shards
• HiveDB
Disadvantage
• Requires Hibernate/Java
• Uses many XML files for configuration
• Not for running services
8
Storage Layer
• Spider for MySQL
Disadvantage
• Requires to change storage
engine
• Not for running services
9. Heavy Middleware
dbShards Gizzard
9
• Requires to change application
code
• Requires agents to be installed on
each DB server
• Not for running services
• Not active
10. Lightweight Middleware
• Spock Proxy
– Active project
– Lightweight
– Flexible
– Easy to configure
– No application change
10
11. Spock Proxy
Spock Proxy
Sharding rule storage Database
Sharding strategy Modulo
Determine Sharding Key Full SQL Parsing
Strength No need to change SQL
Weakness • Performance degradation:
• Extra SQL parsing
• Resultset merging
• Not all MySQL SQL is supported
• Single threaded
11
Blog post: http://www.cubrid.org/blog/dev-
platform/database-sharding-platform-at-nhn/
15. Spock Proxy vs. CUBRID SHARD
Spock Proxy CUBRID SHARD
Sharding rule storage Database Configuration file
Sharding strategy Modulo • Modulo
• User defined hash function
Determine Sharding Key Full SQL Parsing SQL Hint Search
Strength No need to change SQL • Supports CUBRID and MySQL
• Full MySQL SQL support
• Higher performance
• No SQL parsing
• Multi-threaded
• Connection pooling
• Load balancing
• Custom sharding strategy
• Easy configuration
Weakness • Performance degradation:
• Extra SQL parsing
• Resultset merging
• Supports MySQL only
• Not all MySQL SQL is supported
• Single threaded
• Requires to change SQL queries to
insert the sharding hint
15
Blog post: http://www.cubrid.org/blog/dev-
platform/database-sharding-platform-at-nhn/
16. CUBRID Facts
RDBMS
True Open Source @ www.cubrid.org
Optimized for Web services
High performance
Large DB support
High-Availability feature
DB Sharding support
90+% MySQL compatible SQL syntax + Oracle analytical
functions
ACID Transactions
Online Backup
Supported by NHN Corporation
24. # 2. Create Users
• Host 1..N:
$> mysql –ushard -ppassword –hnode1
mysql> USE mysql;
mysql> GRANT ALL PRIVILEGES ON
sharddb@localhost TO shard@localhost
IDENTIFIED BY ‘shard123’
mysql> GRANT ALL PRIVILEGES ON
sharddb@localhost TO
shard@shardBrokerNode IDENTIFIED BY
‘shard123’
25. # 3. Create same tables
$> mysql –ushard -ppassword –hnode1
mysql> USE sharddb;
mysql> CREATE TABLE tbl_users (id BIGINT
PRIMARY KEY, name VARCHAR(20), age
SMALLINT)
$> mysql –ushard -ppassword –hnode2
…
• Host 1..N:
26. # 4. Simple Configuration
• shard.conf
– Main configuration file for CUBRID SHARD.
• shard_connection.txt
– Predefined list of shard IDs, database and host
names for CUBRID/MySQL.
• shard_keys.txt
– A list of shard_key_columns and their mapping
with shard_id
Doc page:
http://www.cubrid.org/manual/91/en/sha
rd.html#configuration-and-setup
28. shard_connection.txt
Set:
1. Shard ID
2. Real database name
3. Remote/local host name
# shard-id real-db-name connection-info
0 sharddb mysqlA:3306
1 sharddb mysqlB:3306
2 sharddb mysqlC:3306
…
** Host names must be identical to the output of
hostname command of every node.
Doc page:
http://www.cubrid.org/manual/91/en/shard.html#
setting-shard-metadata
29. shard_keys.txt
Set:
1. Min shard key
2. Max shard key
3. Shard ID
[%student_no]
# min max shard_id
0 63 0
64 127 1
128 191 2
192 255 3
** Default sharding strategy is
to apply modulo 256
(SHARD_KEY_MODULAR in
shard.conf ).
Doc page:
http://www.cubrid.org/manual/91/en/shard.html#
setting-shard-metadata
30. Custom Library
int fn_shard_key_udf(int type, void *val)
{
int mod = 2;
if (val == NULL)
{
return ERROR_ON_ARGUMENT;
}
switch(type)
{
case SHARD_U_TYPE_INT:
{
int ival;
ival = (int) (*(int *)val);
return ival % 2;
}
break;
case SHARD_U_TYPE_STRING:
return ERROR_ON_MAKE_SHARD_KEY;
default:
return ERROR_ON_ARGUMENT;
}
return ERROR_ON_MAKE_SHARD_KEY;
}
shard.conf
1. SHARD_KEY_LIBRARY_NAME
2. SHARD_KEY_FUNCTION_NAME
[%student_no]
SHARD_KEY_LIBRARY_NAME=$CUBRID/c
onf/shard_key_udf.so
SHARD_KEY_FUNCTION_NAME
=fn_shard_key_udf
Doc page:
http://www.cubrid.org/manual/91/en/shard.html#
setting-user-defined-hash-function
38. CUBRID SHARD
• Easy
– No configuration hassle
– No “moving parts”
• Reliable
– High performance
– No SPOF
• Open source
– Supported by NHN
39. CUBRID SHARD Disadvantages
Need to alter SQL to add Hints
No Data Rebalancing
Need to carefully plan the sharding strategy in
advance.
No GUI monitoring tool. Only command line.
40. CUBRID SHARD is great when…
• Services are already running and stable
• But data is growing fast
• And you need a stable solution
• Quick installation and easy configuration
• Time constraints
43
41. Ndrive cloud storage service
• User files meta data
• Sharding strategy by user ID
• 24 master shards
– Intel(R) Xeon(R) L5640 @ 2.27GHz * 8, 16G RAM, 820G
HDD
• 10TB data
• Load pattern:
– 75~80% SELECT vs. 20~25% INSERT
– Avg. ~3000 QPS/shard
– Avg. ~5% CPU load/shard
44
42. Ndrive cloud storage service
• 1 SHARD BROKER
• 4 Proxies per Broker
• 50 CAS per proxy
• No performance
degradation after CUBRID
SHARD is used
45
64 128 192 256 320
Vuser
To keep this presentation simple, I will focus on three main things.
Ask your questions as I'm going through presentation.Hold off with longer questions to the endDo not hesitate to talk to me during conferenceFollowup by email.
We have researched how other companies manage big data. They still rely on relational databases and manage their data through data sharding. At NHN eventually we do the same.
These are the existing sharding solutions.
- Talking about the open source RDBMS solutions, MySQL doesn’t provide database sharding out of the box.- Google had to significantly change MySQL replication to make it work similarly. But at the time Sun, the former owner of MySQL didn’t accept Google’s changes, resulting in a fork form mainstream without mainstream support.Twitter has recently opened their MySQL fork.http://www.oracle.com/technetwork/database/features/availability/300461-132370.pdf
Spider for MySQL requires to change the storage engine. It’s not an option. We have running service and don’t want to change anything.
Application-aware architecture of dbShards is explained here http://www.dbshards.com/blog/2010/09/black-box-vs-application-aware-sharding/.
NHN has many large services in production. We don’t want any middleware that we need to add to affect the performance. So this was a critical requirement so that the sharding middleware shouldn’t decrease the overall performance of the service.
Spock Proxy and CUBRID SHARD have somewhat similar architecture. Both are lightweight and flexible.