There are many Big Data problems whose output is also Big Data. In this presentation we will show Splout SQL, which allows serving an arbitrarily big dataset by partitioning it. Splout serves partitioned SQL views which are generated and indexed by Hadoop. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Hadoop is nowadays the de-facto open-source solution for Big Data batch-processing. When the output of a Hadoop process is big, there isn`t a satisfying solution for serving it. Think of pre-computed recommendations, for example, where the whole dataset may vary from one day to another. Splout decouples database creation from database serving and makes it efficient and safe to deploy Hadoop-generated datasets. There are many databases that allow serving Big Data such as NoSQL solutions, but they don`t have a rich query language like SQL. You generally can`t aggregate data in real-time like you would do with a GROUP BY clause. Because you can`t precompute everything, SQL is a very convenient feature to have in a Big Data serving solution. Splout is not a “fast analytics” engine. Splout is made for demanding web or mobile applications where query performance is critical. Arbitrary real-time aggregations should be done in less than 200 milliseconds under high traffic load. On top of that, Splout is scalable, flexible, RESTful & open-source.
6. For key = ‘U20’, tablespace=‘CLIENTS_INFO’
SELECT Name, sum(Amount) FROM
Serving CLIENTS c, SALES s WHERE
c.CID = s.CID AND CID = ‘U20’;
Partition U10 – U35 Partition U36 – U60
Table CLIENTS Table CLIENTS
CID Name CID Name
U20 Doug U40 John
U21 Ted
Table SALES Table SALES
SID CID Amount SID CID Amount
S100 U20 102 S223 U40 99
S101 U20 60
7. For key = ‘U40’, tablespace=‘CLIENTS_INFO’
SELECT Name, sum(Amount) FROM
Serving CLIENTS c, SALES s WHERE
c.CID = s.CID AND CID = ‘U40’;
Partition U10 – U35 Partition U36 – U60
Table CLIENTS Table CLIENTS
CID Name CID Name
U20 Doug U40 John
U21 Ted
Table SALES Table SALES
SID CID Amount SID CID Amount
S100 U20 102 S223 U40 99
S101 U20 60
8. Why does it scale?
Data is partitioned
Partitions are distributed across nodes
Adding more nodes increases capacity
Queries restricted to a single partition
Generation does not impact serving
14. Building a Google Analytics
Imagine that one crazy day you decide to build
some kind of Google Analytics…
Zillions of events
Millions of domains
Individual panel per domain
15. Requirements
Time-based charts (day/hour aggregations)
Flexible dimension breakdown
Per page, per browser
Per country, per language
…
20. Each partition is …
Backed by SQLite
Generated on Hadoop
Including any indexes needed
Data can be sorted before insertion to
minimize disk seeks at query time
Pre-sampling for balancing partition size
Distributed on Splout SQL cluster
With replication for failover
21. Atomicity
A tablespace is a set of tables that
share the same partitioning schema
Tablespaces are versioned
Only one version served at a time
Several tablespaces can be deployed
at once
All-or-nothing semantics (atomicity)
Rollback support
22. Characteristics
Ensured ms latencies
Even when queries hit disk
Controlled by the developer selecting the
proper:
- Cluster topology
- Partitioning
- Indexes
- Data collocation (insertion order)
23. Characteristics (II)
100% SQL
But restricted to a single partition
Real-time aggregations
Joins
Scalability
In data capacity
In performance
24. Characteristics (III)
Atomicity
New data replaces old data all at once
High availability
Through the use of replication
Open Source
25. Characteristics (IV)
Easy to manage
Changing the size of the cluster can be done
without any downtime
Read only
Data is updated in batches
Updates come from new tablespace
deployments
34. Future work
Growing the community
Do you want to collaborate?
Automatic rebalancing on failover
Almost done
Some read/write capabilities
Enabling Splout SQL to become the speed
layer on lambda architectures
35. Iván de Prado Alonso – CEO of Datasalt
www.datasalt.es
@ivanprado
@datasalt
Questions?