SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
There are many Big Data problems whose output is also Big Data. In this presentation we will show Splout SQL, which allows serving an arbitrarily big dataset by partitioning it. Splout serves partitioned SQL views which are generated and indexed by Hadoop. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Hadoop is nowadays the de-facto open-source solution for Big Data batch-processing. When the output of a Hadoop process is big, there isn`t a satisfying solution for serving it. Think of pre-computed recommendations, for example, where the whole dataset may vary from one day to another. Splout decouples database creation from database serving and makes it efficient and safe to deploy Hadoop-generated datasets. There are many databases that allow serving Big Data such as NoSQL solutions, but they don`t have a rich query language like SQL. You generally can`t aggregate data in real-time like you would do with a GROUP BY clause. Because you can`t precompute everything, SQL is a very convenient feature to have in a Big Data serving solution. Splout is not a “fast analytics” engine. Splout is made for demanding web or mobile applications where query performance is critical. Arbitrary real-time aggregations should be done in less than 200 milliseconds under high traffic load. On top of that, Splout is scalable, flexible, RESTful & open-source.
Each partition is … Backed by SQLite Generated on Hadoop Including any indexes needed Data can be sorted before insertion to minimize disk seeks at query time Pre-sampling for balancing partition size Distributed on Splout SQL cluster With replication for failover
Atomicity A tablespace is a set of tables that share the same partitioning schema Tablespaces are versioned Only one version served at a time Several tablespaces can be deployed at once All-or-nothing semantics (atomicity) Rollback support
Characteristics Ensured ms latencies Even when queries hit disk Controlled by the developer selecting the proper: - Cluster topology - Partitioning - Indexes - Data collocation (insertion order)
Characteristics (II) 100% SQL But restricted to a single partition Real-time aggregations Joins Scalability In data capacity In performance
Characteristics (III) Atomicity New data replaces old data all at once High availability Through the use of replication Open Source
Characteristics (IV) Easy to manage Changing the size of the cluster can be done without any downtime Read only Data is updated in batches Updates come from new tablespace deployments
Future work Growing the community Do you want to collaborate? Automatic rebalancing on failover Almost done Some read/write capabilities Enabling Splout SQL to become the speed layer on lambda architectures
Iván de Prado Alonso – CEO of Datasaltwww.datasalt.es@ivanprado@datasalt Questions?