Presentation for Big Data Beers @ Berlin, 12-2012. In this presentation I introduce Splout SQL, a new open-source data view for Hadoop which provides high-throughput, low-latency query times. Through these slides I show a recurring problem we have found when doing Hadoop consulting: movin data between processing and serving. I will show how Splout SQL solves it.
2. Who am I?
● Pere Ferrera Bertran, Barcelona @ferrerabertran
● 8 years “backender” @ BCN startups.
● “The shy guy” (aka CTO) @ Datasalt
● Hadoop consulting: PeerIndex, Trovit, BBVA
● Open-source low-level API for Hadoop (Pangool)
– Accepted paper: ICDM 2012
● Jazz pianist in the free time
3. 3.5 Big Data Challenges
Moving Big Data seamlessly is also a challenge!
4. Hadoop
● Mainstream Big Data Storage & Batch Processing
● Open-source.
● Large community.
● Many higher-level tools.
● Many companies around.
● It scales.
● Bad things people say about it:
● Slow
– but we now have MapR!
● Hard to program
– but we have Hive, Pig or Pangool!
– and even things like Datameer!
● Buggy
– but we have a stable 1.0 and supporting companies like Cloudera!
● Getting better and better! - YARN (2.0)
5. The Batch Revolution
● Batch is not the only kind of processing
● But it covers many cases very well.
All our consulting clients use it.
–
● Hadoop makes it transparently scalable!
● I see this as a revolution.
● Advantages: Simple, resistant to programming errors.
● Disadvantages: Long-running processes, results updated in
hours time.
● My advise: Can you cope with that? Then use batch processing.
● Ted Dunning & Nathan Marz are good “gurus” to hear talk about
this.
6. The problem (we want to solve)
● Big Data usually means having Big Data as input
● A lot of emphasis nowadays in “analytics”, where
output is usually small
● Small, targeted reports.
“I will eat all this so that almost nothing
remains out of it... “
7. The problem (we want to solve)
●
But the problem is that sometimes the output is also Big Data!
● Recommendations
● Aggregated stats
●
Listings
●
Recurring problem: Take your “Big Data Output” and “put it” somewhere
● NoSQL
● Search engine
● For being able to answer real-time queries, low-latency lookups over it.
● Websites, mobile apps.
● A lot of people using the app concurrently.
● Read-only!
8. Current options
● Hadoop-generated files are not (directly) queriable
● They lack appropriate indexes (e.g. b-tree) for making
queries really fast
● We can “send” the result of a Hadoop process directly to a
database...
● Problems:
– Latency (random writing / rebalancing / index update)
– Affecting query service (database may slow down while
updating and serving at the same time)
– Incrementality (may lead to inconsistency of results)
9. Meet Splout SQL!
● Store generation decoupled
from store serving
● Data is always optimally indexed.
● Zero fragmentation.
● “Atomic” deployment
● New versions replaced
without affecting query serving.
● All data replaced at once.
● Flexible.
● 100% SQL
● Rich query language
● Real-time aggregations over data
● Not everything needs to be pre-computed!
10. Details
● A very old idea which everyone implemented by hand at some
point.
● Horizontal partitioning.
●
Generates many database files (partitions) and distributes them in
a cluster.
● Replication, fail-over.
● Hadoop (Pangool) for generating the data structures.
● Including all b-trees needed!
● Database files: SQLite files.
12. SQLite
● Fast (10% slower than MySQL)
● Simple.
● Probably the best embedded SQL out there.
● Embedding it makes it easy to use it inside Hadoop.
● Still, it lacks some features.
● Not the database one would choose for an enterprise app.
● But Splout SQL is essentially read-only!
● So we don't need that many features.
Splout != SQLite. In the future we might integrate it with
PostgreSQL, for instance.
13. Making Splout SQL fly
● Because database is created off-line, things like insertion order can be controlled.
● Hadoop sorts the data for you.
● So you insert all your data in the appropriated order for making queries fast.
● Even if disk is used, only one seek will be needed (because of data locality).
Real-time GROUP BY’s with avg. 2000 records of 50 bytes in average 40
milliseconds in a m1.small EC2 machine.
14. Recap
●
We see a recurring problem when the output is also Big Data.
● Moving data between (batch) processing and serving.
● Splout SQL solves it and adds full SQL.
“A web-latency SQL view for Hadoop”
●
Web-latency: unlike data warehousing / analytics
● SQL: unlike key/value and other NoSQL's
● View: simply make files queriable → read-only
●
For Hadoop: for Big Data output of batch processing
Check it out and play with it!
http://sploutsql.com