2. About Me: Neil Dahlke
Engineer
Formerly Globus
• high performance data transfer for research scientists
Past talks
• Real-time, Geospatial, Maps
Slides: http://www.slideshare.net/MemSQL/realtime-geospatial-
maps-by-neil-dahlke
7. Architecture: It’s SQL All The Way Down
Agg 1 Agg 2
select avg(price) from orders;
leaf1> using memsql_demo_0
select count(1), sum(price)
from orders;
leaf2> using memsql_demo_12
select count(1), sum(price)
from orders;
...
Leaf 1 Leaf 2 Leaf 3 Leaf 4
8. Latency in the Enterprise
SELECT*
FROM
WHERE
SLOW DATA
LOADING
Batched Loading
Hours to load
Sampled Data Views
No real-time ingestion
LENGTHY QUERY
EXECUTION
Slow query responses
Slow reports
Slow applications
No real-time response
LOW CONCURRENCY
Single threaded operations
Challenge with mixed workloads
Overall poor performance
9. REIMAGINE AN EXISTING BUSINESS PROCESS.
What if you had intra-day information to inform your decision making,
instead of daily or even weekly?
13. Why MemSQL?
FAST DATA
INGEST
The volume of data
that can be ingested
into the database
LOW LATENCY
QUERIES
The time it takes to
execute queries and
receive results
HIGH
CONCURRENCY
The ability to scale
simultaneous operations
20. A massively scalable database and ingest solution allowed for
massive growth, real-time analytic applications and faster, targeted.
+
21. Kafka
S3
• Persisted all logs to cold storage for eventual analysis
Hadoop
• Nightly map-reduce jobs
Redshift
• Took a full day to load data from previous day
• Reaching overlap of times caused data crisis
• Pre-aggregated
• Limited concurrency
Before
22. Late data
Limited access to the data once it’s in
Long waits for insight
Expensive
Why was this bad for their business?
23. Why was this bad for their data operations?
Not scalable
No deduplication
• aka not exactly-once
Unfiltered and incomplete data (silos)
Pre-aggregated data
FAST DATA
INGEST
LOW
LATENCY
QUERIES
HIGH
CONCURRENCY
30. Visualizing the Data
Demo built using
• Mapbox
• Websockets
• Tornado web server
When an image is pinned, the circles on the globe
expand, showing higher volume areas
Reads data from MemSQL directly
32. Introducing MemSQL Pipelines
CREATE PIPELINE is a database construct that enables
data ingestion with exactly-once semantics
• MemSQL stores the Kafka offset in a table
• Exactly once delivery facilitated by co-locating data and offsets
Extract, transform, and load external data natively
Fully distributed workloads
User-defined transformations
Scalable, highly performant, online ALTER TABLE and
ALTER PIPELINE
33. MemSQL Pipelines Sequence
1. Extract from data sources
2. Transform extracted data
3. Load transformed data into Database tables in parallel
Data
Sources
MemSQL
1. Extract 2. Transform extracted data 3. Load into Database tables
Pipelines
36. Getting Data to MemSQL
CREATE PIPELINE Streamliner
Parallel loading from multiple sources Parallel loading from multiple sources
Direct to leaf nodes
Data to multiple aggregators, then leaf
nodes
Native feature Built with Apache Spark
Exactly-once semantics
40. Learn More
[ODBMS Watch] Powering Big Data at Pinterest.
Interview with Krishna Gade
[GigaOm] Pinterest is experimenting with MemSQL for
real-time data analytics
[InfoQ] Real-time Data Analytics at Pinterest using
MemSQL and Spark Streaming
[MemSQL Blog] How Pinterest Measures Real-Time User
Engagement with Spark
[Pinterest Engineering Blog] Real-time analytics at
Pinterest