The Briefing Room with Dr. Robin Bloor and Splice Machine
Live Webcast August 11, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/onstage/g.php?MTID=e1b33c9d45b178e13784b4a971a4c1349
The ETL process was born out of necessity, and for decades it has been the glue between data sources and target applications. But as data
growth soars and increased competition demands real-time data, standard ETL has become brittle and often unmanageable. Scaling up resources can do the trick, but it’s very costly and only a matter of time before the processes hit another bottleneck. When outmoded ETL stands in the way of real-time analytics, it might be time to consider a completely new approach.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he explains how modern, data-driven architectures must adopt an equally capable data integration strategy. He’ll be briefed by Rich Reimer of Splice Machine, who will discuss how his company solves ETL performance issues and enables real-time analytics and reports on big data. He will show that by leveraging the scale-out power of Hadoop and the in-memory speed of Spark, users can bring both analytical and operational systems together, eventually performing transformations only when needed.
Visit InsideAnalysis.com for more information.
3. Twitter Tag: #briefr The Briefing Room
Welcome
Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com
@eric_kavanagh
4. Twitter Tag: #briefr The Briefing Room
Reveal the essential characteristics of enterprise
software, good and bad
Provide a forum for detailed analysis of today s innovative
technologies
Give vendors a chance to explain their product to savvy
analysts
Allow audience members to pose serious questions... and
get answers!
Mission
5. Twitter Tag: #briefr The Briefing Room
Topics
August: REAL-TIME DATA
September: HADOOP 2.0
October: DATA MANAGEMENT
6. Twitter Tag: #briefr The Briefing Room
Why Data Gets in a Jam
Ø ETL is dated
technology
Ø New super-highways
are needed
Ø Data gravity is real
7. Twitter Tag: #briefr The Briefing Room
Analyst: Robin Bloor
Robin Bloor is
Chief Analyst at
The Bloor Group
robin.bloor@bloorgroup.com
@robinbloor
8. Twitter Tag: #briefr The Briefing Room
Splice Machine
Splice Machine is a SQL-on-Hadoop database
The product is ACID-compliant and can power both
OLAP and OLTP workloads
Splice Machine is built on Java-based Apache Derby
and HBase/Hadoop
9. Twitter Tag: #briefr The Briefing Room
Guest: Rich Reimer
Rich Reimer, VP of Marketing and Product Management
Rich has over 15 years of sales, marketing and management experience in high-
tech companies. Before joining Splice Machine, Rich worked at Zynga as the
Treasure Isle studio head, where he used petabytes of data from millions of daily
users to optimize the business in real-time. Prior to Zynga, he was the COO and
co-founder of a social media platform named Grouply. Before founding Grouply,
Rich held executive positions at Siebel Systems, Blue Martini Software and Oracle
Corporation as well as sales and marketing positions at General Electric and Bell
Atlantic.
10. Splice
Machine
Proprietary
and
Confiden4al
ETL:
Gatekeeper
to
Real-‐Time
Big
Data
Rich
Reimer
VP,
Product
Management
rreimer@splicemachine.com
August
11,
2015
11. Splice
Machine
Proprietary
and
Confiden4al
What
Is
Real-‐Time?
Are
We
There
Yet?
2
Capture Analyze Act
Depends
on
where
you
are
in
the
insight-‐to-‐ac4on
con4nuum
Current
Real-Time
• Nightly ETL
• Data Lakes
• Interactive Reports
on Old Data
• Days for Data
Scientists to Analyze
• Millisecond
Delay
• Automated Machine
Learning
• Days to Update Rules
• Months to Update
Apps
• Autonomic
Applications
Crawl Walk Run
12. Splice
Machine
Proprietary
and
Confiden4al
ETL:
Boring,
Unglamorous,
Inevitable
Burden
3
“ETL
is
something
you
do
that
nobody
no4ces
un4l
you
don’t
do
it.”
-‐
Author
Unknown
13. Splice
Machine
Proprietary
and
Confiden4al
But
It’s
Killing
You
Slowly…
4
Iner4a
and
hidden
costs
dragging
your
business
down
ERP
CRM
…
Data
Warehouse
ETL
ODS
Systems of
Record
Expensive
Scale-up hardware and
proprietary software
Tuning
Ongoing database tuning to
address performance issues
Script
Maintenance
Constant updating of ETL
scripts to handle changing
sources and reports
Unable to Meet
Business Needs
Takes weeks or months to
change or create new reports
Delayed Reports
Errors or performance issues
cause miss of ETL window
and delay reports
Data Too Old
Data is hours or days old, when
business needs it near real-time
Too Slow
Can take hours or
even days to finish
ETL pipeline
14. Splice
Machine
Proprietary
and
Confiden4al
Big
Data
Makes
It
Worse
5
ETL
becomes
bigger
boCleneck
as
data
grows
ETL
Bo'leneck
Applica1ons
Analysis
Source:
2013
IBM
Briefing
Book
30-40%
data
growth
per
year
15. Splice
Machine
Proprietary
and
Confiden4al
6
Scale-‐Out:
The
Future
of
Databases
Drama4c
improvement
in
price/performance
Scale
Up
(Increase
server
size)
Scale
Out
(More
small
servers)
vs.
$ $
$
$
$
$
16. Splice
Machine
Proprietary
and
Confiden4al
Fixing
ETL:
Incremental
Approach
7
Incremental
evolu4on
to
reduce
lag
from
days
to
seconds
ETL:
Scale-up
ETL:
Scale-out
ELT T Only
Legacy Now Now Future
Days/Hours Hours/Minutes Minutes/Seconds No Lag
Transform
TransformTransform
OLTPOLAP OLTP
Transform
OLTP/OLAPOLTP OLAP OLAP
Timing
Architecture
Lag
Approach
17. Splice
Machine
Proprietary
and
Confiden4al
8
Reference
Architecture:
Typical
Data
Processing
Pipeline
How
do
you
reduce
lag
from
days
to
minutes
to
seconds?
Ad Hoc
Analytics
Executive
Business Reports
Operational
Reports & Analytics
ERP
CRM
Supply
Chain
HR
… Data
Warehouse
Datamart
Stream or
Batch Updates
Mixed
Workload AppsODS
ETL
Systems of
Record
Extract
Transform
Load
18. Splice
Machine
Proprietary
and
Confiden4al
9
Ad Hoc
Analytics
Executive
Business Reports
Operational
Reports & Analytics
ERP
CRM
Supply
Chain
HR
… Data
Warehouse
Datamart
Stream or
Batch Updates
Mixed
Workload Apps
ETL
Systems of
Record
Extract
Transform
Load
Reference
Architecture:
Scale-‐Out
Data
Processing
Pipeline
Accelerate
Data
Processing
Pipeline
to
minutes
or
even
seconds
Operational
Data Lake
Benefits
§ 5-‐10x
faster
§ 75%
less
cost
§ Elas4c
scalability
§ Unstructured
data
support
19. Splice
Machine
Proprietary
and
Confiden4al
10
You
Need
More
Than
Hadoop
By
Itself
For
ETL
Errors
or
data
quality
issues
force
ETL
restarts
Restart
ETL
to
fix
errors
or
update
records
Hours
Seconds
Use
transac4on
to
restart
step
or
update
records
Hadoop RDBMS
ETL
Hadoop ETL
Apps
ETL
Analy4cs
Apps
ETL
Hours
Analy4cs
Benefits
§ SQL-‐based
transforms
§ Improved
data
quality
§ Faster
recovery
with
transac4ons
20. Splice
Machine
Proprietary
and
Confiden4al
Streamlining
the
Structured
Data
Pipeline
in
Hadoop
11
Source
Systems
ERP
…
CRM
Sqoop
Apply
Inferred
Schema
Stored as
flat files
SQL Query Engines BI Tools
Tradi3onal
Hadoop
Pipeline
vs.
Source
Systems
ERP
…
CRM
Existing
ETL Tool
Stored in
same
schema
BI Tools
Streamlined
Hadoop
Pipeline
Benefits
§ Less
cost
and
complexity
§ Faster
w/
fewer
transla4ons
§ Improved
data
quality
§ Bejer
SQL
support
21. Splice
Machine
Proprietary
and
Confiden4al
12
Seamless
Integra4on
of
Structured
and
Unstructured
Data
Op4mizing
storage
and
querying
of
structured
data
as
part
of
ELT
or
Hadoop
query
engines
OLTP
Systems
ERP
CRM
Supply
Chain
HR
…
Structured
Data
Unstructured
Data
HCATALOG
Pig
SCHEMA
ON INGEST:
Streamlined,
structured-to-
structured
integration
1
2
3
SCHEMA BEFORE READ:
Repository for structured data or
metadata from ELT process on
unstructured data
SCHEMA ON READ:
Ad-hoc Hadoop queries across
structured and unstructured
data
22. Splice
Machine
Proprietary
and
Confiden4al
Case
Study:
Opera4onal
Data
Lake
13
13
Overview
Computer
technology
corpora4on
Update
database
technology
for:
ODS
layer
replacement
ETL
processing
and
analysis
of
Omniture
data
Real-‐4me
OLTP
for
Global
Tech
Support
app
Challenges
Oracle
and
Teradata
too
expensive
to
scale
Many
Oracle
queries
couldn’t
complete
Can
only
hold
7
days
worth
of
data
in
Oracle
Missing
ETL
window
with
current
Hadoop
data
lake
Solu1on
Diagram
(400TB)
OLTP Systems
ERP
CRM
Supply
Chain
Benefits
75%
less
cost
with
commodity
scale
out
Incremental
ETL
processing
gracefully
handle
data
quality
issues
5x-‐10x
faster
comple4ng
queries
on
which
Oracle
failed
✔
23. Splice
Machine
Proprietary
and
Confiden4al
14
Internet
of
Things
ETL/Opera4onal
Data
Lake
Digital
Marke4ng
Precision
Medicine
Use
Cases
Splice
Machine
|
Proprietary
&
Confiden4al
Fraud
Detec4on
24. Splice
Machine
Proprietary
and
Confiden4al
15
Who
Are
We?
Affordable,
Scale-‐Out
–
Commodity
hardware
Elas3c
–
Easy
to
expand
or
scale
back
Transac3onal
–
Real-‐4me
updates
&
ACID
Transac4ons
ANSI
SQL
–
Leverage
exis4ng
SQL
code,
tools,
&
skills
Flexible
–
Support
opera4onal
and
analy4cal
workloads
10x
Bejer
Price/Perf
THE
HADOOP
RDBMS
Replace
Oracle
with
Splice
Machine
to
scale
out
your
applica4ons
25. Splice
Machine
Proprietary
and
Confiden4al
16
Proven
Building
Blocks:
Hadoop
and
Derby
APACHE
DERBY
§
ANSI
SQL-‐99
RDBMS
§
Java-‐based
§
ODBC/JDBC
Compliant
APACHE
HBASE/HDFS
§ Auto-‐sharding
§ Real-‐4me
updates
§ Fault-‐tolerance
§ Scalability
to
100s
of
PBs
§ Data
replica4on
26. Splice
Machine
Proprietary
and
Confiden4al
17
Distributed,
Parallelized
Query
Execu4on
Parallelized
computa4on
across
cluster
Moves
computa4on
to
the
data
U4lizes
HBase
co-‐processors
No
MapReduce
HBase
Co-‐Processor
HBase
Server
Memory
Space
LEGEND
28. Splice
Machine
Proprietary
and
Confiden4al
19
Lockless,
ACID
transac4ons
• Adds
mul4-‐row,
mul4-‐table
transac4ons
to
HBase
w/
rollback
• Fast,
lockless,
high
concurrency
• Extends
research
from
Google
Percolator,
Yahoo
Labs,
U
of
Waterloo
• Patent
pending
technology
29. Splice
Machine
Proprietary
and
Confiden4al
What
People
are
Saying…
20
Recognized
as
a
key
innovator
in
databases
Scaling
out
on
Splice
Machine
presented
some
major
benefits
over
Oracle
...automa4c
balancing
between
clusters...avoiding
the
costly
licensing
issues.
Quotes
Awards
An
alterna3ve
to
today’s
RDBMSes,
Splice
Machine
effec4vely
combines
tradi4onal
rela4onal
database
technology
with
the
scale-‐out
capabili4es
of
Hadoop.
The
unique
claim
of
…
Splice
Machine
is
that
it
can
run
transac3onal
applica3ons
as
well
as
support
analy4cs
on
top
of
Hadoop.
30. Splice
Machine
Proprietary
and
Confiden4al
Ini4al
Advisory
Board
21
Advisory
Board
includes
luminaries
in
databases
and
technology
Roger
Bamford
Former
Principal
Architect
at
Oracle
Father
of
Oracle
RAC
Mike
Franklin
Computer
Science
Chair,
UC
Berkeley
Director,
UC
Berkeley
AMPLab
Founder
of
Apache
Spark
Marie-‐Anne
Neimat
Co-‐Founder,
Times-‐Ten
Database
Former
VP,
Database
Eng.
at
Oracle
Ken
Rudin
Head
of
Analy4cs
at
Facebook
Former
GM
of
Oracle
Data
Warehousing
Abhinav
Gupta
Co-‐Founder,
VP
Engineering
at
Rocket
Fuel
Runs
15PB
HBase
Cluster
31. Splice
Machine
Proprietary
and
Confiden4al
22
The
First
Step
to
Real-‐Time
Big
Data
Requires
Fixing
ETL
ETL
on
Hadoop
§ Drive
lag
down
from
hours
è
minutes
è
seconds
§ Start
by
replacing
ODS
with
Opera4onal
Data
Lake
§ 5-‐10x
faster
and
¼
cost
Splice
Machine
§ Replace
RDBMSs
like
Oracle
and
MySQL
§ Best
of
both
worlds
§ SQL
and
transac4ons
of
RDBMSs
§ Scale-‐out
of
NoSQL
§ 10x
bejer
price/performance
Transform
TransformTransform
OLTPOLAP OLTP
Transform
OLTP/OLAPOLTP OLAP OLAP
ETL: Scale-up ETL: Scale-out ELT T Only
32. Splice
Machine
Proprietary
and
Confiden4al
ETL:
Gatekeeper
to
Real-‐Time
Big
Data
Rich
Reimer
VP,
Product
Management
rreimer@splicemachine.com
August
11,
2015
37. Hadoop: One Ring to Rule Them All
Hadoop has become the de facto
processing environment for big
data.
Is it going to become the de facto
environment for
ALL SERVER COMPUTING?
38. Empires to Conquer
u Big Data
u Analytics
u Real-time analytics
u OLTP
u Document shares
u Office systems
✔︎
✔︎
?
?
??
41. Hadoop Possibilities?
u Hadoop is evolving faster than any equivalent
technology I can remember
u It has a very long way to go to become the
“server OS for everything.”
u First it would need to become a genuine OS
u It has no stated direction.
u It may vanish into the cloud.
u Nevertheless it is interesting to watch
43. u It’s not just ETL: it’s ETL, data cleansing,
metadata capture, MDM, etc. How do you
accommodate that?
u Do you have any ETL customer experiences to
report?
u How’s your OLTP business going? (Is this ETL
emphasis a complementary activity?)
u How well are you doing versus Oracle?
44. u How well does it integrate with other
technologies?
u What is your current largest customer(s)?
u Do you have any direct competition on Hadoop?
46. Twitter Tag: #briefr The Briefing Room
Upcoming Topics
www.insideanalysis.com
August: REAL-TIME DATA
September: HADOOP 2.0
October: DATA MANAGEMENT
47. Twitter Tag: #briefr The Briefing Room
THANK YOU
for your
ATTENTION!
Some images provided courtesy of Wikimedia Commons and by basykes [CC BY 2.0 (http://
creativecommons.org/licenses/by/2.0)], via Wikimedia Commons (https://upload.wikimedia.org/wikipedia/
commons/9/94/Beijing_traffic_jam.jpg)