SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Grab some
coffee and
enjoy the
pre-­show
banter
before the
top of the
hour!
The Briefing Room
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Twitter Tag: #briefr The Briefing Room
Welcome
Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com
@eric_kavanagh
Twitter Tag: #briefr The Briefing Room
  Reveal the essential characteristics of enterprise
software, good and bad
  Provide a forum for detailed analysis of today s innovative
technologies
  Give vendors a chance to explain their product to savvy
analysts
  Allow audience members to pose serious questions... and
get answers!
Mission
Twitter Tag: #briefr The Briefing Room
Topics
August: REAL-TIME DATA
September: HADOOP 2.0
October: DATA MANAGEMENT
Twitter Tag: #briefr The Briefing Room
Why Data Gets in a Jam
Ø  ETL is dated
technology
Ø  New super-highways
are needed
Ø  Data gravity is real
Twitter Tag: #briefr The Briefing Room
Analyst: Robin Bloor
Robin Bloor is
Chief Analyst at
The Bloor Group
robin.bloor@bloorgroup.com
@robinbloor
Twitter Tag: #briefr The Briefing Room
Splice Machine
  Splice Machine is a SQL-on-Hadoop database
  The product is ACID-compliant and can power both
OLAP and OLTP workloads
  Splice Machine is built on Java-based Apache Derby
and HBase/Hadoop
Twitter Tag: #briefr The Briefing Room
Guest: Rich Reimer
Rich Reimer, VP of Marketing and Product Management
Rich has over 15 years of sales, marketing and management experience in high-
tech companies. Before joining Splice Machine, Rich worked at Zynga as the
Treasure Isle studio head, where he used petabytes of data from millions of daily
users to optimize the business in real-time. Prior to Zynga, he was the COO and
co-founder of a social media platform named Grouply. Before founding Grouply,
Rich held executive positions at Siebel Systems, Blue Martini Software and Oracle
Corporation as well as sales and marketing positions at General Electric and Bell
Atlantic.
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
ETL:	
  Gatekeeper	
  to	
  
Real-­‐Time	
  Big	
  Data	
  
Rich	
  Reimer	
  
VP,	
  Product	
  Management	
  
rreimer@splicemachine.com	
  
	
  
August	
  11,	
  2015	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
What	
  Is	
  Real-­‐Time?	
  Are	
  We	
  There	
  Yet?	
  
2	
  
Capture Analyze Act
Depends	
  on	
  where	
  you	
  are	
  in	
  the	
  insight-­‐to-­‐ac4on	
  con4nuum	
  
Current
Real-Time
•  Nightly ETL
•  Data Lakes
•  Interactive Reports
on Old Data
•  Days for Data
Scientists to Analyze
•  Millisecond
Delay
•  Automated Machine
Learning
•  Days to Update Rules
•  Months to Update
Apps
•  Autonomic
Applications
Crawl Walk Run
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
ETL:	
  Boring,	
  Unglamorous,	
  Inevitable	
  Burden	
  
3	
  
“ETL	
  is	
  something	
  you	
  do	
  that	
  nobody	
  no4ces	
  un4l	
  you	
  don’t	
  do	
  it.”	
  
-­‐	
  Author	
  Unknown	
  
	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
But	
  It’s	
  Killing	
  You	
  Slowly…	
  
4	
  
Iner4a	
  and	
  hidden	
  costs	
  dragging	
  your	
  business	
  down	
  
ERP
CRM
…
Data
Warehouse
ETL
ODS
Systems of
Record
Expensive
Scale-up hardware and
proprietary software
Tuning
Ongoing database tuning to
address performance issues
Script
Maintenance
Constant updating of ETL
scripts to handle changing
sources and reports
Unable to Meet
Business Needs
Takes weeks or months to
change or create new reports
Delayed Reports
Errors or performance issues
cause miss of ETL window
and delay reports
Data Too Old
Data is hours or days old, when
business needs it near real-time
Too Slow
Can take hours or
even days to finish
ETL pipeline
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Big	
  Data	
  Makes	
  It	
  Worse	
  
5	
  
ETL	
  becomes	
  bigger	
  boCleneck	
  as	
  data	
  grows	
  
ETL	
  
Bo'leneck	
  
Applica1ons	
   Analysis	
  
Source:	
  2013	
  IBM	
  Briefing	
  Book	
  
30-40%
data	
  growth	
  
per	
  year	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   6	
  
Scale-­‐Out:	
  The	
  Future	
  of	
  Databases	
  
Drama4c	
  improvement	
  in	
  price/performance	
  
	
  
Scale	
  Up	
  
(Increase	
  server	
  size)	
  
Scale	
  Out	
  
(More	
  small	
  servers)	
  
vs.	
  
$ $
 $
 $
 $
 $
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Fixing	
  ETL:	
  Incremental	
  Approach	
  
7	
  
Incremental	
  evolu4on	
  to	
  reduce	
  lag	
  from	
  days	
  to	
  seconds	
  
ETL:
Scale-up
ETL:
Scale-out
ELT T Only
Legacy Now Now Future
Days/Hours Hours/Minutes Minutes/Seconds No Lag
Transform
TransformTransform
OLTPOLAP OLTP
Transform
OLTP/OLAPOLTP OLAP OLAP
Timing
Architecture
Lag
Approach
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   8	
  
Reference	
  Architecture:	
  Typical	
  Data	
  Processing	
  Pipeline	
  
How	
  do	
  you	
  reduce	
  lag	
  from	
  days	
  to	
  minutes	
  to	
  seconds?	
  
Ad Hoc
Analytics
Executive
Business Reports
Operational
Reports & Analytics
ERP
CRM
Supply
Chain
HR
… Data
Warehouse
Datamart
Stream or
Batch Updates
Mixed
Workload AppsODS
ETL
Systems of
Record
Extract
Transform
Load
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   9	
  
Ad Hoc
Analytics
Executive
Business Reports
Operational
Reports & Analytics
ERP
CRM
Supply
Chain
HR
… Data
Warehouse
Datamart
Stream or
Batch Updates
Mixed
Workload Apps
ETL
Systems of
Record
Extract
Transform
Load
Reference	
  Architecture:	
  Scale-­‐Out	
  Data	
  Processing	
  Pipeline	
  
Accelerate	
  Data	
  Processing	
  Pipeline	
  to	
  minutes	
  or	
  even	
  seconds	
  
Operational
Data Lake
Benefits
§  5-­‐10x	
  faster	
  
§  75%	
  less	
  cost	
  
§  Elas4c	
  scalability	
  
§  Unstructured	
  data	
  support	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   10	
  
You	
  Need	
  More	
  Than	
  Hadoop	
  By	
  Itself	
  For	
  ETL	
  
Errors	
  or	
  data	
  quality	
  issues	
  force	
  ETL	
  restarts	
  
Restart	
  ETL	
  to	
  fix	
  errors	
  or	
  
update	
  records	
  
Hours
Seconds
Use	
  transac4on	
  to	
  
restart	
  step	
  or	
  
update	
  records	
  
Hadoop RDBMS
ETL
Hadoop ETL
Apps	
  
ETL	
   Analy4cs	
  
Apps	
  
ETL	
  
Hours
Analy4cs	
  
Benefits
§  SQL-­‐based	
  transforms	
  
§  Improved	
  data	
  quality	
  
§  Faster	
  recovery	
  with	
  
transac4ons	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Streamlining	
  the	
  Structured	
  Data	
  Pipeline	
  in	
  Hadoop	
  
11	
  
Source
Systems
ERP
…
CRM
Sqoop
Apply
Inferred
Schema
Stored as
flat files
SQL Query Engines BI Tools
Tradi3onal	
  Hadoop	
  Pipeline	
  
vs.	
  
Source
Systems
ERP
…
CRM
Existing
ETL Tool
Stored in
same
schema
BI Tools
Streamlined	
  Hadoop	
  Pipeline	
   Benefits
§  Less	
  cost	
  and	
  
complexity	
  
§  Faster	
  w/	
  fewer	
  
transla4ons	
  
§  Improved	
  data	
  quality	
  
§  Bejer	
  SQL	
  support	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   12	
  
Seamless	
  Integra4on	
  of	
  Structured	
  and	
  Unstructured	
  Data	
  
Op4mizing	
  storage	
  and	
  querying	
  of	
  structured	
  data	
  as	
  part	
  of	
  ELT	
  or	
  Hadoop	
  query	
  engines	
  
OLTP
Systems
ERP
CRM
Supply
Chain
HR
…
Structured
Data
Unstructured
Data
HCATALOG
Pig
SCHEMA
ON INGEST:
Streamlined,
structured-to-
structured
integration
1	
  
2	
  
3	
  
SCHEMA BEFORE READ:
Repository for structured data or
metadata from ELT process on
unstructured data
SCHEMA ON READ:
Ad-hoc Hadoop queries across
structured and unstructured
data
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Case	
  Study:	
  Opera4onal	
  Data	
  Lake	
  
13	
  13	
  
Overview	
  	
  
  Computer	
  technology	
  corpora4on	
  
  Update	
  database	
  technology	
  for:	
  
  ODS	
  layer	
  replacement	
  
  ETL	
  processing	
  and	
  analysis	
  of	
  Omniture	
  data	
  
  Real-­‐4me	
  OLTP	
  for	
  Global	
  Tech	
  Support	
  app	
  
	
  
Challenges	
  
  Oracle	
  and	
  Teradata	
  too	
  expensive	
  to	
  scale	
  
  Many	
  Oracle	
  queries	
  couldn’t	
  complete	
  
  Can	
  only	
  hold	
  7	
  days	
  worth	
  of	
  data	
  in	
  Oracle	
  
  Missing	
  ETL	
  window	
  with	
  current	
  Hadoop	
  data	
  lake	
  
	
  
Solu1on	
  Diagram	
  
	
  
(400TB)	
  
OLTP Systems
ERP
CRM
Supply
Chain
Benefits	
  
75%	
  less	
  cost	
  
with	
  commodity	
  scale	
  out	
  
Incremental	
  ETL	
  processing	
  
gracefully	
  handle	
  data	
  quality	
  issues	
  
5x-­‐10x	
  faster	
  
comple4ng	
  queries	
  on	
  which	
  Oracle	
  failed	
  	
  	
  
	
  
✔	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   14	
  
Internet	
  of	
  Things	
  
ETL/Opera4onal	
  Data	
  Lake	
  Digital	
  Marke4ng	
  
Precision	
  
Medicine	
  
Use	
  Cases	
  
Splice	
  Machine	
  |	
  Proprietary	
  &	
  Confiden4al	
  
Fraud	
  Detec4on	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   15	
  
Who	
  Are	
  We?	
  
Affordable,	
  Scale-­‐Out	
  –	
  Commodity	
  hardware	
  
Elas3c	
  –	
  Easy	
  to	
  expand	
  or	
  scale	
  back	
  
Transac3onal	
  –	
  Real-­‐4me	
  updates	
  &	
  ACID	
  Transac4ons	
  	
  
ANSI	
  SQL	
  –	
  Leverage	
  exis4ng	
  SQL	
  code,	
  tools,	
  &	
  skills	
  
Flexible	
  –	
  Support	
  opera4onal	
  and	
  analy4cal	
  workloads	
  
10x	
  	
  
Bejer	
  	
  
Price/Perf	
  
	
  
THE	
  HADOOP	
  RDBMS	
  	
  
Replace	
  Oracle	
  with	
  Splice	
  Machine	
  
to	
  scale	
  out	
  your	
  applica4ons	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   16	
  
Proven	
  Building	
  Blocks:	
  Hadoop	
  and	
  Derby	
  
APACHE	
  DERBY	
  	
  
§  	
  ANSI	
  SQL-­‐99	
  RDBMS	
  
§  	
  Java-­‐based	
  
§  	
  ODBC/JDBC	
  Compliant	
  
	
  
APACHE	
  HBASE/HDFS	
  
§  Auto-­‐sharding	
  
§  Real-­‐4me	
  updates	
  
§  Fault-­‐tolerance	
  
§  Scalability	
  to	
  100s	
  of	
  PBs	
  
§  Data	
  replica4on	
  	
  
	
  
	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   17	
  
Distributed,	
  Parallelized	
  Query	
  Execu4on	
  
Parallelized	
  
computa4on	
  across	
  
cluster	
  
Moves	
  
computa4on	
  to	
  	
  
the	
  data	
  
U4lizes	
  HBase	
  	
  
co-­‐processors	
  
No	
  MapReduce	
  
HBase	
  	
  
Co-­‐Processor	
  
	
  
HBase	
  Server	
  
Memory	
  Space	
  
LEGEND	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
ANSI	
  SQL-­‐99	
  Coverage	
  
18	
  
§  Data	
  types	
  –	
  e.g.,	
  INTEGER,	
  REAL,	
  
CHARACTER,	
  DATE,	
  BOOLEAN,	
  BIGINT	
  
§  DDL	
  –	
  e.g.,	
  CREATE	
  TABLE,	
  CREATE	
  SCHEMA,	
  
ALTER	
  TABLE,	
  DELETE,	
  UPDATE	
  TABLE	
  
§  Predicates	
  –	
  e.g.,	
  IN,	
  BETWEEN,	
  LIKE,	
  EXISTS	
  
§  DML	
  –	
  e.g.,	
  INSERT,	
  DELETE,	
  UPDATE,	
  SELECT	
  
§  Query	
  specifica3on	
  –	
  e.g.,	
  GROUP	
  BY,	
  
HAVING	
  
§  SET	
  func3ons	
  –	
  e.g.,	
  UNION,	
  ABS,	
  MOD,	
  ALL	
  
§  Aggrega3on	
  func3ons	
  –	
  e.g.,	
  AVG,	
  MAX,	
  
COUNT	
  
§  String	
  func3ons	
  –	
  e.g.,	
  SUBSTRING,	
  
concatena4on,	
  UPPER,	
  LOWER,	
  TRIM,	
  
LENGTH	
  
§  Constraints	
  –	
  e.g.,	
  PRIMARY	
  KEY,	
  FOREIGN	
  
KEY,	
  UNIQUE,	
  NOT	
  NULL	
  
§  Condi3onal	
  func3ons	
  –	
  e.g.,	
  CASE,	
  searched	
  
CASE	
  
§  Privileges	
  –	
  e.g.,	
  privileges	
  for	
  SELECT,	
  
DELETE,	
  INSERT,	
  EXECUTE	
  
§  Joins	
  –	
  e.g.,	
  INNER	
  JOIN,	
  LEFT	
  OUTER	
  JOIN	
  
§  Transac3ons	
  –	
  e.g.,	
  COMMIT,	
  ROLLBACK,	
  
Snapshot	
  Isola4on	
  
§  Sub-­‐queries	
  
§  Triggers	
  
§  User-­‐defined	
  func3ons	
  (UDFs)	
  
§  Views	
  –	
  including	
  grouped	
  views	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   19	
  
Lockless,	
  ACID	
  transac4ons	
  
•  Adds	
  mul4-­‐row,	
  mul4-­‐table	
  
transac4ons	
  to	
  HBase	
  w/	
  rollback	
  
•  Fast,	
  lockless,	
  high	
  concurrency	
  	
  
•  Extends	
  research	
  from	
  Google	
  
Percolator,	
  Yahoo	
  Labs,	
  U	
  of	
  
Waterloo	
  
•  Patent	
  pending	
  technology	
  
	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
What	
  People	
  are	
  Saying…	
  
20	
  
Recognized	
  as	
  a	
  key	
  innovator	
  in	
  databases	
  
Scaling	
  out	
  on	
  Splice	
  
Machine	
  presented	
  	
  
some	
  major	
  benefits	
  	
  
over	
  Oracle	
  
...automa4c	
  balancing	
  between	
  
clusters...avoiding	
  the	
  costly	
  
licensing	
  issues.	
  
Quotes	
  
Awards	
  
	
  
An	
  alterna3ve	
  to	
  today’s	
  
RDBMSes,	
  
Splice	
  Machine	
  effec4vely	
  	
  
combines	
  tradi4onal	
  rela4onal	
  
database	
  	
  technology	
  with	
  	
  
the	
  scale-­‐out	
  capabili4es	
  	
  
of	
  Hadoop.	
  
	
  
The	
  unique	
  claim	
  of	
  …	
  Splice	
  
Machine	
  is	
  that	
  it	
  can	
  run	
  
transac3onal	
  applica3ons	
  
as	
  well	
  as	
  support	
  analy4cs	
  on	
  	
  
top	
  of	
  Hadoop.	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Ini4al	
  Advisory	
  Board	
  
21	
  
Advisory	
  Board	
  includes	
  luminaries	
  in	
  databases	
  and	
  technology	
  	
  
Roger	
  Bamford	
  
Former	
  Principal	
  Architect	
  at	
  Oracle	
  
Father	
  of	
  Oracle	
  RAC	
  
Mike	
  Franklin	
  
Computer	
  Science	
  Chair,	
  UC	
  Berkeley	
  
Director,	
  UC	
  Berkeley	
  AMPLab	
  
Founder	
  of	
  Apache	
  Spark	
  
Marie-­‐Anne	
  Neimat	
  
Co-­‐Founder,	
  Times-­‐Ten	
  Database	
  
Former	
  VP,	
  Database	
  Eng.	
  at	
  Oracle	
  
Ken	
  Rudin	
  
Head	
  of	
  Analy4cs	
  at	
  Facebook	
  
Former	
  GM	
  of	
  Oracle	
  Data	
  Warehousing	
  
Abhinav	
  Gupta	
  	
  
Co-­‐Founder,	
  VP	
  Engineering	
  at	
  Rocket	
  Fuel	
  
Runs	
  15PB	
  HBase	
  Cluster	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   22	
  
The	
  First	
  Step	
  to	
  Real-­‐Time	
  Big	
  Data	
  Requires	
  Fixing	
  ETL	
  
ETL	
  on	
  Hadoop	
  
§  Drive	
  lag	
  down	
  from	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
hours	
  è	
  minutes	
  è	
  seconds	
  
§  Start	
  by	
  replacing	
  ODS	
  with	
  	
  	
  
Opera4onal	
  Data	
  Lake	
  
§  5-­‐10x	
  faster	
  and	
  ¼	
  cost	
  
	
  
Splice	
  Machine	
  
§  Replace	
  RDBMSs	
  like	
  Oracle	
  	
  	
  	
  
and	
  MySQL	
  
§  Best	
  of	
  both	
  worlds	
  
§  SQL	
  and	
  transac4ons	
  of	
  RDBMSs	
  
§  Scale-­‐out	
  of	
  NoSQL	
  
§  10x	
  bejer	
  price/performance	
  
	
  
	
  
	
  
Transform
TransformTransform
OLTPOLAP OLTP
Transform
OLTP/OLAPOLTP OLAP OLAP
ETL: Scale-up ETL: Scale-out ELT T Only
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
ETL:	
  Gatekeeper	
  to	
  
Real-­‐Time	
  Big	
  Data	
  
Rich	
  Reimer	
  
VP,	
  Product	
  Management	
  
rreimer@splicemachine.com	
  
	
  
August	
  11,	
  2015	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
  
Focused	
  on	
  Opera4onal	
  Workloads	
  
24	
  
Splice	
  Machine	
  Proprietary	
  and	
  Confiden4al	
   25	
  
Oracle	
  Vs	
  Splice	
  Machine	
  TCO	
  comparison	
  
Oracle	
  RAC	
  Costs	
   List	
  Price	
   Unit	
   3	
  Year	
  Cost	
  
(Discounted	
  60%)	
  
	
  
Oracle	
  Database	
  Enterprise	
  
Edi4on	
  with	
  RAC	
  	
  
$37,750	
   64	
   $966,400	
  
3	
  years	
  DB	
  Maintenance	
  
(22%	
  list	
  price/yr)	
  	
  
$24,915	
   64	
   $637,824	
  
3	
  years	
  Opera4ng	
  System	
  
Support	
  (Oracle	
  Linux)	
  	
  
	
  
$6,897	
   4	
   $11,035	
  
Server	
  Costs	
  (mid-­‐range,	
  
Intel	
  Xeon-­‐based)	
  
$16,000	
   4	
   $64,000	
  
Primary	
  Storage	
   $143,360	
   $143,360	
  
TOTAL	
   $228,922	
   $1,822,619	
  
Assumes	
  Oracle	
  Enterprise	
  Edi4on	
  ($47.5K/CPU)	
  and	
  RAC	
  ($23K/CPU)	
  	
  
Splice	
  Machine	
  Costs	
   List	
  Price	
   Unit	
   3	
  Year	
  Cost	
  
(without	
  discount)	
  
Splice	
  Machine	
  Annual	
  
Subscrip4on	
  
$10,000	
   7	
   $210,000	
  
Cloudera	
  Enterprise	
  
Edi4on	
  Annual	
  
Subscrip4on	
  
$7,500	
   8	
   $180,000	
  
Server	
  Costs	
  	
  with	
  Storage	
   $5,000	
   8	
   $40,000	
  
TOTAL	
   $22,500	
   $430,000	
  
76%	
  TCO	
  Reduc3on	
  
Twitter Tag: #briefr The Briefing Room
Perceptions & Questions
Analyst:
Robin Bloor
Life in the Data Lake
Robin Bloor, Ph.D.
Hadoop: One Ring to Rule Them All
Hadoop has become the de facto
processing environment for big
data.
Is it going to become the de facto
environment for
ALL SERVER COMPUTING?
Empires to Conquer
u  Big Data
u  Analytics
u  Real-time analytics
u  OLTP
u  Document shares
u  Office systems
✔︎
✔︎
?
?
??
Just A Few Years Ago
What Hadoop Dreams Of
Hadoop Possibilities?
u  Hadoop is evolving faster than any equivalent
technology I can remember
u  It has a very long way to go to become the
“server OS for everything.”
u  First it would need to become a genuine OS
u  It has no stated direction.
u  It may vanish into the cloud.
u  Nevertheless it is interesting to watch
The Net Net
Meanwhile, it has become a lab for
server software
u  It’s not just ETL: it’s ETL, data cleansing,
metadata capture, MDM, etc. How do you
accommodate that?
u  Do you have any ETL customer experiences to
report?
u  How’s your OLTP business going? (Is this ETL
emphasis a complementary activity?)
u  How well are you doing versus Oracle?
u  How well does it integrate with other
technologies?
u  What is your current largest customer(s)?
u  Do you have any direct competition on Hadoop?
Twitter Tag: #briefr The Briefing Room
Twitter Tag: #briefr The Briefing Room
Upcoming Topics
www.insideanalysis.com
August: REAL-TIME DATA
September: HADOOP 2.0
October: DATA MANAGEMENT
Twitter Tag: #briefr The Briefing Room
THANK YOU
for your
ATTENTION!
Some images provided courtesy of Wikimedia Commons and by basykes [CC BY 2.0 (http://
creativecommons.org/licenses/by/2.0)], via Wikimedia Commons (https://upload.wikimedia.org/wikipedia/
commons/9/94/Beijing_traffic_jam.jpg)

Contenu connexe

Tendances

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
DataWorks Summit
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
DataWorks Summit
 

Tendances (19)

Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata Streaming
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Trafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoopTrafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoop
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
2 - Trafodion and Hadoop HBase
2 - Trafodion and Hadoop HBase2 - Trafodion and Hadoop HBase
2 - Trafodion and Hadoop HBase
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 

En vedette

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 

En vedette (9)

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
HBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsHBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbms
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
 
Crawl, Walk, Run: How to Get Started with Hadoop
Crawl, Walk, Run: How to Get Started with HadoopCrawl, Walk, Run: How to Get Started with Hadoop
Crawl, Walk, Run: How to Get Started with Hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 

Similaire à Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL

oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
ssuser8ccb5a
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
camyla81
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 

Similaire à Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL (20)

oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
 
Larry Ellison Introduces Oracle Database In-Memory
Larry Ellison Introduces Oracle Database In-MemoryLarry Ellison Introduces Oracle Database In-Memory
Larry Ellison Introduces Oracle Database In-Memory
 
Stream based Data Integration
Stream based Data IntegrationStream based Data Integration
Stream based Data Integration
 
Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7
 
Oracle Database 11g Lower Your Costs
Oracle Database 11g Lower Your CostsOracle Database 11g Lower Your Costs
Oracle Database 11g Lower Your Costs
 
Performance Tuning intro
Performance Tuning introPerformance Tuning intro
Performance Tuning intro
 
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
 
Performance tuning intro
Performance tuning introPerformance tuning intro
Performance tuning intro
 
Times ten 18.1_overview_meetup
Times ten 18.1_overview_meetupTimes ten 18.1_overview_meetup
Times ten 18.1_overview_meetup
 
Querona Presentation 2018
Querona Presentation 2018Querona Presentation 2018
Querona Presentation 2018
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
What's New in Oracle SQL Developer for 2018
What's New in Oracle SQL Developer for 2018What's New in Oracle SQL Developer for 2018
What's New in Oracle SQL Developer for 2018
 
Meetup Oracle Database: 3 Analizar, Aconsejar, Automatizar… las nuevas funcio...
Meetup Oracle Database: 3 Analizar, Aconsejar, Automatizar… las nuevas funcio...Meetup Oracle Database: 3 Analizar, Aconsejar, Automatizar… las nuevas funcio...
Meetup Oracle Database: 3 Analizar, Aconsejar, Automatizar… las nuevas funcio...
 
Tame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data IntegrationTame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data Integration
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 
Présentation Oracle DataBase 11g
Présentation Oracle DataBase 11gPrésentation Oracle DataBase 11g
Présentation Oracle DataBase 11g
 
Machine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaMachine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE Fukuoka
 
times ten in-memory database for extreme performance
times ten in-memory database for extreme performancetimes ten in-memory database for extreme performance
times ten in-memory database for extreme performance
 

Plus de Inside Analysis

Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
Inside Analysis
 

Plus de Inside Analysis (20)

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BI
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time Analytics
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of Everything
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global Level
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your Architecture
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the Risk
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave Duggal
 
Modus Operandi
Modus OperandiModus Operandi
Modus Operandi
 
Phasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey MalafskyPhasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey Malafsky
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL

  • 1. Grab some coffee and enjoy the pre-­show banter before the top of the hour!
  • 2. The Briefing Room Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
  • 3. Twitter Tag: #briefr The Briefing Room Welcome Host: Eric Kavanagh eric.kavanagh@bloorgroup.com @eric_kavanagh
  • 4. Twitter Tag: #briefr The Briefing Room   Reveal the essential characteristics of enterprise software, good and bad   Provide a forum for detailed analysis of today s innovative technologies   Give vendors a chance to explain their product to savvy analysts   Allow audience members to pose serious questions... and get answers! Mission
  • 5. Twitter Tag: #briefr The Briefing Room Topics August: REAL-TIME DATA September: HADOOP 2.0 October: DATA MANAGEMENT
  • 6. Twitter Tag: #briefr The Briefing Room Why Data Gets in a Jam Ø  ETL is dated technology Ø  New super-highways are needed Ø  Data gravity is real
  • 7. Twitter Tag: #briefr The Briefing Room Analyst: Robin Bloor Robin Bloor is Chief Analyst at The Bloor Group robin.bloor@bloorgroup.com @robinbloor
  • 8. Twitter Tag: #briefr The Briefing Room Splice Machine   Splice Machine is a SQL-on-Hadoop database   The product is ACID-compliant and can power both OLAP and OLTP workloads   Splice Machine is built on Java-based Apache Derby and HBase/Hadoop
  • 9. Twitter Tag: #briefr The Briefing Room Guest: Rich Reimer Rich Reimer, VP of Marketing and Product Management Rich has over 15 years of sales, marketing and management experience in high- tech companies. Before joining Splice Machine, Rich worked at Zynga as the Treasure Isle studio head, where he used petabytes of data from millions of daily users to optimize the business in real-time. Prior to Zynga, he was the COO and co-founder of a social media platform named Grouply. Before founding Grouply, Rich held executive positions at Siebel Systems, Blue Martini Software and Oracle Corporation as well as sales and marketing positions at General Electric and Bell Atlantic.
  • 10. Splice  Machine  Proprietary  and  Confiden4al   ETL:  Gatekeeper  to   Real-­‐Time  Big  Data   Rich  Reimer   VP,  Product  Management   rreimer@splicemachine.com     August  11,  2015  
  • 11. Splice  Machine  Proprietary  and  Confiden4al   What  Is  Real-­‐Time?  Are  We  There  Yet?   2   Capture Analyze Act Depends  on  where  you  are  in  the  insight-­‐to-­‐ac4on  con4nuum   Current Real-Time •  Nightly ETL •  Data Lakes •  Interactive Reports on Old Data •  Days for Data Scientists to Analyze •  Millisecond Delay •  Automated Machine Learning •  Days to Update Rules •  Months to Update Apps •  Autonomic Applications Crawl Walk Run
  • 12. Splice  Machine  Proprietary  and  Confiden4al   ETL:  Boring,  Unglamorous,  Inevitable  Burden   3   “ETL  is  something  you  do  that  nobody  no4ces  un4l  you  don’t  do  it.”   -­‐  Author  Unknown    
  • 13. Splice  Machine  Proprietary  and  Confiden4al   But  It’s  Killing  You  Slowly…   4   Iner4a  and  hidden  costs  dragging  your  business  down   ERP CRM … Data Warehouse ETL ODS Systems of Record Expensive Scale-up hardware and proprietary software Tuning Ongoing database tuning to address performance issues Script Maintenance Constant updating of ETL scripts to handle changing sources and reports Unable to Meet Business Needs Takes weeks or months to change or create new reports Delayed Reports Errors or performance issues cause miss of ETL window and delay reports Data Too Old Data is hours or days old, when business needs it near real-time Too Slow Can take hours or even days to finish ETL pipeline
  • 14. Splice  Machine  Proprietary  and  Confiden4al   Big  Data  Makes  It  Worse   5   ETL  becomes  bigger  boCleneck  as  data  grows   ETL   Bo'leneck   Applica1ons   Analysis   Source:  2013  IBM  Briefing  Book   30-40% data  growth   per  year  
  • 15. Splice  Machine  Proprietary  and  Confiden4al   6   Scale-­‐Out:  The  Future  of  Databases   Drama4c  improvement  in  price/performance     Scale  Up   (Increase  server  size)   Scale  Out   (More  small  servers)   vs.   $ $ $ $ $ $
  • 16. Splice  Machine  Proprietary  and  Confiden4al   Fixing  ETL:  Incremental  Approach   7   Incremental  evolu4on  to  reduce  lag  from  days  to  seconds   ETL: Scale-up ETL: Scale-out ELT T Only Legacy Now Now Future Days/Hours Hours/Minutes Minutes/Seconds No Lag Transform TransformTransform OLTPOLAP OLTP Transform OLTP/OLAPOLTP OLAP OLAP Timing Architecture Lag Approach
  • 17. Splice  Machine  Proprietary  and  Confiden4al   8   Reference  Architecture:  Typical  Data  Processing  Pipeline   How  do  you  reduce  lag  from  days  to  minutes  to  seconds?   Ad Hoc Analytics Executive Business Reports Operational Reports & Analytics ERP CRM Supply Chain HR … Data Warehouse Datamart Stream or Batch Updates Mixed Workload AppsODS ETL Systems of Record Extract Transform Load
  • 18. Splice  Machine  Proprietary  and  Confiden4al   9   Ad Hoc Analytics Executive Business Reports Operational Reports & Analytics ERP CRM Supply Chain HR … Data Warehouse Datamart Stream or Batch Updates Mixed Workload Apps ETL Systems of Record Extract Transform Load Reference  Architecture:  Scale-­‐Out  Data  Processing  Pipeline   Accelerate  Data  Processing  Pipeline  to  minutes  or  even  seconds   Operational Data Lake Benefits §  5-­‐10x  faster   §  75%  less  cost   §  Elas4c  scalability   §  Unstructured  data  support  
  • 19. Splice  Machine  Proprietary  and  Confiden4al   10   You  Need  More  Than  Hadoop  By  Itself  For  ETL   Errors  or  data  quality  issues  force  ETL  restarts   Restart  ETL  to  fix  errors  or   update  records   Hours Seconds Use  transac4on  to   restart  step  or   update  records   Hadoop RDBMS ETL Hadoop ETL Apps   ETL   Analy4cs   Apps   ETL   Hours Analy4cs   Benefits §  SQL-­‐based  transforms   §  Improved  data  quality   §  Faster  recovery  with   transac4ons  
  • 20. Splice  Machine  Proprietary  and  Confiden4al   Streamlining  the  Structured  Data  Pipeline  in  Hadoop   11   Source Systems ERP … CRM Sqoop Apply Inferred Schema Stored as flat files SQL Query Engines BI Tools Tradi3onal  Hadoop  Pipeline   vs.   Source Systems ERP … CRM Existing ETL Tool Stored in same schema BI Tools Streamlined  Hadoop  Pipeline   Benefits §  Less  cost  and   complexity   §  Faster  w/  fewer   transla4ons   §  Improved  data  quality   §  Bejer  SQL  support  
  • 21. Splice  Machine  Proprietary  and  Confiden4al   12   Seamless  Integra4on  of  Structured  and  Unstructured  Data   Op4mizing  storage  and  querying  of  structured  data  as  part  of  ELT  or  Hadoop  query  engines   OLTP Systems ERP CRM Supply Chain HR … Structured Data Unstructured Data HCATALOG Pig SCHEMA ON INGEST: Streamlined, structured-to- structured integration 1   2   3   SCHEMA BEFORE READ: Repository for structured data or metadata from ELT process on unstructured data SCHEMA ON READ: Ad-hoc Hadoop queries across structured and unstructured data
  • 22. Splice  Machine  Proprietary  and  Confiden4al   Case  Study:  Opera4onal  Data  Lake   13  13   Overview       Computer  technology  corpora4on     Update  database  technology  for:     ODS  layer  replacement     ETL  processing  and  analysis  of  Omniture  data     Real-­‐4me  OLTP  for  Global  Tech  Support  app     Challenges     Oracle  and  Teradata  too  expensive  to  scale     Many  Oracle  queries  couldn’t  complete     Can  only  hold  7  days  worth  of  data  in  Oracle     Missing  ETL  window  with  current  Hadoop  data  lake     Solu1on  Diagram     (400TB)   OLTP Systems ERP CRM Supply Chain Benefits   75%  less  cost   with  commodity  scale  out   Incremental  ETL  processing   gracefully  handle  data  quality  issues   5x-­‐10x  faster   comple4ng  queries  on  which  Oracle  failed         ✔  
  • 23. Splice  Machine  Proprietary  and  Confiden4al   14   Internet  of  Things   ETL/Opera4onal  Data  Lake  Digital  Marke4ng   Precision   Medicine   Use  Cases   Splice  Machine  |  Proprietary  &  Confiden4al   Fraud  Detec4on  
  • 24. Splice  Machine  Proprietary  and  Confiden4al   15   Who  Are  We?   Affordable,  Scale-­‐Out  –  Commodity  hardware   Elas3c  –  Easy  to  expand  or  scale  back   Transac3onal  –  Real-­‐4me  updates  &  ACID  Transac4ons     ANSI  SQL  –  Leverage  exis4ng  SQL  code,  tools,  &  skills   Flexible  –  Support  opera4onal  and  analy4cal  workloads   10x     Bejer     Price/Perf     THE  HADOOP  RDBMS     Replace  Oracle  with  Splice  Machine   to  scale  out  your  applica4ons  
  • 25. Splice  Machine  Proprietary  and  Confiden4al   16   Proven  Building  Blocks:  Hadoop  and  Derby   APACHE  DERBY     §   ANSI  SQL-­‐99  RDBMS   §   Java-­‐based   §   ODBC/JDBC  Compliant     APACHE  HBASE/HDFS   §  Auto-­‐sharding   §  Real-­‐4me  updates   §  Fault-­‐tolerance   §  Scalability  to  100s  of  PBs   §  Data  replica4on        
  • 26. Splice  Machine  Proprietary  and  Confiden4al   17   Distributed,  Parallelized  Query  Execu4on   Parallelized   computa4on  across   cluster   Moves   computa4on  to     the  data   U4lizes  HBase     co-­‐processors   No  MapReduce   HBase     Co-­‐Processor     HBase  Server   Memory  Space   LEGEND  
  • 27. Splice  Machine  Proprietary  and  Confiden4al   ANSI  SQL-­‐99  Coverage   18   §  Data  types  –  e.g.,  INTEGER,  REAL,   CHARACTER,  DATE,  BOOLEAN,  BIGINT   §  DDL  –  e.g.,  CREATE  TABLE,  CREATE  SCHEMA,   ALTER  TABLE,  DELETE,  UPDATE  TABLE   §  Predicates  –  e.g.,  IN,  BETWEEN,  LIKE,  EXISTS   §  DML  –  e.g.,  INSERT,  DELETE,  UPDATE,  SELECT   §  Query  specifica3on  –  e.g.,  GROUP  BY,   HAVING   §  SET  func3ons  –  e.g.,  UNION,  ABS,  MOD,  ALL   §  Aggrega3on  func3ons  –  e.g.,  AVG,  MAX,   COUNT   §  String  func3ons  –  e.g.,  SUBSTRING,   concatena4on,  UPPER,  LOWER,  TRIM,   LENGTH   §  Constraints  –  e.g.,  PRIMARY  KEY,  FOREIGN   KEY,  UNIQUE,  NOT  NULL   §  Condi3onal  func3ons  –  e.g.,  CASE,  searched   CASE   §  Privileges  –  e.g.,  privileges  for  SELECT,   DELETE,  INSERT,  EXECUTE   §  Joins  –  e.g.,  INNER  JOIN,  LEFT  OUTER  JOIN   §  Transac3ons  –  e.g.,  COMMIT,  ROLLBACK,   Snapshot  Isola4on   §  Sub-­‐queries   §  Triggers   §  User-­‐defined  func3ons  (UDFs)   §  Views  –  including  grouped  views  
  • 28. Splice  Machine  Proprietary  and  Confiden4al   19   Lockless,  ACID  transac4ons   •  Adds  mul4-­‐row,  mul4-­‐table   transac4ons  to  HBase  w/  rollback   •  Fast,  lockless,  high  concurrency     •  Extends  research  from  Google   Percolator,  Yahoo  Labs,  U  of   Waterloo   •  Patent  pending  technology    
  • 29. Splice  Machine  Proprietary  and  Confiden4al   What  People  are  Saying…   20   Recognized  as  a  key  innovator  in  databases   Scaling  out  on  Splice   Machine  presented     some  major  benefits     over  Oracle   ...automa4c  balancing  between   clusters...avoiding  the  costly   licensing  issues.   Quotes   Awards     An  alterna3ve  to  today’s   RDBMSes,   Splice  Machine  effec4vely     combines  tradi4onal  rela4onal   database    technology  with     the  scale-­‐out  capabili4es     of  Hadoop.     The  unique  claim  of  …  Splice   Machine  is  that  it  can  run   transac3onal  applica3ons   as  well  as  support  analy4cs  on     top  of  Hadoop.  
  • 30. Splice  Machine  Proprietary  and  Confiden4al   Ini4al  Advisory  Board   21   Advisory  Board  includes  luminaries  in  databases  and  technology     Roger  Bamford   Former  Principal  Architect  at  Oracle   Father  of  Oracle  RAC   Mike  Franklin   Computer  Science  Chair,  UC  Berkeley   Director,  UC  Berkeley  AMPLab   Founder  of  Apache  Spark   Marie-­‐Anne  Neimat   Co-­‐Founder,  Times-­‐Ten  Database   Former  VP,  Database  Eng.  at  Oracle   Ken  Rudin   Head  of  Analy4cs  at  Facebook   Former  GM  of  Oracle  Data  Warehousing   Abhinav  Gupta     Co-­‐Founder,  VP  Engineering  at  Rocket  Fuel   Runs  15PB  HBase  Cluster  
  • 31. Splice  Machine  Proprietary  and  Confiden4al   22   The  First  Step  to  Real-­‐Time  Big  Data  Requires  Fixing  ETL   ETL  on  Hadoop   §  Drive  lag  down  from                                 hours  è  minutes  è  seconds   §  Start  by  replacing  ODS  with       Opera4onal  Data  Lake   §  5-­‐10x  faster  and  ¼  cost     Splice  Machine   §  Replace  RDBMSs  like  Oracle         and  MySQL   §  Best  of  both  worlds   §  SQL  and  transac4ons  of  RDBMSs   §  Scale-­‐out  of  NoSQL   §  10x  bejer  price/performance         Transform TransformTransform OLTPOLAP OLTP Transform OLTP/OLAPOLTP OLAP OLAP ETL: Scale-up ETL: Scale-out ELT T Only
  • 32. Splice  Machine  Proprietary  and  Confiden4al   ETL:  Gatekeeper  to   Real-­‐Time  Big  Data   Rich  Reimer   VP,  Product  Management   rreimer@splicemachine.com     August  11,  2015  
  • 33. Splice  Machine  Proprietary  and  Confiden4al   Focused  on  Opera4onal  Workloads   24  
  • 34. Splice  Machine  Proprietary  and  Confiden4al   25   Oracle  Vs  Splice  Machine  TCO  comparison   Oracle  RAC  Costs   List  Price   Unit   3  Year  Cost   (Discounted  60%)     Oracle  Database  Enterprise   Edi4on  with  RAC     $37,750   64   $966,400   3  years  DB  Maintenance   (22%  list  price/yr)     $24,915   64   $637,824   3  years  Opera4ng  System   Support  (Oracle  Linux)       $6,897   4   $11,035   Server  Costs  (mid-­‐range,   Intel  Xeon-­‐based)   $16,000   4   $64,000   Primary  Storage   $143,360   $143,360   TOTAL   $228,922   $1,822,619   Assumes  Oracle  Enterprise  Edi4on  ($47.5K/CPU)  and  RAC  ($23K/CPU)     Splice  Machine  Costs   List  Price   Unit   3  Year  Cost   (without  discount)   Splice  Machine  Annual   Subscrip4on   $10,000   7   $210,000   Cloudera  Enterprise   Edi4on  Annual   Subscrip4on   $7,500   8   $180,000   Server  Costs    with  Storage   $5,000   8   $40,000   TOTAL   $22,500   $430,000   76%  TCO  Reduc3on  
  • 35. Twitter Tag: #briefr The Briefing Room Perceptions & Questions Analyst: Robin Bloor
  • 36. Life in the Data Lake Robin Bloor, Ph.D.
  • 37. Hadoop: One Ring to Rule Them All Hadoop has become the de facto processing environment for big data. Is it going to become the de facto environment for ALL SERVER COMPUTING?
  • 38. Empires to Conquer u  Big Data u  Analytics u  Real-time analytics u  OLTP u  Document shares u  Office systems ✔︎ ✔︎ ? ? ??
  • 39. Just A Few Years Ago
  • 41. Hadoop Possibilities? u  Hadoop is evolving faster than any equivalent technology I can remember u  It has a very long way to go to become the “server OS for everything.” u  First it would need to become a genuine OS u  It has no stated direction. u  It may vanish into the cloud. u  Nevertheless it is interesting to watch
  • 42. The Net Net Meanwhile, it has become a lab for server software
  • 43. u  It’s not just ETL: it’s ETL, data cleansing, metadata capture, MDM, etc. How do you accommodate that? u  Do you have any ETL customer experiences to report? u  How’s your OLTP business going? (Is this ETL emphasis a complementary activity?) u  How well are you doing versus Oracle?
  • 44. u  How well does it integrate with other technologies? u  What is your current largest customer(s)? u  Do you have any direct competition on Hadoop?
  • 45. Twitter Tag: #briefr The Briefing Room
  • 46. Twitter Tag: #briefr The Briefing Room Upcoming Topics www.insideanalysis.com August: REAL-TIME DATA September: HADOOP 2.0 October: DATA MANAGEMENT
  • 47. Twitter Tag: #briefr The Briefing Room THANK YOU for your ATTENTION! Some images provided courtesy of Wikimedia Commons and by basykes [CC BY 2.0 (http:// creativecommons.org/licenses/by/2.0)], via Wikimedia Commons (https://upload.wikimedia.org/wikipedia/ commons/9/94/Beijing_traffic_jam.jpg)