SlideShare une entreprise Scribd logo
1  sur  57
SQL vs NoSQL: Why you’ll never
dump your relations
17th March 2015
© 2015 EXASOL AG
BCS Data Management Specialist Group
Dave Shuttleworth – Principal Consultant, Exasol UK
email: dave.shuttleworth@exasol.com
Twitter: @EXA_DaveS
© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda
© 2015 EXASOL AG
 2014-2015 – EXASOL UK – Principal Consultant
 Introducing EXASOL DBMS technology into UK
 2003 - 2014 – Intelligent Edge Group – Principal Consultant
 Data Warehouse design and migration from older technologies to new MPP DBMS
 Business Intelligence infrastructure architect
 New DBMS technology assessment
 1992 - 2003 – WhiteCross Systems (now Kognitio) – Principal Consultant
 Pre-sales and post-sales technical support
 1989 -1992 – Teradata – Consultant
 Pre-sales and post-sales technical support
 1980 -1989 – Data General (now part of EMC) – Systems engineer
 Pre-sales and post-sales technical support
 1975 -1980 – UK retailer – Analyst programmer
 Applications design and implementation, system management and tuning
My background
© 2015 EXASOL AG
 a column store, in-memory, massively parallel processing (MPP)
database
 modern software designed for analytics
 runs on standard x86 hardware
 Uses standard SQL language (with optional extensions)
 suitable for any scale of data & any number of users
 mature, proven & very cost effective
 quick to implement & easy to operate
The World’s Fastest Analytic Database
What is Exasol?
© 2015 EXASOL AG
QphH@1000 GB 1,000,000 2,000,000 3,000,000 4.000,000
Sept ´14
April ´14
June ´12
Feb ´14
Dec ´13
Aug ´11
Sept ´11
Oct ´11
Dec ´11
Source: www.tpc.org / Sept 22,
2 0 1 5
We are the benchmark leader
5,246,338
Microsoft 134,117
Oracle 201,487
Oracle 209,533
Microsoft 219,887
Sybase IQ 258,474
Oracle 326,454
Vectorwise 445,529
Microsoft 519,976
On 1 Terabyte of data - an order of magnitude faster than its closest rival
Queries per hour
© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda
© 2015 EXASOL AG
• Databases and Data Warehouses have evolved to meet the needs of
business (over many years…!)
• Generally using some form of Relational Database (SQL based)
• Originally tightly structured data, now expanding to include unstructured data
• Ever increasing data volumes and complexity
• New technologies have emerged to address (and extend) the storage and
management requirements
• Fast cheap network connectivity
• Cloud services for cheaper and more flexible implementation
• Wider acceptance of open source software for production systems
• Hadoop parallel processing platform – often in a ‘hybrid’ environment
• Alternative database technologies (e.g. document stores, graph databases)
• Publicly accessible data sources (e.g. weather history, flight data, Google
searches. Twitter feeds, census data, mapping data)
• More complex analytics needed to stay competitive
SQL vs NoSQL - background
© 2015 EXASOL AG
• Proliferation of NoSQL (‘not only SQL’) databases – over 150 listed on
nosql.database.org – classified by type:
• Wide Column Stores
• E.g. Hadoop, MapR, Cassandra, MonetDB
• Document stores
• Elasticseach, MongoDB, Couchbase, Marklogic
• Key value/tuple store
• DynamoDB, Azure Table Storage, Oracle NoSQL, MemcacheDB
• Graph databases
• NEO4J, Yarcdata, Graphbase
• Multimodal databases
• Object databases
• etc, etc..
SQL vs NoSQL - background
© 2015 EXASOL AG
• The inherent restrictions of relational databases are addressed by
NoSQL implementations :
• More flexible data model – ‘schemaless’ or ‘schema on read’
• ‘Schemaless’ can mean very fast write performance – useful for streaming data
• Simplifies handling of unstructured and semi-structured data such as logfiles,
other machine generated data and text
• Designed for easy scale-up (and scale down) to handle seasonal workloads
• High levels of concurrency can be achieved via distributed processing
• High availability via replication is built in to some NoSQL databases
• Maps well to cloud based infrastructure and capabilities (if done well!)
SQL vs NoSQL - background
© 2015 EXASOL AG
Hadoop today is …
 Still Open Source !
 Began with HDFS and Map/Reduce
 Now comprises a number of additional technologies
 File systems
 (e.g. Tachyon)
 Cluster Managers
 (e.g. YARN + Mesos)
 Execution Engines
 (e.g. Tez, Spark etc.)
 Analytical Layer and Applications
 (e.g. Hive, Pig, various SQL on Hadoop)
© 2015 EXASOL AG
Hadoop With Everything?
 Hadoop was invented to more easily distribute the Nutch
web search engine across a cluster of machines.
 Map/Reduce – distributed processing
 HDFS – distributed file system
 Began to be used for …. just about everything.
 But not all processing tasks are like indexing the Internet
 Hadoop started to attract criticism
 But usually when it was being used for something it wasn’t
designed for
© 2015 EXASOL AG
Definitely NOT jobs for Hadoop
 Word processing
 Payroll system
 Anything on a single computer
 Anything with “small” data
© 2015 EXASOL AG
Analytical Queries
 “GROUP BY“ logic
 i.e. not concerned with individual data items
 Analytical Functions
 MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION …
 Table joins, nested subqueries
Usually short-running, ad-hoc and submitted many at a time.
© 2015 EXASOL AG
Map/Reduce and HDFS : the wrong tools for Analytics ?
 Queries tend to be short : fault tolerance is less important
 If chance of failure in a 5 hour batch is 1 in 300
 Chance of failure in a 5 second query is 1 in 1,000,000
 Queries tend to be short : start-up time is significant
 a 20 second start-up time is NOT OK on a 5 second query
 A number of projects started to address these issues
 e.g. “Hot containers” in Hive on Tez to reduce start-up time
 Also Pushdown via Hive partitions or ORC predicate pushdown
© 2015 EXASOL AG
Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation
Map/Reduce: the wrong language for Analytics ?
Stage 0: Map-Shuffle-Reduce
Mapper(row) {
fields = row.split("t")
emit(fields[0], fields[1]);
}
Reducer(key, values) {
sum = 0;
for (value in values) {
sum += value;
}
emit(key, sum);
}
Stage 1: Map-Shuffle
Mapper(row) {
...
emit(page_views, page_name);
}
... shuffle
Stage 2: Local
data = open("stage1.out")
for (i in 0 to 10) {
print(data.getNext())
}
© 2015 EXASOL AG
Equivalent in SQL
SELECT
page_name,
SUM(page_views) views
FROM wikistats
GROUP BY page_name
ORDER BY views DESC
LIMIT 10;
© 2015 EXASOL AG
The SQL language
 Portable
 Well-defined standards exist
 No detailed knowledge of the platform required
 e.g. you don’t need to manage memory
 SQL is assumed by a lot of reporting tools
 Widely used and understood even by non-technical
people
© 2015 EXASOL AG
I‘m not saying that SQL is perfect
• Try writing the simple Hadoop “Word Count” example in
pure SQL
• Or try to “sessionise” weblog data
• Or anything with data that is not structured
• “Which part of STRUCTURED Query Language don’t you
understand …?!”
• All I’m saying is that is an excellent language for
analytical queries.
© 2015 EXASOL AG
Hadoop could handle SQL (via Hive), but historically …
 High Latency
 Restricted SQL options
 All but simple table joins were difficult
 Little support for compression & indexing
 Merv Adrian (Gartner Research - 2014)
 “What is remarkable is that Hadoop does SQL.
Just don’t expect it to do it well”
 Result : EVERYTHING looked good compared to Hive
© 2015 EXASOL AG
Everyone still likes to compare themselves to Hive
© 2015 EXASOL AG
EXASOL being no exception !
© 2015 EXASOL AG
Hive continues to be improved …
 Completed
 Views (HIVE-1143)
 Partitioned Views (HIVE-1941)
 Storage Handlers (HIVE-705)
 HBase Integration
 HBase Bulk Load
 Locking (HIVE-1293)
 Indexes (HIVE-417)
 Bitmap Indexes (HIVE-1803)
 Filter Pushdown (HIVE-279)
 Table-level Statistics (HIVE-1361)
 Dynamic Partitions
 Binary Data Type (HIVE-2380)
 Decimal Precision and Scale Support
 HCatalog
 HiveServer2 (HIVE-2935)
 Column Statistics in Hive (HIVE-1362)
 List Bucketing (HIVE-3026)
 Group By With Rollup (HIVE-2397)
 Enhanced Aggregation, Cube, Grouping
and Rollup (HIVE-3433)
 Optimizing Skewed Joins (HIVE-3086)
 Correlation Optimizer (HIVE-2206)
 Hive on Tez (HIVE-4660)
 Vectorized Query Execution (HIVE-
4160)
 In Progress
 Atomic Insert/Update/Delete (HIVE-
5317)
 Transaction Manager (HIVE-5843)
 Cost Based Optimizer in Hive (HIVE-
5775)
 Proposed
 Spatial Queries
 Theta Join (HIVE-556)
 JDBC Storage Handler
 MapJoin Optimization
 Proposal to standardize and expand
Authorization in Hive
 Dependent Tables (HIVE-3466)
 AccessServer
 Type Qualifiers in Hive
 MapJoin & Partition Pruning (HIVE-
5119)
 SQL Standard based secure
authorization (HIVE-5837)
 Updatable Views (HIVE-1143)
 Hive on Spark (HIVE-7292)
© 2015 EXASOL AG
The dream data architecture for analytics …
Based on the SQL language
but leverages Hadoop’s extreme scalability
and Hadoop’s fault tolerance
while not compromising on speed.
Could it please also have some maturity ?
And be easy to use ?
© 2015 EXASOL AG
The current reality
 SQL on SQL, which is arguably
 Less scalable
 Less fault tolerant
 Less good with unstructured data
 SQL on Hadoop, which is arguably
 Less mature
 Less easy to use
 Slower
© 2015 EXASOL AG
Choices for SQL and Hadoop
 SQL AND HADOOP
 A Connector
 HADOOP ON SQL
 User Defined Functions
 SQL ON HADOOP
 Something like Hive, but better
© 2015 EXASOL AG
Option 1 – SQL AND HADOOP
Run SQL on SQL and Hadoop on Hadoop and use a connector
to join the two systems
Pros
 Minimal impact (SQL and Hadoop worlds can function as before)
 Easier to implement
Cons
 Network !
 Challenge of optimising across two technologies
© 2015 EXASOL AG
Option 2 – HADOOP ON SQL
 Bring Map/Reduce into the Parallel database
 For example using Java User Defined Functions
select my_java_map_function(words) a_word,
count(*) word_count
from DOCUMENTS
group by 1
 Doesn’t benefit from Hadoop’s storage advantages
© 2015 EXASOL AG
Option 3 - SQL ON HADOOP
Build a relational database on Hadoop storage
 Impala (Cloudera)
 Stinger (Hortonworks)
 Presto (Facebook)
 SparkSQL (UC Berkeley)
 HAWQ (Pivotal)
 BigSQL (IBM)
 Apache Phoenix (for HBase)
 Apache Tajo
 Apache Drill
 etc etc etc ….
AND DON‘T FORGET HIVE !
© 2015 EXASOL AG
Four possible market outcomes…
 Hadoop and SQL databases are on a collision course – only
one will survive
 No sign of that so far
 They are complementary – both will survive
 Probably - the challenge is how to make them work together
 They will merge and become one
 Some indications this is already starting to happen
 Something even more amazing will come along and replace
them both
 Sometimes this happens – Spark ?
© 2015 EXASOL AG
What do the pundits say?
 Martin Fowler – Thoughtworks
 The rise of NoSQL databases marks the end of the era of relational database
dominance
 But NoSQL databases will not become the new dominators. Relational will still
be popular, and used in the majority of situations. They, however, will no longer
be the automatic choice.
 The era of Polyglot Persistence has begun - where any decent sized enterprise
will have a variety of different data storage technologies for different kinds of
data
 Emil Eifrem – Neo Technology
 When evaluating a NoSQL database, it is critical to demand enterprise-
readiness. An enterprise delivering modern applications needs a NoSQL
database that can manage today's complex and connected data while still
delivering the enterprise strength, transactions and durability that IT
departments have relied on for years.
© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda
© 2015 EXASOL AG
37
King in numbers
• 100 million daily active users
• 1 billion game plays per day
• 8 offices
And lots and lots of data...
• 14 billion rows per day
• 500 Gb per day new
• 700 Tb stored
Case Study - King
© 2015 EXASOL AG
King - Getting to know 500 million players
Objectives in game analytics
38
• Metrics and KPIs
• Measure and understand player behaviour
• Player segmentation
• Improve player experience
• Forecasting
• Predictive modelling
© 2015 EXASOL AG
39
Challenges at King
• Extreme scale
• Rate of growth
• Speed of innovation
• Cross platform
• Virtual economies
King - Getting to know 500 million players
© 2015 EXASOL AG
40
The King formula
• Data driven culture
• Engaged business
• Talented embedded data scientists
• AB testing
• Right technology platform
• Right data model
King - Getting to know 500 million players
© 2015 EXASOL AG
System architecture
41
How King does data
Game
servers
Log
server
Reports
Data
scientists
Data WarehouseTSV log
files
Dimensional
model
Raw
data
ETL
© 2015 EXASOL AG
Our data keeps growing...
42
How King does data
King launches
on mobile...
© 2015 EXASOL AG
…our technology has to keep up
43
How King does data
Qlikview says no
Infobright CE
says no
10 node
Hadoop
80 nodes
40 nodes
20 nodes
InfiniDB
Exasol
© 2015 EXASOL AG
Data platform 1.0
44
How King does data
Games
Event
data
Hive
Reports
Data
scientists
ETL
© 2015 EXASOL AG
Data platform 1.5
45
How King does data
Games
Event
data
Hive DB
Reports
Data
scientists
ETL
© 2015 EXASOL AG
46
Why ExaSolution?
• Speed
• Efficiency
• Tuning free
• Scaling (150Tb and counting...)
• ExaDudes
How King does data
© 2015 EXASOL AG
Performance
47
How King does data
© 2015 EXASOL AG
48
Games
Event
data
Hive Exasol
Reports
Data
scientists
ETL
Data platform 2.0
How King does data
© 2015 EXASOL AG
49
Benefits
• ETL times slashed
• Cost saving
• Tuning free
• Scaling
How King does data
© 2015 EXASOL AG
Data platform 3.0
50
Where next?
Games
Event
data
Exasol Hive
Reports
Data
scientists
ETL
© 2015 EXASOL AG
51
Future challenges
• Keep on scaling
• Closer Hadoop integration
• Evolving data model
• Microbatch ETL
• Real(er) time…
Where next?
© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda
© 2015 EXASOL AG
What’s hot?
© 2015 EXASOL AG
• A definition:
• The Internet of Things (IoT) is a scenario in which objects, animals or people are
provided with unique identifiers and the ability to transfer data over a network
without requiring human-to-human or human-to-computer interaction
• Basic concept has been around for decades – now accepted into the
mainstream
• Wide range of potential uses:
• Environmental monitoring
• Infrastructure management
• Manufacturing
• Energy management
• Medical and healthcare systems
• Building and home automation
• Transport systems
Internet of Things
© 2015 EXASOL AG
• Wearable technologies – e.g. smart watches, Google Glass
• Bio sensors for humans (and other animals)
• Health monitoring
• Already in use on some dairy farms – optimise milk yields and give early
warning for possible disease
• Location based data
• All modern phones provide location data (either GPS or cell based)
• ‘crowd sourcing’ – e.g. traffic flow based on cellphone signals
• Beacons – e.g. Regent Street in London
• Location-based special offers and advertisement
• Facial recognition
• To drive targetted advertisements
Other emerging technologies which produce data
© 2015 EXASOL AG
• Cloud being used for evaluation of new technologies and also as a platform for
dev/test (and even DR) environments
• In-database analytics using UDFs in languages such a R, Lua and Python
• Move the processing closer to the data
• Run analytics on full data volumes (no sampling/extract required)
• Get improved performance due to parallelism (where possible)
• Lots of freely available R code on the web
• Automated conversion of analytical results to text (NLG) is emerging
• AI rule-based generation of natural language output
• Readable summaries and recommendations
• Yseop, NarrativeScience, Automated Insights, Arria NLG
Other emerging trends
© 2015 EXASOL AG
• Data and database technology isn’t going away!
• New database approaches are being developed to address the
requirements of flexibility, scalability etc
• These technologies drive an increasing need for more analysts,
database designers, data scientists
• Hybrid systems are becoming the norm, with companies mixing ‘best
of breed’ technologies (possibly open source) to get the best and
most cost-effective results – use ‘the right tool for the job’
• SQL databases will continue to be widely utilised – but alongside
other technologies and integration will become tighter
Summary
© 2015 EXASOL AG
 Introduction & background
 SQL vs NoSQL - observations
 Case study
 King – online gaming
 What’s hot?
 Q & A
Agenda
© 2015 EXASOL AG
Dave Shuttleworth
Twitter: @EXA_Daves
Email: dave.shuttleworth@exasol.com
Any questions?
Presentation to insert name here 60
Presentation to insert name here 61

Contenu connexe

Tendances

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
A7 storytelling with_oracle_analytics_cloud
A7 storytelling with_oracle_analytics_cloudA7 storytelling with_oracle_analytics_cloud
A7 storytelling with_oracle_analytics_cloudDr. Wilfred Lin (Ph.D.)
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?Venu Anuganti
 
Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Cloudera, Inc.
 
Tame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data IntegrationTame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data IntegrationMichael Rainey
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...James Serra
 
Versa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinarVersa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinarShawn Rao
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondDataWorks Summit/Hadoop Summit
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
McGraw-Hill Optimizes Analytics Workloads with Databricks
 McGraw-Hill Optimizes Analytics Workloads with Databricks McGraw-Hill Optimizes Analytics Workloads with Databricks
McGraw-Hill Optimizes Analytics Workloads with DatabricksAmazon Web Services
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at WalgreensDataWorks Summit
 
Oracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analyticsOracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analyticsjdijcks
 

Tendances (20)

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
A7 storytelling with_oracle_analytics_cloud
A7 storytelling with_oracle_analytics_cloudA7 storytelling with_oracle_analytics_cloud
A7 storytelling with_oracle_analytics_cloud
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
The EDW Ecosystem
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 
Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop
 
4AA6-4492ENW
4AA6-4492ENW4AA6-4492ENW
4AA6-4492ENW
 
Tame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data IntegrationTame Big Data with Oracle Data Integration
Tame Big Data with Oracle Data Integration
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
Versa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinarVersa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinar
 
Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
McGraw-Hill Optimizes Analytics Workloads with Databricks
 McGraw-Hill Optimizes Analytics Workloads with Databricks McGraw-Hill Optimizes Analytics Workloads with Databricks
McGraw-Hill Optimizes Analytics Workloads with Databricks
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
Oracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analyticsOracle Big Data Appliance and Big Data SQL for advanced analytics
Oracle Big Data Appliance and Big Data SQL for advanced analytics
 

Similaire à SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL

Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleDatabricks
 
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?brianlangbecker
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfpbonillo1
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...Charley Hanania
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure Antonios Chatzipavlis
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure Antonios Chatzipavlis
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresExploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresCCG
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopDavid Yahalom
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
 
Afternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesAfternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesCCG
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Graph databases and OrientDB
Graph databases and OrientDBGraph databases and OrientDB
Graph databases and OrientDBAhsan Bilal
 

Similaire à SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL (20)

Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
Why does Microsoft care about NoSQL, SQL and Polyglot Persistence?
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresExploring Microsoft Azure Infrastructures
Exploring Microsoft Azure Infrastructures
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 
AZURE Data Related Services
AZURE Data Related ServicesAZURE Data Related Services
AZURE Data Related Services
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Afternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesAfternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data Services
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Graph databases and OrientDB
Graph databases and OrientDBGraph databases and OrientDB
Graph databases and OrientDB
 

Dernier

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL

  • 1. SQL vs NoSQL: Why you’ll never dump your relations 17th March 2015
  • 2. © 2015 EXASOL AG BCS Data Management Specialist Group Dave Shuttleworth – Principal Consultant, Exasol UK email: dave.shuttleworth@exasol.com Twitter: @EXA_DaveS
  • 3. © 2015 EXASOL AG  Introduction & background  SQL vs NoSQL - observations  Case study  King – online gaming  What’s hot?  Q & A Agenda
  • 4. © 2015 EXASOL AG  2014-2015 – EXASOL UK – Principal Consultant  Introducing EXASOL DBMS technology into UK  2003 - 2014 – Intelligent Edge Group – Principal Consultant  Data Warehouse design and migration from older technologies to new MPP DBMS  Business Intelligence infrastructure architect  New DBMS technology assessment  1992 - 2003 – WhiteCross Systems (now Kognitio) – Principal Consultant  Pre-sales and post-sales technical support  1989 -1992 – Teradata – Consultant  Pre-sales and post-sales technical support  1980 -1989 – Data General (now part of EMC) – Systems engineer  Pre-sales and post-sales technical support  1975 -1980 – UK retailer – Analyst programmer  Applications design and implementation, system management and tuning My background
  • 5. © 2015 EXASOL AG  a column store, in-memory, massively parallel processing (MPP) database  modern software designed for analytics  runs on standard x86 hardware  Uses standard SQL language (with optional extensions)  suitable for any scale of data & any number of users  mature, proven & very cost effective  quick to implement & easy to operate The World’s Fastest Analytic Database What is Exasol?
  • 6. © 2015 EXASOL AG QphH@1000 GB 1,000,000 2,000,000 3,000,000 4.000,000 Sept ´14 April ´14 June ´12 Feb ´14 Dec ´13 Aug ´11 Sept ´11 Oct ´11 Dec ´11 Source: www.tpc.org / Sept 22, 2 0 1 5 We are the benchmark leader 5,246,338 Microsoft 134,117 Oracle 201,487 Oracle 209,533 Microsoft 219,887 Sybase IQ 258,474 Oracle 326,454 Vectorwise 445,529 Microsoft 519,976 On 1 Terabyte of data - an order of magnitude faster than its closest rival Queries per hour
  • 7. © 2015 EXASOL AG  Introduction & background  SQL vs NoSQL - observations  Case study  King – online gaming  What’s hot?  Q & A Agenda
  • 8. © 2015 EXASOL AG • Databases and Data Warehouses have evolved to meet the needs of business (over many years…!) • Generally using some form of Relational Database (SQL based) • Originally tightly structured data, now expanding to include unstructured data • Ever increasing data volumes and complexity • New technologies have emerged to address (and extend) the storage and management requirements • Fast cheap network connectivity • Cloud services for cheaper and more flexible implementation • Wider acceptance of open source software for production systems • Hadoop parallel processing platform – often in a ‘hybrid’ environment • Alternative database technologies (e.g. document stores, graph databases) • Publicly accessible data sources (e.g. weather history, flight data, Google searches. Twitter feeds, census data, mapping data) • More complex analytics needed to stay competitive SQL vs NoSQL - background
  • 9. © 2015 EXASOL AG • Proliferation of NoSQL (‘not only SQL’) databases – over 150 listed on nosql.database.org – classified by type: • Wide Column Stores • E.g. Hadoop, MapR, Cassandra, MonetDB • Document stores • Elasticseach, MongoDB, Couchbase, Marklogic • Key value/tuple store • DynamoDB, Azure Table Storage, Oracle NoSQL, MemcacheDB • Graph databases • NEO4J, Yarcdata, Graphbase • Multimodal databases • Object databases • etc, etc.. SQL vs NoSQL - background
  • 10. © 2015 EXASOL AG • The inherent restrictions of relational databases are addressed by NoSQL implementations : • More flexible data model – ‘schemaless’ or ‘schema on read’ • ‘Schemaless’ can mean very fast write performance – useful for streaming data • Simplifies handling of unstructured and semi-structured data such as logfiles, other machine generated data and text • Designed for easy scale-up (and scale down) to handle seasonal workloads • High levels of concurrency can be achieved via distributed processing • High availability via replication is built in to some NoSQL databases • Maps well to cloud based infrastructure and capabilities (if done well!) SQL vs NoSQL - background
  • 11. © 2015 EXASOL AG Hadoop today is …  Still Open Source !  Began with HDFS and Map/Reduce  Now comprises a number of additional technologies  File systems  (e.g. Tachyon)  Cluster Managers  (e.g. YARN + Mesos)  Execution Engines  (e.g. Tez, Spark etc.)  Analytical Layer and Applications  (e.g. Hive, Pig, various SQL on Hadoop)
  • 12. © 2015 EXASOL AG Hadoop With Everything?  Hadoop was invented to more easily distribute the Nutch web search engine across a cluster of machines.  Map/Reduce – distributed processing  HDFS – distributed file system  Began to be used for …. just about everything.  But not all processing tasks are like indexing the Internet  Hadoop started to attract criticism  But usually when it was being used for something it wasn’t designed for
  • 13. © 2015 EXASOL AG Definitely NOT jobs for Hadoop  Word processing  Payroll system  Anything on a single computer  Anything with “small” data
  • 14. © 2015 EXASOL AG Analytical Queries  “GROUP BY“ logic  i.e. not concerned with individual data items  Analytical Functions  MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION …  Table joins, nested subqueries Usually short-running, ad-hoc and submitted many at a time.
  • 15. © 2015 EXASOL AG Map/Reduce and HDFS : the wrong tools for Analytics ?  Queries tend to be short : fault tolerance is less important  If chance of failure in a 5 hour batch is 1 in 300  Chance of failure in a 5 second query is 1 in 1,000,000  Queries tend to be short : start-up time is significant  a 20 second start-up time is NOT OK on a 5 second query  A number of projects started to address these issues  e.g. “Hot containers” in Hive on Tez to reduce start-up time  Also Pushdown via Hive partitions or ORC predicate pushdown
  • 16. © 2015 EXASOL AG Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation Map/Reduce: the wrong language for Analytics ? Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("t") emit(fields[0], fields[1]); } Reducer(key, values) { sum = 0; for (value in values) { sum += value; } emit(key, sum); } Stage 1: Map-Shuffle Mapper(row) { ... emit(page_views, page_name); } ... shuffle Stage 2: Local data = open("stage1.out") for (i in 0 to 10) { print(data.getNext()) }
  • 17. © 2015 EXASOL AG Equivalent in SQL SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10;
  • 18. © 2015 EXASOL AG The SQL language  Portable  Well-defined standards exist  No detailed knowledge of the platform required  e.g. you don’t need to manage memory  SQL is assumed by a lot of reporting tools  Widely used and understood even by non-technical people
  • 19. © 2015 EXASOL AG I‘m not saying that SQL is perfect • Try writing the simple Hadoop “Word Count” example in pure SQL • Or try to “sessionise” weblog data • Or anything with data that is not structured • “Which part of STRUCTURED Query Language don’t you understand …?!” • All I’m saying is that is an excellent language for analytical queries.
  • 20. © 2015 EXASOL AG Hadoop could handle SQL (via Hive), but historically …  High Latency  Restricted SQL options  All but simple table joins were difficult  Little support for compression & indexing  Merv Adrian (Gartner Research - 2014)  “What is remarkable is that Hadoop does SQL. Just don’t expect it to do it well”  Result : EVERYTHING looked good compared to Hive
  • 21. © 2015 EXASOL AG Everyone still likes to compare themselves to Hive
  • 22. © 2015 EXASOL AG EXASOL being no exception !
  • 23. © 2015 EXASOL AG Hive continues to be improved …  Completed  Views (HIVE-1143)  Partitioned Views (HIVE-1941)  Storage Handlers (HIVE-705)  HBase Integration  HBase Bulk Load  Locking (HIVE-1293)  Indexes (HIVE-417)  Bitmap Indexes (HIVE-1803)  Filter Pushdown (HIVE-279)  Table-level Statistics (HIVE-1361)  Dynamic Partitions  Binary Data Type (HIVE-2380)  Decimal Precision and Scale Support  HCatalog  HiveServer2 (HIVE-2935)  Column Statistics in Hive (HIVE-1362)  List Bucketing (HIVE-3026)  Group By With Rollup (HIVE-2397)  Enhanced Aggregation, Cube, Grouping and Rollup (HIVE-3433)  Optimizing Skewed Joins (HIVE-3086)  Correlation Optimizer (HIVE-2206)  Hive on Tez (HIVE-4660)  Vectorized Query Execution (HIVE- 4160)  In Progress  Atomic Insert/Update/Delete (HIVE- 5317)  Transaction Manager (HIVE-5843)  Cost Based Optimizer in Hive (HIVE- 5775)  Proposed  Spatial Queries  Theta Join (HIVE-556)  JDBC Storage Handler  MapJoin Optimization  Proposal to standardize and expand Authorization in Hive  Dependent Tables (HIVE-3466)  AccessServer  Type Qualifiers in Hive  MapJoin & Partition Pruning (HIVE- 5119)  SQL Standard based secure authorization (HIVE-5837)  Updatable Views (HIVE-1143)  Hive on Spark (HIVE-7292)
  • 24. © 2015 EXASOL AG The dream data architecture for analytics … Based on the SQL language but leverages Hadoop’s extreme scalability and Hadoop’s fault tolerance while not compromising on speed. Could it please also have some maturity ? And be easy to use ?
  • 25. © 2015 EXASOL AG The current reality  SQL on SQL, which is arguably  Less scalable  Less fault tolerant  Less good with unstructured data  SQL on Hadoop, which is arguably  Less mature  Less easy to use  Slower
  • 26. © 2015 EXASOL AG Choices for SQL and Hadoop  SQL AND HADOOP  A Connector  HADOOP ON SQL  User Defined Functions  SQL ON HADOOP  Something like Hive, but better
  • 27. © 2015 EXASOL AG Option 1 – SQL AND HADOOP Run SQL on SQL and Hadoop on Hadoop and use a connector to join the two systems Pros  Minimal impact (SQL and Hadoop worlds can function as before)  Easier to implement Cons  Network !  Challenge of optimising across two technologies
  • 28. © 2015 EXASOL AG Option 2 – HADOOP ON SQL  Bring Map/Reduce into the Parallel database  For example using Java User Defined Functions select my_java_map_function(words) a_word, count(*) word_count from DOCUMENTS group by 1  Doesn’t benefit from Hadoop’s storage advantages
  • 29. © 2015 EXASOL AG Option 3 - SQL ON HADOOP Build a relational database on Hadoop storage  Impala (Cloudera)  Stinger (Hortonworks)  Presto (Facebook)  SparkSQL (UC Berkeley)  HAWQ (Pivotal)  BigSQL (IBM)  Apache Phoenix (for HBase)  Apache Tajo  Apache Drill  etc etc etc …. AND DON‘T FORGET HIVE !
  • 30. © 2015 EXASOL AG Four possible market outcomes…  Hadoop and SQL databases are on a collision course – only one will survive  No sign of that so far  They are complementary – both will survive  Probably - the challenge is how to make them work together  They will merge and become one  Some indications this is already starting to happen  Something even more amazing will come along and replace them both  Sometimes this happens – Spark ?
  • 31. © 2015 EXASOL AG What do the pundits say?  Martin Fowler – Thoughtworks  The rise of NoSQL databases marks the end of the era of relational database dominance  But NoSQL databases will not become the new dominators. Relational will still be popular, and used in the majority of situations. They, however, will no longer be the automatic choice.  The era of Polyglot Persistence has begun - where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data  Emil Eifrem – Neo Technology  When evaluating a NoSQL database, it is critical to demand enterprise- readiness. An enterprise delivering modern applications needs a NoSQL database that can manage today's complex and connected data while still delivering the enterprise strength, transactions and durability that IT departments have relied on for years.
  • 32. © 2015 EXASOL AG  Introduction & background  SQL vs NoSQL - observations  Case study  King – online gaming  What’s hot?  Q & A Agenda
  • 33. © 2015 EXASOL AG 37 King in numbers • 100 million daily active users • 1 billion game plays per day • 8 offices And lots and lots of data... • 14 billion rows per day • 500 Gb per day new • 700 Tb stored Case Study - King
  • 34. © 2015 EXASOL AG King - Getting to know 500 million players Objectives in game analytics 38 • Metrics and KPIs • Measure and understand player behaviour • Player segmentation • Improve player experience • Forecasting • Predictive modelling
  • 35. © 2015 EXASOL AG 39 Challenges at King • Extreme scale • Rate of growth • Speed of innovation • Cross platform • Virtual economies King - Getting to know 500 million players
  • 36. © 2015 EXASOL AG 40 The King formula • Data driven culture • Engaged business • Talented embedded data scientists • AB testing • Right technology platform • Right data model King - Getting to know 500 million players
  • 37. © 2015 EXASOL AG System architecture 41 How King does data Game servers Log server Reports Data scientists Data WarehouseTSV log files Dimensional model Raw data ETL
  • 38. © 2015 EXASOL AG Our data keeps growing... 42 How King does data King launches on mobile...
  • 39. © 2015 EXASOL AG …our technology has to keep up 43 How King does data Qlikview says no Infobright CE says no 10 node Hadoop 80 nodes 40 nodes 20 nodes InfiniDB Exasol
  • 40. © 2015 EXASOL AG Data platform 1.0 44 How King does data Games Event data Hive Reports Data scientists ETL
  • 41. © 2015 EXASOL AG Data platform 1.5 45 How King does data Games Event data Hive DB Reports Data scientists ETL
  • 42. © 2015 EXASOL AG 46 Why ExaSolution? • Speed • Efficiency • Tuning free • Scaling (150Tb and counting...) • ExaDudes How King does data
  • 43. © 2015 EXASOL AG Performance 47 How King does data
  • 44. © 2015 EXASOL AG 48 Games Event data Hive Exasol Reports Data scientists ETL Data platform 2.0 How King does data
  • 45. © 2015 EXASOL AG 49 Benefits • ETL times slashed • Cost saving • Tuning free • Scaling How King does data
  • 46. © 2015 EXASOL AG Data platform 3.0 50 Where next? Games Event data Exasol Hive Reports Data scientists ETL
  • 47. © 2015 EXASOL AG 51 Future challenges • Keep on scaling • Closer Hadoop integration • Evolving data model • Microbatch ETL • Real(er) time… Where next?
  • 48. © 2015 EXASOL AG  Introduction & background  SQL vs NoSQL - observations  Case study  King – online gaming  What’s hot?  Q & A Agenda
  • 49. © 2015 EXASOL AG What’s hot?
  • 50. © 2015 EXASOL AG • A definition: • The Internet of Things (IoT) is a scenario in which objects, animals or people are provided with unique identifiers and the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction • Basic concept has been around for decades – now accepted into the mainstream • Wide range of potential uses: • Environmental monitoring • Infrastructure management • Manufacturing • Energy management • Medical and healthcare systems • Building and home automation • Transport systems Internet of Things
  • 51. © 2015 EXASOL AG • Wearable technologies – e.g. smart watches, Google Glass • Bio sensors for humans (and other animals) • Health monitoring • Already in use on some dairy farms – optimise milk yields and give early warning for possible disease • Location based data • All modern phones provide location data (either GPS or cell based) • ‘crowd sourcing’ – e.g. traffic flow based on cellphone signals • Beacons – e.g. Regent Street in London • Location-based special offers and advertisement • Facial recognition • To drive targetted advertisements Other emerging technologies which produce data
  • 52. © 2015 EXASOL AG • Cloud being used for evaluation of new technologies and also as a platform for dev/test (and even DR) environments • In-database analytics using UDFs in languages such a R, Lua and Python • Move the processing closer to the data • Run analytics on full data volumes (no sampling/extract required) • Get improved performance due to parallelism (where possible) • Lots of freely available R code on the web • Automated conversion of analytical results to text (NLG) is emerging • AI rule-based generation of natural language output • Readable summaries and recommendations • Yseop, NarrativeScience, Automated Insights, Arria NLG Other emerging trends
  • 53. © 2015 EXASOL AG • Data and database technology isn’t going away! • New database approaches are being developed to address the requirements of flexibility, scalability etc • These technologies drive an increasing need for more analysts, database designers, data scientists • Hybrid systems are becoming the norm, with companies mixing ‘best of breed’ technologies (possibly open source) to get the best and most cost-effective results – use ‘the right tool for the job’ • SQL databases will continue to be widely utilised – but alongside other technologies and integration will become tighter Summary
  • 54. © 2015 EXASOL AG  Introduction & background  SQL vs NoSQL - observations  Case study  King – online gaming  What’s hot?  Q & A Agenda
  • 55. © 2015 EXASOL AG Dave Shuttleworth Twitter: @EXA_Daves Email: dave.shuttleworth@exasol.com Any questions?
  • 56. Presentation to insert name here 60
  • 57. Presentation to insert name here 61