SlideShare a Scribd company logo
1 of 70
Download to read offline
NICK HALSTEAD, FOUNDER
DATASIFT, @NIK
Big Data
“Myths and Legends”
#BDW13
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
1.5 PETABYTES OF STORAGE
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
1.5 PETABYTES OF STORAGE
5000 CPU HADOOP CLUSTER #DATASIFT
Thursday, 25 April 13
Big Data
“Myths and Legends”
#BD13
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA VENDOR “MYTHS”
Thursday, 25 April 13
Thursday, 25 April 13
BIG DATA VENDOR “MYTHS”
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
1. YOU MUST BUY ALL OF THIS (for one job!)
#BDW13
Thursday, 25 April 13
2. HOW BIG IS “BIG”
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
20 PETABYTES IN EACH SEARCH INDEX REBULD (this was 2 years ago)
#BDW13
Thursday, 25 April 13
20 PETABYTES IN EACH SEARCH INDEX REBULD (this was 2 years ago)
900,000 SERVERS
#BDW13
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
#BDW13
3.2 BILLION LIKES AND COMMENTS PER DAY
Thursday, 25 April 13
#BDW13
3.2 BILLION LIKES AND COMMENTS PER DAY
OVER HALF A PETABYTE … EVERY 24 HOURS
Thursday, 25 April 13
#BDW13 #HADRON
Thursday, 25 April 13
150 MILLION SENSORS DELIVERING DATA 40 MILLION TIMES PER SECOND
#BDW13 #HADRON
Thursday, 25 April 13
150 MILLION SENSORS DELIVERING DATA 40 MILLION TIMES PER SECOND
10’s OF PETABYTES PER YEAR
#BDW13 #HADRON
Thursday, 25 April 13
A TYPICAL COMPANY
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
=4 GIGABYTES (for largest database)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
=4 GIGABYTES (for largest database)
=20 GIGABYTES (for ALL company data)
Thursday, 25 April 13
A TYPICAL HARDDRIVE
2000 GIGABYTES (2TB)
Thursday, 25 April 13
A TYPICAL HARDDRIVE
2000 GIGABYTES (2TB)
4000 GIGABYTES (4TB)
Thursday, 25 April 13
3. YOU NEED *LOTS* OF DATA SCIENTISTS
#DILBERT#BDW13
Thursday, 25 April 13
3. YOU NEED *LOTS* OF DATA SCIENTISTS
#DILBERT#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
COMMUNICATIONS
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
COMMUNICATIONS
GOVERNMENT
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
WEB LOGS 51%
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
WEB LOGS 51%
CLICK STREAM 35%
Thursday, 25 April 13
5. HADOOP GONE BAD
+
SQL
#BDW13 #HADOOPGONEBAD
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
REQUIRES INDEX TO PERFORM
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
REQUIRES INDEX TO PERFORM
QUERIES ARE CONSTRAINED
#BDW13
Thursday, 25 April 13
MAP REDUCE
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
ANY TYPE OF ANALYSIS
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
ANY TYPE OF ANALYSIS
HIDES DETAILS OFFAULT TOLERANCE, LOCALITY
AND LOAD BALANCING
#MAPREDUCE#BDW13
Thursday, 25 April 13
BIG DATA SCHEMA #NOSQL
HBASE
COLUMNS FILES
#BDW13
Thursday, 25 April 13
(QUICK ASIDE)
#SIDEBARThursday, 25 April 13
GOOGLE FILE SYSTEM (GFS) GOOGLE MAPREDUCE (GMR).
GOOGLE STARTED ALL THIS....
Thursday, 25 April 13
GOOGLE DREMEL
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
SCALE UP TO 10,000 SERVERS
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
SCALE UP TO 10,000 SERVERS
COLUMN STORAGE
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
OpenDremel
GOOGLE BIG QUERY
Google
Big Query
#BDW13
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
USE GPS / TRUETIME
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
USE GPS / TRUETIME
NO OPEN SOURCE EQUIVALENT
Thursday, 25 April 13
Thursday, 25 April 13
BIG DATA IS THE NEW OIL
Thursday, 25 April 13
NICK HALSTEAD, FOUNDER
HTTP://DATASIFT.COM
WE ARE HIRING!!
Thursday, 25 April 13

More Related Content

More from Nick Halstead (6)

DataSift Historics in 5 Steps
DataSift Historics in 5 StepsDataSift Historics in 5 Steps
DataSift Historics in 5 Steps
 
DataSift API
DataSift APIDataSift API
DataSift API
 
Twitter and Mediasift Partnership
Twitter and Mediasift PartnershipTwitter and Mediasift Partnership
Twitter and Mediasift Partnership
 
Have I Got The Future Of News For You
Have I Got The Future Of News For YouHave I Got The Future Of News For You
Have I Got The Future Of News For You
 
A guide to Twitter Tools & Jargon
A guide to Twitter Tools & JargonA guide to Twitter Tools & Jargon
A guide to Twitter Tools & Jargon
 
Building on Twitter
Building on TwitterBuilding on Twitter
Building on Twitter
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Big Data Week - Myths and Legends