SlideShare a Scribd company logo
1 of 70
Download to read offline
NICK HALSTEAD, FOUNDER
DATASIFT, @NIK
Big Data
“Myths and Legends”
#BDW13
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
1.5 PETABYTES OF STORAGE
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
1.5 PETABYTES OF STORAGE
5000 CPU HADOOP CLUSTER #DATASIFT
Thursday, 25 April 13
Big Data
“Myths and Legends”
#BD13
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA VENDOR “MYTHS”
Thursday, 25 April 13
Thursday, 25 April 13
BIG DATA VENDOR “MYTHS”
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
1. YOU MUST BUY ALL OF THIS (for one job!)
#BDW13
Thursday, 25 April 13
2. HOW BIG IS “BIG”
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
20 PETABYTES IN EACH SEARCH INDEX REBULD (this was 2 years ago)
#BDW13
Thursday, 25 April 13
20 PETABYTES IN EACH SEARCH INDEX REBULD (this was 2 years ago)
900,000 SERVERS
#BDW13
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
#BDW13
3.2 BILLION LIKES AND COMMENTS PER DAY
Thursday, 25 April 13
#BDW13
3.2 BILLION LIKES AND COMMENTS PER DAY
OVER HALF A PETABYTE … EVERY 24 HOURS
Thursday, 25 April 13
#BDW13 #HADRON
Thursday, 25 April 13
150 MILLION SENSORS DELIVERING DATA 40 MILLION TIMES PER SECOND
#BDW13 #HADRON
Thursday, 25 April 13
150 MILLION SENSORS DELIVERING DATA 40 MILLION TIMES PER SECOND
10’s OF PETABYTES PER YEAR
#BDW13 #HADRON
Thursday, 25 April 13
A TYPICAL COMPANY
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
=4 GIGABYTES (for largest database)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
=4 GIGABYTES (for largest database)
=20 GIGABYTES (for ALL company data)
Thursday, 25 April 13
A TYPICAL HARDDRIVE
2000 GIGABYTES (2TB)
Thursday, 25 April 13
A TYPICAL HARDDRIVE
2000 GIGABYTES (2TB)
4000 GIGABYTES (4TB)
Thursday, 25 April 13
3. YOU NEED *LOTS* OF DATA SCIENTISTS
#DILBERT#BDW13
Thursday, 25 April 13
3. YOU NEED *LOTS* OF DATA SCIENTISTS
#DILBERT#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
COMMUNICATIONS
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
COMMUNICATIONS
GOVERNMENT
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
WEB LOGS 51%
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
WEB LOGS 51%
CLICK STREAM 35%
Thursday, 25 April 13
5. HADOOP GONE BAD
+
SQL
#BDW13 #HADOOPGONEBAD
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
REQUIRES INDEX TO PERFORM
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
REQUIRES INDEX TO PERFORM
QUERIES ARE CONSTRAINED
#BDW13
Thursday, 25 April 13
MAP REDUCE
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
ANY TYPE OF ANALYSIS
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
ANY TYPE OF ANALYSIS
HIDES DETAILS OFFAULT TOLERANCE, LOCALITY
AND LOAD BALANCING
#MAPREDUCE#BDW13
Thursday, 25 April 13
BIG DATA SCHEMA #NOSQL
HBASE
COLUMNS FILES
#BDW13
Thursday, 25 April 13
(QUICK ASIDE)
#SIDEBARThursday, 25 April 13
GOOGLE FILE SYSTEM (GFS) GOOGLE MAPREDUCE (GMR).
GOOGLE STARTED ALL THIS....
Thursday, 25 April 13
GOOGLE DREMEL
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
SCALE UP TO 10,000 SERVERS
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
SCALE UP TO 10,000 SERVERS
COLUMN STORAGE
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
OpenDremel
GOOGLE BIG QUERY
Google
Big Query
#BDW13
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
USE GPS / TRUETIME
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
USE GPS / TRUETIME
NO OPEN SOURCE EQUIVALENT
Thursday, 25 April 13
Thursday, 25 April 13
BIG DATA IS THE NEW OIL
Thursday, 25 April 13
NICK HALSTEAD, FOUNDER
HTTP://DATASIFT.COM
WE ARE HIRING!!
Thursday, 25 April 13

More Related Content

More from Nick Halstead (6)

DataSift Historics in 5 Steps
DataSift Historics in 5 StepsDataSift Historics in 5 Steps
DataSift Historics in 5 Steps
 
DataSift API
DataSift APIDataSift API
DataSift API
 
Twitter and Mediasift Partnership
Twitter and Mediasift PartnershipTwitter and Mediasift Partnership
Twitter and Mediasift Partnership
 
Have I Got The Future Of News For You
Have I Got The Future Of News For YouHave I Got The Future Of News For You
Have I Got The Future Of News For You
 
A guide to Twitter Tools & Jargon
A guide to Twitter Tools & JargonA guide to Twitter Tools & Jargon
A guide to Twitter Tools & Jargon
 
Building on Twitter
Building on TwitterBuilding on Twitter
Building on Twitter
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Big Data Week - Myths and Legends