SlideShare a Scribd company logo
1 of 70
Download to read offline
NICK HALSTEAD, FOUNDER
DATASIFT, @NIK
Big Data
“Myths and Legends”
#BDW13
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
1.5 PETABYTES OF STORAGE
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
1.5 PETABYTES OF STORAGE
5000 CPU HADOOP CLUSTER #DATASIFT
Thursday, 25 April 13
Big Data
“Myths and Legends”
#BD13
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA VENDOR “MYTHS”
Thursday, 25 April 13
Thursday, 25 April 13
BIG DATA VENDOR “MYTHS”
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
1. YOU MUST BUY ALL OF THIS (for one job!)
#BDW13
Thursday, 25 April 13
2. HOW BIG IS “BIG”
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
20 PETABYTES IN EACH SEARCH INDEX REBULD (this was 2 years ago)
#BDW13
Thursday, 25 April 13
20 PETABYTES IN EACH SEARCH INDEX REBULD (this was 2 years ago)
900,000 SERVERS
#BDW13
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
#BDW13
3.2 BILLION LIKES AND COMMENTS PER DAY
Thursday, 25 April 13
#BDW13
3.2 BILLION LIKES AND COMMENTS PER DAY
OVER HALF A PETABYTE … EVERY 24 HOURS
Thursday, 25 April 13
#BDW13 #HADRON
Thursday, 25 April 13
150 MILLION SENSORS DELIVERING DATA 40 MILLION TIMES PER SECOND
#BDW13 #HADRON
Thursday, 25 April 13
150 MILLION SENSORS DELIVERING DATA 40 MILLION TIMES PER SECOND
10’s OF PETABYTES PER YEAR
#BDW13 #HADRON
Thursday, 25 April 13
A TYPICAL COMPANY
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
=4 GIGABYTES (for largest database)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
=4 GIGABYTES (for largest database)
=20 GIGABYTES (for ALL company data)
Thursday, 25 April 13
A TYPICAL HARDDRIVE
2000 GIGABYTES (2TB)
Thursday, 25 April 13
A TYPICAL HARDDRIVE
2000 GIGABYTES (2TB)
4000 GIGABYTES (4TB)
Thursday, 25 April 13
3. YOU NEED *LOTS* OF DATA SCIENTISTS
#DILBERT#BDW13
Thursday, 25 April 13
3. YOU NEED *LOTS* OF DATA SCIENTISTS
#DILBERT#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
COMMUNICATIONS
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
COMMUNICATIONS
GOVERNMENT
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
WEB LOGS 51%
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
WEB LOGS 51%
CLICK STREAM 35%
Thursday, 25 April 13
5. HADOOP GONE BAD
+
SQL
#BDW13 #HADOOPGONEBAD
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
REQUIRES INDEX TO PERFORM
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
REQUIRES INDEX TO PERFORM
QUERIES ARE CONSTRAINED
#BDW13
Thursday, 25 April 13
MAP REDUCE
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
ANY TYPE OF ANALYSIS
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
ANY TYPE OF ANALYSIS
HIDES DETAILS OFFAULT TOLERANCE, LOCALITY
AND LOAD BALANCING
#MAPREDUCE#BDW13
Thursday, 25 April 13
BIG DATA SCHEMA #NOSQL
HBASE
COLUMNS FILES
#BDW13
Thursday, 25 April 13
(QUICK ASIDE)
#SIDEBARThursday, 25 April 13
GOOGLE FILE SYSTEM (GFS) GOOGLE MAPREDUCE (GMR).
GOOGLE STARTED ALL THIS....
Thursday, 25 April 13
GOOGLE DREMEL
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
SCALE UP TO 10,000 SERVERS
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
SCALE UP TO 10,000 SERVERS
COLUMN STORAGE
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
OpenDremel
GOOGLE BIG QUERY
Google
Big Query
#BDW13
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
USE GPS / TRUETIME
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
USE GPS / TRUETIME
NO OPEN SOURCE EQUIVALENT
Thursday, 25 April 13
Thursday, 25 April 13
BIG DATA IS THE NEW OIL
Thursday, 25 April 13
NICK HALSTEAD, FOUNDER
HTTP://DATASIFT.COM
WE ARE HIRING!!
Thursday, 25 April 13

More Related Content

More from Nick Halstead

DataSift Historics in 5 Steps
DataSift Historics in 5 StepsDataSift Historics in 5 Steps
DataSift Historics in 5 StepsNick Halstead
 
Twitter and Mediasift Partnership
Twitter and Mediasift PartnershipTwitter and Mediasift Partnership
Twitter and Mediasift PartnershipNick Halstead
 
Have I Got The Future Of News For You
Have I Got The Future Of News For YouHave I Got The Future Of News For You
Have I Got The Future Of News For YouNick Halstead
 
A guide to Twitter Tools & Jargon
A guide to Twitter Tools & JargonA guide to Twitter Tools & Jargon
A guide to Twitter Tools & JargonNick Halstead
 

More from Nick Halstead (6)

DataSift Historics in 5 Steps
DataSift Historics in 5 StepsDataSift Historics in 5 Steps
DataSift Historics in 5 Steps
 
DataSift API
DataSift APIDataSift API
DataSift API
 
Twitter and Mediasift Partnership
Twitter and Mediasift PartnershipTwitter and Mediasift Partnership
Twitter and Mediasift Partnership
 
Have I Got The Future Of News For You
Have I Got The Future Of News For YouHave I Got The Future Of News For You
Have I Got The Future Of News For You
 
A guide to Twitter Tools & Jargon
A guide to Twitter Tools & JargonA guide to Twitter Tools & Jargon
A guide to Twitter Tools & Jargon
 
Building on Twitter
Building on TwitterBuilding on Twitter
Building on Twitter
 

Recently uploaded

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Big Data Week - Myths and Legends