SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
Big Data
Big Data: hype or necessity?
Dr. ir. ing. Bart Vandewoestyne
Sizing Servers Lab, Howest, Kortrijk
Televic R&D meeting - April 25, 2014
1 / 69
Big Data
Outline
1 Introduction
Big Data?
2 Big Data Technology
Hadoop
Pig, Hive
NoSQL
3 Big Data in my company?
4 Conclusions
2 / 69
Big Data
Introduction
Outline
1 Introduction
Big Data?
2 Big Data Technology
Hadoop
Pig, Hive
NoSQL
3 Big Data in my company?
4 Conclusions
3 / 69
Big Data
Introduction
Big Data?
Exponential growth of data
© 2013 International Business Machines Corporation 4
Big Data: This is just the beginning
2010
VolumeinExabytes
9000
8000
7000
6000
5000
4000
3000
2015
Percentage of uncertain data
Percentofuncertaindata
100
80
60
40
20
0
You are here
Sensors
& Devices
VoIP
Enterprise
Data
Social
Media
4 / 69
Big Data
Introduction
Big Data?
Examples
Facebook hosts ≈ 10 billion photos ≈ 1 petabyte
Large Hadron Collider: will produce ≈ 15 petabytes per year
5 / 69
Big Data
Introduction
Big Data?
Examples
RFID readers vehicle GPS traces
Smart energy meters
6 / 69
Big Data
Introduction
Big Data?
Examples relevant to Televic
Seattle’s Children Hospital Google Now
Union Pacific
Automatic rescheduling
Sensors in rails, GPS, RFID in terminals,. . .
Weather forecast,. . .
7 / 69
Big Data
Introduction
Big Data?
Big Data definition
Definition of Big Data depends on who you ask:
Big Data
“Multiple terabytes or petabytes.”
(according to some professionals)
“I don’t know.”
(today’s big may be tomorrow’s normal)
“Relative to its context.”
8 / 69
Big Data
Introduction
Big Data?
Quotes on Big Data
“Big data” is a subjective label attached to situations in
which human and technical infrastructures are unable to
keep pace with a company’s data needs.
It’s about recognizing that for some problems other
storage solutions are better suited.
9 / 69
Big Data
Introduction
Big Data?
The Three V’s
Volume The amount of data is big.
Variety Different kinds of data:
structured
semi-structured
unstructured
Velocity Speed-issues to consider:
How fast is the data available for analysis?
How fast can we do something with it?
Other V’s: Veracity, Variability, Validity, Value,. . .
10 / 69
Big Data
Introduction
Big Data?
Structured data
Structured data
Pre-defined schema imposed on the data
Highly structured
Usually stored in a relational database system
Example
numbers: 20, 3.1415,. . .
dates: 21/03/1978
strings: ”Hello World”
. . .
Roughly 20% of all data out there is structured.
11 / 69
Big Data
Introduction
Big Data?
Semi-structured data
Semi-structured data
Inconsistent structure.
Cannot be stored in rows and tables in a typical database.
Information is often self-describing (label/value pairs).
Example
XML, SGML,. . .
BibTeX files
logs
tweets
sensor feeds
. . .
12 / 69
Big Data
Introduction
Big Data?
Semi-structured data: examples
Example
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer’s Guide</title>
<genre>Computer</genre>
<price>44.95</price>
</book>
</catalog>
13 / 69
Big Data
Introduction
Big Data?
Unstructured data
Definition (Unstructured data)
Lacks structure or parts of it lack structure.
Example
multimedia: videos, photos,
audio files,. . .
email messages
free-form text
word processing documents
presentations
reports
. . .
Experts estimate that 80 to 90 % of the data in any
organization is unstructured.
14 / 69
Big Data
Introduction
Big Data?
Data Storage and Analysis
Storage capacity of hard drives has increased massively over
the years.
Access speeds have not kept up.
Example (Reading a whole disk)
Year Storage Capacity Transfer Speed Time
1990 1370 MB 4.4 MB/s ≈ 5 minutes
2010 1 TB 100 MB/s > 2.5 hours
Solution: work in parallel!
Using 100 drives (each holding 1/100th of the data),
reading 1 TB takes less than 2 minutes.
15 / 69
Big Data
Introduction
Big Data?
Working in parallel
Problems
1 Hardware failure?
2 Combining data from different disks for analysis?
Solutions
1 HDFS: Hadoop Distributed Filesystem
2 MapReduce: programming model
16 / 69
Big Data
Big Data Technology
Outline
1 Introduction
Big Data?
2 Big Data Technology
Hadoop
Pig, Hive
NoSQL
3 Big Data in my company?
4 Conclusions
17 / 69
Big Data
Big Data Technology
Big Data Landscape
18 / 69
Big Data
Big Data Technology
Hadoop
Hadoop
Hadoop is VMware, but the other way around.
19 / 69
Big Data
Big Data Technology
Hadoop
Hadoop as the opposite of a virtual machine
VMware
1 take one physical server
2 split it up
3 get many small virtual
servers
Hadoop
1 take many physical servers
2 merge them all together
3 get one big, massive, virtual
server
20 / 69
Big Data
Big Data Technology
Hadoop
Hadoop: core functionality
HDFS Self-healing, high-bandwidth, clustered storage.
MapReduce Distributed, fault-tolerant resource management,
coupled with scalable data processing.
21 / 69
Big Data
Big Data Technology
Hadoop
HDFS architecture
22 / 69
Big Data
Big Data Technology
Hadoop
MapReduce
23 / 69
Big Data
Big Data Technology
Hadoop
MapReduce
24 / 69
Big Data
Big Data Technology
Hadoop
Hadoop: applications
Example Hadoop stack:
→ Hadoop distributions
25 / 69
Big Data
Big Data Technology
Hadoop
Example Hadoop distributions
26 / 69
Big Data
Big Data Technology
Hadoop
Hadoop vs RDBMS
Relational Database Management Systems (RDBMS):
Very fast to max speed!
some queries → msecs
other queries → hours, days
use when
latency is important
ACID transactions
(banking,. . . )
100% SQL compliance
Unstructured data → BLOB
:-(
27 / 69
Big Data
Big Data Technology
Hadoop
Hadoop vs RDBMS
Hadoop:
Slower to (higher) max
speed. . .
some queries → seconds,
minutes
other queries → seconds!!!
Use when:
throughput important
scalability of storage/compute
(un|semi)structured data
complex data processing
(NoSQL, Java, C, Python,. . . )
28 / 69
Big Data
Big Data Technology
Pig, Hive
Apache Hadoop essentials: technology stack
29 / 69
Big Data
Big Data Technology
Pig, Hive
Pig
MapReduce requires programmers
think in terms of map and reduce
functions,
more than likely use the Java language.
Pig provides a high-level language (Pig
Latin) that can be used by
Analysts
Data Scientists
Statisticians
Etc. . .
30 / 69
Big Data
Big Data Technology
Pig, Hive
Pig Latin
Pig Latin
Originally from Yahoo! to allow analysts to access data.
Dataflow language.
Makes it simpler to write MapReduce programs.
Abstracts you from specific details
→ focus on data processing.
Has User Defined Functions (UDFs).
Compiles script into a set of MapReduce jobs.
31 / 69
Big Data
Big Data Technology
Pig, Hive
Pig example
Load users Load pages
Filter
by age
Join on
name
Group
on URL
Count
clicks
Order
by clicks
Take top 5
Input data
file with user data
file with website data
Your task
Find the top 5 most visited
pages by users aged 18-25.
32 / 69
Big Data
Big Data Technology
Pig, Hive
In MapReduce
. . . 170 lines of Java MapReduce code . . .
33 / 69
Big Data
Big Data Technology
Pig, Hive
In Pig Latin
Example
Users = load ’users’ as (name, age);
Fltrd = filter Users by age >= 18 and age <= 25;
Pages = load ’pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ’top5sites’;
Only 9 lines of Pig Latin.
34 / 69
Big Data
Big Data Technology
Pig, Hive
Hive
Originated at Facebook to analyze log data.
HiveQL: Hive Query Language, similar to standard SQL.
Queries are compiled into MapReduce jobs.
Has command-line shell, similar to e.g. MySQL shell.
35 / 69
Big Data
Big Data Technology
Pig, Hive
Hive: example
Example (Create table to hold weather data)
CREATE TABLE records (year STRING,
temperature INT,
quality INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ’t’;
Example (Populate Hive with the data)
LOAD DATA LOCAL INPATH ’input/sample.txt’
OVERWRITE INTO TABLE records;
36 / 69
Big Data
Big Data Technology
Pig, Hive
Hive: example
Example (Run query)
hive> SELECT year, MAX(temperature)
> FROM records
> WHERE temperature != 9999
> AND (quality = 0 OR quality = 1)
> GROUP BY year;
1949 111
1950 22
37 / 69
Big Data
Big Data Technology
NoSQL
NoSQL
38 / 69
Big Data
Big Data Technology
NoSQL
RDBMS: Codd’s 12 rules
Codd’s 12 rules
A set of rules designed to define what is required from a database
management system in order for it to be considered relational.
Rule 0 The Foundation rule
Rule 1 The Information rule
Rule 2 The guaranteed access rule
Rule 3 Systematic treatment of null values
Rule 4 Active online catalog based on the relational model
. . . . . .
39 / 69
Big Data
Big Data Technology
NoSQL
ACID
ACID
A set of properties that guarantee that database transactions are
processed reliably.
Atomicity A transaction is all or nothing.
Consistency Only transactions with valid data.
Isolation Simultaneous transactions will not interfere.
Durability Written transaction data stays there “forever”
(even in case of power loss, crashes, errors,. . . ).
40 / 69
Big Data
Big Data Technology
NoSQL
Scaling up
What if you need to scale up your RDBMS in terms of
dataset size,
read/write concurrency?
This usually involves
breaking Codds rules,
loosening ACID restrictions,
forgetting conventional DBA wisdom,
loose most of the desirable properties that made RDBMS so
convenient in the first place.
NoSQL to the rescue!
41 / 69
Big Data
Big Data Technology
NoSQL
NoSQL
NoSQL
‘Invented’ by Carl Strozzi in 1998 (for his file-based database)
“Not only SQL”
It’s NOT about
saying that SQL should never be used,
saying that SQL is dead.
42 / 69
Big Data
Big Data Technology
NoSQL
NoSQL databases
Four emerging NoSQL categories:
43 / 69
Big Data
Big Data Technology
NoSQL
Key-Value stores or ‘the big hash table’
Keys Values
13a1
13a2
13a3
Nexus 32 GB
Nexus 16 GB
Nexus 08 GB
Most basic type of NoSQL
databases.
Aggregation of key-value
pairs.
Typically only 4 operations:
create(key, value)
read(key)
update(key, value)
delete(key)
Fast, scalable, less complex.
Mainly used for systems with simple queries (caches etc. . . . )
44 / 69
Big Data
Big Data Technology
NoSQL
Key-Value stores or ’the big hash table’
45 / 69
Big Data
Big Data Technology
NoSQL
Column-oriented DBMS
Example
Id LastName FirstName Salary
10 Smith Joe 40000
12 Jones Mary 50000
11 Johnson Cathy 44000
22 Jones Bob 55000
Row-based:
10,Smith,Joe,40000;12,Jones,Mary,50000;11,Johnson,Cathy,44000;22,Jones,Bob,55000
Column-based:
10,12,11,22;Smith,Jones,Johnson,Jones;Joe,Mary,Cathy,Bob;40000,50000,44000,55000
46 / 69
Big Data
Big Data Technology
NoSQL
Column family based databases
Like column-oriented DBMS, but with a twist
Columns and supercolumns ≈ RDBMS table columns
Family of columns ≈ RDBMS table
Keyspace ≈ RDBMS database
47 / 69
Big Data
Big Data Technology
NoSQL
Column family based databases
Most complex NoSQL database type.
Based on Google’s BigTable paper.
More flexibility than traditional RDBMS:
adding (super)columns is always possible.
Excellent for analysis and mass treatment of data
(via Map-Reduce type operations)
48 / 69
Big Data
Big Data Technology
NoSQL
Document databases
Data is stored as a collection of
documents
(JSON, XML,. . . but also PDF,
Excel,. . . )
Documents → collection of
key-value pairs
Values can be
simple values
arrays
another document (collection of
key-values)
Schemaless
Quite well queryable
49 / 69
Big Data
Big Data Technology
NoSQL
Document databases
Example (Document 1)
{
FirstName: "Bob",
Address: "5 Oak St.",
Hobby: "sailing"
}
Example (Document 2)
{
FirstName: "Jonathan",
Address: "15 Wanamassa Road",
Children: [
{Name: "Michael", Age: 10},
{Name: "Jennifer", Age: 8},
{Name: "Samantha", Age: 5},
{Name: "Elena", Age: 2}
]
}
Best suited for custom queries like the ones in RDBMS.
Quite popular for Content Management Systems.
50 / 69
Big Data
Big Data Technology
NoSQL
Document databases: examples
51 / 69
Big Data
Big Data Technology
NoSQL
Graph databases
Julie Steve
Rock
Music
Bob BMW
Fido Jim IBM
Sister in-Law To
Listens To
Listens To
M
arried
To Brother Of
Drives
W
orks For
Colleague
Of
Works ForHas Pet
Based on graph theory.
Employ nodes (objects) and edges (relations between objects).
52 / 69
Big Data
Big Data Technology
NoSQL
Graph databases: examples
Well-suited for problems with network-structure:
mine data from social media
“customers who bought this also looked at. . . ”
relations between persons
healthcare ontologies ???
. . .
53 / 69
Big Data
Big Data Technology
NoSQL
Us the right tool for the right job!
http://db-engines.com/
54 / 69
Big Data
Big Data in my company?
Outline
1 Introduction
Big Data?
2 Big Data Technology
Hadoop
Pig, Hive
NoSQL
3 Big Data in my company?
4 Conclusions
55 / 69
Big Data
Big Data in my company?
Typical RDBMS scaling story
1. Initial Public Launch
From local workstation → remotely hosted MySQL instance.
2. Service popularity ↑, too many reads hitting the database
Add memcached to cache common queries. Reads are now no
longer strictly ACID; cached data must expire.
3. Popularity ↑↑, too many writes hitting the database
Scale MySQL vertically by buying a beefed-up server:
16 cores
128 GB of RAM
banks of 15 k RPM hard drives



Costly
56 / 69
Big Data
Big Data in my company?
Typical RDBMS scaling story
4. New features → query complexity ↑, now too many joins
Denormalize your data to reduce joins.
(Thats not what they taught me in DBA school!)
5. Rising popularity swamps the server; things are too slow
Stop doing any server-side computations.
57 / 69
Big Data
Big Data in my company?
Typical RDBMS scaling story
6. Some queries are still too slow
Periodically prematerialize the most complex queries, and try to
stop joining in most cases.
7. Reads are OK, writes are getting slower and slower. . .
Drop secondary indexes and triggers (no indexes?).
If you stay up at night
worrying about your database
(uptime, scale, or speed), you
should seriously consider
making a jump from the
RDBMS world to HBase.
58 / 69
Big Data
Big Data in my company?
Two types of companies (personal view)
‘Core Big Data’ company
Core business = big data processing, crunching, analyzing,. . .
Example
Google, Facebook,. . .
Smart metering companies
Video/Image processing companies
Biotech companies with sequencing data
Lots of healthcare data???
. . .
59 / 69
Big Data
Big Data in my company?
Two types of companies (personal view)
‘General Big Data’ company
Some other core business.
Lots of useful data is available.
Desirable: business analytics, process optimization,. . .
Example
Supermarkets → customer cards
Transport firms → GPS-traces
Hospitals → patient and medical info???
. . .
60 / 69
Big Data
Big Data in my company?
Use-cases of Big Data
‘Core Big Data’ company
Big Data
crunching,
hacking,
processing,
analyzing,
. . .
‘General Big Data’ company
Business Analytics
improve decision-making,
gain operational insights,
increase overall
performance,
track and analyze
shopping patterns,
. . .
Both
Explore! Discover hidden gems!
61 / 69
Big Data
Big Data in my company?
Some examples
IBM: predict heart disease
long before it strikes.
Predict and stop the spread
of infectious disease
62 / 69
Big Data
Big Data in my company?
Some examples
How to predict wine quality?
Skip tasting! Use science!
Weather seems the key
variable.
Correlate historical weather
& wine data.
Reduce fuel cost and
improve driver safety by
analyzing geolocation data
63 / 69
Big Data
Big Data in my company?
Big Data in your company
Big data is typically a division of the IT-department.
Requires skilled people:
sysadmins
software developers
data-scientists
visualization experts
. . .
Advice, trend (Andrew McAfee)
Give geeks a seat at the decision-making table.
64 / 69
Big Data
Big Data in my company?
Big Data in your company
65 / 69
Big Data
Big Data in my company?
IWT TETRA project
Data mining: van relationele database naar Big Data.
Dates
Submitted: 12/03/2014
Notification of acceptance: July, 2014
Runs from 01/10/2014 – 01/10/2016
People involved
Wannes De Smet (researcher)
Bart Vandewoestyne (researcher)
Johan De Gelas (project coordinator)
Thanks for being interested project partner :-)
66 / 69
Big Data
Conclusions
Outline
1 Introduction
Big Data?
2 Big Data Technology
Hadoop
Pig, Hive
NoSQL
3 Big Data in my company?
4 Conclusions
67 / 69
Big Data
Conclusions
Conclusions
“Big” can be small too.
The Big Data landscape is huge.
RDBMS and SQL are not dead.
The right tool for the right job!
Your company can benefit from Big Data technology.
We can help.
Be brave in your quest. . .
68 / 69
Big Data
Conclusions
Questions?
Questions?
bart@sizingservers.be
69 / 69

Contenu connexe

Tendances

IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET Journal
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data WarehousingThomas Kejser
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreTrendwise Analytics
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentalsrjain51
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceInformation Security Awareness Group
 

Tendances (20)

Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articles
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoop
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and More
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data
Big DataBig Data
Big Data
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security Alliance
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 

Similaire à Big Data: hype or necessity?

Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big DataJean-Marc Desvaux
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big DataDataWorks Summit
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopOCTO Technology
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big DataNetApp
 
Big data peresintaion
Big data peresintaion Big data peresintaion
Big data peresintaion ahmed alshikh
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 

Similaire à Big Data: hype or necessity? (20)

Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big Data
 
BigData Analytics
BigData AnalyticsBigData Analytics
BigData Analytics
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big Data
 
Big Data
Big DataBig Data
Big Data
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big data
Big dataBig data
Big data
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
 
Big data peresintaion
Big data peresintaion Big data peresintaion
Big data peresintaion
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 

Dernier

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...boychatmate1
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligencePriyadharshiniG41
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 

Dernier (20)

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligence
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 

Big Data: hype or necessity?

  • 1. Big Data Big Data: hype or necessity? Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk Televic R&D meeting - April 25, 2014 1 / 69
  • 2. Big Data Outline 1 Introduction Big Data? 2 Big Data Technology Hadoop Pig, Hive NoSQL 3 Big Data in my company? 4 Conclusions 2 / 69
  • 3. Big Data Introduction Outline 1 Introduction Big Data? 2 Big Data Technology Hadoop Pig, Hive NoSQL 3 Big Data in my company? 4 Conclusions 3 / 69
  • 4. Big Data Introduction Big Data? Exponential growth of data © 2013 International Business Machines Corporation 4 Big Data: This is just the beginning 2010 VolumeinExabytes 9000 8000 7000 6000 5000 4000 3000 2015 Percentage of uncertain data Percentofuncertaindata 100 80 60 40 20 0 You are here Sensors & Devices VoIP Enterprise Data Social Media 4 / 69
  • 5. Big Data Introduction Big Data? Examples Facebook hosts ≈ 10 billion photos ≈ 1 petabyte Large Hadron Collider: will produce ≈ 15 petabytes per year 5 / 69
  • 6. Big Data Introduction Big Data? Examples RFID readers vehicle GPS traces Smart energy meters 6 / 69
  • 7. Big Data Introduction Big Data? Examples relevant to Televic Seattle’s Children Hospital Google Now Union Pacific Automatic rescheduling Sensors in rails, GPS, RFID in terminals,. . . Weather forecast,. . . 7 / 69
  • 8. Big Data Introduction Big Data? Big Data definition Definition of Big Data depends on who you ask: Big Data “Multiple terabytes or petabytes.” (according to some professionals) “I don’t know.” (today’s big may be tomorrow’s normal) “Relative to its context.” 8 / 69
  • 9. Big Data Introduction Big Data? Quotes on Big Data “Big data” is a subjective label attached to situations in which human and technical infrastructures are unable to keep pace with a company’s data needs. It’s about recognizing that for some problems other storage solutions are better suited. 9 / 69
  • 10. Big Data Introduction Big Data? The Three V’s Volume The amount of data is big. Variety Different kinds of data: structured semi-structured unstructured Velocity Speed-issues to consider: How fast is the data available for analysis? How fast can we do something with it? Other V’s: Veracity, Variability, Validity, Value,. . . 10 / 69
  • 11. Big Data Introduction Big Data? Structured data Structured data Pre-defined schema imposed on the data Highly structured Usually stored in a relational database system Example numbers: 20, 3.1415,. . . dates: 21/03/1978 strings: ”Hello World” . . . Roughly 20% of all data out there is structured. 11 / 69
  • 12. Big Data Introduction Big Data? Semi-structured data Semi-structured data Inconsistent structure. Cannot be stored in rows and tables in a typical database. Information is often self-describing (label/value pairs). Example XML, SGML,. . . BibTeX files logs tweets sensor feeds . . . 12 / 69
  • 13. Big Data Introduction Big Data? Semi-structured data: examples Example <?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer’s Guide</title> <genre>Computer</genre> <price>44.95</price> </book> </catalog> 13 / 69
  • 14. Big Data Introduction Big Data? Unstructured data Definition (Unstructured data) Lacks structure or parts of it lack structure. Example multimedia: videos, photos, audio files,. . . email messages free-form text word processing documents presentations reports . . . Experts estimate that 80 to 90 % of the data in any organization is unstructured. 14 / 69
  • 15. Big Data Introduction Big Data? Data Storage and Analysis Storage capacity of hard drives has increased massively over the years. Access speeds have not kept up. Example (Reading a whole disk) Year Storage Capacity Transfer Speed Time 1990 1370 MB 4.4 MB/s ≈ 5 minutes 2010 1 TB 100 MB/s > 2.5 hours Solution: work in parallel! Using 100 drives (each holding 1/100th of the data), reading 1 TB takes less than 2 minutes. 15 / 69
  • 16. Big Data Introduction Big Data? Working in parallel Problems 1 Hardware failure? 2 Combining data from different disks for analysis? Solutions 1 HDFS: Hadoop Distributed Filesystem 2 MapReduce: programming model 16 / 69
  • 17. Big Data Big Data Technology Outline 1 Introduction Big Data? 2 Big Data Technology Hadoop Pig, Hive NoSQL 3 Big Data in my company? 4 Conclusions 17 / 69
  • 18. Big Data Big Data Technology Big Data Landscape 18 / 69
  • 19. Big Data Big Data Technology Hadoop Hadoop Hadoop is VMware, but the other way around. 19 / 69
  • 20. Big Data Big Data Technology Hadoop Hadoop as the opposite of a virtual machine VMware 1 take one physical server 2 split it up 3 get many small virtual servers Hadoop 1 take many physical servers 2 merge them all together 3 get one big, massive, virtual server 20 / 69
  • 21. Big Data Big Data Technology Hadoop Hadoop: core functionality HDFS Self-healing, high-bandwidth, clustered storage. MapReduce Distributed, fault-tolerant resource management, coupled with scalable data processing. 21 / 69
  • 22. Big Data Big Data Technology Hadoop HDFS architecture 22 / 69
  • 23. Big Data Big Data Technology Hadoop MapReduce 23 / 69
  • 24. Big Data Big Data Technology Hadoop MapReduce 24 / 69
  • 25. Big Data Big Data Technology Hadoop Hadoop: applications Example Hadoop stack: → Hadoop distributions 25 / 69
  • 26. Big Data Big Data Technology Hadoop Example Hadoop distributions 26 / 69
  • 27. Big Data Big Data Technology Hadoop Hadoop vs RDBMS Relational Database Management Systems (RDBMS): Very fast to max speed! some queries → msecs other queries → hours, days use when latency is important ACID transactions (banking,. . . ) 100% SQL compliance Unstructured data → BLOB :-( 27 / 69
  • 28. Big Data Big Data Technology Hadoop Hadoop vs RDBMS Hadoop: Slower to (higher) max speed. . . some queries → seconds, minutes other queries → seconds!!! Use when: throughput important scalability of storage/compute (un|semi)structured data complex data processing (NoSQL, Java, C, Python,. . . ) 28 / 69
  • 29. Big Data Big Data Technology Pig, Hive Apache Hadoop essentials: technology stack 29 / 69
  • 30. Big Data Big Data Technology Pig, Hive Pig MapReduce requires programmers think in terms of map and reduce functions, more than likely use the Java language. Pig provides a high-level language (Pig Latin) that can be used by Analysts Data Scientists Statisticians Etc. . . 30 / 69
  • 31. Big Data Big Data Technology Pig, Hive Pig Latin Pig Latin Originally from Yahoo! to allow analysts to access data. Dataflow language. Makes it simpler to write MapReduce programs. Abstracts you from specific details → focus on data processing. Has User Defined Functions (UDFs). Compiles script into a set of MapReduce jobs. 31 / 69
  • 32. Big Data Big Data Technology Pig, Hive Pig example Load users Load pages Filter by age Join on name Group on URL Count clicks Order by clicks Take top 5 Input data file with user data file with website data Your task Find the top 5 most visited pages by users aged 18-25. 32 / 69
  • 33. Big Data Big Data Technology Pig, Hive In MapReduce . . . 170 lines of Java MapReduce code . . . 33 / 69
  • 34. Big Data Big Data Technology Pig, Hive In Pig Latin Example Users = load ’users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ’pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ’top5sites’; Only 9 lines of Pig Latin. 34 / 69
  • 35. Big Data Big Data Technology Pig, Hive Hive Originated at Facebook to analyze log data. HiveQL: Hive Query Language, similar to standard SQL. Queries are compiled into MapReduce jobs. Has command-line shell, similar to e.g. MySQL shell. 35 / 69
  • 36. Big Data Big Data Technology Pig, Hive Hive: example Example (Create table to hold weather data) CREATE TABLE records (year STRING, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ’t’; Example (Populate Hive with the data) LOAD DATA LOCAL INPATH ’input/sample.txt’ OVERWRITE INTO TABLE records; 36 / 69
  • 37. Big Data Big Data Technology Pig, Hive Hive: example Example (Run query) hive> SELECT year, MAX(temperature) > FROM records > WHERE temperature != 9999 > AND (quality = 0 OR quality = 1) > GROUP BY year; 1949 111 1950 22 37 / 69
  • 38. Big Data Big Data Technology NoSQL NoSQL 38 / 69
  • 39. Big Data Big Data Technology NoSQL RDBMS: Codd’s 12 rules Codd’s 12 rules A set of rules designed to define what is required from a database management system in order for it to be considered relational. Rule 0 The Foundation rule Rule 1 The Information rule Rule 2 The guaranteed access rule Rule 3 Systematic treatment of null values Rule 4 Active online catalog based on the relational model . . . . . . 39 / 69
  • 40. Big Data Big Data Technology NoSQL ACID ACID A set of properties that guarantee that database transactions are processed reliably. Atomicity A transaction is all or nothing. Consistency Only transactions with valid data. Isolation Simultaneous transactions will not interfere. Durability Written transaction data stays there “forever” (even in case of power loss, crashes, errors,. . . ). 40 / 69
  • 41. Big Data Big Data Technology NoSQL Scaling up What if you need to scale up your RDBMS in terms of dataset size, read/write concurrency? This usually involves breaking Codds rules, loosening ACID restrictions, forgetting conventional DBA wisdom, loose most of the desirable properties that made RDBMS so convenient in the first place. NoSQL to the rescue! 41 / 69
  • 42. Big Data Big Data Technology NoSQL NoSQL NoSQL ‘Invented’ by Carl Strozzi in 1998 (for his file-based database) “Not only SQL” It’s NOT about saying that SQL should never be used, saying that SQL is dead. 42 / 69
  • 43. Big Data Big Data Technology NoSQL NoSQL databases Four emerging NoSQL categories: 43 / 69
  • 44. Big Data Big Data Technology NoSQL Key-Value stores or ‘the big hash table’ Keys Values 13a1 13a2 13a3 Nexus 32 GB Nexus 16 GB Nexus 08 GB Most basic type of NoSQL databases. Aggregation of key-value pairs. Typically only 4 operations: create(key, value) read(key) update(key, value) delete(key) Fast, scalable, less complex. Mainly used for systems with simple queries (caches etc. . . . ) 44 / 69
  • 45. Big Data Big Data Technology NoSQL Key-Value stores or ’the big hash table’ 45 / 69
  • 46. Big Data Big Data Technology NoSQL Column-oriented DBMS Example Id LastName FirstName Salary 10 Smith Joe 40000 12 Jones Mary 50000 11 Johnson Cathy 44000 22 Jones Bob 55000 Row-based: 10,Smith,Joe,40000;12,Jones,Mary,50000;11,Johnson,Cathy,44000;22,Jones,Bob,55000 Column-based: 10,12,11,22;Smith,Jones,Johnson,Jones;Joe,Mary,Cathy,Bob;40000,50000,44000,55000 46 / 69
  • 47. Big Data Big Data Technology NoSQL Column family based databases Like column-oriented DBMS, but with a twist Columns and supercolumns ≈ RDBMS table columns Family of columns ≈ RDBMS table Keyspace ≈ RDBMS database 47 / 69
  • 48. Big Data Big Data Technology NoSQL Column family based databases Most complex NoSQL database type. Based on Google’s BigTable paper. More flexibility than traditional RDBMS: adding (super)columns is always possible. Excellent for analysis and mass treatment of data (via Map-Reduce type operations) 48 / 69
  • 49. Big Data Big Data Technology NoSQL Document databases Data is stored as a collection of documents (JSON, XML,. . . but also PDF, Excel,. . . ) Documents → collection of key-value pairs Values can be simple values arrays another document (collection of key-values) Schemaless Quite well queryable 49 / 69
  • 50. Big Data Big Data Technology NoSQL Document databases Example (Document 1) { FirstName: "Bob", Address: "5 Oak St.", Hobby: "sailing" } Example (Document 2) { FirstName: "Jonathan", Address: "15 Wanamassa Road", Children: [ {Name: "Michael", Age: 10}, {Name: "Jennifer", Age: 8}, {Name: "Samantha", Age: 5}, {Name: "Elena", Age: 2} ] } Best suited for custom queries like the ones in RDBMS. Quite popular for Content Management Systems. 50 / 69
  • 51. Big Data Big Data Technology NoSQL Document databases: examples 51 / 69
  • 52. Big Data Big Data Technology NoSQL Graph databases Julie Steve Rock Music Bob BMW Fido Jim IBM Sister in-Law To Listens To Listens To M arried To Brother Of Drives W orks For Colleague Of Works ForHas Pet Based on graph theory. Employ nodes (objects) and edges (relations between objects). 52 / 69
  • 53. Big Data Big Data Technology NoSQL Graph databases: examples Well-suited for problems with network-structure: mine data from social media “customers who bought this also looked at. . . ” relations between persons healthcare ontologies ??? . . . 53 / 69
  • 54. Big Data Big Data Technology NoSQL Us the right tool for the right job! http://db-engines.com/ 54 / 69
  • 55. Big Data Big Data in my company? Outline 1 Introduction Big Data? 2 Big Data Technology Hadoop Pig, Hive NoSQL 3 Big Data in my company? 4 Conclusions 55 / 69
  • 56. Big Data Big Data in my company? Typical RDBMS scaling story 1. Initial Public Launch From local workstation → remotely hosted MySQL instance. 2. Service popularity ↑, too many reads hitting the database Add memcached to cache common queries. Reads are now no longer strictly ACID; cached data must expire. 3. Popularity ↑↑, too many writes hitting the database Scale MySQL vertically by buying a beefed-up server: 16 cores 128 GB of RAM banks of 15 k RPM hard drives    Costly 56 / 69
  • 57. Big Data Big Data in my company? Typical RDBMS scaling story 4. New features → query complexity ↑, now too many joins Denormalize your data to reduce joins. (Thats not what they taught me in DBA school!) 5. Rising popularity swamps the server; things are too slow Stop doing any server-side computations. 57 / 69
  • 58. Big Data Big Data in my company? Typical RDBMS scaling story 6. Some queries are still too slow Periodically prematerialize the most complex queries, and try to stop joining in most cases. 7. Reads are OK, writes are getting slower and slower. . . Drop secondary indexes and triggers (no indexes?). If you stay up at night worrying about your database (uptime, scale, or speed), you should seriously consider making a jump from the RDBMS world to HBase. 58 / 69
  • 59. Big Data Big Data in my company? Two types of companies (personal view) ‘Core Big Data’ company Core business = big data processing, crunching, analyzing,. . . Example Google, Facebook,. . . Smart metering companies Video/Image processing companies Biotech companies with sequencing data Lots of healthcare data??? . . . 59 / 69
  • 60. Big Data Big Data in my company? Two types of companies (personal view) ‘General Big Data’ company Some other core business. Lots of useful data is available. Desirable: business analytics, process optimization,. . . Example Supermarkets → customer cards Transport firms → GPS-traces Hospitals → patient and medical info??? . . . 60 / 69
  • 61. Big Data Big Data in my company? Use-cases of Big Data ‘Core Big Data’ company Big Data crunching, hacking, processing, analyzing, . . . ‘General Big Data’ company Business Analytics improve decision-making, gain operational insights, increase overall performance, track and analyze shopping patterns, . . . Both Explore! Discover hidden gems! 61 / 69
  • 62. Big Data Big Data in my company? Some examples IBM: predict heart disease long before it strikes. Predict and stop the spread of infectious disease 62 / 69
  • 63. Big Data Big Data in my company? Some examples How to predict wine quality? Skip tasting! Use science! Weather seems the key variable. Correlate historical weather & wine data. Reduce fuel cost and improve driver safety by analyzing geolocation data 63 / 69
  • 64. Big Data Big Data in my company? Big Data in your company Big data is typically a division of the IT-department. Requires skilled people: sysadmins software developers data-scientists visualization experts . . . Advice, trend (Andrew McAfee) Give geeks a seat at the decision-making table. 64 / 69
  • 65. Big Data Big Data in my company? Big Data in your company 65 / 69
  • 66. Big Data Big Data in my company? IWT TETRA project Data mining: van relationele database naar Big Data. Dates Submitted: 12/03/2014 Notification of acceptance: July, 2014 Runs from 01/10/2014 – 01/10/2016 People involved Wannes De Smet (researcher) Bart Vandewoestyne (researcher) Johan De Gelas (project coordinator) Thanks for being interested project partner :-) 66 / 69
  • 67. Big Data Conclusions Outline 1 Introduction Big Data? 2 Big Data Technology Hadoop Pig, Hive NoSQL 3 Big Data in my company? 4 Conclusions 67 / 69
  • 68. Big Data Conclusions Conclusions “Big” can be small too. The Big Data landscape is huge. RDBMS and SQL are not dead. The right tool for the right job! Your company can benefit from Big Data technology. We can help. Be brave in your quest. . . 68 / 69