Big Data Concepts Masterclass

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Big Data Concepts Masterclass
A crash course for executives and managers
@BigDataExperts

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Who We Are?

Agenda
Three big questions:
!1. What is Big Data?
2. Why should I care?
3. Where do I start?

1.What is Big Data?

What is Big Data?
1. New technology
▪ Volume
▪ Variety
▪ Velocity
2. New philosophy
▪ Value of data
▪ Taming Voracity
▪ Becoming data-driven
▪ Empirical approach: Data Science
3. 1 + 2 = Business Transformation

New technology drivers
What is Big Data?

Volume:The Information Revolution
▪ We are living in an Information Revolution.
▪ Accumulation of last 2 year’s data flow (1 ZB), dwarfs the
entire prior record of human civilization.
▪ Social Media, smart sensors, server logs, finance, e-mail…

Volume:Why can’t I just make it bigger?
Legacy database, experiencing huge growth in data volumes
$ / GB
$$ / GB
$$$ / GBLarge Application Database
or Data Warehouse
$$$$ / GB
TB ???
Data Volume
Performance
Cost
ScaleUP

Variety:Why won’t it load my data?
▪ Business are increasingly moving beyond relational data –
80% of enterprise data is unstructured.
▪ The rise of social media data integrated with other
enterprise data leaves us with the problem of handling
complex graph data.
▪ Machine-generated data such as log data is often semi-
structured.
▪ Often as datasets get much larger, it is more efficient to
leave them in their original format and store them that
way, than to transform everything into a normalised
relational schema.

Velocity:Why can’t I capture everything?
▪ All single-server information systems have limits on
throughput.
▪ The only question is whether you hit that limit or not.
▪ If you do, your options are limited unless you have a
distributed system to capture the data as it arrives.
▪ Distributed systems which are designed in an appropriate
way can scale linearly to accept increasing data
throughput rates, effectively lifting the cap on capture
throughput.
▪ In today’s high data intensity applications, this is
becoming ever more important.

Cheat Sheet: Big Data Jargon
Hadoop
▪ Open-source framework for storing and processing large
data sets.
▪ Uses clusters of commodity hardware to tackle big data
challenge in an affordable way.
▪ Designed to cope with failures automatically.
▪ Can be scaled out from one server to thousands of
machines. 
Scale Out

Cheat Sheet: Big Data Jargon
NoSQL
▪ Means “Not Only SQL”
▪ Refers to most databases which post-date the SQL era. 
Some support SQL, or SQL-derived languages.
▪ May be capable of handling Big Data (a Distributed
System), or may be limited to a single server.
▪ Often represent data in more flexible ways than
spreadsheets, 
e.g.a “map” of many Item=>Value pairs,  
or, a “graph” of many items and the relationships
between them.

New philosophy
What is Big Data?

What’s Data Science all about?
▪ Data + Science
▪ Science: theory + experiment => evidence => insight
▪ Science: “the empirical method” = evidence-based
approach 
Never based on assumptions or intuition.
▪ Data Science movement, particularly in the context of Big
Data, is all about making business data-driven and
empirical.

What’s Data Science all about?
▪ Before: analysts used intuition and domain knowledge to
draw conclusions from statistics.
▪ Unfortunately, statistics can be easily manipulated, as we
often see in the media. 
“There are lies, damned lies, and statistics” – Mark Twain
▪ Critical evaluation of data empirically is key to avoiding
bias.
▪ More modern techniques such as Bayesian statistics can
help to remove subjective bias.
▪ Machine Learning methods can remove the human
element almost entirely.

Data Science + Big Data
More
data +
limited
compute
resource
More
aggressiv
e
sampling
Less
accurate
results
Improve
accuracy
of results
+ limited
compute
resource
More
complex
models
Less
accounta
ble
results
✓
All data +
scalable
compute
resource
No
sampling
More
accurate
results
All data +
scalable
compute
resource
Less
complex
models
More
accounta
ble
results
✗
✗
✓
Often quoted as “more data trumps smarter algorithms” (Google)

BusinessTransformation
What is Big Data?

Why do we need to change?
!
• New
technology
• Disruptive
• New
philosophy
• Challenging
to existing
processes
• Business
transformatio
n
• New strategy,
new roles
Big Data
Strategy
Big Data Engineer
Big Data Architect
Data Scientist
Chief Data Officer

Why do we need a Big Data strategy?
Any major change programme needs a strategy to steer it.
1. Everyone will be pulling in the same direction.
2. Performance can be measured against the strategy later.
3. Target outcomes will be clearly defined.
4. The business will understand the need for the programme.

Why can’t I just ‘build it and they will come’?
×

Do I need a Data Scientist?

Understanding the Data Scientist role
Data Analyst
▪ SAS, SPSS
▪ Excel, possibly R
▪ Relational Databases
▪ SQL
▪ Some training in statistics
▪ Education in IT or Business
▪ Happy with table or spreadsheet
formatted data
Data Scientist
▪ Statistics
▪ SAS, SPSS
▪ R
▪ Relational Databases
▪ SQL
▪ Education in Maths or Physics
▪ Happy with any data formats
and data varieties
▪ Machine Learning
▪ Big Data
▪ NoSQL, Hadoop, Cassandra

Understanding the Data Scientist role
Because they are a scientist, their
job is to explore and discover –  
within your business’ data.
1. Access to all data => break down information siloes
2. Tools to explore => big data computing infrastructure
3. Freedom to explore and discover => changes to policy and team
structure

Use Case #101:
Data Lake
▪ Consolidation of data siloes for
combined analysis, online
archival, and free-range Data
Science exploration.
▪ Often begins as a POC.
▪ Value could take a long time to
emerge, and could be difficult
to plan or predict. 
(Unknown unknowns)
ROI analysis: 
 
Value uplift from new insight 
should be > than 
cost of big data implementation +  
cost of data source integration +  
cost of staffing Data Science team

Where is Big Data heading?
▪ Big Data is here to stay. 
Data volumes are not going to decrease!
▪ We see data processing becoming increasingly commoditised. 
Vendor proliferation + it is simply a matter of mechanics.
▪ We see Machine Learning becoming far more widespread. 
More complex relationships harder to identify for humans
▪ We see Data Science permeating a much wider range of
businesses and taking over as the next boom industry. 
The 24-hour global economy makes being data-driven
increasingly more valuable.
▪ Investment in Big Data technology is a solid foundation, but
investment in Machine Learning and Data Science expertise
will really put you at the front of the pack.

2.Why should I care?

Why should I care?
1. Quality of insight
2. Time to insight
3. Competitive advantage

Quality of Insight

Recap: Data Science + Big Data
More
data +
limited
compute
resource
More
aggressiv
e
sampling
Less
accurate
results
Improve
accuracy
of results
+ limited
compute
resource
More
complex
models
Less
accounta
ble
results
✓
All data +
scalable
compute
resource
No
sampling
More
accurate
results
All data +
scalable
compute
resource
Less
complex
models
More
accounta
ble
results
✗
✗
✓

Data Science + Big Data
✓
More  
accurate  
results
Better
decisions
More
efficienc
y or
revenue
More  
accounta
ble  
results
Better
traceabili
ty
Less risk
+
regulator
y
complian
ce
✓
Clear, quantifiable business outcomes
Use Case #102

Use Case #102:
Migrate & scale existing analytic models
▪ Identify existing analytic models
which suffer from sampling of
input data, or overly complex
models. Migrate to big data
platform, scale out to whole
dataset and/or simplify model.
▪ Can go directly to POV with real
measurable business value.
▪ Rapid turnaround for POV if
models are not too difficult to
migrate to your chosen
platform.
ROI analysis: 
 
Value uplift from use case 
should be = 
value from improved model
accuracy – 
cost of migration work

Case Studies

Case Study: Retail
Predicting fashion trends for retailers
▪ Client: Global publisher providing fashion 
insight & trend analysis for customers.
▪ Wanted superior market intelligence to inform 
crucial retail buyer decisions.
▪ Challenges:
▪ Consume vast amounts of unstructured data 
from the web.
▪ Make accurate, actionable predictions from the data.
▪ Use cases:
▪ Large-scale parallel data processing of unstructured data from uncontrolled sources.
▪ Predictive analytics & machine learning.
▪ Used big data ecosystem technologies (Hadoop, Hive, Pig) to collect, process, transform
the data and serve the front-end.
▪ Outcome:
▪ Platform successfully launched Sept 2013
▪ Opened up new business stream as this was a new product

Case Study: Music Industry
Digital music service play analytics, recommendations, and royalties
▪ Client: leading online music  
streaming service
▪ Music listening habits of millions of users, 
measured across millions of tracks.
▪ Challenges:
▪ Connecting datasets from different  
application systems, too large for 
a traditional database.
▪ Generating actionable reports and  
recommendations.
▪ Use cases:
▪ Reporting, and royalty charge computation.
▪ Generating recommendations for users to help them find new music.
▪ Outcome:
▪ Richer information about users in a shorter time frame
▪ Lower overheads and for less money than previous system
▪ = significant operational efficiency improvements

Case Study: M2M
Machine-to-machine data across various industries
▪ M2M data = telemetry collected from 
industrial machines  
(e.g. production line robots,  
power plants, aircraft engines, …).
!
▪ Can be analysed to increase efficiency of those machines.
▪ Individually,
▪ or optimise many of them as a collective system.
▪ GE conducted a detailed study of the impact of a 1%
improvement in productivity across different industries, as
a result of Machine Data Analytics with big data
technology.

Case Study: M2M
Machine-to-machine data across various industries
▪ Use cases:
▪ Asset management & Predictive maintenance
▪ Aggregate view across geography, machines, components, parts
▪ Deliver optimal number of parts to right location at right time
▪ Minimise parts inventory held, and maintenance costs
▪ Predictive analytics to replace parts before failure
▪ Supply chain optimisation
▪ RFID & smart sensors
▪ Deliver goods at optimal time, e.g. fresh produce
▪ Monitor state of goods in transit, adjust logistics in real-time
▪ Transport fleet optimisation
▪ Interconnected vehicles know their own + other vehicles
location
▪ Optimise routing to find most efficient system-level solution

3.Where do I start?

Life cycle of a Big Data programme
Education
(1)
Analys
is
(2)
Discovery
(3)
Prototyp
e
(4)
Implem
entatio
n
(5)
Evolution
(6)
Company
Strategy
Big Data
Strategy

Cheat Sheet: POC vs POV
Proof of Concept
▪ Select a use case to illustrate
with
▪ Sample, or mock up data on a
smaller scale
▪ Build a scaled-down version of
the full use case
▪ Prove the technology can
deliver as intended for use
case, and can scale to the full
dataset
▪ Preferably through repeatable,
automated unit tests
Proof of Value
▪ Select a use case to illustrate
with
▪ Sample, or partition real data to
a smaller scale
▪ Build a scaled-down version of
the full use case
▪ Prove the technology can deliver
business value from insight
generated for the use case
▪ Document implementation cost
vs business value uplift
rigorously

1. Make sure there is an owner of data across the
organisation,
▪ Chief Data Officer is an ideal role for this if you can do it.
2. Organise your Data scientists so they are best placed to
support the business goals,
▪ one central team,
▪ one team per analysis type, or
▪ one person dedicated to each business unit, for instance.
3. Make sure IT is able to make the data available to those
individuals in the right way (sandpit, right tools, access
etc.).

Integrating with the enterprise Data Warehouse
▪ Systems like Hadoop are not a full replacement for a Data
Warehouse.
▪ There are overlapping qualities.
▪ Hadoop is not transactional, nor does it support fine-grained
access to data.
▪ Hadoop is fundamentally a batch oriented system, so mixed
workloads are not easily supported.
!
▪ Best practice is to use Hadoop to complement an existing Data
Warehouse.
▪ Hadoop can offload “cold” or rarely accessed data to act as an
online archive.
▪ Hadoop can offload expensive ETL processing.
▪ Hadoop can efficiently generate aggregations/summaries, and
export these to the Data Warehouse for enterprise use.
▪ Keep only the highest-value data in Data Warehouse.

Summary
▪ Dispelled myths:
▪ Big Data is only about technology
▪ Big Data is only relevant to technologists
▪ Hadoop is magic
▪ Hadoop is an unknown black box
▪ A Big Data approach can help with problems which may
combine Volume, Variety, Velocity.
▪ A Big Data approach is in demand because it is helping
increase business value, and time to insight.
▪ Data Science is key to getting full value from a Big Data
platform.

FurtherTraining
▪ Apache Hadoop 2.0 Developing Java Applications (4 days)
▪ Apache Hadoop 2.0 Development for Data Analysts (4
days)
▪ Apache Hadoop 2.0 Operations Management (3 days)
▪ MapR Hadoop Fundamentals of Administration (3 days)
▪ Apache Cassandra DevOps Fundamentals (3 days)
▪ Apache Hadoop Masterclass (1 day)
▪ Big Data Concepts Masterclass (1 day)
▪ Machine Learning at scale with Apache Mahout (1 day)

Contact Details
Tim Seears 
CTO 
Big Data Partnership 
 
info@bigdatapartnership.com 
@BigDataExperts

Big Data Concepts Masterclass

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (15)

Dernier

Dernier (20)

Big Data Concepts Masterclass

Notes de l'éditeur