SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Big Data Concepts Masterclass
A crash course for executives and managers
@BigDataExperts
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Who We Are?
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Agenda
Three big questions:
!1. What is Big Data?
2. Why should I care?
3. Where do I start?
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
1.What is Big Data?
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
What is Big Data?
1. New technology
▪ Volume
▪ Variety
▪ Velocity
2. New philosophy
▪ Value of data
▪ Taming Voracity
▪ Becoming data-driven
▪ Empirical approach: Data Science
3. 1 + 2 = Business Transformation
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
New technology drivers
What is Big Data?
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Volume:The Information Revolution
▪ We are living in an Information Revolution.
▪ Accumulation of last 2 year’s data flow (1 ZB), dwarfs the
entire prior record of human civilization.
▪ Social Media, smart sensors, server logs, finance, e-mail…
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Volume:Why can’t I just make it bigger?
Legacy database, experiencing huge growth in data volumes
$ / GB
$$ / GB
$$$ / GBLarge Application Database
or Data Warehouse
$$$$ / GB
TB ???
Data Volume
Performance
Cost
ScaleUP
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Variety:Why won’t it load my data?
▪ Business are increasingly moving beyond relational data –
80% of enterprise data is unstructured.
▪ The rise of social media data integrated with other
enterprise data leaves us with the problem of handling
complex graph data.
▪ Machine-generated data such as log data is often semi-
structured.
▪ Often as datasets get much larger, it is more efficient to
leave them in their original format and store them that
way, than to transform everything into a normalised
relational schema.
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Velocity:Why can’t I capture everything?
▪ All single-server information systems have limits on
throughput.
▪ The only question is whether you hit that limit or not.
▪ If you do, your options are limited unless you have a
distributed system to capture the data as it arrives.
▪ Distributed systems which are designed in an appropriate
way can scale linearly to accept increasing data
throughput rates, effectively lifting the cap on capture
throughput.
▪ In today’s high data intensity applications, this is
becoming ever more important.
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Cheat Sheet: Big Data Jargon
Hadoop
▪ Open-source framework for storing and processing large
data sets.
▪ Uses clusters of commodity hardware to tackle big data
challenge in an affordable way.
▪ Designed to cope with failures automatically.
▪ Can be scaled out from one server to thousands of
machines.

Scale Out
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Cheat Sheet: Big Data Jargon
NoSQL
▪ Means “Not Only SQL”
▪ Refers to most databases which post-date the SQL era.

Some support SQL, or SQL-derived languages.
▪ May be capable of handling Big Data (a Distributed
System), or may be limited to a single server.
▪ Often represent data in more flexible ways than
spreadsheets,

e.g.a “map” of many Item=>Value pairs, 

or, a “graph” of many items and the relationships
between them.
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
New philosophy
What is Big Data?
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
What’s Data Science all about?
▪ Data + Science
▪ Science: theory + experiment => evidence => insight
▪ Science: “the empirical method” = evidence-based
approach

Never based on assumptions or intuition.
▪ Data Science movement, particularly in the context of Big
Data, is all about making business data-driven and
empirical.
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
What’s Data Science all about?
▪ Before: analysts used intuition and domain knowledge to
draw conclusions from statistics.
▪ Unfortunately, statistics can be easily manipulated, as we
often see in the media.

“There are lies, damned lies, and statistics” – Mark Twain
▪ Critical evaluation of data empirically is key to avoiding
bias.
▪ More modern techniques such as Bayesian statistics can
help to remove subjective bias.
▪ Machine Learning methods can remove the human
element almost entirely.
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Data Science + Big Data
More
data +
limited
compute
resource
More
aggressiv
e
sampling
Less
accurate
results
Improve
accuracy
of results
+ limited
compute
resource
More
complex
models
Less
accounta
ble
results
✓
All data +
scalable
compute
resource
No
sampling
More
accurate
results
All data +
scalable
compute
resource
Less
complex
models
More
accounta
ble
results
✗
✗
✓
Often quoted as “more data trumps smarter algorithms” (Google)
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
BusinessTransformation
What is Big Data?
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Why do we need to change?
!
• New
technology
• Disruptive
• New
philosophy
• Challenging
to existing
processes
• Business
transformatio
n
• New strategy,
new roles
Big Data
Strategy
Big Data Engineer
Big Data Architect
Data Scientist
Chief Data Officer
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Why do we need a Big Data strategy?
Any major change programme needs a strategy to steer it.
1. Everyone will be pulling in the same direction.
2. Performance can be measured against the strategy later.
3. Target outcomes will be clearly defined.
4. The business will understand the need for the programme.
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Why can’t I just ‘build it and they will come’?
×
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Do I need a Data Scientist?
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Understanding the Data Scientist role
Data Analyst
▪ SAS, SPSS
▪ Excel, possibly R
▪ Relational Databases
▪ SQL
▪ Some training in statistics
▪ Education in IT or Business
▪ Happy with table or spreadsheet
formatted data
Data Scientist
▪ Statistics
▪ SAS, SPSS
▪ R
▪ Relational Databases
▪ SQL
▪ Education in Maths or Physics
▪ Happy with any data formats
and data varieties
▪ Machine Learning
▪ Big Data
▪ NoSQL, Hadoop, Cassandra
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Understanding the Data Scientist role
Because they are a scientist, their
job is to explore and discover – 

within your business’ data.
1. Access to all data => break down information siloes
2. Tools to explore => big data computing infrastructure
3. Freedom to explore and discover => changes to policy and team
structure
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Use Case #101:
Data Lake
▪ Consolidation of data siloes for
combined analysis, online
archival, and free-range Data
Science exploration.
▪ Often begins as a POC.
▪ Value could take a long time to
emerge, and could be difficult
to plan or predict.

(Unknown unknowns)
ROI analysis:



Value uplift from new insight

should be > than

cost of big data implementation + 

cost of data source integration + 

cost of staffing Data Science team
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Where is Big Data heading?
▪ Big Data is here to stay.

Data volumes are not going to decrease!
▪ We see data processing becoming increasingly commoditised.

Vendor proliferation + it is simply a matter of mechanics.
▪ We see Machine Learning becoming far more widespread.

More complex relationships harder to identify for humans
▪ We see Data Science permeating a much wider range of
businesses and taking over as the next boom industry.

The 24-hour global economy makes being data-driven
increasingly more valuable.
▪ Investment in Big Data technology is a solid foundation, but
investment in Machine Learning and Data Science expertise
will really put you at the front of the pack.
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
2.Why should I care?
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Why should I care?
1. Quality of insight
2. Time to insight
3. Competitive advantage
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Quality of Insight
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Recap: Data Science + Big Data
More
data +
limited
compute
resource
More
aggressiv
e
sampling
Less
accurate
results
Improve
accuracy
of results
+ limited
compute
resource
More
complex
models
Less
accounta
ble
results
✓
All data +
scalable
compute
resource
No
sampling
More
accurate
results
All data +
scalable
compute
resource
Less
complex
models
More
accounta
ble
results
✗
✗
✓
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Data Science + Big Data
✓
More 

accurate 

results
Better
decisions
More
efficienc
y or
revenue
More 

accounta
ble 

results
Better
traceabili
ty
Less risk
+
regulator
y
complian
ce
✓
Clear, quantifiable business outcomes
Use Case #102
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Use Case #102:
Migrate & scale existing analytic models
▪ Identify existing analytic models
which suffer from sampling of
input data, or overly complex
models. Migrate to big data
platform, scale out to whole
dataset and/or simplify model.
▪ Can go directly to POV with real
measurable business value.
▪ Rapid turnaround for POV if
models are not too difficult to
migrate to your chosen
platform.
ROI analysis:



Value uplift from use case

should be =

value from improved model
accuracy –

cost of migration work 

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Case Studies
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Case Study: Retail
Predicting fashion trends for retailers
▪ Client: Global publisher providing fashion

insight & trend analysis for customers.
▪ Wanted superior market intelligence to inform

crucial retail buyer decisions.
▪ Challenges:
▪ Consume vast amounts of unstructured data

from the web.
▪ Make accurate, actionable predictions from the data.
▪ Use cases:
▪ Large-scale parallel data processing of unstructured data from uncontrolled sources.
▪ Predictive analytics & machine learning.
▪ Used big data ecosystem technologies (Hadoop, Hive, Pig) to collect, process, transform
the data and serve the front-end.
▪ Outcome:
▪ Platform successfully launched Sept 2013
▪ Opened up new business stream as this was a new product
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Case Study: Music Industry
Digital music service play analytics, recommendations, and royalties
▪ Client: leading online music 

streaming service
▪ Music listening habits of millions of users,

measured across millions of tracks.
▪ Challenges:
▪ Connecting datasets from different 

application systems, too large for

a traditional database.
▪ Generating actionable reports and 

recommendations.
▪ Use cases:
▪ Reporting, and royalty charge computation.
▪ Generating recommendations for users to help them find new music.
▪ Outcome:
▪ Richer information about users in a shorter time frame
▪ Lower overheads and for less money than previous system
▪ = significant operational efficiency improvements
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Case Study: M2M
Machine-to-machine data across various industries
▪ M2M data = telemetry collected from

industrial machines 

(e.g. production line robots, 

power plants, aircraft engines, …).
!
▪ Can be analysed to increase efficiency of those machines.
▪ Individually,
▪ or optimise many of them as a collective system.
▪ GE conducted a detailed study of the impact of a 1%
improvement in productivity across different industries, as
a result of Machine Data Analytics with big data
technology.
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Case Study: M2M
Machine-to-machine data across various industries
▪ Use cases:
▪ Asset management & Predictive maintenance
▪ Aggregate view across geography, machines, components, parts
▪ Deliver optimal number of parts to right location at right time
▪ Minimise parts inventory held, and maintenance costs
▪ Predictive analytics to replace parts before failure
▪ Supply chain optimisation
▪ RFID & smart sensors
▪ Deliver goods at optimal time, e.g. fresh produce
▪ Monitor state of goods in transit, adjust logistics in real-time
▪ Transport fleet optimisation
▪ Interconnected vehicles know their own + other vehicles
location
▪ Optimise routing to find most efficient system-level solution
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
3.Where do I start?
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Life cycle of a Big Data programme
Education
(1)
Analys
is
(2)
Discovery
(3)
Prototyp
e
(4)
Implem
entatio
n
(5)
Evolution
(6)
Company
Strategy
Big Data
Strategy
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Cheat Sheet: POC vs POV
Proof of Concept
▪ Select a use case to illustrate
with
▪ Sample, or mock up data on a
smaller scale
▪ Build a scaled-down version of
the full use case
▪ Prove the technology can
deliver as intended for use
case, and can scale to the full
dataset
▪ Preferably through repeatable,
automated unit tests
Proof of Value
▪ Select a use case to illustrate
with
▪ Sample, or partition real data to
a smaller scale
▪ Build a scaled-down version of
the full use case
▪ Prove the technology can deliver
business value from insight
generated for the use case
▪ Document implementation cost
vs business value uplift
rigorously
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
BusinessTransformation
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
BusinessTransformation
1. Make sure there is an owner of data across the
organisation,
▪ Chief Data Officer is an ideal role for this if you can do it.
2. Organise your Data scientists so they are best placed to
support the business goals,
▪ one central team,
▪ one team per analysis type, or
▪ one person dedicated to each business unit, for instance.
3. Make sure IT is able to make the data available to those
individuals in the right way (sandpit, right tools, access
etc.).
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Integrating with the enterprise Data Warehouse
▪ Systems like Hadoop are not a full replacement for a Data
Warehouse.
▪ There are overlapping qualities.
▪ Hadoop is not transactional, nor does it support fine-grained
access to data.
▪ Hadoop is fundamentally a batch oriented system, so mixed
workloads are not easily supported.
!
▪ Best practice is to use Hadoop to complement an existing Data
Warehouse.
▪ Hadoop can offload “cold” or rarely accessed data to act as an
online archive.
▪ Hadoop can offload expensive ETL processing.
▪ Hadoop can efficiently generate aggregations/summaries, and
export these to the Data Warehouse for enterprise use.
▪ Keep only the highest-value data in Data Warehouse.
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Summary
▪ Dispelled myths:
▪ Big Data is only about technology
▪ Big Data is only relevant to technologists
▪ Hadoop is magic
▪ Hadoop is an unknown black box
▪ A Big Data approach can help with problems which may
combine Volume, Variety, Velocity.
▪ A Big Data approach is in demand because it is helping
increase business value, and time to insight.
▪ Data Science is key to getting full value from a Big Data
platform.
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
FurtherTraining
▪ Apache Hadoop 2.0 Developing Java Applications (4 days)
▪ Apache Hadoop 2.0 Development for Data Analysts (4
days)
▪ Apache Hadoop 2.0 Operations Management (3 days)
▪ MapR Hadoop Fundamentals of Administration (3 days)
▪ Apache Cassandra DevOps Fundamentals (3 days)
▪ Apache Hadoop Masterclass (1 day)
▪ Big Data Concepts Masterclass (1 day)
▪ Machine Learning at scale with Apache Mahout (1 day)
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Contact Details
Tim Seears

CTO

Big Data Partnership



info@bigdatapartnership.com

@BigDataExperts

Contenu connexe

En vedette

Organigrama
OrganigramaOrganigrama
Organigramanatalid
 
Presentación Mario Miranda -Workshop 3| Abril "Como Aumentar mis Ventas a tra...
Presentación Mario Miranda -Workshop 3| Abril "Como Aumentar mis Ventas a tra...Presentación Mario Miranda -Workshop 3| Abril "Como Aumentar mis Ventas a tra...
Presentación Mario Miranda -Workshop 3| Abril "Como Aumentar mis Ventas a tra...eCommerce Institute
 
Oficina Técnica para Consultoría de Innovación en Pymes de Elche – Smart City
Oficina Técnica para Consultoría de Innovación en Pymes de Elche – Smart CityOficina Técnica para Consultoría de Innovación en Pymes de Elche – Smart City
Oficina Técnica para Consultoría de Innovación en Pymes de Elche – Smart CityEOI Escuela de Organización Industrial
 
Imbiss trailers
Imbiss trailersImbiss trailers
Imbiss trailersEUROPAGES
 
Big data Europe: concept, platform and pilots
Big data Europe: concept, platform and pilotsBig data Europe: concept, platform and pilots
Big data Europe: concept, platform and pilotsBigData_Europe
 
Snapchat Marketing: La Guía Sin Censuras
Snapchat Marketing: La Guía Sin CensurasSnapchat Marketing: La Guía Sin Censuras
Snapchat Marketing: La Guía Sin CensurasAllan V. Braverman
 
Essential Management for Sport Managers.
Essential Management for Sport Managers. Essential Management for Sport Managers.
Essential Management for Sport Managers. Joan Celma
 
Vogue uk april_2016
Vogue uk april_2016Vogue uk april_2016
Vogue uk april_2016PrivetOUTLET
 
Classification and Nomenclature of Organic Halides
Classification and Nomenclature of Organic HalidesClassification and Nomenclature of Organic Halides
Classification and Nomenclature of Organic HalidesCyra Mae Soreda
 
El calentamiento 1º
El calentamiento 1ºEl calentamiento 1º
El calentamiento 1ºferbergn
 
La salud mental en la familia
La salud mental en la familiaLa salud mental en la familia
La salud mental en la familiayaritza_93_12
 
Laura Ters, Home Smart Realty - Seller Presentation
Laura Ters, Home Smart Realty - Seller PresentationLaura Ters, Home Smart Realty - Seller Presentation
Laura Ters, Home Smart Realty - Seller PresentationDiane Neslund
 
Big data ... for security
Big data ... for securityBig data ... for security
Big data ... for securityJames Salter
 
Ocimf annual report_2013
Ocimf annual report_2013Ocimf annual report_2013
Ocimf annual report_2013OCIMF OVID
 

En vedette (15)

Organigrama
OrganigramaOrganigrama
Organigrama
 
Presentación Mario Miranda -Workshop 3| Abril "Como Aumentar mis Ventas a tra...
Presentación Mario Miranda -Workshop 3| Abril "Como Aumentar mis Ventas a tra...Presentación Mario Miranda -Workshop 3| Abril "Como Aumentar mis Ventas a tra...
Presentación Mario Miranda -Workshop 3| Abril "Como Aumentar mis Ventas a tra...
 
Oficina Técnica para Consultoría de Innovación en Pymes de Elche – Smart City
Oficina Técnica para Consultoría de Innovación en Pymes de Elche – Smart CityOficina Técnica para Consultoría de Innovación en Pymes de Elche – Smart City
Oficina Técnica para Consultoría de Innovación en Pymes de Elche – Smart City
 
Imbiss trailers
Imbiss trailersImbiss trailers
Imbiss trailers
 
Big data Europe: concept, platform and pilots
Big data Europe: concept, platform and pilotsBig data Europe: concept, platform and pilots
Big data Europe: concept, platform and pilots
 
Snapchat Marketing: La Guía Sin Censuras
Snapchat Marketing: La Guía Sin CensurasSnapchat Marketing: La Guía Sin Censuras
Snapchat Marketing: La Guía Sin Censuras
 
Essential Management for Sport Managers.
Essential Management for Sport Managers. Essential Management for Sport Managers.
Essential Management for Sport Managers.
 
Vogue uk april_2016
Vogue uk april_2016Vogue uk april_2016
Vogue uk april_2016
 
Classification and Nomenclature of Organic Halides
Classification and Nomenclature of Organic HalidesClassification and Nomenclature of Organic Halides
Classification and Nomenclature of Organic Halides
 
El calentamiento 1º
El calentamiento 1ºEl calentamiento 1º
El calentamiento 1º
 
La salud mental en la familia
La salud mental en la familiaLa salud mental en la familia
La salud mental en la familia
 
Laura Ters, Home Smart Realty - Seller Presentation
Laura Ters, Home Smart Realty - Seller PresentationLaura Ters, Home Smart Realty - Seller Presentation
Laura Ters, Home Smart Realty - Seller Presentation
 
Big data ... for security
Big data ... for securityBig data ... for security
Big data ... for security
 
SOC Quantitative Approach
SOC Quantitative ApproachSOC Quantitative Approach
SOC Quantitative Approach
 
Ocimf annual report_2013
Ocimf annual report_2013Ocimf annual report_2013
Ocimf annual report_2013
 

Dernier

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Dernier (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Big Data Concepts Masterclass

  • 1. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Big Data Concepts Masterclass A crash course for executives and managers @BigDataExperts
  • 2. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Who We Are?
  • 3. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Agenda Three big questions: !1. What is Big Data? 2. Why should I care? 3. Where do I start?
  • 4. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 1.What is Big Data?
  • 5. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. What is Big Data? 1. New technology ▪ Volume ▪ Variety ▪ Velocity 2. New philosophy ▪ Value of data ▪ Taming Voracity ▪ Becoming data-driven ▪ Empirical approach: Data Science 3. 1 + 2 = Business Transformation
  • 6. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. New technology drivers What is Big Data?
  • 7. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Volume:The Information Revolution ▪ We are living in an Information Revolution. ▪ Accumulation of last 2 year’s data flow (1 ZB), dwarfs the entire prior record of human civilization. ▪ Social Media, smart sensors, server logs, finance, e-mail…
  • 8. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Volume:Why can’t I just make it bigger? Legacy database, experiencing huge growth in data volumes $ / GB $$ / GB $$$ / GBLarge Application Database or Data Warehouse $$$$ / GB TB ??? Data Volume Performance Cost ScaleUP
  • 9. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Variety:Why won’t it load my data? ▪ Business are increasingly moving beyond relational data – 80% of enterprise data is unstructured. ▪ The rise of social media data integrated with other enterprise data leaves us with the problem of handling complex graph data. ▪ Machine-generated data such as log data is often semi- structured. ▪ Often as datasets get much larger, it is more efficient to leave them in their original format and store them that way, than to transform everything into a normalised relational schema.
  • 10. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Velocity:Why can’t I capture everything? ▪ All single-server information systems have limits on throughput. ▪ The only question is whether you hit that limit or not. ▪ If you do, your options are limited unless you have a distributed system to capture the data as it arrives. ▪ Distributed systems which are designed in an appropriate way can scale linearly to accept increasing data throughput rates, effectively lifting the cap on capture throughput. ▪ In today’s high data intensity applications, this is becoming ever more important.
  • 11. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Cheat Sheet: Big Data Jargon Hadoop ▪ Open-source framework for storing and processing large data sets. ▪ Uses clusters of commodity hardware to tackle big data challenge in an affordable way. ▪ Designed to cope with failures automatically. ▪ Can be scaled out from one server to thousands of machines.
 Scale Out
  • 12. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Cheat Sheet: Big Data Jargon NoSQL ▪ Means “Not Only SQL” ▪ Refers to most databases which post-date the SQL era.
 Some support SQL, or SQL-derived languages. ▪ May be capable of handling Big Data (a Distributed System), or may be limited to a single server. ▪ Often represent data in more flexible ways than spreadsheets,
 e.g.a “map” of many Item=>Value pairs, 
 or, a “graph” of many items and the relationships between them.
  • 13. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. New philosophy What is Big Data?
  • 14. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. What’s Data Science all about? ▪ Data + Science ▪ Science: theory + experiment => evidence => insight ▪ Science: “the empirical method” = evidence-based approach
 Never based on assumptions or intuition. ▪ Data Science movement, particularly in the context of Big Data, is all about making business data-driven and empirical.
  • 15. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. What’s Data Science all about? ▪ Before: analysts used intuition and domain knowledge to draw conclusions from statistics. ▪ Unfortunately, statistics can be easily manipulated, as we often see in the media.
 “There are lies, damned lies, and statistics” – Mark Twain ▪ Critical evaluation of data empirically is key to avoiding bias. ▪ More modern techniques such as Bayesian statistics can help to remove subjective bias. ▪ Machine Learning methods can remove the human element almost entirely.
  • 16. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Data Science + Big Data More data + limited compute resource More aggressiv e sampling Less accurate results Improve accuracy of results + limited compute resource More complex models Less accounta ble results ✓ All data + scalable compute resource No sampling More accurate results All data + scalable compute resource Less complex models More accounta ble results ✗ ✗ ✓ Often quoted as “more data trumps smarter algorithms” (Google)
  • 17. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. BusinessTransformation What is Big Data?
  • 18. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Why do we need to change? ! • New technology • Disruptive • New philosophy • Challenging to existing processes • Business transformatio n • New strategy, new roles Big Data Strategy Big Data Engineer Big Data Architect Data Scientist Chief Data Officer
  • 19. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Why do we need a Big Data strategy? Any major change programme needs a strategy to steer it. 1. Everyone will be pulling in the same direction. 2. Performance can be measured against the strategy later. 3. Target outcomes will be clearly defined. 4. The business will understand the need for the programme.
  • 20. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Why can’t I just ‘build it and they will come’? ×
  • 21. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Do I need a Data Scientist?
  • 22. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Understanding the Data Scientist role Data Analyst ▪ SAS, SPSS ▪ Excel, possibly R ▪ Relational Databases ▪ SQL ▪ Some training in statistics ▪ Education in IT or Business ▪ Happy with table or spreadsheet formatted data Data Scientist ▪ Statistics ▪ SAS, SPSS ▪ R ▪ Relational Databases ▪ SQL ▪ Education in Maths or Physics ▪ Happy with any data formats and data varieties ▪ Machine Learning ▪ Big Data ▪ NoSQL, Hadoop, Cassandra
  • 23. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Understanding the Data Scientist role Because they are a scientist, their job is to explore and discover – 
 within your business’ data. 1. Access to all data => break down information siloes 2. Tools to explore => big data computing infrastructure 3. Freedom to explore and discover => changes to policy and team structure
  • 24. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Use Case #101: Data Lake ▪ Consolidation of data siloes for combined analysis, online archival, and free-range Data Science exploration. ▪ Often begins as a POC. ▪ Value could take a long time to emerge, and could be difficult to plan or predict.
 (Unknown unknowns) ROI analysis:
 
 Value uplift from new insight
 should be > than
 cost of big data implementation + 
 cost of data source integration + 
 cost of staffing Data Science team
  • 25. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Where is Big Data heading? ▪ Big Data is here to stay.
 Data volumes are not going to decrease! ▪ We see data processing becoming increasingly commoditised.
 Vendor proliferation + it is simply a matter of mechanics. ▪ We see Machine Learning becoming far more widespread.
 More complex relationships harder to identify for humans ▪ We see Data Science permeating a much wider range of businesses and taking over as the next boom industry.
 The 24-hour global economy makes being data-driven increasingly more valuable. ▪ Investment in Big Data technology is a solid foundation, but investment in Machine Learning and Data Science expertise will really put you at the front of the pack.
  • 26. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 2.Why should I care?
  • 27. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Why should I care? 1. Quality of insight 2. Time to insight 3. Competitive advantage
  • 28. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Quality of Insight
  • 29. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Recap: Data Science + Big Data More data + limited compute resource More aggressiv e sampling Less accurate results Improve accuracy of results + limited compute resource More complex models Less accounta ble results ✓ All data + scalable compute resource No sampling More accurate results All data + scalable compute resource Less complex models More accounta ble results ✗ ✗ ✓
  • 30. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Data Science + Big Data ✓ More 
 accurate 
 results Better decisions More efficienc y or revenue More 
 accounta ble 
 results Better traceabili ty Less risk + regulator y complian ce ✓ Clear, quantifiable business outcomes Use Case #102
  • 31. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Use Case #102: Migrate & scale existing analytic models ▪ Identify existing analytic models which suffer from sampling of input data, or overly complex models. Migrate to big data platform, scale out to whole dataset and/or simplify model. ▪ Can go directly to POV with real measurable business value. ▪ Rapid turnaround for POV if models are not too difficult to migrate to your chosen platform. ROI analysis:
 
 Value uplift from use case
 should be =
 value from improved model accuracy –
 cost of migration work 

  • 32. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Case Studies
  • 33. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Case Study: Retail Predicting fashion trends for retailers ▪ Client: Global publisher providing fashion
 insight & trend analysis for customers. ▪ Wanted superior market intelligence to inform
 crucial retail buyer decisions. ▪ Challenges: ▪ Consume vast amounts of unstructured data
 from the web. ▪ Make accurate, actionable predictions from the data. ▪ Use cases: ▪ Large-scale parallel data processing of unstructured data from uncontrolled sources. ▪ Predictive analytics & machine learning. ▪ Used big data ecosystem technologies (Hadoop, Hive, Pig) to collect, process, transform the data and serve the front-end. ▪ Outcome: ▪ Platform successfully launched Sept 2013 ▪ Opened up new business stream as this was a new product
  • 34. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Case Study: Music Industry Digital music service play analytics, recommendations, and royalties ▪ Client: leading online music 
 streaming service ▪ Music listening habits of millions of users,
 measured across millions of tracks. ▪ Challenges: ▪ Connecting datasets from different 
 application systems, too large for
 a traditional database. ▪ Generating actionable reports and 
 recommendations. ▪ Use cases: ▪ Reporting, and royalty charge computation. ▪ Generating recommendations for users to help them find new music. ▪ Outcome: ▪ Richer information about users in a shorter time frame ▪ Lower overheads and for less money than previous system ▪ = significant operational efficiency improvements
  • 35. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Case Study: M2M Machine-to-machine data across various industries ▪ M2M data = telemetry collected from
 industrial machines 
 (e.g. production line robots, 
 power plants, aircraft engines, …). ! ▪ Can be analysed to increase efficiency of those machines. ▪ Individually, ▪ or optimise many of them as a collective system. ▪ GE conducted a detailed study of the impact of a 1% improvement in productivity across different industries, as a result of Machine Data Analytics with big data technology.
  • 36. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Case Study: M2M Machine-to-machine data across various industries ▪ Use cases: ▪ Asset management & Predictive maintenance ▪ Aggregate view across geography, machines, components, parts ▪ Deliver optimal number of parts to right location at right time ▪ Minimise parts inventory held, and maintenance costs ▪ Predictive analytics to replace parts before failure ▪ Supply chain optimisation ▪ RFID & smart sensors ▪ Deliver goods at optimal time, e.g. fresh produce ▪ Monitor state of goods in transit, adjust logistics in real-time ▪ Transport fleet optimisation ▪ Interconnected vehicles know their own + other vehicles location ▪ Optimise routing to find most efficient system-level solution
  • 37. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 3.Where do I start?
  • 38. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Life cycle of a Big Data programme Education (1) Analys is (2) Discovery (3) Prototyp e (4) Implem entatio n (5) Evolution (6) Company Strategy Big Data Strategy
  • 39. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Cheat Sheet: POC vs POV Proof of Concept ▪ Select a use case to illustrate with ▪ Sample, or mock up data on a smaller scale ▪ Build a scaled-down version of the full use case ▪ Prove the technology can deliver as intended for use case, and can scale to the full dataset ▪ Preferably through repeatable, automated unit tests Proof of Value ▪ Select a use case to illustrate with ▪ Sample, or partition real data to a smaller scale ▪ Build a scaled-down version of the full use case ▪ Prove the technology can deliver business value from insight generated for the use case ▪ Document implementation cost vs business value uplift rigorously
  • 40. Copyright © 2014 Big Data Partnership Ltd. All rights reserved.Copyright © 2014 Big Data Partnership Ltd. All rights reserved. BusinessTransformation
  • 41. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. BusinessTransformation 1. Make sure there is an owner of data across the organisation, ▪ Chief Data Officer is an ideal role for this if you can do it. 2. Organise your Data scientists so they are best placed to support the business goals, ▪ one central team, ▪ one team per analysis type, or ▪ one person dedicated to each business unit, for instance. 3. Make sure IT is able to make the data available to those individuals in the right way (sandpit, right tools, access etc.).
  • 42. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Integrating with the enterprise Data Warehouse ▪ Systems like Hadoop are not a full replacement for a Data Warehouse. ▪ There are overlapping qualities. ▪ Hadoop is not transactional, nor does it support fine-grained access to data. ▪ Hadoop is fundamentally a batch oriented system, so mixed workloads are not easily supported. ! ▪ Best practice is to use Hadoop to complement an existing Data Warehouse. ▪ Hadoop can offload “cold” or rarely accessed data to act as an online archive. ▪ Hadoop can offload expensive ETL processing. ▪ Hadoop can efficiently generate aggregations/summaries, and export these to the Data Warehouse for enterprise use. ▪ Keep only the highest-value data in Data Warehouse.
  • 43. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Summary ▪ Dispelled myths: ▪ Big Data is only about technology ▪ Big Data is only relevant to technologists ▪ Hadoop is magic ▪ Hadoop is an unknown black box ▪ A Big Data approach can help with problems which may combine Volume, Variety, Velocity. ▪ A Big Data approach is in demand because it is helping increase business value, and time to insight. ▪ Data Science is key to getting full value from a Big Data platform.
  • 44. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. FurtherTraining ▪ Apache Hadoop 2.0 Developing Java Applications (4 days) ▪ Apache Hadoop 2.0 Development for Data Analysts (4 days) ▪ Apache Hadoop 2.0 Operations Management (3 days) ▪ MapR Hadoop Fundamentals of Administration (3 days) ▪ Apache Cassandra DevOps Fundamentals (3 days) ▪ Apache Hadoop Masterclass (1 day) ▪ Big Data Concepts Masterclass (1 day) ▪ Machine Learning at scale with Apache Mahout (1 day)
  • 45. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Contact Details Tim Seears
 CTO
 Big Data Partnership
 
 info@bigdatapartnership.com
 @BigDataExperts

Notes de l'éditeur

  1. Imagine you have a legacy database or DW => Data volumes growing rapidly => Running out of space => We scale up a little, cost goes up a little – nothing too serious => Scale up a little more, cost/GB starts to look a little concerning => Scale up again, suddenly we hit a discontinuity => cost/GB is prohibitive => Or performance drops dramatically => Or maybe you want to go to TB-PB range and it’s not even possible => Problem is, as data increases we have to SCALE UP because of the architecture choices => Cost per GB also increases, so it becomes exponentially more expensive to grow => Interestingly, this also broadly holds for complexity of integrating more business data sources (Variety) (BI problem), not just adding more Volume (DW problem) => Because again the architecture dictates we have increasingly higher cost to add more Variety The net result of storing large schema-driven databases is that the individual machines they’re housed on must be: * high quality * redundant * highly available This translates into: * high costs for servers, infrastructure and support
  2. Clusters of hardware = distributed system
  3. Netflix challenge – winning team used a very rudimentary algorithm but won because it appended data about movies from outside the original data set (IMDb). Google – showed PageRank could outperform keyword extraction in other search engines, by leveraging data from outside the page itself (votes, by page creators linking to the page). Facebook – using detailed data about friendships (social network topology of real world) to beat other media companies.