Use-cases and opportunities in BigData
Return on experience with Hadoop
* Introduction to BigData & Hadoop Technology
* Market Insights and Typical use-cases
* NetApp technology for Hadoop
* Best practices for your first project with Hadoop
2. Octo and the Big Data
Octo Technology has been investing on the big data market since 2010:
R&D
Training
Partnerships development
We provide to our customers consulting services:
Use case and opportunity/feasibility studies
Solution choice for Big Data projects
Architecture design of Big Data solutions
Big Data/NoSQL solutions deployment
Training
Octo Technology Big Data unit is composed today of a team of 12 dedicated people:
Technical experts
+ Data analysts
We have performed so far some 20 Big Data projects:
Mainly big data studies and PoC
Deployment of NoSQL solutions
In very different sectors: Insurance, Bank, Logistics, Energy
Technical partnerships with the biggest players of the Market (see next slide)
2
3. Octo expertise & partners on Big Data
Ecosystème
Hadoop
Complex Event
Processing
High Performance
Computing
NoSQL
Cloud
DevOps
OCTO has expertise on most of the solutions from the market.
Our multiple partnerships allow us to be completely independent towards solutions
editors
3
4. Big Data @ OCTO: some data
Number of
conferences on Big
Data organized by
Octo so far
20
850
16
250TB:
800
biggest volume of
data analyzed by
Octo
Nodes: Largest
Hadoop cluster
deployed by Octo
To: largest
storage volume
used by Octo
during a Big Data
project
Number of partnerhsips of Octo with
major players of the Big Data market
80
Number of Octo
consultants who
have training on a
least one Big Data
solution
4
6. Agenda
Introduction to BigData & Hadoop Technology
Market Insights and Typical use-cases
NetApp technology for Hadoop
Best practices for your first project with Hadoop
6
8. Big-data is like teenage sex:
everyone talks about it,
nobody knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it!
8
9. Origins of Big Data
Consulting firms predicted a
big economic change, and
Big Data is part of it
Web giants implement BigData
solutions for their owns needs
WEB
Google, Amazon, F
acebook, Twitter,
…
Management
IT Vendors
McKinsey,
BCG, Gartner,
…
NetApp,
IBM, Vmware
…
Vendors now follow this
movement. They try to take
a hold on this very
promising business
9
15. Is there a clear definition ?
Super
datawarehouse?
Big
databases?
NoSQL?
Low cost storage
?
Unstructured
data?
Cloud?
Real-time
analysis ?
Internet
Intelligence?
Open Data?
There’s no clear definition of Big Data
It is altogether a business ambition and many technological opportunities
15
16. Big Data : proposed definition
Big Data aims at getting an
economical advantage
from the quantitative analysis of
internal and external data
16
18. Exponential growth of capacities
CPU, memory, network bandwith, storage …
all of them followed the Moore’s law
Source :
http://strata.oreilly.com/2011/08/building-data-startups.html
18
23. Hadoop : a reference in the Big Data landscape
Open Source
• Apache Hadoop
Main distributions
• Cloudera CDH
• Hortonworks HDP
• MapR
Commercial
• Greenplum (EMC)
• IBM InfoSphere BigInsights (CDH)
• Oracle Big data appliance (CDH)
• NetApp Analytics (CDH)
•…
Cloud
• Amazon EMR (MapR)
• RackSpace (HDP)
• VirtualScale (CDH)
•…
23
24. Hadoop Distributed File System (HDFS)
Key principles
File storage more voluminous than a single disk
Data distributed on several nodes
Data replication to ensure « fail-over », with « rack awareness »
Use of commodity disk instead of SAN
24
25. Hadoop distributed processing : Map Reduce
Key principles
Parallelise and distribute processing
Quicker processing of smaller data volumes (unitary)
Co-location of processing and data
25
30. Limits of traditional BI architectures
Operational stores
ETL tools become bottlenecks
BI
•
•
ETL
does not scale well
too much time spent moving the data
ODS
Traditional DWH are not adapted to
new sources of data
•
•
DWH
ETL
changing schema
semi-structured, or unstructured data
Moving the data again !
Datamarts
30
31. Hadoop can help improving the BI architecture
Operational stores
Data can be stored fast in Hadoop,
and can be transformed “in-place”
using processing languages like
PIG, or streaming
HDFS
This approach is called E-L-T :
Extract, Load, then Transform
Map Reduce
SAS, Tableau Software, Qliktech …
PIG
BI with Hadoop
Hive
Streaming
BI reporting tools can also query
the data stored in Hadoop
using HIVE, or other libraries,
more or less interactively
31
32. Summary of Hadoop
What Hadoop is :
A distributed storage system
Combined with a framework of distributed batch processing
A platform with a linear scalability, designed for commodity hardware
Complementary to traditional BI systems, with lower price/performance
ratios
What Hadoop is not, as of today :
Not a database with random-access to data
Not mature on real-time, or interactive query
Not enough : you need to add visualization tools, processing
libraries, and other elements related to your project
32
35. Types of projects launched in 2012-2013
Data Science = Data mining and learning on business signals
Innovation projects, launched directly by a business department with
or without the IT department
Exploration of new data sources (clickstream, logs, social…)
Iterative projects : average budget around (100k€-200k€), ~50k-100k€
per step
IT Optimization = Data warehouse offloading, Streamlining of BI
appliances (Teradata, Oracle, …) with Hadoop
IT project, with objectives of cost-killing, and technical improvement
Building hybrid architecture with Hadoop as raw storage and ETL to
offload massive data warehouse (over 40TB)
Project budgets around 1M€ CAPEX and 300k€ OPEX with a clear ROI
35
36. Main use cases by sector
Project launched in 2012-2013
Sector
Data Science
Retail Banking
•
•
IT Optimization
Behavioral marketing
Savings market trends
•
•
•
Corporate & Investment Banking
Insurance
•
•
•
•
Proactive Customer Care
Behavioral Churn
E-Commerce & media
•
•
•
Fail prediction
Capacity prediction
Mobile data log repository
Marketing Data Labs
QoS Data Labs
•
Smart metering repository
Behavioral marketing
Utilities
•
•
•
Behavioral marketing
Health and Savings market
trends
Telecoms
Market data repository
Trade analytics
Risk computation
36
37. Perspectives for 2014
Q3-2013 seems to have been a turning point on the Big Data
Analytics market in Europe
Executive Committees are supporting Data Science projects as
strategic projects
Big Data Analytics projects are included in the 2014 budget
plan, with
Budget over 500k€
Open positions for Data Scientists
Sectors where this topics seems to be of highest interest:
Retail Banks
Telecom
E-commerce
+ Insurance
+ Energy (distribution)
37
40. Behavioral analysis of churners
on channels : Web, mobile, call-center
Objective : Anticipate churn.
The Marketing dept wanted to analyze new datasources (logs of mobile internet), previously
ignored because of their size (250 TB for 6 months
of data)
DATA
“Data Lab” Project :
Identification of
patterns
IT and Marketing joined in the same team
Elaboration of a platform to store, process, analyze
and discover the behavior of churners, using machine
learning algorithms
Duration : 7 months
Marketing rules to
make proposals
40
41. Architecture
Internet mobile logs
250 TB of data to analyze for churn patterns
Cluster of 8 datanodes + 2 master/support nodes
Total of :
96 * 3TB disks
128 CPU
Cloudera CDH 4
Tools : HIVE, PIG
Mahout, R…
Web portal
Proposals in
real-time
It is planned to scale-up the cluster to 40 nodes
Behavior
Analysis
Identification of
patterns, and
marketing rules
41
42. ANALYSIS OF SOCIAL DATA TO IDENTIFY CORRELATION WITH
HEALTH-INSURANCE CLAIMS
(GENERALI - INSURANCE)
42
43. Analysis of social data to identify correlation with
health-insurance claims
Keywords correlations
Objective : Anticipation of health-claims, to improve internal
prediction models.
Introduction of statistical variables computed from analysis
of social data (medical forums).
Realization :
Datavizualisation example
Collect of text from forums and other social data
Natural language processing (text cleaning and analysis)
Semantic learning (medical concepts), to identify trends
Identification of correlations in datasets having more than 10
millions of variables
Datavizualisation to evaluate results with business experts
Technology :
Hadoop on Amazon EC2
Machine learning : python, CloudSearch, NLTK, sklearn
Duration : 6 months
43
45. Customer interaction Timeline
in a cross-channel context (web + call-center)
Mr Mathieu DESPRIEE
4 124569
Today
30/09/1977
50, av des Champs Elysees 75008
Paris
06 17 17 54 12
Segment :HP
Act – 12:16
Objective : Improve the knowledge about customer
behavior, and the improve the quality of customer care.
+
Type : case created 4 124 569 356
Operator : Mme Catherine LECHU
Incoming call – 12:08
Realization :
-
06 64 45 53 73
Duration 12 min
Wait time 6 min
Subject : Problem with attached files
01/08/2013
Outgoing call – 10:12
Collect of data from Web, CRM, Call-center
Analyses using a time-line approach
Determination of typical behaviors
Creation of real-time rules and alerting for web and call-centers
+
Subject : Problem with attached files
Web Portal – 08:03
-
Duration : 23 min
Pages :
• My subscription (2 min)
• Details Case 4 124 586 356 (11 min)
• Attached file (10 min)
27/07/2013
Web FAQ– 12:11
Duration: 20 min
Pages :
• FAQ (13 min)
• Subscriptions section (7 min)
45
46. (bank, confidential)
Analyses
Analysis
Axis 1 : Collect existing data, to search for
correlations with customer behavior
Business usage
Timelines
Axes of analysis
Personalized direct
marketing
A database allowing to viualize and
navigate into customer’s events, in
the form of a timeline
Customer care,
call-center rules
Axis 2 : Use data from credit-card expenses
Typical customer
behaviors
(Machine Learning)
Axis 3 : Search for social data (twitter,
Facebook) in relation to customers
Real-time alerts in
e-banking
Identification of these behaviors :
•
•
•
•
•
Purchase
Churn
Claims
Default
Fraud
Digital Banking Trends
Axis 4 : Fraud analysis
•
•
Remarketing
Digital Marketing
Center of interests in
communities
Evaluation of concurrents
Community
Management
46
47. (bank, confidential)
Hadoop / Spark
Script
Python
Data
collection
•
•
•
•
•
•
•
•
•
•
Storage
Data preparation
Feature extraction
Feature engineering
Feature Qualification
Large Scale Machine
Learning (Mahout ou
Mixture of experts)
NLP (NLTK)
MapReduce scripting
(Python)
SQL (Hive)
ELT (Pig)
Architecture
R / Python sklearn
/ SAS
•
•
•
•
Data miner
Sample Qualification
Statiscal Dataviz
Machine Learning
Statistics
DataViz
ElasticSearch
• Drill-down
• Interactive analysis
• Search
Custom Python
D3.js
Highcharts.js
Reporting
Tableau Software
Marketing
&
Analysts
IT
47
52. NetApp Confidential
Wireless Provider leverages Hadoop
Business Challenge
Consolidate large amounts of raw customer log data from
multiple data centers into one data center
Run analytical queries on consolidated data, currently
can’t be done with existing tools
Telco Industry
Provides wireless voice
and data services globally
Solution
NetApp Open Solution for Hadoop eight node cluster for
ingesting, storing, compressing data; Solr, Lucene for
indexing, HBase for querying indexed data
Benefits
Another NetApp
solution delivered by
POC: 660GB of data consolidated, indexed, 1.125 billion
records processed in six hours
Hadoop storage failover without service interruption
New data processing and analytics capabilities
52
57. Check that Hadoop is a good choice
Hadoop is not a replacement to a database technology
Hadoop is easy to SCALE, but is a complex technology
Hadoop is batch oriented.
Real-time processing and interactive querying tools are emerging, but
they are still young
If you have less than a few TB of data, you don’t need Hadoop
57
59. Project
Framing
Identify a data-source you want to explore, with a potential
business value
Short-list and choose one business question to
evaluate, related to this data
Define at a macro-level your needs in analytics
“classical analytics” (aggregates and reports)
exploratory, with datavizualisation
statistical, datamining, machine-learning
Will help you choose the tools in the ecosystem
Determine the technological constraints
Volume
Latency (batch, or not)
Data quality
Integration with the rest of IT
Size your cluster
59
60. This step requires your attention !
Cluster
setup
Hadoop uses commodity hardware, but it’s probably
not the machines you are used to use in your
datacenter
2U, internal storage, high-memory…
Consider using the solution of a provider like NetApp
Consider using Hadoop in Cloud
blog.octo.com/pt-br/hadoop-na-nuvem/
Benchmark your brand new cluster before actually
starting the project
Lots of configuration parameters involved…
Setup all the tools around Hadoop
60
61. Project Team
setup
This is innovative technology.
Data-science project is an innovative project.
You need an adapted project management
Co-locate the people
business data analysts, architect, developers, infrastructure /
ops
Use Agile practices :
Work iteratively, with short cycles or sprints (1 or 2 weeks).
Choose small and achievable objectives for each sprint.
Use Agile rituals (stand-up, retrospectives…)
Train your team. A Hadoop project requires skilled people
Hadoop infrastructure
Hadoop development
data-science
Hire experts, and organize the knowledge transfer from
them to your team
61
62. Data collect /
Data quality
As in a classical “data project” (like in BI), an
important part of efforts will be related to data
quality :
preparation of the data
clean up
data transformation
Don’t under-estimate this
62
63. Typical iteration of data analytics
Analytics
iterations
select subset of data
select machine-learning algorithm to use on it
prepare the data (explore, filter, enrich…)
divide in training dataset & test dataset
execute algorithm
measure prediction error
visualize results
draw a conclusion from the test with this algorithm
and adapt for the next iteration : other data ? other
algorithm ? solve a technical issue ?
And start again !
63