Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop

Use-cases and opportunities in BigData
Return on experience with Hadoop

28 nov. 2013

© OCTO 2013

Rua Funchal, 411 5e andar Vila Olimpia
Sao Paulo - BRASIL

Tél : +55.11.3468.01.03
www.octo.com

1

Octo and the Big Data
Octo Technology has been investing on the big data market since 2010:
R&D
Training
Partnerships development

We provide to our customers consulting services:
Use case and opportunity/feasibility studies
Solution choice for Big Data projects
Architecture design of Big Data solutions
Big Data/NoSQL solutions deployment
Training

Octo Technology Big Data unit is composed today of a team of 12 dedicated people:
Technical experts

+ Data analysts

We have performed so far some 20 Big Data projects:
Mainly big data studies and PoC
Deployment of NoSQL solutions
In very different sectors: Insurance, Bank, Logistics, Energy

Technical partnerships with the biggest players of the Market (see next slide)
2

Octo expertise & partners on Big Data
Ecosystème
Hadoop
Complex Event
Processing
High Performance
Computing

NoSQL

Cloud
DevOps

OCTO has expertise on most of the solutions from the market.
Our multiple partnerships allow us to be completely independent towards solutions
editors
3

Big Data @ OCTO: some data
Number of
conferences on Big
Data organized by
Octo so far

20
850
16

250TB:

800

biggest volume of
data analyzed by
Octo

Nodes: Largest
Hadoop cluster
deployed by Octo

To: largest
storage volume
used by Octo
during a Big Data
project

Number of partnerhsips of Octo with
major players of the Big Data market

80

Number of Octo
consultants who
have training on a
least one Big Data
solution

4

Speakers

Clement ROUQUIE
Director BRAZIL
OCTO
crouquie@octo.com

Diego Flaborea
System Engineer
NetApp
diego.flaborea@netapp.com

Mathieu DESPRIEE
Senior Architect
OCTO
mde@octo.com
Wagner Roberto DOS SANTOS
Architect
OCTO
wds@octo.com

5

Agenda

Introduction to BigData & Hadoop Technology

Market Insights and Typical use-cases

NetApp technology for Hadoop

Best practices for your first project with Hadoop

6

Introduction to BigData
and Hadoop

© OCTO 2012
2013

7

Big-data is like teenage sex:
everyone talks about it,
nobody knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it!

8

Origins of Big Data

Consulting firms predicted a
big economic change, and
Big Data is part of it

Web giants implement BigData
solutions for their owns needs

WEB
Google, Amazon, F
acebook, Twitter,
…

Management

IT Vendors

McKinsey,
BCG, Gartner,
…

NetApp,
IBM, Vmware
…

Vendors now follow this
movement. They try to take
a hold on this very
promising business

9

Data and Innovation

Data we traditionally
manipulate
(customers, product catalog…)

Innovation is here !
11

Velocity
Real time
Second
Hour
Day
File
API
Web

Social
networks

Variety

MB GB

TB PB
Volume

Structured

Text

Audio
Video
13

NEW

NEW

USAGES

SERVICES

NEW

IT SYSTEMS

14

Is there a clear definition ?

Super
datawarehouse?

Big
databases?

NoSQL?
Low cost storage
?

Unstructured
data?

Cloud?

Real-time
analysis ?

Internet
Intelligence?

Open Data?

There’s no clear definition of Big Data
It is altogether a business ambition and many technological opportunities
15

Big Data : proposed definition

Big Data aims at getting an

economical advantage
from the quantitative analysis of
internal and external data

16

Technology

© OCTO 2012
2013

17

Exponential growth of capacities

CPU, memory, network bandwith, storage …
all of them followed the Moore’s law

Source :
http://strata.oreilly.com/2011/08/building-data-startups.html

18

70
Seagate
Barracuda
7200.10

64 MB/s
60

MB/s

50

40

Seagate
Barracuda
ATA IV

30

20
IBM DTTA
35010

10

0,7 MB/s
0

1990

2010

Storage capacity
Throughtput
We can store 100’000 times more data, but it takes 1000 times longer to read it !

x 100’000

x 91

19

Traditional architectures are limited
Storage oriented
applications

Over 10 Tb, « classical »
architectures requires huge
software and hardware
adaptations.

Event flow oriented
application

(IO bound)

Distributed
storage
Share
nothing

Event Stream
Processing

(streaming)

Over 1 000 events /
second, « classical »
adaptations.

« Traditional »
architectures
RDBMS,
Application server,
ETL, ESB

Parallel
processing

Over 1 000 transactions /
second, « classical »
adaptations.

XTP

Transaction oriented
applications
(TPS)

Over 10 threads/Core CPU,
sequential programming reach
its limits (IO).

Computation
oriented applications
(CPU bound)
20

Emerging families
Storage oriented
applications
The Hadoop ecosystem offers
a distributed storage, but also
distributed computing using
MapReduce.

(IO bound)

NoSQL : ditributed nonrelational stores,
NewSQL : SQL compliant
distributed stores

Hadoop
Event flow oriented
application

NoSQL
NewSQL

Streaming

Transaction oriented
applications

(streaming)

(TPS)

CEP - Complex Event
Processing, ESP - Event Stream
Processing

In-memory
analytics

Grid GPU
Grid computing on
CPU, or on GPU

Computation
oriented applications

In-memory analytics solutions
distribute the data in the
memory of several nodes to
obtain a low processing time.

(CPU bound)
21

Hadoop : a reference in the Big Data landscape
Open Source
• Apache Hadoop

Main distributions
• Cloudera CDH
• Hortonworks HDP
• MapR
Commercial
• Greenplum (EMC)
• IBM InfoSphere BigInsights (CDH)
• Oracle Big data appliance (CDH)
• NetApp Analytics (CDH)
•…
Cloud
• Amazon EMR (MapR)
• RackSpace (HDP)
• VirtualScale (CDH)
•…

23

Hadoop Distributed File System (HDFS)
Key principles
File storage more voluminous than a single disk
Data distributed on several nodes
Data replication to ensure « fail-over », with « rack awareness »
Use of commodity disk instead of SAN

24

Hadoop distributed processing : Map Reduce
Key principles
Parallelise and distribute processing
Quicker processing of smaller data volumes (unitary)
Co-location of processing and data

25

Integration w/
Information System

Querying

Advanced
processing

Orchestration

Distributed Processing

Distributed Storage

Monitoring and Management

Overview of Hadoop architecture

26

Available tools in a typical distribution (CDH)

Sqoop

Pig
Cascading
Hive

Mahout
HAMA
Giraph

Oozie
Azkaban
Web
Console

Flume
Scribe

MapReduce
YARN (v2)

Impala

Chukwa

Hue
Cloudera
Manager

HBase

CLI

HDFS

27

Hadoop ecosystem today
sklearn
Spark

Impala

Stinger

Hawq

nltk

HAMA

Mahout

panda

RHadoop

Python

R

Drill
SAS

Tools

Giraph

HBase

Cassandra

Cascading

Pig

Hive

Talend

Interactive
Transactional
API MR Java

Batch

Analytical queries
ETL
Spark

Scientific Computing

Search
Oozie

Compute

Usages

Solr

Streaming
YARN

MR/Tez

Storage systems

Storage API

Distributed FS
GlusterFS
HDFS
S3
Isilon
MapRFS

Local FS

NoSQL based
Cassandra
DynamoDB
Ceph
Ring
Openstack Swift

Import/export
CLI
Sqoop
Flume
Storm
ETL (Talend, Pentaho)

28

IS HADOOP A REPLACEMENT FOR BI ?

29

Limits of traditional BI architectures
Operational stores

ETL tools become bottlenecks

BI

•
•

ETL

does not scale well
too much time spent moving the data

ODS

Traditional DWH are not adapted to
new sources of data
•
•

DWH

ETL

changing schema
semi-structured, or unstructured data

Moving the data again !
Datamarts

30

Hadoop can help improving the BI architecture
Operational stores

Data can be stored fast in Hadoop,
and can be transformed “in-place”
using processing languages like
PIG, or streaming

HDFS

This approach is called E-L-T :
Extract, Load, then Transform
Map Reduce

SAS, Tableau Software, Qliktech …
PIG

BI with Hadoop

Hive

Streaming

BI reporting tools can also query
the data stored in Hadoop
using HIVE, or other libraries,
more or less interactively

31

Summary of Hadoop

What Hadoop is :
A distributed storage system
Combined with a framework of distributed batch processing
A platform with a linear scalability, designed for commodity hardware
Complementary to traditional BI systems, with lower price/performance
ratios

What Hadoop is not, as of today :
Not a database with random-access to data
Not mature on real-time, or interactive query
Not enough : you need to add visualization tools, processing
libraries, and other elements related to your project

32

Market Trends in Europe

© OCTO 2012
2013

34

Types of projects launched in 2012-2013

Data Science = Data mining and learning on business signals
Innovation projects, launched directly by a business department with
or without the IT department
Exploration of new data sources (clickstream, logs, social…)
Iterative projects : average budget around (100k€-200k€), ~50k-100k€
per step

IT Optimization = Data warehouse offloading, Streamlining of BI
appliances (Teradata, Oracle, …) with Hadoop
IT project, with objectives of cost-killing, and technical improvement
Building hybrid architecture with Hadoop as raw storage and ETL to
offload massive data warehouse (over 40TB)
Project budgets around 1M€ CAPEX and 300k€ OPEX with a clear ROI

35

Main use cases by sector
Project launched in 2012-2013
Sector

Data Science

Retail Banking

•
•

IT Optimization

Behavioral marketing
Savings market trends
•
•
•

Corporate & Investment Banking

Insurance

•
•

•
•

Proactive Customer Care
Behavioral Churn

E-Commerce & media

•
•
•

Fail prediction
Capacity prediction

Mobile data log repository
Marketing Data Labs
QoS Data Labs

•

Smart metering repository


Utilities

•
•
•

Health and Savings market
trends

Telecoms

Market data repository
Trade analytics
Risk computation

36

Perspectives for 2014
Q3-2013 seems to have been a turning point on the Big Data
Analytics market in Europe
Executive Committees are supporting Data Science projects as
strategic projects
Big Data Analytics projects are included in the 2014 budget
plan, with
Budget over 500k€
Open positions for Data Scientists

Sectors where this topics seems to be of highest interest:
Retail Banks
Telecom
E-commerce

+ Insurance
+ Energy (distribution)

37

Use-cases

© OCTO 2012
2013

38

CHURN ANALYSIS
(TELECOM OPERATOR)

39

Behavioral analysis of churners
on channels : Web, mobile, call-center

 Objective : Anticipate churn.
 The Marketing dept wanted to analyze new datasources (logs of mobile internet), previously
ignored because of their size (250 TB for 6 months
of data)

DATA

 “Data Lab” Project :



Identification of
patterns

IT and Marketing joined in the same team
Elaboration of a platform to store, process, analyze
and discover the behavior of churners, using machine
learning algorithms

 Duration : 7 months

Marketing rules to
make proposals
40

Architecture
Internet mobile logs
250 TB of data to analyze for churn patterns

Cluster of 8 datanodes + 2 master/support nodes
Total of :
96 * 3TB disks
128 CPU

Cloudera CDH 4
Tools : HIVE, PIG
Mahout, R…

Web portal
Proposals in
real-time

It is planned to scale-up the cluster to 40 nodes

Behavior
Analysis
Identification of
patterns, and
marketing rules
41

ANALYSIS OF SOCIAL DATA TO IDENTIFY CORRELATION WITH
HEALTH-INSURANCE CLAIMS

(GENERALI - INSURANCE)

42

Analysis of social data to identify correlation with
health-insurance claims
Keywords correlations

 Objective : Anticipation of health-claims, to improve internal


prediction models.
Introduction of statistical variables computed from analysis
of social data (medical forums).

 Realization :

Datavizualisation example







Collect of text from forums and other social data
Natural language processing (text cleaning and analysis)
Semantic learning (medical concepts), to identify trends
Identification of correlations in datasets having more than 10
millions of variables
Datavizualisation to evaluate results with business experts

 Technology :



Hadoop on Amazon EC2
Machine learning : python, CloudSearch, NLTK, sklearn

 Duration : 6 months

43

CROSS-CHANNEL BEHAVIOR ANALYSIS
(BANK + INSURANCE)

44

Customer interaction Timeline
in a cross-channel context (web + call-center)
Mr Mathieu DESPRIEE
4 124569

Today

30/09/1977
50, av des Champs Elysees 75008
Paris
06 17 17 54 12
Segment :HP

Act – 12:16

 Objective : Improve the knowledge about customer
behavior, and the improve the quality of customer care.
+

Type : case created 4 124 569 356
Operator : Mme Catherine LECHU

Incoming call – 12:08

 Realization :
-

06 64 45 53 73
Duration 12 min
Wait time 6 min
Subject : Problem with attached files

01/08/2013
Outgoing call – 10:12






Collect of data from Web, CRM, Call-center
Analyses using a time-line approach
Determination of typical behaviors
Creation of real-time rules and alerting for web and call-centers

+

Subject : Problem with attached files

Web Portal – 08:03

-

Duration : 23 min
Pages :
• My subscription (2 min)
• Details Case 4 124 586 356 (11 min)
• Attached file (10 min)

27/07/2013
Web FAQ– 12:11
Duration: 20 min
Pages :
• FAQ (13 min)
• Subscriptions section (7 min)

45

(bank, confidential)

Analyses
Analysis

Axis 1 : Collect existing data, to search for
correlations with customer behavior

Business usage

Timelines

Axes of analysis

Personalized direct
marketing

A database allowing to viualize and
navigate into customer’s events, in
the form of a timeline

Customer care,
call-center rules
Axis 2 : Use data from credit-card expenses

Typical customer
behaviors
(Machine Learning)
Axis 3 : Search for social data (twitter,
Facebook) in relation to customers

Real-time alerts in
e-banking

Identification of these behaviors :
•
•
•
•
•

Purchase
Churn
Claims
Default
Fraud

Digital Banking Trends

Axis 4 : Fraud analysis
•
•

Remarketing

Digital Marketing

Center of interests in
communities
Evaluation of concurrents

Community
Management
46

(bank, confidential)

Hadoop / Spark

Script
Python
Data
collection

•
•
•
•
•
•
•
•
•
•

Storage
Data preparation
Feature extraction
Feature engineering
Feature Qualification
Large Scale Machine
Learning (Mahout ou
Mixture of experts)
NLP (NLTK)
MapReduce scripting
(Python)
SQL (Hive)
ELT (Pig)

Architecture

R / Python sklearn
/ SAS
•
•
•
•

Data miner
Sample Qualification
Statiscal Dataviz
Machine Learning
Statistics

DataViz
ElasticSearch
• Drill-down
• Interactive analysis
• Search

Custom Python
D3.js
Highcharts.js
Reporting
Tableau Software

Marketing
&
Analysts

IT
47

REAL-TIME ANALYSIS OF ENERGY GRID SENSOR DATA
“SMART-METERING”

(EDF - ENERGY)

48

Real-Time Analytics
Output

10
5
0

Smart Metering
Data Stream

1
217
433
649
865
1081
1297
1513
1729
1945
2161
2377
2593
2809
3025
3241
3457
3673
3889
4105

Data in motion

Input

Aggregates

Data at rest

Weather Forecast

Static or Dynamic
Prices

Analytics

Storm Network

Forecasts

Distributed complex event
processing on Hadoop
Customer Data

Machine Learning

Storage
49

Input Data

Forecast

~ 1,5 M smart meter measures processed
per second to compute forecasts
(6-nodes cluster)

50

BI PLATFORM OPTIMIZATION
(TELECOM)

51

NetApp Confidential

Wireless Provider leverages Hadoop
Business Challenge
 Consolidate large amounts of raw customer log data from
multiple data centers into one data center
 Run analytical queries on consolidated data, currently
can’t be done with existing tools

Telco Industry
Provides wireless voice
and data services globally

Solution
 NetApp Open Solution for Hadoop eight node cluster for
ingesting, storing, compressing data; Solr, Lucene for
indexing, HBase for querying indexed data

Benefits

Another NetApp
solution delivered by

 POC: 660GB of data consolidated, indexed, 1.125 billion
records processed in six hours
 Hadoop storage failover without service interruption
 New data processing and analytics capabilities
52

Check that Hadoop is a good choice

Hadoop is not a replacement to a database technology
Hadoop is easy to SCALE, but is a complex technology
Hadoop is batch oriented.
Real-time processing and interactive querying tools are emerging, but
they are still young

If you have less than a few TB of data, you don’t need Hadoop

57

Project
Framing

Cluster
setup

Project
Team setup

Data collect
/ Data
quality

Analytics
Iterations

58

Project
Framing

Identify a data-source you want to explore, with a potential
business value
Short-list and choose one business question to
evaluate, related to this data
Define at a macro-level your needs in analytics
“classical analytics” (aggregates and reports)
exploratory, with datavizualisation
statistical, datamining, machine-learning
 Will help you choose the tools in the ecosystem

Determine the technological constraints
Volume
Latency (batch, or not)
Data quality
Integration with the rest of IT

Size your cluster

59

This step requires your attention !
Cluster
setup

Hadoop uses commodity hardware, but it’s probably
not the machines you are used to use in your
datacenter
2U, internal storage, high-memory…

Consider using the solution of a provider like NetApp
Consider using Hadoop in Cloud
blog.octo.com/pt-br/hadoop-na-nuvem/

Benchmark your brand new cluster before actually
starting the project
Lots of configuration parameters involved…

Setup all the tools around Hadoop

60

Project Team
setup

This is innovative technology.
Data-science project is an innovative project.
 You need an adapted project management
Co-locate the people
business data analysts, architect, developers, infrastructure /
ops

Use Agile practices :
Work iteratively, with short cycles or sprints (1 or 2 weeks).
Choose small and achievable objectives for each sprint.
Use Agile rituals (stand-up, retrospectives…)

Train your team. A Hadoop project requires skilled people
Hadoop infrastructure
Hadoop development
data-science

Hire experts, and organize the knowledge transfer from
them to your team

61

Data collect /
Data quality

As in a classical “data project” (like in BI), an
important part of efforts will be related to data
quality :
preparation of the data
clean up
data transformation

Don’t under-estimate this

62

Typical iteration of data analytics
Analytics
iterations

select subset of data
select machine-learning algorithm to use on it

prepare the data (explore, filter, enrich…)
divide in training dataset & test dataset
execute algorithm
measure prediction error
visualize results
draw a conclusion from the test with this algorithm
and adapt for the next iteration : other data ? other
algorithm ? solve a technical issue ?

And start again !

63

Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop

Similaire à Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop (20)

Plus de OCTO Technology

Plus de OCTO Technology (20)

Dernier

Dernier (20)

Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop