SlideShare une entreprise Scribd logo
1  sur  64
Use-cases and opportunities in BigData
Return on experience with Hadoop

28 nov. 2013

© OCTO 2013

Rua Funchal, 411 5e andar Vila Olimpia
Sao Paulo - BRASIL

Tél : +55.11.3468.01.03
www.octo.com

1
Octo and the Big Data
Octo Technology has been investing on the big data market since 2010:
R&D
Training
Partnerships development

We provide to our customers consulting services:
Use case and opportunity/feasibility studies
Solution choice for Big Data projects
Architecture design of Big Data solutions
Big Data/NoSQL solutions deployment
Training

Octo Technology Big Data unit is composed today of a team of 12 dedicated people:
Technical experts

+ Data analysts

We have performed so far some 20 Big Data projects:
Mainly big data studies and PoC
Deployment of NoSQL solutions
In very different sectors: Insurance, Bank, Logistics, Energy

Technical partnerships with the biggest players of the Market (see next slide)
2
Octo expertise & partners on Big Data
Ecosystème
Hadoop
Complex Event
Processing
High Performance
Computing

NoSQL

Cloud
DevOps

OCTO has expertise on most of the solutions from the market.
Our multiple partnerships allow us to be completely independent towards solutions
editors
3
Big Data @ OCTO: some data
Number of
conferences on Big
Data organized by
Octo so far

20
850
16

250TB:

800

biggest volume of
data analyzed by
Octo

Nodes: Largest
Hadoop cluster
deployed by Octo

To: largest
storage volume
used by Octo
during a Big Data
project

Number of partnerhsips of Octo with
major players of the Big Data market

80

Number of Octo
consultants who
have training on a
least one Big Data
solution

4
Speakers

Clement ROUQUIE
Director BRAZIL
OCTO
crouquie@octo.com

Diego Flaborea
System Engineer
NetApp
diego.flaborea@netapp.com

Mathieu DESPRIEE
Senior Architect
OCTO
mde@octo.com
Wagner Roberto DOS SANTOS
Architect
OCTO
wds@octo.com

5
Agenda

Introduction to BigData & Hadoop Technology

Market Insights and Typical use-cases

NetApp technology for Hadoop

Best practices for your first project with Hadoop

6
Introduction to BigData
and Hadoop

© OCTO 2012
2013

7
Big-data is like teenage sex:
everyone talks about it,
nobody knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it!

8
Origins of Big Data

Consulting firms predicted a
big economic change, and
Big Data is part of it

Web giants implement BigData
solutions for their owns needs

WEB
Google, Amazon, F
acebook, Twitter,
…

Management

IT Vendors

McKinsey,
BCG, Gartner,
…

NetApp,
IBM, Vmware
…

Vendors now follow this
movement. They try to take
a hold on this very
promising business

9
data
deluge !

10
Data and Innovation

Data we traditionally
manipulate
(customers, product catalog…)

Innovation is here !
11
VOLUME
VELOCITY
VARIETY
12
Velocity
Real time
Second
Hour
Day
File
API
Web

Social
networks

Variety

MB GB

TB PB
Volume

Structured

Text

Audio
Video
13
NEW

NEW

USAGES

SERVICES

NEW

IT SYSTEMS

14
Is there a clear definition ?

Super
datawarehouse?

Big
databases?

NoSQL?
Low cost storage
?

Unstructured
data?

Cloud?

Real-time
analysis ?

Internet
Intelligence?

Open Data?

There’s no clear definition of Big Data
It is altogether a business ambition and many technological opportunities
15
Big Data : proposed definition

Big Data aims at getting an

economical advantage
from the quantitative analysis of
internal and external data

16
Technology

© OCTO 2012
2013

17
Exponential growth of capacities

CPU, memory, network bandwith, storage …
all of them followed the Moore’s law

Source :
http://strata.oreilly.com/2011/08/building-data-startups.html

18
70
Seagate
Barracuda
7200.10

64 MB/s
60

MB/s

50

40

Seagate
Barracuda
ATA IV

30

20
IBM DTTA
35010

10

0,7 MB/s
0

1990

2010

Storage capacity
Throughtput
We can store 100’000 times more data, but it takes 1000 times longer to read it !

x 100’000

x 91

19
Traditional architectures are limited
Storage oriented
applications

Over 10 Tb, « classical »
architectures requires huge
software and hardware
adaptations.

Event flow oriented
application

(IO bound)

Distributed
storage
Share
nothing

Event Stream
Processing

(streaming)

Over 1 000 events /
second, « classical »
architectures requires huge
software and hardware
adaptations.

« Traditional »
architectures
RDBMS,
Application server,
ETL, ESB

Parallel
processing

Over 1 000 transactions /
second, « classical »
architectures requires huge
software and hardware
adaptations.

XTP

Transaction oriented
applications
(TPS)

Over 10 threads/Core CPU,
sequential programming reach
its limits (IO).

Computation
oriented applications
(CPU bound)
20
Emerging families
Storage oriented
applications
The Hadoop ecosystem offers
a distributed storage, but also
distributed computing using
MapReduce.

(IO bound)

NoSQL : ditributed nonrelational stores,
NewSQL : SQL compliant
distributed stores

Hadoop
Event flow oriented
application

NoSQL
NewSQL

Streaming

Transaction oriented
applications

(streaming)

(TPS)

CEP - Complex Event
Processing, ESP - Event Stream
Processing

In-memory
analytics

Grid GPU
Grid computing on
CPU, or on GPU

Computation
oriented applications

In-memory analytics solutions
distribute the data in the
memory of several nodes to
obtain a low processing time.

(CPU bound)
21
22
Hadoop : a reference in the Big Data landscape
Open Source
• Apache Hadoop

Main distributions
• Cloudera CDH
• Hortonworks HDP
• MapR
Commercial
• Greenplum (EMC)
• IBM InfoSphere BigInsights (CDH)
• Oracle Big data appliance (CDH)
• NetApp Analytics (CDH)
•…
Cloud
• Amazon EMR (MapR)
• RackSpace (HDP)
• VirtualScale (CDH)
•…

23
Hadoop Distributed File System (HDFS)
Key principles
File storage more voluminous than a single disk
Data distributed on several nodes
Data replication to ensure « fail-over », with « rack awareness »
Use of commodity disk instead of SAN

24
Hadoop distributed processing : Map Reduce
Key principles
Parallelise and distribute processing
Quicker processing of smaller data volumes (unitary)
Co-location of processing and data

25
Integration w/
Information System

Querying

Advanced
processing

Orchestration

Distributed Processing

Distributed Storage

Monitoring and Management

Overview of Hadoop architecture

26
Available tools in a typical distribution (CDH)

Sqoop

Pig
Cascading
Hive

Mahout
HAMA
Giraph

Oozie
Azkaban
Web
Console

Flume
Scribe

MapReduce
YARN (v2)

Impala

Chukwa

Hue
Cloudera
Manager

HBase

CLI

HDFS

27
Hadoop ecosystem today
sklearn
Spark

Impala

Stinger

Hawq

nltk

HAMA

Mahout

panda

RHadoop

Python

R

Drill
SAS

Tools

Giraph

HBase

Cassandra

Cascading

Pig

Hive

Talend

Interactive
Transactional
API MR Java

Batch

Analytical queries
ETL
Spark

Scientific Computing

Search
Oozie

Compute

Usages

Solr

Streaming
YARN

MR/Tez

Storage systems

Storage API

Distributed FS
GlusterFS
HDFS
S3
Isilon
MapRFS

Local FS

NoSQL based
Cassandra
DynamoDB
Ceph
Ring
Openstack Swift

Import/export
CLI
Sqoop
Flume
Storm
ETL (Talend, Pentaho)

28
IS HADOOP A REPLACEMENT FOR BI ?

29
Limits of traditional BI architectures
Operational stores

ETL tools become bottlenecks

BI

•
•

ETL

does not scale well
too much time spent moving the data

ODS

Traditional DWH are not adapted to
new sources of data
•
•

DWH

ETL

changing schema
semi-structured, or unstructured data

Moving the data again !
Datamarts

30
Hadoop can help improving the BI architecture
Operational stores

Data can be stored fast in Hadoop,
and can be transformed “in-place”
using processing languages like
PIG, or streaming

HDFS

This approach is called E-L-T :
Extract, Load, then Transform
Map Reduce

SAS, Tableau Software, Qliktech …
PIG

BI with Hadoop

Hive

Streaming

BI reporting tools can also query
the data stored in Hadoop
using HIVE, or other libraries,
more or less interactively

31
Summary of Hadoop

What Hadoop is :
A distributed storage system
Combined with a framework of distributed batch processing
A platform with a linear scalability, designed for commodity hardware
Complementary to traditional BI systems, with lower price/performance
ratios

What Hadoop is not, as of today :
Not a database with random-access to data
Not mature on real-time, or interactive query
Not enough : you need to add visualization tools, processing
libraries, and other elements related to your project

32
Q/A

33
Market Trends in Europe

© OCTO 2012
2013

34
Types of projects launched in 2012-2013

Data Science = Data mining and learning on business signals
Innovation projects, launched directly by a business department with
or without the IT department
Exploration of new data sources (clickstream, logs, social…)
Iterative projects : average budget around (100k€-200k€), ~50k-100k€
per step

IT Optimization = Data warehouse offloading, Streamlining of BI
appliances (Teradata, Oracle, …) with Hadoop
IT project, with objectives of cost-killing, and technical improvement
Building hybrid architecture with Hadoop as raw storage and ETL to
offload massive data warehouse (over 40TB)
Project budgets around 1M€ CAPEX and 300k€ OPEX with a clear ROI

35
Main use cases by sector
Project launched in 2012-2013
Sector

Data Science

Retail Banking

•
•

IT Optimization

Behavioral marketing
Savings market trends
•
•
•

Corporate & Investment Banking

Insurance

•
•

•
•

Proactive Customer Care
Behavioral Churn

E-Commerce & media

•
•
•

Fail prediction
Capacity prediction

Mobile data log repository
Marketing Data Labs
QoS Data Labs

•

Smart metering repository

Behavioral marketing

Utilities

•
•
•

Behavioral marketing
Health and Savings market
trends

Telecoms

Market data repository
Trade analytics
Risk computation

36
Perspectives for 2014
Q3-2013 seems to have been a turning point on the Big Data
Analytics market in Europe
Executive Committees are supporting Data Science projects as
strategic projects
Big Data Analytics projects are included in the 2014 budget
plan, with
Budget over 500k€
Open positions for Data Scientists

Sectors where this topics seems to be of highest interest:
Retail Banks
Telecom
E-commerce

+ Insurance
+ Energy (distribution)

37
Use-cases

© OCTO 2012
2013

38
CHURN ANALYSIS
(TELECOM OPERATOR)

39
Behavioral analysis of churners
on channels : Web, mobile, call-center

 Objective : Anticipate churn.
 The Marketing dept wanted to analyze new datasources (logs of mobile internet), previously
ignored because of their size (250 TB for 6 months
of data)

DATA

 “Data Lab” Project :



Identification of
patterns

IT and Marketing joined in the same team
Elaboration of a platform to store, process, analyze
and discover the behavior of churners, using machine
learning algorithms

 Duration : 7 months

Marketing rules to
make proposals
40
Architecture
Internet mobile logs
250 TB of data to analyze for churn patterns

Cluster of 8 datanodes + 2 master/support nodes
Total of :
96 * 3TB disks
128 CPU

Cloudera CDH 4
Tools : HIVE, PIG
Mahout, R…

Web portal
Proposals in
real-time

It is planned to scale-up the cluster to 40 nodes

Behavior
Analysis
Identification of
patterns, and
marketing rules
41
ANALYSIS OF SOCIAL DATA TO IDENTIFY CORRELATION WITH
HEALTH-INSURANCE CLAIMS

(GENERALI - INSURANCE)

42
Analysis of social data to identify correlation with
health-insurance claims
Keywords correlations

 Objective : Anticipation of health-claims, to improve internal


prediction models.
Introduction of statistical variables computed from analysis
of social data (medical forums).

 Realization :

Datavizualisation example







Collect of text from forums and other social data
Natural language processing (text cleaning and analysis)
Semantic learning (medical concepts), to identify trends
Identification of correlations in datasets having more than 10
millions of variables
Datavizualisation to evaluate results with business experts

 Technology :



Hadoop on Amazon EC2
Machine learning : python, CloudSearch, NLTK, sklearn

 Duration : 6 months

43
CROSS-CHANNEL BEHAVIOR ANALYSIS
(BANK + INSURANCE)

44
Customer interaction Timeline
in a cross-channel context (web + call-center)
Mr Mathieu DESPRIEE
4 124569

Today

30/09/1977
50, av des Champs Elysees 75008
Paris
06 17 17 54 12
Segment :HP

Act – 12:16

 Objective : Improve the knowledge about customer
behavior, and the improve the quality of customer care.
+

Type : case created 4 124 569 356
Operator : Mme Catherine LECHU

Incoming call – 12:08

 Realization :
-

06 64 45 53 73
Duration 12 min
Wait time 6 min
Subject : Problem with attached files

01/08/2013
Outgoing call – 10:12






Collect of data from Web, CRM, Call-center
Analyses using a time-line approach
Determination of typical behaviors
Creation of real-time rules and alerting for web and call-centers

+

Subject : Problem with attached files

Web Portal – 08:03

-

Duration : 23 min
Pages :
• My subscription (2 min)
• Details Case 4 124 586 356 (11 min)
• Attached file (10 min)

27/07/2013
Web FAQ– 12:11
Duration: 20 min
Pages :
• FAQ (13 min)
• Subscriptions section (7 min)

45
(bank, confidential)

Analyses
Analysis

Axis 1 : Collect existing data, to search for
correlations with customer behavior

Business usage

Timelines

Axes of analysis

Personalized direct
marketing

A database allowing to viualize and
navigate into customer’s events, in
the form of a timeline

Customer care,
call-center rules
Axis 2 : Use data from credit-card expenses

Typical customer
behaviors
(Machine Learning)
Axis 3 : Search for social data (twitter,
Facebook) in relation to customers

Real-time alerts in
e-banking

Identification of these behaviors :
•
•
•
•
•

Purchase
Churn
Claims
Default
Fraud

Digital Banking Trends

Axis 4 : Fraud analysis
•
•

Remarketing

Digital Marketing

Center of interests in
communities
Evaluation of concurrents

Community
Management
46
(bank, confidential)

Hadoop / Spark

Script
Python
Data
collection

•
•
•
•
•
•
•
•
•
•

Storage
Data preparation
Feature extraction
Feature engineering
Feature Qualification
Large Scale Machine
Learning (Mahout ou
Mixture of experts)
NLP (NLTK)
MapReduce scripting
(Python)
SQL (Hive)
ELT (Pig)

Architecture

R / Python sklearn
/ SAS
•
•
•
•

Data miner
Sample Qualification
Statiscal Dataviz
Machine Learning
Statistics

DataViz
ElasticSearch
• Drill-down
• Interactive analysis
• Search

Custom Python
D3.js
Highcharts.js
Reporting
Tableau Software

Marketing
&
Analysts

IT
47
REAL-TIME ANALYSIS OF ENERGY GRID SENSOR DATA
“SMART-METERING”

(EDF - ENERGY)

48
Real-Time Analytics
Output

10
5
0

Smart Metering
Data Stream

1
217
433
649
865
1081
1297
1513
1729
1945
2161
2377
2593
2809
3025
3241
3457
3673
3889
4105

Data in motion

Input

Aggregates

Data at rest

Weather Forecast

Static or Dynamic
Prices

Analytics

Storm Network

Forecasts

Distributed complex event
processing on Hadoop
Customer Data

Machine Learning

Storage
49
Input Data

Forecast

~ 1,5 M smart meter measures processed
per second to compute forecasts
(6-nodes cluster)

50
BI PLATFORM OPTIMIZATION
(TELECOM)

51
NetApp Confidential

Wireless Provider leverages Hadoop
Business Challenge
 Consolidate large amounts of raw customer log data from
multiple data centers into one data center
 Run analytical queries on consolidated data, currently
can’t be done with existing tools

Telco Industry
Provides wireless voice
and data services globally

Solution
 NetApp Open Solution for Hadoop eight node cluster for
ingesting, storing, compressing data; Solr, Lucene for
indexing, HBase for querying indexed data

Benefits

Another NetApp
solution delivered by

 POC: 660GB of data consolidated, indexed, 1.125 billion
records processed in six hours
 Hadoop storage failover without service interruption
 New data processing and analytics capabilities
52
Q/A

53
NetApp technology
for Hadoop

© OCTO 2012
2013

54
Q/A

55
Best practices for your first
analytics project
with Hadoop

© OCTO 2012
2013

56
Check that Hadoop is a good choice

Hadoop is not a replacement to a database technology
Hadoop is easy to SCALE, but is a complex technology
Hadoop is batch oriented.
Real-time processing and interactive querying tools are emerging, but
they are still young

If you have less than a few TB of data, you don’t need Hadoop

57
Project
Framing

Cluster
setup

Project
Team setup

Data collect
/ Data
quality

Analytics
Iterations

58
Project
Framing

Identify a data-source you want to explore, with a potential
business value
Short-list and choose one business question to
evaluate, related to this data
Define at a macro-level your needs in analytics
“classical analytics” (aggregates and reports)
exploratory, with datavizualisation
statistical, datamining, machine-learning
 Will help you choose the tools in the ecosystem

Determine the technological constraints
Volume
Latency (batch, or not)
Data quality
Integration with the rest of IT

Size your cluster

59
This step requires your attention !
Cluster
setup

Hadoop uses commodity hardware, but it’s probably
not the machines you are used to use in your
datacenter
2U, internal storage, high-memory…

Consider using the solution of a provider like NetApp
Consider using Hadoop in Cloud
blog.octo.com/pt-br/hadoop-na-nuvem/

Benchmark your brand new cluster before actually
starting the project
Lots of configuration parameters involved…

Setup all the tools around Hadoop

60
Project Team
setup

This is innovative technology.
Data-science project is an innovative project.
 You need an adapted project management
Co-locate the people
business data analysts, architect, developers, infrastructure /
ops

Use Agile practices :
Work iteratively, with short cycles or sprints (1 or 2 weeks).
Choose small and achievable objectives for each sprint.
Use Agile rituals (stand-up, retrospectives…)

Train your team. A Hadoop project requires skilled people
Hadoop infrastructure
Hadoop development
data-science

Hire experts, and organize the knowledge transfer from
them to your team

61
Data collect /
Data quality

As in a classical “data project” (like in BI), an
important part of efforts will be related to data
quality :
preparation of the data
clean up
data transformation

Don’t under-estimate this

62
Typical iteration of data analytics
Analytics
iterations

select subset of data
select machine-learning algorithm to use on it

prepare the data (explore, filter, enrich…)
divide in training dataset & test dataset
execute algorithm
measure prediction error
visualize results
draw a conclusion from the test with this algorithm
and adapt for the next iteration : other data ? other
algorithm ? solve a technical issue ?

And start again !

63
Q/A

64

Contenu connexe

Tendances

Big data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesBig data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesShilpi Sharma
 
doolyk_rev_p_001.compressed
doolyk_rev_p_001.compresseddoolyk_rev_p_001.compressed
doolyk_rev_p_001.compressedDoolytics
 
Dell hans timmerman v1.1
Dell hans timmerman v1.1Dell hans timmerman v1.1
Dell hans timmerman v1.1BigDataExpo
 
The importance of data
The importance of dataThe importance of data
The importance of dataAPNIC
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A reviewShilpa Soi
 
Telco Big Data Workshop Sample
Telco Big Data Workshop SampleTelco Big Data Workshop Sample
Telco Big Data Workshop SampleAlan Quayle
 
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud ComputingBattling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud ComputingEdwin Poot
 
"Empower Developers with HPE Machine Learning and Augmented Intelligence", Dr...
"Empower Developers with HPE Machine Learning and Augmented Intelligence", Dr..."Empower Developers with HPE Machine Learning and Augmented Intelligence", Dr...
"Empower Developers with HPE Machine Learning and Augmented Intelligence", Dr...Dataconomy Media
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Denodo
 
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",..."From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...Dataconomy Media
 
Big Data Public-Private Forum_General Presentation
Big Data Public-Private Forum_General PresentationBig Data Public-Private Forum_General Presentation
Big Data Public-Private Forum_General PresentationBIG Project
 
Face Data Challenges of Life Science Organizations With Next-Generation Hitac...
Face Data Challenges of Life Science Organizations With Next-Generation Hitac...Face Data Challenges of Life Science Organizations With Next-Generation Hitac...
Face Data Challenges of Life Science Organizations With Next-Generation Hitac...Hitachi Vantara
 
Hortonworks & IBM solutions
Hortonworks & IBM solutionsHortonworks & IBM solutions
Hortonworks & IBM solutionsThiago Santiago
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data AnalyticsTUSHAR GARG
 
Big data analysis using map/reduce
Big data analysis using map/reduceBig data analysis using map/reduce
Big data analysis using map/reduceRenuSuren
 
ds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_Suiteds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_SuiteRobin Fong 方俊强
 

Tendances (20)

Combining hadoop with big data analytics
Combining hadoop with big data analyticsCombining hadoop with big data analytics
Combining hadoop with big data analytics
 
Big data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesBig data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & Challenges
 
doolyk_rev_p_001.compressed
doolyk_rev_p_001.compresseddoolyk_rev_p_001.compressed
doolyk_rev_p_001.compressed
 
Study: #Big Data in #Austria
Study: #Big Data in #AustriaStudy: #Big Data in #Austria
Study: #Big Data in #Austria
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Dell hans timmerman v1.1
Dell hans timmerman v1.1Dell hans timmerman v1.1
Dell hans timmerman v1.1
 
Fundamentals of Big Data
Fundamentals of Big DataFundamentals of Big Data
Fundamentals of Big Data
 
The importance of data
The importance of dataThe importance of data
The importance of data
 
BIG Data and Methodology-A review
BIG Data and Methodology-A reviewBIG Data and Methodology-A review
BIG Data and Methodology-A review
 
Telco Big Data Workshop Sample
Telco Big Data Workshop SampleTelco Big Data Workshop Sample
Telco Big Data Workshop Sample
 
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud ComputingBattling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
 
"Empower Developers with HPE Machine Learning and Augmented Intelligence", Dr...
"Empower Developers with HPE Machine Learning and Augmented Intelligence", Dr..."Empower Developers with HPE Machine Learning and Augmented Intelligence", Dr...
"Empower Developers with HPE Machine Learning and Augmented Intelligence", Dr...
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
 
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",..."From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
 
Big Data Public-Private Forum_General Presentation
Big Data Public-Private Forum_General PresentationBig Data Public-Private Forum_General Presentation
Big Data Public-Private Forum_General Presentation
 
Face Data Challenges of Life Science Organizations With Next-Generation Hitac...
Face Data Challenges of Life Science Organizations With Next-Generation Hitac...Face Data Challenges of Life Science Organizations With Next-Generation Hitac...
Face Data Challenges of Life Science Organizations With Next-Generation Hitac...
 
Hortonworks & IBM solutions
Hortonworks & IBM solutionsHortonworks & IBM solutions
Hortonworks & IBM solutions
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Big data analysis using map/reduce
Big data analysis using map/reduceBig data analysis using map/reduce
Big data analysis using map/reduce
 
ds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_Suiteds_Pivotal_Big_Data_Suite_Product_Suite
ds_Pivotal_Big_Data_Suite_Product_Suite
 

Similaire à Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop

Big data and you
Big data and you Big data and you
Big data and you IBM
 
Datenstrategie der Zukunft - Technologietrends, die Sie kennen müssen
Datenstrategie der Zukunft - Technologietrends, die Sie kennen müssenDatenstrategie der Zukunft - Technologietrends, die Sie kennen müssen
Datenstrategie der Zukunft - Technologietrends, die Sie kennen müssenDenodo
 
Virtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & BénéficesVirtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & BénéficesDenodo
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life RevolutionCapgemini
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Big Data Expo 2015 - Pentaho The Future of Analytics
Big Data Expo 2015 - Pentaho The Future of AnalyticsBig Data Expo 2015 - Pentaho The Future of Analytics
Big Data Expo 2015 - Pentaho The Future of AnalyticsBigDataExpo
 
Traditional data word
Traditional data wordTraditional data word
Traditional data wordorcoxsm
 
Bridging the Last Mile: Getting Data to the People Who Need It
Bridging the Last Mile: Getting Data to the People Who Need ItBridging the Last Mile: Getting Data to the People Who Need It
Bridging the Last Mile: Getting Data to the People Who Need ItDenodo
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnectaDigital
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013nkabra
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationDenodo
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
 
Webinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data AnalyticsWebinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data AnalyticsStorage Switzerland
 
Open Source DWBI-A Primer
Open Source DWBI-A PrimerOpen Source DWBI-A Primer
Open Source DWBI-A Primerpartha69
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 

Similaire à Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop (20)

Big data and you
Big data and you Big data and you
Big data and you
 
Datenstrategie der Zukunft - Technologietrends, die Sie kennen müssen
Datenstrategie der Zukunft - Technologietrends, die Sie kennen müssenDatenstrategie der Zukunft - Technologietrends, die Sie kennen müssen
Datenstrategie der Zukunft - Technologietrends, die Sie kennen müssen
 
Virtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & BénéficesVirtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & Bénéfices
 
Big Data - A Real Life Revolution
Big Data - A Real Life RevolutionBig Data - A Real Life Revolution
Big Data - A Real Life Revolution
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Big Data Expo 2015 - Pentaho The Future of Analytics
Big Data Expo 2015 - Pentaho The Future of AnalyticsBig Data Expo 2015 - Pentaho The Future of Analytics
Big Data Expo 2015 - Pentaho The Future of Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
Bridging the Last Mile: Getting Data to the People Who Need It
Bridging the Last Mile: Getting Data to the People Who Need ItBridging the Last Mile: Getting Data to the People Who Need It
Bridging the Last Mile: Getting Data to the People Who Need It
 
Big data
Big dataBig data
Big data
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud Platform
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
Webinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data AnalyticsWebinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data Analytics
 
Open Source DWBI-A Primer
Open Source DWBI-A PrimerOpen Source DWBI-A Primer
Open Source DWBI-A Primer
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 

Plus de OCTO Technology

Le Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonné
Le Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonnéLe Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonné
Le Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonnéOCTO Technology
 
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloudLe Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloudOCTO Technology
 
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...OCTO Technology
 
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...OCTO Technology
 
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...OCTO Technology
 
OCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeursOCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeursOCTO Technology
 
OCTO Talks - Lancement du livre Culture Test
OCTO Talks - Lancement du livre Culture TestOCTO Talks - Lancement du livre Culture Test
OCTO Talks - Lancement du livre Culture TestOCTO Technology
 
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...OCTO Technology
 
OCTO Talks - State of the art Architecture dans les frontend web
OCTO Talks - State of the art Architecture dans les frontend webOCTO Talks - State of the art Architecture dans les frontend web
OCTO Talks - State of the art Architecture dans les frontend webOCTO Technology
 
Comptoir OCTO ALD Automotive/Leaseplan
Comptoir OCTO ALD Automotive/LeaseplanComptoir OCTO ALD Automotive/Leaseplan
Comptoir OCTO ALD Automotive/LeaseplanOCTO Technology
 
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ? Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ? OCTO Technology
 
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...OCTO Technology
 
Le Comptoir OCTO - Affinez vos forecasts avec la planification distribuée et...
Le Comptoir OCTO -  Affinez vos forecasts avec la planification distribuée et...Le Comptoir OCTO -  Affinez vos forecasts avec la planification distribuée et...
Le Comptoir OCTO - Affinez vos forecasts avec la planification distribuée et...OCTO Technology
 
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conceptionLe Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conceptionOCTO Technology
 
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...OCTO Technology
 
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone : les solutions E...
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone :  les solutions E...Le Comptoir OCTO - L'avenir de la gestion du bilan carbone :  les solutions E...
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone : les solutions E...OCTO Technology
 
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...OCTO Technology
 
RefCard Tests sur tous les fronts
RefCard Tests sur tous les frontsRefCard Tests sur tous les fronts
RefCard Tests sur tous les frontsOCTO Technology
 
RefCard RESTful API Design
RefCard RESTful API DesignRefCard RESTful API Design
RefCard RESTful API DesignOCTO Technology
 

Plus de OCTO Technology (20)

Le Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonné
Le Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonnéLe Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonné
Le Comptoir OCTO - Se conformer à la CSRD : un levier d'action insoupçonné
 
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloudLe Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
Le Comptoir OCTO - MLOps : Les patterns MLOps dans le cloud
 
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
La Grosse Conf 2024 - Philippe Stepniewski -Atelier - Live coding d'une base ...
 
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
La Grosse Conf 2024 - Philippe Prados - Atelier - RAG : au-delà de la démonst...
 
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
Le Comptoir OCTO - Maîtriser le RAG : connecter les modèles d’IA génératives ...
 
OCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeursOCTO Talks - Les IA s'invitent au chevet des développeurs
OCTO Talks - Les IA s'invitent au chevet des développeurs
 
OCTO Talks - Lancement du livre Culture Test
OCTO Talks - Lancement du livre Culture TestOCTO Talks - Lancement du livre Culture Test
OCTO Talks - Lancement du livre Culture Test
 
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
Le Comptoir OCTO - Green AI, comment éviter que votre votre potion magique d’...
 
OCTO Talks - State of the art Architecture dans les frontend web
OCTO Talks - State of the art Architecture dans les frontend webOCTO Talks - State of the art Architecture dans les frontend web
OCTO Talks - State of the art Architecture dans les frontend web
 
Refcard GraphQL
Refcard GraphQLRefcard GraphQL
Refcard GraphQL
 
Comptoir OCTO ALD Automotive/Leaseplan
Comptoir OCTO ALD Automotive/LeaseplanComptoir OCTO ALD Automotive/Leaseplan
Comptoir OCTO ALD Automotive/Leaseplan
 
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ? Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
Le Comptoir OCTO - Comment optimiser les stocks en linéaire par la Data ?
 
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
Le Comptoir OCTO - Retour sur 5 ans de mise en oeuvre : Comment le RGPD a réi...
 
Le Comptoir OCTO - Affinez vos forecasts avec la planification distribuée et...
Le Comptoir OCTO -  Affinez vos forecasts avec la planification distribuée et...Le Comptoir OCTO -  Affinez vos forecasts avec la planification distribuée et...
Le Comptoir OCTO - Affinez vos forecasts avec la planification distribuée et...
 
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conceptionLe Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
Le Comptoir OCTO - La formation au cœur de la stratégie d’éco-conception
 
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
Le Comptoir OCTO - Une vision de plateforme sans leadership tech n’est qu’hal...
 
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone : les solutions E...
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone :  les solutions E...Le Comptoir OCTO - L'avenir de la gestion du bilan carbone :  les solutions E...
Le Comptoir OCTO - L'avenir de la gestion du bilan carbone : les solutions E...
 
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
Le Comptoir OCTO - Continuous discovery et continuous delivery pour construir...
 
RefCard Tests sur tous les fronts
RefCard Tests sur tous les frontsRefCard Tests sur tous les fronts
RefCard Tests sur tous les fronts
 
RefCard RESTful API Design
RefCard RESTful API DesignRefCard RESTful API Design
RefCard RESTful API Design
 

Dernier

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop

  • 1. Use-cases and opportunities in BigData Return on experience with Hadoop 28 nov. 2013 © OCTO 2013 Rua Funchal, 411 5e andar Vila Olimpia Sao Paulo - BRASIL Tél : +55.11.3468.01.03 www.octo.com 1
  • 2. Octo and the Big Data Octo Technology has been investing on the big data market since 2010: R&D Training Partnerships development We provide to our customers consulting services: Use case and opportunity/feasibility studies Solution choice for Big Data projects Architecture design of Big Data solutions Big Data/NoSQL solutions deployment Training Octo Technology Big Data unit is composed today of a team of 12 dedicated people: Technical experts + Data analysts We have performed so far some 20 Big Data projects: Mainly big data studies and PoC Deployment of NoSQL solutions In very different sectors: Insurance, Bank, Logistics, Energy Technical partnerships with the biggest players of the Market (see next slide) 2
  • 3. Octo expertise & partners on Big Data Ecosystème Hadoop Complex Event Processing High Performance Computing NoSQL Cloud DevOps OCTO has expertise on most of the solutions from the market. Our multiple partnerships allow us to be completely independent towards solutions editors 3
  • 4. Big Data @ OCTO: some data Number of conferences on Big Data organized by Octo so far 20 850 16 250TB: 800 biggest volume of data analyzed by Octo Nodes: Largest Hadoop cluster deployed by Octo To: largest storage volume used by Octo during a Big Data project Number of partnerhsips of Octo with major players of the Big Data market 80 Number of Octo consultants who have training on a least one Big Data solution 4
  • 5. Speakers Clement ROUQUIE Director BRAZIL OCTO crouquie@octo.com Diego Flaborea System Engineer NetApp diego.flaborea@netapp.com Mathieu DESPRIEE Senior Architect OCTO mde@octo.com Wagner Roberto DOS SANTOS Architect OCTO wds@octo.com 5
  • 6. Agenda Introduction to BigData & Hadoop Technology Market Insights and Typical use-cases NetApp technology for Hadoop Best practices for your first project with Hadoop 6
  • 7. Introduction to BigData and Hadoop © OCTO 2012 2013 7
  • 8. Big-data is like teenage sex: everyone talks about it, nobody knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it! 8
  • 9. Origins of Big Data Consulting firms predicted a big economic change, and Big Data is part of it Web giants implement BigData solutions for their owns needs WEB Google, Amazon, F acebook, Twitter, … Management IT Vendors McKinsey, BCG, Gartner, … NetApp, IBM, Vmware … Vendors now follow this movement. They try to take a hold on this very promising business 9
  • 11. Data and Innovation Data we traditionally manipulate (customers, product catalog…) Innovation is here ! 11
  • 15. Is there a clear definition ? Super datawarehouse? Big databases? NoSQL? Low cost storage ? Unstructured data? Cloud? Real-time analysis ? Internet Intelligence? Open Data? There’s no clear definition of Big Data It is altogether a business ambition and many technological opportunities 15
  • 16. Big Data : proposed definition Big Data aims at getting an economical advantage from the quantitative analysis of internal and external data 16
  • 18. Exponential growth of capacities CPU, memory, network bandwith, storage … all of them followed the Moore’s law Source : http://strata.oreilly.com/2011/08/building-data-startups.html 18
  • 19. 70 Seagate Barracuda 7200.10 64 MB/s 60 MB/s 50 40 Seagate Barracuda ATA IV 30 20 IBM DTTA 35010 10 0,7 MB/s 0 1990 2010 Storage capacity Throughtput We can store 100’000 times more data, but it takes 1000 times longer to read it ! x 100’000 x 91 19
  • 20. Traditional architectures are limited Storage oriented applications Over 10 Tb, « classical » architectures requires huge software and hardware adaptations. Event flow oriented application (IO bound) Distributed storage Share nothing Event Stream Processing (streaming) Over 1 000 events / second, « classical » architectures requires huge software and hardware adaptations. « Traditional » architectures RDBMS, Application server, ETL, ESB Parallel processing Over 1 000 transactions / second, « classical » architectures requires huge software and hardware adaptations. XTP Transaction oriented applications (TPS) Over 10 threads/Core CPU, sequential programming reach its limits (IO). Computation oriented applications (CPU bound) 20
  • 21. Emerging families Storage oriented applications The Hadoop ecosystem offers a distributed storage, but also distributed computing using MapReduce. (IO bound) NoSQL : ditributed nonrelational stores, NewSQL : SQL compliant distributed stores Hadoop Event flow oriented application NoSQL NewSQL Streaming Transaction oriented applications (streaming) (TPS) CEP - Complex Event Processing, ESP - Event Stream Processing In-memory analytics Grid GPU Grid computing on CPU, or on GPU Computation oriented applications In-memory analytics solutions distribute the data in the memory of several nodes to obtain a low processing time. (CPU bound) 21
  • 22. 22
  • 23. Hadoop : a reference in the Big Data landscape Open Source • Apache Hadoop Main distributions • Cloudera CDH • Hortonworks HDP • MapR Commercial • Greenplum (EMC) • IBM InfoSphere BigInsights (CDH) • Oracle Big data appliance (CDH) • NetApp Analytics (CDH) •… Cloud • Amazon EMR (MapR) • RackSpace (HDP) • VirtualScale (CDH) •… 23
  • 24. Hadoop Distributed File System (HDFS) Key principles File storage more voluminous than a single disk Data distributed on several nodes Data replication to ensure « fail-over », with « rack awareness » Use of commodity disk instead of SAN 24
  • 25. Hadoop distributed processing : Map Reduce Key principles Parallelise and distribute processing Quicker processing of smaller data volumes (unitary) Co-location of processing and data 25
  • 26. Integration w/ Information System Querying Advanced processing Orchestration Distributed Processing Distributed Storage Monitoring and Management Overview of Hadoop architecture 26
  • 27. Available tools in a typical distribution (CDH) Sqoop Pig Cascading Hive Mahout HAMA Giraph Oozie Azkaban Web Console Flume Scribe MapReduce YARN (v2) Impala Chukwa Hue Cloudera Manager HBase CLI HDFS 27
  • 28. Hadoop ecosystem today sklearn Spark Impala Stinger Hawq nltk HAMA Mahout panda RHadoop Python R Drill SAS Tools Giraph HBase Cassandra Cascading Pig Hive Talend Interactive Transactional API MR Java Batch Analytical queries ETL Spark Scientific Computing Search Oozie Compute Usages Solr Streaming YARN MR/Tez Storage systems Storage API Distributed FS GlusterFS HDFS S3 Isilon MapRFS Local FS NoSQL based Cassandra DynamoDB Ceph Ring Openstack Swift Import/export CLI Sqoop Flume Storm ETL (Talend, Pentaho) 28
  • 29. IS HADOOP A REPLACEMENT FOR BI ? 29
  • 30. Limits of traditional BI architectures Operational stores ETL tools become bottlenecks BI • • ETL does not scale well too much time spent moving the data ODS Traditional DWH are not adapted to new sources of data • • DWH ETL changing schema semi-structured, or unstructured data Moving the data again ! Datamarts 30
  • 31. Hadoop can help improving the BI architecture Operational stores Data can be stored fast in Hadoop, and can be transformed “in-place” using processing languages like PIG, or streaming HDFS This approach is called E-L-T : Extract, Load, then Transform Map Reduce SAS, Tableau Software, Qliktech … PIG BI with Hadoop Hive Streaming BI reporting tools can also query the data stored in Hadoop using HIVE, or other libraries, more or less interactively 31
  • 32. Summary of Hadoop What Hadoop is : A distributed storage system Combined with a framework of distributed batch processing A platform with a linear scalability, designed for commodity hardware Complementary to traditional BI systems, with lower price/performance ratios What Hadoop is not, as of today : Not a database with random-access to data Not mature on real-time, or interactive query Not enough : you need to add visualization tools, processing libraries, and other elements related to your project 32
  • 34. Market Trends in Europe © OCTO 2012 2013 34
  • 35. Types of projects launched in 2012-2013 Data Science = Data mining and learning on business signals Innovation projects, launched directly by a business department with or without the IT department Exploration of new data sources (clickstream, logs, social…) Iterative projects : average budget around (100k€-200k€), ~50k-100k€ per step IT Optimization = Data warehouse offloading, Streamlining of BI appliances (Teradata, Oracle, …) with Hadoop IT project, with objectives of cost-killing, and technical improvement Building hybrid architecture with Hadoop as raw storage and ETL to offload massive data warehouse (over 40TB) Project budgets around 1M€ CAPEX and 300k€ OPEX with a clear ROI 35
  • 36. Main use cases by sector Project launched in 2012-2013 Sector Data Science Retail Banking • • IT Optimization Behavioral marketing Savings market trends • • • Corporate & Investment Banking Insurance • • • • Proactive Customer Care Behavioral Churn E-Commerce & media • • • Fail prediction Capacity prediction Mobile data log repository Marketing Data Labs QoS Data Labs • Smart metering repository Behavioral marketing Utilities • • • Behavioral marketing Health and Savings market trends Telecoms Market data repository Trade analytics Risk computation 36
  • 37. Perspectives for 2014 Q3-2013 seems to have been a turning point on the Big Data Analytics market in Europe Executive Committees are supporting Data Science projects as strategic projects Big Data Analytics projects are included in the 2014 budget plan, with Budget over 500k€ Open positions for Data Scientists Sectors where this topics seems to be of highest interest: Retail Banks Telecom E-commerce + Insurance + Energy (distribution) 37
  • 40. Behavioral analysis of churners on channels : Web, mobile, call-center  Objective : Anticipate churn.  The Marketing dept wanted to analyze new datasources (logs of mobile internet), previously ignored because of their size (250 TB for 6 months of data) DATA  “Data Lab” Project :   Identification of patterns IT and Marketing joined in the same team Elaboration of a platform to store, process, analyze and discover the behavior of churners, using machine learning algorithms  Duration : 7 months Marketing rules to make proposals 40
  • 41. Architecture Internet mobile logs 250 TB of data to analyze for churn patterns Cluster of 8 datanodes + 2 master/support nodes Total of : 96 * 3TB disks 128 CPU Cloudera CDH 4 Tools : HIVE, PIG Mahout, R… Web portal Proposals in real-time It is planned to scale-up the cluster to 40 nodes Behavior Analysis Identification of patterns, and marketing rules 41
  • 42. ANALYSIS OF SOCIAL DATA TO IDENTIFY CORRELATION WITH HEALTH-INSURANCE CLAIMS (GENERALI - INSURANCE) 42
  • 43. Analysis of social data to identify correlation with health-insurance claims Keywords correlations  Objective : Anticipation of health-claims, to improve internal  prediction models. Introduction of statistical variables computed from analysis of social data (medical forums).  Realization : Datavizualisation example      Collect of text from forums and other social data Natural language processing (text cleaning and analysis) Semantic learning (medical concepts), to identify trends Identification of correlations in datasets having more than 10 millions of variables Datavizualisation to evaluate results with business experts  Technology :   Hadoop on Amazon EC2 Machine learning : python, CloudSearch, NLTK, sklearn  Duration : 6 months 43
  • 45. Customer interaction Timeline in a cross-channel context (web + call-center) Mr Mathieu DESPRIEE 4 124569 Today 30/09/1977 50, av des Champs Elysees 75008 Paris 06 17 17 54 12 Segment :HP Act – 12:16  Objective : Improve the knowledge about customer behavior, and the improve the quality of customer care. + Type : case created 4 124 569 356 Operator : Mme Catherine LECHU Incoming call – 12:08  Realization : - 06 64 45 53 73 Duration 12 min Wait time 6 min Subject : Problem with attached files 01/08/2013 Outgoing call – 10:12     Collect of data from Web, CRM, Call-center Analyses using a time-line approach Determination of typical behaviors Creation of real-time rules and alerting for web and call-centers + Subject : Problem with attached files Web Portal – 08:03 - Duration : 23 min Pages : • My subscription (2 min) • Details Case 4 124 586 356 (11 min) • Attached file (10 min) 27/07/2013 Web FAQ– 12:11 Duration: 20 min Pages : • FAQ (13 min) • Subscriptions section (7 min) 45
  • 46. (bank, confidential) Analyses Analysis Axis 1 : Collect existing data, to search for correlations with customer behavior Business usage Timelines Axes of analysis Personalized direct marketing A database allowing to viualize and navigate into customer’s events, in the form of a timeline Customer care, call-center rules Axis 2 : Use data from credit-card expenses Typical customer behaviors (Machine Learning) Axis 3 : Search for social data (twitter, Facebook) in relation to customers Real-time alerts in e-banking Identification of these behaviors : • • • • • Purchase Churn Claims Default Fraud Digital Banking Trends Axis 4 : Fraud analysis • • Remarketing Digital Marketing Center of interests in communities Evaluation of concurrents Community Management 46
  • 47. (bank, confidential) Hadoop / Spark Script Python Data collection • • • • • • • • • • Storage Data preparation Feature extraction Feature engineering Feature Qualification Large Scale Machine Learning (Mahout ou Mixture of experts) NLP (NLTK) MapReduce scripting (Python) SQL (Hive) ELT (Pig) Architecture R / Python sklearn / SAS • • • • Data miner Sample Qualification Statiscal Dataviz Machine Learning Statistics DataViz ElasticSearch • Drill-down • Interactive analysis • Search Custom Python D3.js Highcharts.js Reporting Tableau Software Marketing & Analysts IT 47
  • 48. REAL-TIME ANALYSIS OF ENERGY GRID SENSOR DATA “SMART-METERING” (EDF - ENERGY) 48
  • 49. Real-Time Analytics Output 10 5 0 Smart Metering Data Stream 1 217 433 649 865 1081 1297 1513 1729 1945 2161 2377 2593 2809 3025 3241 3457 3673 3889 4105 Data in motion Input Aggregates Data at rest Weather Forecast Static or Dynamic Prices Analytics Storm Network Forecasts Distributed complex event processing on Hadoop Customer Data Machine Learning Storage 49
  • 50. Input Data Forecast ~ 1,5 M smart meter measures processed per second to compute forecasts (6-nodes cluster) 50
  • 52. NetApp Confidential Wireless Provider leverages Hadoop Business Challenge  Consolidate large amounts of raw customer log data from multiple data centers into one data center  Run analytical queries on consolidated data, currently can’t be done with existing tools Telco Industry Provides wireless voice and data services globally Solution  NetApp Open Solution for Hadoop eight node cluster for ingesting, storing, compressing data; Solr, Lucene for indexing, HBase for querying indexed data Benefits Another NetApp solution delivered by  POC: 660GB of data consolidated, indexed, 1.125 billion records processed in six hours  Hadoop storage failover without service interruption  New data processing and analytics capabilities 52
  • 54. NetApp technology for Hadoop © OCTO 2012 2013 54
  • 56. Best practices for your first analytics project with Hadoop © OCTO 2012 2013 56
  • 57. Check that Hadoop is a good choice Hadoop is not a replacement to a database technology Hadoop is easy to SCALE, but is a complex technology Hadoop is batch oriented. Real-time processing and interactive querying tools are emerging, but they are still young If you have less than a few TB of data, you don’t need Hadoop 57
  • 59. Project Framing Identify a data-source you want to explore, with a potential business value Short-list and choose one business question to evaluate, related to this data Define at a macro-level your needs in analytics “classical analytics” (aggregates and reports) exploratory, with datavizualisation statistical, datamining, machine-learning  Will help you choose the tools in the ecosystem Determine the technological constraints Volume Latency (batch, or not) Data quality Integration with the rest of IT Size your cluster 59
  • 60. This step requires your attention ! Cluster setup Hadoop uses commodity hardware, but it’s probably not the machines you are used to use in your datacenter 2U, internal storage, high-memory… Consider using the solution of a provider like NetApp Consider using Hadoop in Cloud blog.octo.com/pt-br/hadoop-na-nuvem/ Benchmark your brand new cluster before actually starting the project Lots of configuration parameters involved… Setup all the tools around Hadoop 60
  • 61. Project Team setup This is innovative technology. Data-science project is an innovative project.  You need an adapted project management Co-locate the people business data analysts, architect, developers, infrastructure / ops Use Agile practices : Work iteratively, with short cycles or sprints (1 or 2 weeks). Choose small and achievable objectives for each sprint. Use Agile rituals (stand-up, retrospectives…) Train your team. A Hadoop project requires skilled people Hadoop infrastructure Hadoop development data-science Hire experts, and organize the knowledge transfer from them to your team 61
  • 62. Data collect / Data quality As in a classical “data project” (like in BI), an important part of efforts will be related to data quality : preparation of the data clean up data transformation Don’t under-estimate this 62
  • 63. Typical iteration of data analytics Analytics iterations select subset of data select machine-learning algorithm to use on it prepare the data (explore, filter, enrich…) divide in training dataset & test dataset execute algorithm measure prediction error visualize results draw a conclusion from the test with this algorithm and adapt for the next iteration : other data ? other algorithm ? solve a technical issue ? And start again ! 63