Zenika matinale spark-zeppelin_ml

Matinale Big Data
Spark et Machine Learning
Zenika Lyon, le 25/05/16

Hervé RIVIERE
Développeur Big Data / NoSQL
Formateur Couchbase
Fabrice SZNAJDERMAN
Développeur Java / Scala / Web
Formateur Java / Scala
Co-organisateur du ScalaIO 2016 (Lyon le 27 – 28
octobre)

Big Data : Spark + Machine Learning
Sommaire
Big Data : Panorama 2016 (15 ’)1
2 Présentation d’Apache Spark et
Apache Zeppelin (45’)
4 Démystifions le Machine Learning (45’)
3 Pause (30’)

De 2014 à 2017….
2014
• POC / expérimentation
• Usage analytique
• Hadoop Map-Reduce / HDFS / Pig / Hive / HBase / Storm ….
2015
• Industrialisation Data-Lake / Création Plateforme Big Data analytique
• POC streaming / Plateforme Big Data opérationnelle
• Spark / Cassandra / HDFS /Kafka / Storm / Samza / Mesos
2016
• Industrialisation Streaming / Plateforme Big Data opérationnelle
• Expérimentation/ POC Big Data Prédictif / Machine Learning
• Kafka / Spark / Flink / HDFS / Notebook web / Cassandra / Mesos….
2017
• Industrialisation Big Data Prédictif / Machine Learning ? Internet of Things ?
• Kafka stream ? / Kudu ? /Spark 2.0 ? / Flink ? ….

Le Big Data pour quoi faire ?
• Informatique décisionnelle : Statistiques descriptives sur
des données à forte densité en information
Exemple : Données CRM dans une BDD
• Big Data : Données à faible densité d’informations mais
dont l’important volume permet d’en déduire des lois /
règles  Statistiques inférentielles
Exemple : Données issues de capteurs dans un Data Lake
• Fast Data : Transformer en temps réel la données à la place
de traitements quotidiens / hebdomadaires / mensuels
Exemple : Données issues d’un site web dans des topic
Kafka

Exemple de projets
• Vision clients 360° (Banque / Distribution / Service…)
o Réagir lors de certains évènements cross-canaux
o Recommandation
o Analyse ad-hoc spécifique métier (marketing, fraude…)
• Analyse de données logs/capteurs (Industrie, Services,
IT…)
• Automatiser une surveillance humaine
• Analyser puis optimiser
• Soulager des outils décisionnels par des technologies Big
Data
• Pour la scalabilité
• Pour de nouvelles possibilités (temps réel, schéma plus
flexible, vitesse ….)

Nos interventions
Architecture
Big Data Industrialisations
développements
POC Java / Scala
Dataviz
Formations
Industrialisation
algorithmes
machine learning
NoSQL
Expertise
technique
Ateliers
innovations

Streaming
Query/SQL
ETL
Machine Learning
Search Engine
Scheduler
Service Discovery
Resource Manager
Kafka NiFi Flink StormZookeeper Spark Yarn
Mesos
File System
OLAP
Columns
Document
Key-Value
Graph
In-memory/Cache
Time-Series
CassandraMongoDB
Neo4j
Titan Couchbase Druid InfluxDB
Hazelcast
Redis
Aerospike Kylin
SolR
ElasticSearch
MahoutTez, Slider
Oozie
Hive,
Impala, Hawq
Drill
Pig
MR
FrameworksStorage/NoSQL
HbaseHDFS

Architectures Big Data
Couche temps réel / Opérationnelle
Couche batch / analytique
Requêtes
Requêtes
Données
Données
Données
Données

Analytique
De 3 à 300 nœuds !
Stocker / traiter un (très) important volume de données (Tera octets…) à intervalle
régulier
Système analytique et non opérationnel !
Stockage
Outil couramment utilisé
En complément ou alternative
Exécution Outil couramment utilisé
Scheduler
• NiFi
• Oozie
Notebook web
• Zeppelin
• Jupyter
data-minning / Machine learning
• R / Python
• Mahout / H2O
• Dataiku
I/O
• Sqoop
- Kafka
Ressource negociator
• YARN
• Mesos

Opérationnelle
De 3 à 300 nœuds !
Traiter un important volume de données en temps réel
Système opérationnel et non analytique !
Stockage
Exécution
Schema registry
• Avro
API I/O
• Akka
• Spring
• Play…
Ressource negociator
• Yarn
• Mesos
SMACK

Nos partenaires conseil et
formation
NoSQL
Langages &
Ecosystème Big Data
Intégration &
continuous delivery

Spark &
Zeppelin
Matinale Spark et ML
25/05/16
Fabrice Sznajderman

Agenda
●Apache Spark
●Apache Zeppelin
Introduction

Big picture
Spark introduction

What is it about?
●A cluster computing framework
●Open source
●Written in Scala

History
2009 : Project start at MIT research lab
2010 : Project open-sourced
2013 : Become a Apache project and creation of the
Databricks company
2014 : Become a top level Apache project and the most active
project in the Apache fundation (500+ contributors)
2014 : Release of Spark 1.0, 1.1 and 1.2
2015 : Release of Spark 1.3, 1.4, 1.5 and 1.6
2015 : IBM, SAP… investment in Spark
2015 : 2000 registration in Spark Summit SF, 1000 in Spark
Summit Amsterdam
2016 : new Spark Summit in San Francisco in June 2016

Where spark is used?
Source : http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-
Infographic.pdf?t=1443057549926
The results reflect the answers and opinions of over 1,417 respondents representing over 842
organizations.

Which kind of using?

Multi-languages

Spark Shell
●REPL
●Learn API
●Interactive Analysis

Definition
●Resilient
●Distributed
●Datasets

Properties
●Immutable
●Serializable
●Can be persist in RAM and / or
disk
●Simple or complexe type

Use as a collection
●DSL
●Monadic type
●Several operators
–map, filter, count, distinct, flatmap, ...
–join, groupBy, union, ...

●A collection (List, Set)
●Various formats of file
–json, text, Hadoop SequenceFile, ...
●Various database
–JDBC, Cassandra, ...
●Others RDD
Created from
Sources must be natively distributed (hdfs, cassandra,..), if
not network become bottleneck

Sample
val conf = new SparkConf()
.setAppName("sample")
.setMaster("local")
val sc = new SparkContext(conf)
val rdd = sc.textFile("data.csv")
val nb = rdd.map(s => s.length).filter(i => i> 10).count()

Lazy-evaluation
●Intermediate operators
–map, filter, distinct, flatmap, …
●final operators
–count, mean, fold, first, ...
val nb = rdd.map(s => s.length).filter(i => i> 10).count()

Caching
●Reused an intermediate result
●Cache operator
●Avoid re-computing
val r = rdd.map(s => s.length).cache()
val nb = r.filter(i => i> 10).count()
val sum = r.filter(i => i> 10).sum()

Distributed
Architecture
Core concept

Run locally
val master = "local"
val master = "local[*]"
val master = "local[4]"
val conf = new SparkConf().setAppName("sample")
.setMaster(master)

Run on cluster
val master = "spark://..."
val conf = new SparkConf().setAppName("sample")
.setMaster(master)

Cluster
Spark
Master
Spark
Slave
Spark
Slave
Spark
Slave
E
E E
E
E
E
Spark
client
Spark
client
Spark
client

Composed by
Spark Core
Spark
Streaming
MLlib GraphX
Spark
SQL
ML PipelineDataFrames
Several data sources

Several data sources
http://prog3.com/article/2015-06-18/2824958

Spark SQL
●Structured data processing
●SQL Language
●DataFrame

DataFrame 1/3
●A distributed collection of rows
organized into named columns
●An abstraction for selecting,
filtering, aggregating and
plotting structured data
●Provide a schema
●Not a RDD replacement
What?

DataFrame 1/3
●RDD more efficient than before
(Hadoop)
●But RDD is still too complicated
for common tasks
●DataFrame is more simple and
faster
Why?

DataFrame 3/3
●From Spark 1.3
● DataFrame API is just an
interface
–Implementation is done one time in
Spark engine
–All languages take benefits of
optimization with out rewriting
anything
How ?

Spark Streaming
●Framework over RDD and
Dataframe API
●Real-time data processing
●RDD is DStream here
●Same as before but dataset is
not static

Spark Streaming
Internal flow
http://spark.apache.org/docs/latest/img/streaming-flow.png

Spark Streaming
Inputs / Ouputs
http://spark.apache.org/docs/latest/img/streaming-arch.png

Spark MLlib
●Make pratical machine learning
scalable and easy
●Provide commons learning
algorithms & utilities

Spark MLlib
●Divides into 2 packages
– spark.mllib
– spark.ml

Spark MLlib
●Original API based on RDD
●Each model has its own
interface
spark.mllib

Spark MLlib
●Provides uniform set of high-
level APIs
●Based on top of Dataframe
●Pipeline concepts
–Transformer
–Estimator
–Pipeline
spark.ml

Spark MLlib
spark.ml
●Transformer : transform(DF)
–map a dataFrame by adding new
column
–predict the label and adding result in
new column
●Estimator : fit(DF)
–learning algorithm
–produces a model from dataFrame

Spark MLlib
spark.ml
●Pipeline
–sequence of stages (transformer or
estimator)
–specific order

Spark 2.0
●Easier
●Faster
●Smarter
3 axis

Spark 2.0
●Unifying DataFrames and Datasets in
Scala/Java
●SparkSession (replace SQLContext &
HiveContext)
●Simpler, more performant
Accumulator API
●spark.ml package emerges as the primary
ML API
Easier

Spark 2.0
According to our 2015 Spark Survey, 91%
of users consider performance as the most
important aspect of Spark.
l
Faster

Spark 2.0
Faster
●The second generation of Tungsten engine
●Builds upon ideas from
– Modern compilers
– Massively Parallel Processing Database (MPP)
●Spark SQL’s Catalyst Optimizer improvement

Spark 2.0
●Structured Streaming API
●Based on Catalyst optimizer
●Unifying DataFrames and Datasets
Smarter

Spark 2.0
This technical preview version is now
available on Databricks :
https://databricks.com/try-databricks
Try it

Big picture
Zeppelin introduction

What it is about?
●“A web-based notebook that
enables interactive data analytics”
●100% opensource
●Undergoing Incubation but …

Multi-purpose
●Data Ingestion
●Data Discovery
●Data Analytics
●Data Visualization &
Collaboration

Multiple Language
backend
●Scala
●shell
●python
●markdown
●your language by creation your
own interpreter

Data visualization
Easy way to build graph from data

Démystifions le Machine
Learning
Matinale Spark et ML
25/05/16
Hervé RIVIERE

Démystifions le Machine Learning
Sommaire
Machine Learning ?1
2
4
Fondamentaux
Algorithmes
3 Préparation des données
5 Outils
6 Mettre en place un projet ML

Machine learning : ”Field of study that gives computers the ability to learn
without being explicitly programmed.” Arthur Samuel
Solves tasks that people are good at, but traditional computation is bad at.
Programmes qui ecrivent de nouveaux programmes

Orange : « Sauvons les livebox »
Prévenir le foudroiement
 Demande client de débrancher son équipement
Fnac : Ciblage marketing / envoi d’email de recommandation
Passer d’une solution avec des RG statiques à des algorithmes
de machine learning
Optimiser ROI

 Remplacer des règles de gestion métier statiques par un
algorithme auto-apprenant.
1- Mesure du risque (exemple : taux de prêt en fonction du dossier)
2- Recommandation (exemple : recommandation de films, pub)
3- Prédiction de revenu
4- Prédiction d’un comportement client (désabonnement, appel hotline…)

 Etre capable de détecter et réagir à des signaux faibles
1- Prévision et / ou détection d’une panne
2- Diagnostic médical
3- Asservissement machine – optimiser consommation électrique

 Mieux comprendre un jeu de données via les corrélations
faites par les algorithmes ML
1 – Détecter / identifier des signaux faibles (ex : fraude, marketing…)
2 – Segmentation en différente catégories (exemple : campagne de publicité)

Machine Learning Regression
Deep Learning Clustering
Data Science Features engineering
(….)

Variable cible
numérique
Type Surface (m²) Nb de pièces Date de
construction
Prix (€)
Appartement 120 4 2005 200 000
Maison 200 7 1964 250 000
Maison 450 15 1878 700 000
Appartement 300 8 1986 ?????
Variables prédictives = Features
Prédire une valeur numérique : Algorithme de régression

Variable cible textuelle
= classe
Type Surface (m²) Nb de pièces Date de
construction
Prix (€)
Appartement 120 4 2005 200 000
Maison 200 7 1964 250 000
Maison 450 15 1878 700 000
???? 300 8 1986 600 000
Variables prédictives = Features
Prédire une valeur textuelle : Algorithme de classification

0
100
200
300
400
500
600
0 5 10 15 20 25
Prix(K€)
Observations
Revenu réel
Fonction prédictive
Bruit aléatoire
Prix réel = f(X) + a
a
a
a
f(X)
Modèle ML
Ecart imprévisible
Prédiction jamais exacte !

Si « a » trop important…
Modèle ML
Ecart imprévisible
Prédiction jamais exacte !
Prix réel = f(X) + a
Données non prédictible !
0
10
20
30
40
50
60
0 5 10 15 20 25
Prix(K€)
Observations
Revenu réel
Bruit

DWH
Open Data
Web
crawling
Dataset
d’entrainement
avec variables
prédictives et cible
Modèle
Prédiction
Variable cible
Hypothèses
Variables
prédictives
….
Préparation Construction du
modèle :
Générer un
programme (ie. le
modèle)
Production :
Utiliser le
programme généré

• Prédiction de l’avenir proche en fonction du
passé
• Approximation d’un pattern à partir d’exemple
• Copie d’un comportement en « boite noire »
(juste input et output)
• Algorithmes qui s’adaptent

DWH
Open Data
Web
crawling
Modèle
Prédiction
Hypothèses
Préparation
Dataset
d’entrainement
avec variables
prédictives et cible
….

- Complétude: champs manquant ?
- Echelle: Revenues par pays et nombre d’achats par
région !
- Exactitude : données réelles ?
- Fraicheur : Données du 19e siècle ?

- Format : CSV, images, JSON, BDD  JSON
- Agréger
- Enrichir
A B C D E F G H
10 3 2 5 7 43 2 4
1 24 34 5 876 7 6 52
43 24 1 558 23 4 5 6
Algorithmes ML

Moyenne des X : 9
Moyenne des Y : 7.5

• Une tache potentiellement (très…) longue
• Ingrat ?
• Influence directement le modèle
• Une bonne préparation des données est
meilleure que des bon algorithmes !

DWH
Open Data
Web
crawling
Dataset
d’entrainement
Modèle
Prédiction
Hypothèses

Illustration en 2D, majorité des modèles avec 5..10..1000
dimensions
0
10
20
30
40
50
0 5 10 15 20 25
Prix(K€)
Observations
Revenu réel
Linéaire : f(X)=aX+b (avec « a » et « b » découverts automatiquement)
0
2000
4000
6000
8000
10000
0 5 10 15 20 25
Prix(K€)
Observations
Revenu réel
Polynomiale : f(X)=aXy+bXz… (avec « a » et « b », « x », « y » découverts
automatiquement)

Programme généré par l’algorithme
après entrainement :
Une formule mathématiques
Prix maison = 2*nbPieces + 3*surface

Essai successifs de l’algorithme pour trouver la courbe qui minimise l’erreur
Simple à visualiser / comprendre
Algorithme supervisé (nécessite un entrainement préalable)
Peut être utilisé à des fin prédictive ou descriptive
Très sensible à la préparation initiale (valeurs aberrantes…)
Suppose que les données peuvent être modélisées sous formes
d’équations

Prix d’une maison : Si 10 + pièces…
Type
Pièce >10 Surface > 300
Etage <= 3 Ville = Paris
MaisonAppartement
Oui Non
Oui Oui NonNon
Oui Non
300 000€ 200 000€900 000€700 000€
400 000€600 000€

Programme généré par l’algorithme
après l’entrainement :
Conditions
If(surface>10 && piece=3)
if(type==maison) 250 000
else if (type==appartement) 150 000
Else 145 000

Algorithme supervisé (nécessite un entrainement préalable)
Moins sensible à la qualité de préparation de données
Paramètre à définir : nombre d’arbres / profondeurs etc…
Plusieurs arbres entrainés avec des subsets variés peuvent être
combinés  Random Forest
Le random forest est un des algorithmes actuellement le plus performant

Malade / Sain
Recommandation de film
Transformer un problème de régression (ex : prix d’une maison) en
classification :
« Cette maison va-t-elle se vendre plus cher que le prix moyen de
la ville » Oui / Non
 Minimiser l’erreur

Ne fonctionne qu’avec 2 catégories uniquement !

Boisson = alcool
Prix > 30€ Steak haché
Boisson=vin
NonOui
Oui Non
Oui
Adulte
Non
Oui Non
AdolescentEnfant
Senior Adulte

Algorithme non supervisé (pas d’entrainement)
Utilisé pour des algorithmes de recommandation (Netflix)
Le nombre de catégorie est définis par l’utilisateur ou dynamique
Le nom / description des catégorie est à définir par l’utilisateur

Prototypage
Voir grand, commencer petit
Prototypage : tester rapidement et de façon autonome les
hypothèses
• R
• SAS
• Scikit-learn (Python)
• Dataiku
• Excel
• Tableau
• ….

Industrialisation : Automatisation, performance, maintenabilité,
important volume de données….
Important travail de réécriture de code !
• Brique ETL en amont
• Construction du modèle :
• Volume de donnée « faible » : R / SAS / Python industrialisé
• Volume de donnée « important » : Spark / Hadoop/Mahout (calcul distribué)
• Solutions cloud (Azure ML / Amazon ML / Google prediction API)
• Distribution du modèle en aval :
• Webservice
• Embarqué dans une application
• …

Big Data et machine learning: Manuel du data scientist
Dunod
MOOC Machine Learning, Coursera
Andrew Ng

Zenika matinale spark-zeppelin_ml

Zenika matinale spark-zeppelin_ml

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (17)

Similaire à Zenika matinale spark-zeppelin_ml

Similaire à Zenika matinale spark-zeppelin_ml (20)

Plus de Zenika

Plus de Zenika (13)

Zenika matinale spark-zeppelin_ml