Unsupervised learning with Spark

•

2 j'aime•704 vues

This document discusses unsupervised learning techniques, including distances, clustering algorithms, and examples from Styria practice. It introduces common distance measures like Euclidean, Manhattan, and Mahalanobis distances. For clustering, it describes K-means clustering and provides a Spark example. It also discusses using convolutional neural network outputs as unsupervised learning features and shows examples of semi-manual photo clustering, T-SNE concept visualization, and automatic learned hierarchies from Styria projects.

Technologie

Marko Velic PhD
Data Science Department
Styria Medijski Servisi d.o.o.
marko.velic@styria.hr
UNSUPERVISED LEARNING
(WITH SPARK)

CONTENTS
 Distances
• Eucledian
• Manhattan
• Mahalanobis
• Cosine Similarity
 Clustering
• K-Means
• Example (Spark)
 Examples from Styria practice (not Spark – for now)
10.03.2016 2

UNSUPERVISED LEARNING
 Opservations are not assigned to classes
 Computer program is not ‘supervised’
throughout the learning process
 Usually the task is to find ‘meaningful’
groups within data
 Decision is made based on distances i.e.
similarities among data points
10.03.2016 4

DISTANCES
10.03.2016 5
• To decide upon the groups we have to introduce
similarity measure or contrary – a distance measure
• Pitagora’s theorem – Euclidean distance
• dist((2, -1), (-2, 2))= √((2 - (-2))² + ((-1) - 2)²) = √((2 + 2)² + (-1 -
2)²) = √((4)² + (-3)²) = √(16 + 9) = √25 = 5

DISTANCES & APPROACHES
10.03.2016 6
Source:
http://en.wikipedia.org/wiki/Man
hattan_distance
 Manhattan/Cityblock/Taxicab
• dist((x, y), (a, b)) = |x - a| + |y - b|
 Normalization!
 Mahalanobis – considers variance
• “multidimensional z-score”
 Cosine similarity
 Autoencoders – ‘unsupervised’ neural nets
 Non-unsupervised but based on distances
• ReliefF measure, KNN classifier ... etc...

K-MEANS
7
Simplified:
1. Randomly place
centroids
2. Find the closest
3. Put centroid in the
middle
4. GOTO 2
Image source:
http://www.javabeat.net/2011/05/k-means-
clustering-algorithms-in-mahout/

DEMO (SPARK!)
 K-means clustering of photos (ie.
their vector representations)
 Convolutional neural network as
a supervised model and its
outputs as features for
unsupervised models
 Vector representations after the
pooling layers, after every
convolutional layer (Caffe)
 Clustering in Spark
8

SEMI-MANUAL CLUSTERING OF PHOTOS
10Gruping photos based in visual features, Enes Deumić, Styria Data Science Team

SEMI-MANUAL CLUSTERING OF PHOTOS
11Gruping photos based in visual features, Enes Deumić, Styria Data Science Team

NATURAL LANGUAGE PROCESSING
10.03.2016 12
T-sne concept visualization; vecernji.hr, Styria Data Science Team

AUTOMATIC (LEARNED) HIERARCHIES
13
Hierarchical clustering, Florijan Stamenković, Styria Data Science Team

CONCLUSION
 Distances
• Eucledian
• Manhattan
• Mahalanobis
• Cosine Similarity
 Clustering
• K-Means
 We can nicely combine supervised and unsupervised
features
 SparkNet: Training Deep Networks in Spark
http://arxiv.org/pdf/1511.06051v4.pdf
 https://news.developer.nvidia.com/caffe-on-spark-for-
deep-learning-from-yahoo/
10.03.2016 15

Recommandé

Neural Networks with Anticipation: Problems and ProspectsSSA KPI

15857 cse422 unsupervised-learningAnil Yadav

Unsupervised learningamalalhait

Supervised and unsupervised learningParas Kohli

NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...Rizwan Habib

Machine Learning with Big Data using Apache SparkInSemble

Kefed introduction 12-05-10-2224Gully Burns

Kefed introduction 12-06-10-0043Gully Burns

Recommandé

Neural Networks with Anticipation: Problems and ProspectsSSA KPI

15857 cse422 unsupervised-learningAnil Yadav

Unsupervised learningamalalhait

Supervised and unsupervised learningParas Kohli

NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...Rizwan Habib

Machine Learning with Big Data using Apache SparkInSemble

Kefed introduction 12-05-10-2224Gully Burns

Kefed introduction 12-06-10-0043Gully Burns

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev

Unsupervised learning clusteringArshad Farhad

LalalHimanshu Sharma

Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks

A comprehensive survey of contemporaryprjpublications

Master's Thesis - Data Science - PresentationGiorgio Carbone

image_segmentation_ppt.pptxfgdg12

DMTM Lecture 11 ClusteringPier Luca Lanzi

Poggi analytics - clustering - 1Gaston Liberman

Deep Learning AtoC with Image PerspectiveDong Heon Cho

Fa18_P2.pptxMd Abul Hayat

Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen

Digital image classification22octAleemuddin Abbasi

Mathematics online: some common algorithmsMark Moriarty

Artificial intelligence NEURAL NETWORKSREHMAT ULLAH

Density based clusteringYaswanthHariKumarVud

Introduction talk to Computer Vision Chen Sagiv

My MS defenseAlexander Shyrokov

Lec4 ClusteringJeff Hammerbacher

Facilitating Data Curation: a Solution Developed in the Toxicology DomainChristophe Debruyne

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Contenu connexe

Similaire à Unsupervised learning with Spark

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev

Unsupervised learning clusteringArshad Farhad

LalalHimanshu Sharma

Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks

A comprehensive survey of contemporaryprjpublications

Master's Thesis - Data Science - PresentationGiorgio Carbone

image_segmentation_ppt.pptxfgdg12

DMTM Lecture 11 ClusteringPier Luca Lanzi

Poggi analytics - clustering - 1Gaston Liberman

Deep Learning AtoC with Image PerspectiveDong Heon Cho

Fa18_P2.pptxMd Abul Hayat

Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen

Digital image classification22octAleemuddin Abbasi

Mathematics online: some common algorithmsMark Moriarty

Artificial intelligence NEURAL NETWORKSREHMAT ULLAH

Density based clusteringYaswanthHariKumarVud

Introduction talk to Computer Vision Chen Sagiv

My MS defenseAlexander Shyrokov

Lec4 ClusteringJeff Hammerbacher

Facilitating Data Curation: a Solution Developed in the Toxicology DomainChristophe Debruyne

Similaire à Unsupervised learning with Spark (20)

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...

Unsupervised learning clustering

Lalal

Astronomical Data Processing on the LSST Scale with Apache Spark

A comprehensive survey of contemporary

Master's Thesis - Data Science - Presentation

image_segmentation_ppt.pptx

DMTM Lecture 11 Clustering

Poggi analytics - clustering - 1

Deep Learning AtoC with Image Perspective

Fa18_P2.pptx

Machine Learning Foundations for Professional Managers

Digital image classification22oct

Mathematics online: some common algorithms

Artificial intelligence NEURAL NETWORKS

Density based clustering

Introduction talk to Computer Vision

My MS defense

Lec4 Clustering

Facilitating Data Curation: a Solution Developed in the Toxicology Domain

Dernier

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

How to convert PDF to text with Nanonetsnaman860154

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Slack Application Development 101 Slidespraypatel2

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Dernier (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Handwritten Text Recognition for manuscripts and early printed texts

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

How to convert PDF to text with Nanonets

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Pigging Solutions in Pet Food Manufacturing

Breaking the Kubernetes Kill Chain: Host Path Mount

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Azure Monitor & Application Insight to monitor Infrastructure & Application

How to Troubleshoot Apps for the Modern Connected Worker

Pigging Solutions Piggable Sweeping Elbows

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Slack Application Development 101 Slides

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Unsupervised learning with Spark

1. Marko Velic PhD Data Science Department Styria Medijski Servisi d.o.o. marko.velic@styria.hr UNSUPERVISED LEARNING (WITH SPARK)

2. CONTENTS  Distances • Eucledian • Manhattan • Mahalanobis • Cosine Similarity  Clustering • K-Means • Example (Spark)  Examples from Styria practice (not Spark – for now) 10.03.2016 2

3. MACHINE LEARNING 10.03.2016 3

4. UNSUPERVISED LEARNING  Opservations are not assigned to classes  Computer program is not ‘supervised’ throughout the learning process  Usually the task is to find ‘meaningful’ groups within data  Decision is made based on distances i.e. similarities among data points 10.03.2016 4

5. DISTANCES 10.03.2016 5 • To decide upon the groups we have to introduce similarity measure or contrary – a distance measure • Pitagora’s theorem – Euclidean distance • dist((2, -1), (-2, 2))= √((2 - (-2))² + ((-1) - 2)²) = √((2 + 2)² + (-1 - 2)²) = √((4)² + (-3)²) = √(16 + 9) = √25 = 5

6. DISTANCES & APPROACHES 10.03.2016 6 Source: http://en.wikipedia.org/wiki/Man hattan_distance  Manhattan/Cityblock/Taxicab • dist((x, y), (a, b)) = |x - a| + |y - b|  Normalization!  Mahalanobis – considers variance • “multidimensional z-score”  Cosine similarity  Autoencoders – ‘unsupervised’ neural nets  Non-unsupervised but based on distances • ReliefF measure, KNN classifier ... etc...

7. K-MEANS 7 Simplified: 1. Randomly place centroids 2. Find the closest 3. Put centroid in the middle 4. GOTO 2 Image source: http://www.javabeat.net/2011/05/k-means- clustering-algorithms-in-mahout/

8. DEMO (SPARK!)  K-means clustering of photos (ie. their vector representations)  Convolutional neural network as a supervised model and its outputs as features for unsupervised models  Vector representations after the pooling layers, after every convolutional layer (Caffe)  Clustering in Spark 8

9. T-SNE CLUSTER VISUALIZATION 9

10. SEMI-MANUAL CLUSTERING OF PHOTOS 10Gruping photos based in visual features, Enes Deumić, Styria Data Science Team

11. SEMI-MANUAL CLUSTERING OF PHOTOS 11Gruping photos based in visual features, Enes Deumić, Styria Data Science Team

12. NATURAL LANGUAGE PROCESSING 10.03.2016 12 T-sne concept visualization; vecernji.hr, Styria Data Science Team

13. AUTOMATIC (LEARNED) HIERARCHIES 13 Hierarchical clustering, Florijan Stamenković, Styria Data Science Team

14. VISUAL SEARCH EXAMPLE 14

15. CONCLUSION  Distances • Eucledian • Manhattan • Mahalanobis • Cosine Similarity  Clustering • K-Means  We can nicely combine supervised and unsupervised features  SparkNet: Training Deep Networks in Spark http://arxiv.org/pdf/1511.06051v4.pdf  https://news.developer.nvidia.com/caffe-on-spark-for- deep-learning-from-yahoo/ 10.03.2016 15

16. THANK YOU!

17. CONCLUSION 10.03.2016 17