SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Marko Velic PhD
Data Science Department
Styria Medijski Servisi d.o.o.
marko.velic@styria.hr
UNSUPERVISED LEARNING
(WITH SPARK)
CONTENTS
 Distances
• Eucledian
• Manhattan
• Mahalanobis
• Cosine Similarity
 Clustering
• K-Means
• Example (Spark)
 Examples from Styria practice (not Spark – for now)
10.03.2016 2
MACHINE LEARNING
10.03.2016 3
UNSUPERVISED LEARNING
 Opservations are not assigned to classes
 Computer program is not ‘supervised’
throughout the learning process
 Usually the task is to find ‘meaningful’
groups within data
 Decision is made based on distances i.e.
similarities among data points
10.03.2016 4
DISTANCES
10.03.2016 5
• To decide upon the groups we have to introduce
similarity measure or contrary – a distance measure
• Pitagora’s theorem – Euclidean distance
• dist((2, -1), (-2, 2))= √((2 - (-2))² + ((-1) - 2)²) = √((2 + 2)² + (-1 -
2)²) = √((4)² + (-3)²) = √(16 + 9) = √25 = 5
DISTANCES & APPROACHES
10.03.2016 6
Source:
http://en.wikipedia.org/wiki/Man
hattan_distance
 Manhattan/Cityblock/Taxicab
• dist((x, y), (a, b)) = |x - a| + |y - b|
 Normalization!
 Mahalanobis – considers variance
• “multidimensional z-score”
 Cosine similarity
 Autoencoders – ‘unsupervised’ neural nets
 Non-unsupervised but based on distances
• ReliefF measure, KNN classifier ... etc...
K-MEANS
7
Simplified:
1. Randomly place
centroids
2. Find the closest
3. Put centroid in the
middle
4. GOTO 2
Image source:
http://www.javabeat.net/2011/05/k-means-
clustering-algorithms-in-mahout/
DEMO (SPARK!)
 K-means clustering of photos (ie.
their vector representations)
 Convolutional neural network as
a supervised model and its
outputs as features for
unsupervised models
 Vector representations after the
pooling layers, after every
convolutional layer (Caffe)
 Clustering in Spark
8
T-SNE CLUSTER VISUALIZATION
9
SEMI-MANUAL CLUSTERING OF PHOTOS
10Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
SEMI-MANUAL CLUSTERING OF PHOTOS
11Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
NATURAL LANGUAGE PROCESSING
10.03.2016 12
T-sne concept visualization; vecernji.hr, Styria Data Science Team
AUTOMATIC (LEARNED) HIERARCHIES
13
Hierarchical clustering, Florijan Stamenković, Styria Data Science Team
VISUAL SEARCH EXAMPLE
14
CONCLUSION
 Distances
• Eucledian
• Manhattan
• Mahalanobis
• Cosine Similarity
 Clustering
• K-Means
 We can nicely combine supervised and unsupervised
features
 SparkNet: Training Deep Networks in Spark
http://arxiv.org/pdf/1511.06051v4.pdf
 https://news.developer.nvidia.com/caffe-on-spark-for-
deep-learning-from-yahoo/
10.03.2016 15
THANK YOU!
CONCLUSION
10.03.2016 17

Contenu connexe

Similaire à Unsupervised learning with Spark

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkDatabricks
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporaryprjpublications
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
image_segmentation_ppt.pptx
image_segmentation_ppt.pptximage_segmentation_ppt.pptx
image_segmentation_ppt.pptxfgdg12
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
Poggi analytics - clustering - 1
Poggi   analytics - clustering - 1Poggi   analytics - clustering - 1
Poggi analytics - clustering - 1Gaston Liberman
 
Deep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDeep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDong Heon Cho
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
Digital image classification22oct
Digital image classification22octDigital image classification22oct
Digital image classification22octAleemuddin Abbasi
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithmsMark Moriarty
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSREHMAT ULLAH
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision Chen Sagiv
 
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainChristophe Debruyne
 

Similaire à Unsupervised learning with Spark (20)

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Lalal
LalalLalal
Lalal
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache Spark
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
image_segmentation_ppt.pptx
image_segmentation_ppt.pptximage_segmentation_ppt.pptx
image_segmentation_ppt.pptx
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
Poggi analytics - clustering - 1
Poggi   analytics - clustering - 1Poggi   analytics - clustering - 1
Poggi analytics - clustering - 1
 
Deep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDeep Learning AtoC with Image Perspective
Deep Learning AtoC with Image Perspective
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Digital image classification22oct
Digital image classification22octDigital image classification22oct
Digital image classification22oct
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
 
My MS defense
My MS defenseMy MS defense
My MS defense
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
 

Dernier

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Dernier (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Unsupervised learning with Spark

  • 1. Marko Velic PhD Data Science Department Styria Medijski Servisi d.o.o. marko.velic@styria.hr UNSUPERVISED LEARNING (WITH SPARK)
  • 2. CONTENTS  Distances • Eucledian • Manhattan • Mahalanobis • Cosine Similarity  Clustering • K-Means • Example (Spark)  Examples from Styria practice (not Spark – for now) 10.03.2016 2
  • 4. UNSUPERVISED LEARNING  Opservations are not assigned to classes  Computer program is not ‘supervised’ throughout the learning process  Usually the task is to find ‘meaningful’ groups within data  Decision is made based on distances i.e. similarities among data points 10.03.2016 4
  • 5. DISTANCES 10.03.2016 5 • To decide upon the groups we have to introduce similarity measure or contrary – a distance measure • Pitagora’s theorem – Euclidean distance • dist((2, -1), (-2, 2))= √((2 - (-2))² + ((-1) - 2)²) = √((2 + 2)² + (-1 - 2)²) = √((4)² + (-3)²) = √(16 + 9) = √25 = 5
  • 6. DISTANCES & APPROACHES 10.03.2016 6 Source: http://en.wikipedia.org/wiki/Man hattan_distance  Manhattan/Cityblock/Taxicab • dist((x, y), (a, b)) = |x - a| + |y - b|  Normalization!  Mahalanobis – considers variance • “multidimensional z-score”  Cosine similarity  Autoencoders – ‘unsupervised’ neural nets  Non-unsupervised but based on distances • ReliefF measure, KNN classifier ... etc...
  • 7. K-MEANS 7 Simplified: 1. Randomly place centroids 2. Find the closest 3. Put centroid in the middle 4. GOTO 2 Image source: http://www.javabeat.net/2011/05/k-means- clustering-algorithms-in-mahout/
  • 8. DEMO (SPARK!)  K-means clustering of photos (ie. their vector representations)  Convolutional neural network as a supervised model and its outputs as features for unsupervised models  Vector representations after the pooling layers, after every convolutional layer (Caffe)  Clustering in Spark 8
  • 10. SEMI-MANUAL CLUSTERING OF PHOTOS 10Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
  • 11. SEMI-MANUAL CLUSTERING OF PHOTOS 11Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
  • 12. NATURAL LANGUAGE PROCESSING 10.03.2016 12 T-sne concept visualization; vecernji.hr, Styria Data Science Team
  • 13. AUTOMATIC (LEARNED) HIERARCHIES 13 Hierarchical clustering, Florijan Stamenković, Styria Data Science Team
  • 15. CONCLUSION  Distances • Eucledian • Manhattan • Mahalanobis • Cosine Similarity  Clustering • K-Means  We can nicely combine supervised and unsupervised features  SparkNet: Training Deep Networks in Spark http://arxiv.org/pdf/1511.06051v4.pdf  https://news.developer.nvidia.com/caffe-on-spark-for- deep-learning-from-yahoo/ 10.03.2016 15