Configuring Mahout Clustering Jobs - Frank Scholten

•

2 j'aime•2,535 vues

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011 For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.

Technologie Formation

Configuring
Clustering Jobs
Frank Scholten, DutchWorks
frank.scholten@dutchworks.nl, 19 october 2011

My Background

Frank Scholten @Frank_Scholten

Software Developer at

Blogger at

user & contributor

2

Agenda

What is clustering?

Intro to

Clustering

3

Clustering introduction

So much How to get a nice
data... overview?

4

What is clustering?

Grouping & summarizing data

Unsupervised machine learning

“...the assignment of a set of observations
into subsets so that observations in the
same clusters are similar in some sense...”
Source: Wikipedia

5

Applications

Market segmentation

Species identifications

Machine vision

Information retrieval & search

...and many more!

6

2-D Clustering example
Intra-cluster
distance

Inter-cluster
distance

Legend Point Cluster Cluster Center
8

K-Means algorithm

Select K random vectors

Specify distance measure + threshold

Every iteration
●
Add vector closest to cluster
●
Recompute center
●
Converged if no vectors within threshold

9

Is this SPAM?

Classification
Collaborative
Filtering
And much more!

Clustering
11

The Project

Apache project started in 2008

Scalable machine learning, often with Hadoop

Steadily growing community

Version 0.6 coming soon

12

bin/mahout
frank@frankthetank:~$ mahout
    no HADOOP_HOME set, running locally
    An example program must be given as the
    first argument.
    Valid program names are:
    arff.vector: : Generate Vectors from an ARFF
                   file or directory
    canopy: : Canopy clustering
    cat: : Print a file or resource as the
           logistic regression models would see
     it
    ...


13

Help

frank@frankthetank:~$ mahout kmeans help
  usage: <command> [Generic Options]
  [Jobspecific Options]
  Generic Options:
...
  Jobspecific Options:
  input (i) input Path to job input directory.
  output (o) output The directory pathname for
output.
    ...

14

Java Drivers

  String[] args = new String[] {
    "input", input,
    "output", output,
    "clusters",  clusters,
    "clustering",
    "numClusters", “10”
  };

  ToolRunner.run(conf, new KmeansDriver(), args);

15

Text clustering
process
[ 0.03, 0.95, 0.45, 0.34 ]
[ 0.02, 0.98, 0.73, 0.55 ]
Text files or Vectors
Sequence files
Lucene index

K-means Clusters
Find n-grams Dictionary
(CL-1, [0.32, 0.6, .. ]
(CL-1, [0.76, 0.1, .. ]
(CL-1,23) (CL-2, [0.98, 0.2, .. ]
Quick fox Dog (CL-1,37)
(CL-2,45)
Cluster labels Points

16

Text clustering
programs
$ mahout seqdirectory

Text files or Sequence files
Lucene index

[ 0.03, 0.95, .. ] $ mahout seq2sparse
[ 0.29, 0.98, .. ]

Sequence files Vectors

[ 0.03, 0.95, .. ]
[ 0.29, 0.98, .. ] $ mahout kmeans
Vectors Clusters

17

Clustering

Publicly available monthly dumps

Posts ~ 5.5 GB ~ 1.4 M questions (April 2011)

Let's use to extract a tag cloud!

19

Clustering
Cluster

Vectorize Index

[ 0,1,0,1,1,1,0,0,1,0,1 ]
[ 0,1,1,1,1,0,0,0,0,0,1 ]
Text

Join content
& clusters

Java Git Lucene
Pre-process
XML & HTML Regular expressions

Post ID Version control
& Title
20

Clustering
How to implement?

Pre-process XML XML & HTML parsing

Vectorize Custom Analyzer

Cluster

Index

21

[ 0,1,0,1,1,1,0,0,1,0,1 ]
Vectorize [ 0,1,1,1,1,0,0,0,0,0,1 ]

Many options and flags

$ mahout seq2sparse
input ..
output ..
analyzerClass ..
maxDFPercent ..
minDF ..

22

Cluster

Run one of the clustering algorithms!

K-means, Fuzzy K-means, Canopy,
Mean-shift, Min-hash, LDA

All with different pros and cons

23

Index

Custom code to join data at index time

Index clusters
(cluster_id, cluster_name, size)

Index posts
(post_id, post_cluster_id, title)

24

Conclusions

Clustering is fun!

Vectorization & labeling improvements

Tools for cluster evaluation?

26

References
Mahout in Action – Just released!
Sean Owen, Ted Dunning, Robin Anil, Ellen Friedman

{user|dev}@mahout.apache.org
http://jira.apache.org/MAHOUT

http://www.searchworkings.org

27

Contenu connexe

Tendances

Python 3.6 Features 20161207Jay Coskey

Puppet overviewMike_Foto

Clojure: a LISP for the JVMKnowledge Engineering and Machine Learning Group

SQL Server Select TopicsJay Coskey

BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012Amazon Web Services

Java 8 - Stamped LockHaim Yadid

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305mjfrankli

Java Day-7People Strategists

Intro to Python (High School) Unit #2Jay Coskey

Vk.amberfog.com gtug part1_introduction2_javaandroid_gtugketan_patel25

Millions quotes per second in pure javaRoman Elizarov

Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey

DroolsAllan Huang

Mathemetics modulemanikanta361

Introduction to Haskell: 2011-04-13Jay Coskey

Java 103Manuela Grindei

Scaling Deep Learning with MXNetAI Frontiers

Java Day-6People Strategists

Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Eelco Visser

Collections forceawakensRichardWarburton

Tendances (20)

Python 3.6 Features 20161207

Puppet overview

Clojure: a LISP for the JVM

SQL Server Select Topics

BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012

Java 8 - Stamped Lock

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Java Day-7

Intro to Python (High School) Unit #2

Vk.amberfog.com gtug part1_introduction2_javaandroid_gtug

Millions quotes per second in pure java

Beyond Map/Reduce: Getting Creative With Parallel Processing

Drools

Mathemetics module

Introduction to Haskell: 2011-04-13

Java 103

Scaling Deep Learning with MXNet

Java Day-6

Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...

Collections forceawakens

Similaire à Configuring Mahout Clustering Jobs - Frank Scholten

Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia

Hands on Mahout!OSCON Byrum

Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusManasi Vartak

Weave User Group Talk - DockerCon 2017 RecapPatrick Chanezon

OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil

ICPC 2012 - Mining Source Code DescriptionsSebastiano Panichella

Debugging With Idguest215c4e

Microservices Application Tracing Standards and Simulators - Adrians at OSCONAdrian Cockcroft

Pawan industrial training presentation on Hadoop, Clustering and Network virt...PAWANNAYAK15

Standardizing on a single N-dimensional array API for PythonRalf Gommers

Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Animesh Singh

Objective-CAbdlhadi Oul

Presentation of Python, Django, DockerStackDavid Sanchez

Scala Days NYC 2016Martin Odersky

And Then There Are AlgorithmsInfluxData

Machine Learning using Apache Spark MLlibIMC Institute

MXNet WorkshopAmazon Web Services

DotNet IntroductionWei Sun

Real-time streams and logs with Storm and KafkaAndrew Montalenti

Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith ...PyData

Similaire à Configuring Mahout Clustering Jobs - Frank Scholten (20)

Orchestrating the Intelligent Web with Apache Mahout

Hands on Mahout!

Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus

Weave User Group Talk - DockerCon 2017 Recap

OSCON: Apache Mahout - Mammoth Scale Machine Learning

ICPC 2012 - Mining Source Code Descriptions

Debugging With Id

Microservices Application Tracing Standards and Simulators - Adrians at OSCON

Pawan industrial training presentation on Hadoop, Clustering and Network virt...

Standardizing on a single N-dimensional array API for Python

Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]

Objective-C

Presentation of Python, Django, DockerStack

Scala Days NYC 2016

And Then There Are Algorithms

Machine Learning using Apache Spark MLlib

MXNet Workshop

DotNet Introduction

Real-time streams and logs with Storm and Kafka

Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith ...

Plus de lucenerevolution

Text Classification Powered by Apache Mahout and Lucenelucenerevolution

State of the Art Logging. Kibana4Solr is Here! lucenerevolution

Search at Twitterlucenerevolution

Building Client-side Search Applications with Solrlucenerevolution

Integrate Solr with real-time stream processing applicationslucenerevolution

Scaling Solr with SolrCloudlucenerevolution

Administering and Monitoring SolrCloud Clusterslucenerevolution

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution

Using Solr to Search and Analyze Logs lucenerevolution

Enhancing relevancy through personalization & semantic searchlucenerevolution

Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution

Solr's Admin UI - Where does the data come from?lucenerevolution

Schemaless Solr and the Solr Schema REST APIlucenerevolution

High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution

Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution

Faceted Search with Lucenelucenerevolution

Recent Additions to Lucene Arsenallucenerevolution

Turning search upside downlucenerevolution

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution

Shrinking the haystack wes caldwell - finallucenerevolution

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene

State of the Art Logging. Kibana4Solr is Here!

Search at Twitter

Building Client-side Search Applications with Solr

Integrate Solr with real-time stream processing applications

Scaling Solr with SolrCloud

Administering and Monitoring SolrCloud Clusters

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

Using Solr to Search and Analyze Logs

Enhancing relevancy through personalization & semantic search

Real-time Inverted Search in the Cloud Using Lucene and Storm

Solr's Admin UI - Where does the data come from?

Schemaless Solr and the Solr Schema REST API

High Performance JSON Search and Relational Faceted Browsing with Lucene

Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Faceted Search with Lucene

Recent Additions to Lucene Arsenal

Turning search upside down

Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...

Shrinking the haystack wes caldwell - final

Dernier

A Year of the Servo Reboot: Where Are We Now?Igalia

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

GenAI Risks & Security Meetup 01052024.pdflior mazor

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Real Time Object Detection Using Open CVKhem

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Histor y of HAM Radio presentation slidevu2urc

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Developing An App To Navigate The Roads of BrazilV3cube

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

A Domino Admins Adventures (Engage 2024)Gabriella Davis

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Apidays New York 2024 - The value of a flexible API Management solution for O...

GenAI Risks & Security Meetup 01052024.pdf

Finology Group – Insurtech Innovation Award 2024

Boost PC performance: How more available memory can improve productivity

Handwritten Text Recognition for manuscripts and early printed texts

Powerful Google developer tools for immediate impact! (2023-24 C)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Driving Behavioral Change for Information Management through Data-Driven Gree...

Real Time Object Detection Using Open CV

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Histor y of HAM Radio presentation slide

🐬 The future of MySQL is Postgres 🐘

Developing An App To Navigate The Roads of Brazil

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

A Domino Admins Adventures (Engage 2024)

How to Troubleshoot Apps for the Modern Connected Worker

Configuring Mahout Clustering Jobs - Frank Scholten

1. Configuring Clustering Jobs Frank Scholten, DutchWorks frank.scholten@dutchworks.nl, 19 october 2011

2. My Background Frank Scholten @Frank_Scholten Software Developer at Blogger at user & contributor 2

3. Agenda What is clustering? Intro to Clustering 3

4. Clustering introduction So much How to get a nice data... overview? 4

5. What is clustering? Grouping & summarizing data Unsupervised machine learning “...the assignment of a set of observations into subsets so that observations in the same clusters are similar in some sense...” Source: Wikipedia 5

6. Applications Market segmentation Species identifications Machine vision Information retrieval & search ...and many more! 6

7. Example - Google news 7

8. 2-D Clustering example Intra-cluster distance Inter-cluster distance Legend Point Cluster Cluster Center 8

9. K-Means algorithm Select K random vectors Specify distance measure + threshold Every iteration ● Add vector closest to cluster ● Recompute center ● Converged if no vectors within threshold 9

10. Intro to 10

11. Is this SPAM? Classification Collaborative Filtering And much more! Clustering 11

12. The Project Apache project started in 2008 Scalable machine learning, often with Hadoop Steadily growing community Version 0.6 coming soon 12

13. bin/mahout frank@frankthetank:~$ mahout no HADOOP_HOME set, running locally An example program must be given as the first argument. Valid program names are: arff.vector: : Generate Vectors from an ARFF file or directory canopy: : Canopy clustering cat: : Print a file or resource as the logistic regression models would see it ... 13

14. Help frank@frankthetank:~$ mahout kmeans help usage: <command> [Generic Options] [Jobspecific Options] Generic Options: ... Jobspecific Options: input (i) input Path to job input directory. output (o) output The directory pathname for output. ... 14

15. Java Drivers String[] args = new String[] { "input", input, "output", output, "clusters", clusters, "clustering", "numClusters", “10” }; ToolRunner.run(conf, new KmeansDriver(), args); 15

16. Text clustering process [ 0.03, 0.95, 0.45, 0.34 ] [ 0.02, 0.98, 0.73, 0.55 ] Text files or Vectors Sequence files Lucene index K-means Clusters Find n-grams Dictionary (CL-1, [0.32, 0.6, .. ] (CL-1, [0.76, 0.1, .. ] (CL-1,23) (CL-2, [0.98, 0.2, .. ] Quick fox Dog (CL-1,37) (CL-2,45) Cluster labels Points 16

17. Text clustering programs $ mahout seqdirectory Text files or Sequence files Lucene index [ 0.03, 0.95, .. ] $ mahout seq2sparse [ 0.29, 0.98, .. ] Sequence files Vectors [ 0.03, 0.95, .. ] [ 0.29, 0.98, .. ] $ mahout kmeans Vectors Clusters 17

18. Clustering 18

19. Clustering Publicly available monthly dumps Posts ~ 5.5 GB ~ 1.4 M questions (April 2011) Let's use to extract a tag cloud! 19

20. Clustering Cluster Vectorize Index [ 0,1,0,1,1,1,0,0,1,0,1 ] [ 0,1,1,1,1,0,0,0,0,0,1 ] Text Join content & clusters Java Git Lucene Pre-process XML & HTML Regular expressions Post ID Version control & Title 20

21. Clustering How to implement? Pre-process XML XML & HTML parsing Vectorize Custom Analyzer Cluster Index 21

22. [ 0,1,0,1,1,1,0,0,1,0,1 ] Vectorize [ 0,1,1,1,1,0,0,0,0,0,1 ] Many options and flags $ mahout seq2sparse input .. output .. analyzerClass .. maxDFPercent .. minDF .. 22

23. Cluster Run one of the clustering algorithms! K-means, Fuzzy K-means, Canopy, Mean-shift, Min-hash, LDA All with different pros and cons 23

24. Index Custom code to join data at index time Index clusters (cluster_id, cluster_name, size) Index posts (post_id, post_cluster_id, title) 24

25. Demo time! 25

26. Conclusions Clustering is fun! Vectorization & labeling improvements Tools for cluster evaluation? 26

27. References Mahout in Action – Just released! Sean Owen, Ted Dunning, Robin Anil, Ellen Friedman {user|dev}@mahout.apache.org http://jira.apache.org/MAHOUT http://www.searchworkings.org 27

28. Q&A 28

Configuring Mahout Clustering Jobs - Frank Scholten

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Configuring Mahout Clustering Jobs - Frank Scholten

Similaire à Configuring Mahout Clustering Jobs - Frank Scholten (20)

Plus de lucenerevolution

Plus de lucenerevolution (20)

Dernier

Dernier (20)

Configuring Mahout Clustering Jobs - Frank Scholten