See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.
5. What is clustering?
Grouping & summarizing data
Unsupervised machine learning
“...the assignment of a set of observations
into subsets so that observations in the
same clusters are similar in some sense...”
Source: Wikipedia
5
8. 2-D Clustering example
Intra-cluster
distance
Inter-cluster
distance
Legend Point Cluster Cluster Center
8
9. K-Means algorithm
Select K random vectors
Specify distance measure + threshold
Every iteration
●
Add vector closest to cluster
●
Recompute center
●
Converged if no vectors within threshold
9
20. Clustering
Cluster
Vectorize Index
[ 0,1,0,1,1,1,0,0,1,0,1 ]
[ 0,1,1,1,1,0,0,0,0,0,1 ]
Text
Join content
& clusters
Java Git Lucene
Pre-process
XML & HTML Regular expressions
Post ID Version control
& Title
20
27. References
Mahout in Action – Just released!
Sean Owen, Ted Dunning, Robin Anil, Ellen Friedman
{user|dev}@mahout.apache.org
http://jira.apache.org/MAHOUT
http://www.searchworkings.org
27