3. Machine Learning
Machine Learning is a
subset of Artificial
Intelligence
Artificial Intelligence
Machine Learning
4. NoSQL, Search and Machine Learning
NoSQL, Search and
Machine Learning
greatly complete
Machine
Learning each other !
NoSQL Search
5. Machine Learning algorithms
• Recommentations
Advice user with recommended items
• Classification
Automatically classify documents based on a given set of
examples
• Clustering
Automatically discover groups within a set of documents
• Patterns mining, evolutionary algorithms, ...
9. Recommendation use cases
• Advice user with items on e-commerce websites
And increase revenue
• Advice user with feature he may be interested in on a Web application
As most features are usually unknown
• Filter and adapt scoring of results of a search engine
Based on similar users clicks, ...
16. Clustering with K-Means
Cluster centers are
moved in order to
A minimize the sum
B
of distances
C
D
E
F
17. Clustering with K-Means
The data point C is
then attached to the
A first center as it has
B
become the nearest
C
D
E
F
18. Clustering use cases
• Finds key topics in a set of documents
News feeds, business documents, ...
• Finds some typical behaviors within a set of users
Visit frequency, buying habits, ...
20. In few words
• Implementation of machine learning algorithms in Java
Continuously growing collection of algorithms
• Most of them come in a MapReduce implementation for Hadoop
Scalable to huge datasets
• Still quite young but growing fast
Started in early 2009
• Intended to be for Machine Learning what Lucene is for Information Retrieval
22. Recommendation example
DataModel model = new FileDataModel(new File("data.csv"));
UserSimilarity simil =
new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood =
new NearestNUserNeighborhood(2, similarity, model);
Recommender recommender =
new GenericUserBasedRecommender(model, neighborhood, simil);
List<RecommendedItem> recommendations =
recommender.recommend(1, 1);
The code for a basic recommendation is pretty straightforward !
29. A Search Engine
MyCustomer Search
Document Non Disclosure Agreement 12 days ago
... MyCustomer agrees not to disclose any part of ...
Document 2010 Sales Report 1 month ago
... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days ago
Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min
Description: Invoice not received for order #2354E
30. Indexing Pipeline
Tika
PDF
Text
Analyzer
Extractor
Search
Index
Analyzer
Phone
Call
Lucene
31. A more complex Search Engine
MyCustomer Search
Sales Juridic Accounting
Document 2010 Sales Report 1 month ago
... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days ago
Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min
Description: Invoice not received for order #2354E
32. Indexing Pipeline with Mahout
Tika Mahout
PDF
Text
Classifier Analyzer
Extractor
Search
Index
Classifier Analyzer
Phone
Call
Lucene
33. Query pipeline
Lucene
Query
Analyzer
Search
Index
Results
34. Query pipeline with Mahout
Lucene
Query
Analyzer
Search
Index
Custom
Analyzer
Scoring
Results
Using Mahout
recommendations
35. Conclusion
• Machine learning brings a lot of valuable features for enterprises
Revenue increasing, better productivity, user adoption, ...
• Mahout is growing fast and is becoming a great choice for Java apps
With easy integration to business applications
• Business people are not used to that kind of use cases
Collaboration with technical folks is mandatory