1. September 2011 – HUG– Atlanta, GA Machine Learning With Hadoop Josh Patterson | Sr Solution Architect
2. Who is Josh Patterson? josh@cloudera.com Master’s Thesis: self-organizing mesh networks Published in IAAI-09: TinyTermite: A Secure Routing Algorithm Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) Led team which designed classification techniques for time series and Map Reduce Open source work at http://openpdc.codeplex.com https://github.com/jpatanooga Today Sr. Solutions Architect at Cloudera
4. “After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.” --- Excerpt from the book “The American Gas Station” Hadoop Today: The Oil Industry Circa 1900 4
14. Data Mining 9 “How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?” --- Peter Norvig, “Artificial Intelligence: A Modern Approach”
15. Basic Concepts What is Data Mining? “the process of extracting patterns from data” Why are we interested in Data Mining? Raw data essentially useless Data is simply recorded facts Information is the patterns underlying the data We want to learn these patterns Information is key
16. How does Machine Learning differ from Data Mining? Data Mining Extracting information from data Finds patterns in data Machine Learning Algorithms for acquiring structural descriptions from data “examples” Process of learning “concepts” “structural descriptions” represent patterns explicitly
17. Shades of Gray Information Retrieval information science, information architecture, cognitive psychology, linguistics, and statistics. Natural Language Processing grounded in machine learning, especially statistical machine learning Statistics Math and stuff Machine Learning Considered a branch of artificial intelligence
18. Types of Machine Learning Classification Association Clustering Numeric Prediction AKA: “Regression”
20. ML Focused on in Mahout Classification Naïve Bayes in Text Classification Stochastic Gradient Descent (Logistic Regression) Random Forests Recommendation Collaborative Filtering, Taste Engine Item to item Clustering K-means, Fuzzy K-means (Latent) Dirichlet Process
21. Naïve Bayes and Text Doc classification is an important domain in Machine Learning Docs are characterized by the words that appear in them One approach is to treat presence / absence of each word as a boolean attribute Naïve Bayes is popular here, fast, accurate
22. What Are Recommenders? An algorithm that looks at a user’s past actions and suggests Products Services People
23. Collaborative Filtering Collaborative filtering produces recommendations based on user preferences for items, “User Based” does not require knowledge of the specific properties of the items. In contrast, content-based recommendation produces recommendations based off of intimate knowledge of the properties of items. “Item based”
25. What is time series data? Time series data is defined as a sequence of data points measured typically at successive times spaced at uniform time intervals Examples in finance daily adjusted close price of a stock at the NYSE Example in Sensors / Signal Processing / Smart Grid sensor readings on a power grid occurring 30 times a second. For more reference on time series data http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/
30. What is Lumberyard? Lumberyard is time series iSAX indexing stored in HBase for persistent and scalable index storage It’s interesting for Indexing large amounts of time series data Low latency fuzzy pattern matching queries on time series data Lumberyard is open source and ASF 2.0 Licensed at Github: https://github.com/jpatanooga/Lumberyard/ Copyright 2011 Cloudera Inc. All rights reserved
31. Genome Data as Time Series A, C, G, and T Could be thought of as “1, 2, 3, and 4”! If we have sequence X, what is the “closest” subsequence in a genome that is most like it? Doesn’t have to be an exact match! Example: ATATAT TATATA Useful in proteomics as well iSAX Indexing Lumberyard use case Copyright 2011 Cloudera Inc. All rights reserved
32. Bioinformatics Applications in DNA Sequencing Shortest Superstring Problem (SSP) Take lots of reads from sequencing We want the “superstring” of all the reads We want a long string that “explains” all the reads we generated We want the shortest string possible NP-complete We can reduce SSP to the Traveling Salesman Problem Graph processing / algorithms now applicable 25
33. Packages For Hadoop DataFu http://sna-projects.com/datafu/ UDFs in Pig used at LinkedIn in many of off-line workflows for data derived products "People You May Know” "Skills” Techniques PageRank Quantiles (median), variance, etc. Sessionization Convenience bag functions Convenience utility functions 26
34. Integration with Libs Mix MapReduce with Machine Learning Libs WEKA KXEN CPLEX Map side “groups data” Reduce side processes groups of data with Lib in parallel Involves tricks in getting K/V pairs into lib Pipes, tmp files, task cache dir, etc 27
35. What Hadoop Not Good At in Data Mining Anything highly iterative Anything that is extemely CPU bound and not disk bound Algorithms that can’t be inherently parallelized Examples Stochastic Gradient Descent (SGD) Support Vector Machines (SVM) Doesn’t mean they arent great to use
37. MRv2 Not everything fits great in MapReduce Mahout as evidence of this Examples Stochastic Gradient Descent (SGD) Support Vector Machines (SVM) As we build further into verticals our analysis needs will become more complicated MRv2 gives us new options CDH4 will be based on 0.23.x (or later) 0.23.0 doesn't include MRv1 (via Tom White) CDH4 will *only* include MRv2 30
39. Frameworks Currently in Dev – MRv2 Giraph https://issues.apache.org/jira/browse/GIRAPH-13 Hama BSP plans to integrate with MRv2 https://issues.apache.org/jira/browse/HAMA-431 MPI https://issues.apache.org/jira/browse/MAPREDUCE-2911 Spark https://github.com/mesos/spark-yarn GraphLab Discussion in user-mahout 32
42. Questions? (Thanks!) Hadoop World 2011 You should go Talks are high quality Lots more Machine Learning talks Developer class 10/10/2011 http://www.eventbrite.com/event/1951335497 10% discount with code atlhug 35
Notes de l'éditeur
Theme: they through away a lot of valuable gas and oil just like we through away data today
But what if some constraints changed?
Talk about changing market dynamics of storage costWhat if some of the previously held constraints changed? Enter hadoop
Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
MLCan be used to predict outcome in new situationCan be used to understand and explain how prediction is derived (may be even more important)Methods originate from artificial intelligence, statistics, and research on databasesDM: about the processML: about the algorithms“Can machines really learn?” --- long discussion, but from some perspectives yes. Good philosophical talk over beers.
Mention how different books lay out information in different formatting, or may not group techniques exactly the sameLots of bleed over, from NLP, to IR, to ML
SGD – online learning, non batch, not parallelizable, good performance
“What do other people w/ similar tastes like?”“strength of associations”
Let’s set the stage in the context of story, why we were looking at big data for time series.
Ok, so how did we get to this point?Older SCADA systems take 1 data point per 2-4 seconds --- PMUs --- 30 times a sec, 120 PMUs, Growing by 10x factor
On Monday Steve from google talked about working with genomic data --- genomic data is time seriesOur take home demo actually works with a small bit of genomic dataLots of chatter @ oscon about genomics, I just sat in one today