Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Hadoop for the Data Scientist: Spark in Cloudera 5.5

3 315 vues

Publié le

Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?

Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.

Publié dans : Technologie
  • Soyez le premier à commenter

Hadoop for the Data Scientist: Spark in Cloudera 5.5

  1. 1. 1© Cloudera, Inc. All rights reserved. Hadoop for the Data Scientist: Spark in Cloudera 5.5 Anand Iyer | Senior Product Manager | Cloudera Sandy Ryza | Senior Data Scientist | Cloudera
  2. 2. 2© Cloudera, Inc. All rights reserved. Agenda • Apache Spark Overview • Machine Learning with Hadoop and Spark • Machine Learning Use Cases • What’s Next
  3. 3. 3© Cloudera, Inc. All rights reserved. Cloudera Enterprise Making Hadoop Fast, Easy, and Secure A new kind of data platform: • One place for unlimited data • Unified, multi-framework data access Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise OPERATIONS DATA MANAGEMENT STRUCTURED UNSTRUCTURED PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT SECURITY FILESYSTEM RELATIONAL NoSQL STORE INTEGRATE BATCH STREAM SQL SEARCH SDK
  4. 4. 4© Cloudera, Inc. All rights reserved. One Platform, Many Workloads Batch, Interactive, and Real-Time. Leading performance and usability in one platform. • End-to-end analytic workflows • Access more data • Work with data in new ways • Enable new users OPERATIONS Cloudera Manager Cloudera Director DATA MANAGEMENT Cloudera Navigator Encrypt and KeyTrustee Optimizer STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr SDK Kite
  5. 5. 5© Cloudera, Inc. All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easy Development Flexible Extensible API Fast Batch & Stream Processing • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  6. 6. 6© Cloudera, Inc. All rights reserved. The Spark Ecosystem & Hadoop STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE SQL Impala SEARCH Solr SDK Kite BATCH & STREAM Spark Spark Streaming Spark SQL DataFrames MLlib …
  7. 7. 7© Cloudera, Inc. All rights reserved. Easy Machine Learning on data distributed over a large cluster of machines
  8. 8. 8© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  9. 9. 9© Cloudera, Inc. All rights reserved. What is Mllib? Library of machine learning and data mining algorithms and utilities • Implemented in Spark • Invoked within Java, Scala, or Python Spark applications MLlib applications are Spark applications • Requires Spark knowledge to effectively run • Recommended deployment on YARN • MLlib apps require the same set of parameters Spark applications require (number of executors, memory per executor, etc)
  10. 10. 10© Cloudera, Inc. All rights reserved. What Does MLlib Contain? • Machine learning models for classification and regression • Recommender System • Clustering Algorithms • Feature Engineering Algorithms and Utilities • Data Mining Algorithms & Basic Statistical Analysis Utilities
  11. 11. 11© Cloudera, Inc. All rights reserved. Classification & Regression Traditional Models • Linear and Logistic Regression • Naïve Bayes • Decision Trees • Support Vector Machines
  12. 12. 12© Cloudera, Inc. All rights reserved. Classification & Regression Traditional Models • Linear and Logistic Regression • Naïve Bayes • Decision Trees • Support Vector Machines Next-Gen Models • Gradient Boosted Trees • Random Forests
  13. 13. 13© Cloudera, Inc. All rights reserved. Clustering Algorithms • K-Means • Power Iteration Clustering (PIC) • Gaussian Mixture Model • Streaming K-Means
  14. 14. 14© Cloudera, Inc. All rights reserved. Clustering Algorithms • K-Means • Power Iteration Clustering (PIC) • Gaussian Mixture Model • Streaming K-Means Textual data clustering i.e. identifying “topics” from a corpus of documents: • Latent Dirichlet Allocation (LDA)
  15. 15. 15© Cloudera, Inc. All rights reserved. • Predicting the interests of a user, by collecting partial list of preferences from many users • Predicting missing items of a user-item association matrix • Algorithm used: Alternating Least Squares • Admittedly limited choice of algorithms ? ? ? ? ? ? ? ? ? ? Collaborative Filtering For Building Recommender Systems
  16. 16. 16© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
  17. 17. 17© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc Dimensionality Reduction: • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD)
  18. 18. 18© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc Dimensionality Reduction: • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD) Textual Feature Generation: • Word2Vec • Term Frequency – Inverse Document Frequency (TF-IDF)
  19. 19. 19© Cloudera, Inc. All rights reserved. Data Mining: Frequent Pattern Mining Data Mining Urban Legend: Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men who buy diapers have a very high likelihood of buying beer!”
  20. 20. 20© Cloudera, Inc. All rights reserved. Data Mining: Frequent Pattern Mining Data Mining Urban Legend: Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men who buy diapers have a very high likelihood of buying beer!” Algorithms in MLlib: • Frequent Pattern-Growth • Association Rule Mining • PrefixSpan
  21. 21. 21© Cloudera, Inc. All rights reserved. What about “Deep Learning”? Deep Learning is an umbrella term for large complex Multi- Layer Neural Networks • MLlib contains a robust Multilayer Neural Network implementation
  22. 22. 22© Cloudera, Inc. All rights reserved. Pipeline API Hooking the Pieces Together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of Words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier
  23. 23. 23© Cloudera, Inc. All rights reserved. Pipeline API Hooking the Pieces Together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of Words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier Sequence is repeated during Training and Scoring
  24. 24. 24© Cloudera, Inc. All rights reserved. Pipeline API: Hooking the pieces together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier Sequence is repeated during Training and Scoring Hyper-Parameter Tuning  Repeat Sequence with different parameter values
  25. 25. 25© Cloudera, Inc. All rights reserved. Overview of Pipeline API • Create Pipeline as a sequence of Stages: • Transformers: Transform or augment features • Estimators: Fit a model • Re-use Pipeline • Basic save and load functionality available • Invoke Pipeline with different set of parameters passed as ParamMap
  26. 26. 26© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  27. 27. 27© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  28. 28. 28© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  29. 29. 29© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  30. 30. 30© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  31. 31. 31© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc Score streaming events in Spark Streaming.
  32. 32. 32© Cloudera, Inc. All rights reserved. Machine Learning Use Case
  33. 33. 33© Cloudera, Inc. All rights reserved. Predicting Influencers at a Large Telco • Customer loyalty difficult and expensive • Aggressive competition
  34. 34. 34© Cloudera, Inc. All rights reserved. Social Churn • Churn is not an isolated event! • When influential subscribers leave, they take their friends with them
  35. 35. 35© Cloudera, Inc. All rights reserved. Casting This as a Data Science Problem • Can we quantify: Which lost users were the most influential? • Can we predict: Which current subscribers have as much influence?
  36. 36. 36© Cloudera, Inc. All rights reserved. The Challenge: Lots Customers, Lots of Data • Over 100 million customers • Over 1 billion connections
  37. 37. 37© Cloudera, Inc. All rights reserved. The Challenge: Lots Customers, Lots of Data • Over 100 million customers • Over 1 billion connections
  38. 38. 38© Cloudera, Inc. All rights reserved. Calculating Influencer Scores • Connection: pair of users with communication both ways • Influencer score: number of connected users that churn after user X churns
  39. 39. 39© Cloudera, Inc. All rights reserved. Predicting Influencer Scores MLlib! • Regression model • Linear regression • Random forests • Features • # of connections, # calls to connections • Internal vs. External
  40. 40. 40© Cloudera, Inc. All rights reserved. Breaking Down the Work Building User and Connection Tables Computing Historical Influencer Scores Feature Generation Model Fitting Model Evaluation
  41. 41. 41© Cloudera, Inc. All rights reserved. What’s Next
  42. 42. 42© Cloudera, Inc. All rights reserved. Roadmap Update MANAGEMENT Initial Spark-on-YARN integration for shared resource management SECURITY SCALE STREAMING New metrics for easier diagnosis Improved Spark-on-YARN for better multi-tenancy, performance, ease of use Automated configurations to optimize over time Visibility into resource utilization Improved PySpark integration for Python access Kerberos-based authorization Fine-grained access control Auditing and lineage (Governance) Integration with Intel’s Advanced Encryption libraries Full PCI compliance Improved integration with HDFS to enable scheduling Reduced memory pressure on larger jobs Dynamic resource utilization and prioritization Stress test at scale with mixed multi-tenant workloads Spark Streaming resiliency for zero data loss Data ingest integration for Kafka and Flume Improved state management for better performance Higher-level language extensions ✔ ✔✔ ✔ ✔✔ ✔
  43. 43. 43© Cloudera, Inc. All rights reserved. Download Cloudera 5.5 cloudera.com/downloads
  44. 44. 44© Cloudera, Inc. All rights reserved. Data Science & Spark Training Courses university.cloudera.com
  45. 45. 45© Cloudera, Inc. All rights reserved. Thank You
  46. 46. 46© Cloudera, Inc. All rights reserved. Spark Resources • Learn Spark • O’Reilly Advanced Analytics with Spark eBook (written by Clouderans) • Cloudera Developer Blog: blog.cloudera.com/spark • Spark Page: cloudera.com/spark • Get Trained • Cloudera Spark Training: university.cloudera.com • Try it Out • Cloudera Live Spark Tutorial: cloudera.com/live • Download Cloudera 5.5: cloudera.com/downloads

×