Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Apache Spark's MLlib's Past Trajectory and new Directions

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 31 Publicité

Apache Spark's MLlib's Past Trajectory and new Directions

Télécharger pour lire hors ligne

This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. We will review the history of the project, including major trends and efforts leading up to today. These discussions will provide perspective as we delve into ongoing and future efforts within the community. This talk is geared towards both practitioners and developers and will provide a deeper understanding of priorities, directions and plans for MLlib.

Since the original MLlib project was merged into Apache Spark, some of the most significant efforts have been in expanding algorithmic coverage, adding multiple language APIs, supporting ML Pipelines, improving DataFrame integration, and providing model persistence. At an even higher level, the project has evolved from building a standard ML library to supporting complex workflows and production requirements.

This momentum continues. We will discuss some of the major ongoing and future efforts in Apache Spark based on discussions, planning and development amongst the MLlib community. We (the community) aim to provide pluggable and extensible APIs usable by both practitioners and ML library developers. To take advantage of Projects Tungsten and Catalyst, we are exploring DataFrame-based implementations of ML algorithms for better scaling and performance. Finally, we are making continuous improvements to core algorithms in performance, functionality, and robustness. We will augment this discussion with statistics from project activity.

This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. We will review the history of the project, including major trends and efforts leading up to today. These discussions will provide perspective as we delve into ongoing and future efforts within the community. This talk is geared towards both practitioners and developers and will provide a deeper understanding of priorities, directions and plans for MLlib.

Since the original MLlib project was merged into Apache Spark, some of the most significant efforts have been in expanding algorithmic coverage, adding multiple language APIs, supporting ML Pipelines, improving DataFrame integration, and providing model persistence. At an even higher level, the project has evolved from building a standard ML library to supporting complex workflows and production requirements.

This momentum continues. We will discuss some of the major ongoing and future efforts in Apache Spark based on discussions, planning and development amongst the MLlib community. We (the community) aim to provide pluggable and extensible APIs usable by both practitioners and ML library developers. To take advantage of Projects Tungsten and Catalyst, we are exploring DataFrame-based implementations of ML algorithms for better scaling and performance. Finally, we are making continuous improvements to core algorithms in performance, functionality, and robustness. We will augment this discussion with statistics from project activity.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Apache Spark's MLlib's Past Trajectory and new Directions (20)

Publicité

Plus par Databricks (20)

Plus récents (20)

Publicité

Apache Spark's MLlib's Past Trajectory and new Directions

  1. 1. Apache Spark MLlib's past trajectory and new directions Joseph K. Bradley Spark Summit 2017
  2. 2. 2 About me Software engineer at Databricks Apache Spark committer & PMC member Ph.D. Carnegie Mellon in Machine Learning
  3. 3. 3 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeleyin 2009 33 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  4. 4. 4 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) 44 DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com
  5. 5. 5 5 years ago… Zaharia et al. NSDI, 2012 Scalable ML using Spark!
  6. 6. 6 3 ½ years ago until today People added some algorithms… Classification • Logisticregression & linear SVMs • L1, L2, or elastic net regularization • Decision trees • Random forests • Gradient-boosted trees • Naive Bayes • Multilayer perceptron • One-vs-rest • Streaming logisticregression Regression • Least squares • L1, L2, or elastic net regularization • Decision trees • Random forests • Gradient-boosted trees • Isotonicregression • Streaming least squares Feature extraction, transformation & selection • Binarizer • Bucketizer • Chi-Squared selection • CountVectorizer • Discrete cosine transform • ElementwiseProduct • Hashing term frequency • Inverse document frequency • MinMaxScaler • Ngram • Normalizer • VectorAssembler • VectorIndexer • VectorSlicer • Word2Vec Clustering • Gaussian mixture model • K-Means • Streaming K-Means • Latent Dirichlet Allocatio • Power Iteration Clusterin Recommendation • Alternating Least Squares (ALS) Frequent itemsets • FP-growth • Prefixspan Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation Linear Algebra • Local & distributed dense &sparse matrices • Matrix decompositions (PCA, SVD, QR)
  7. 7. 7 Today Apache Spark is widely used for ML. • Many production use cases • 1000s of commits • 100s of contributors • 10,000s of users (on Databricks alone)
  8. 8. 8 This talk What? How has MLlib developed, and what has it become? So what? What are the implications for the dev & user communities? Now what? Where could or should it go?
  9. 9. 9 What? How has MLlib developed, and what has it become?
  10. 10. 1 Let’s do some data analytics 50+ algorithms and featurizers Rapid development Shift from adding algorithms to improving algorithms & infrastructure Growing dev community
  11. 11. 1
  12. 12. 1
  13. 13. 1
  14. 14. 1 Major projects 0.8 master2.12.01.61.50.9 1.0 1.1 1.2 1.3 1.4 DataFrame-based API ML persistence Trees & ensembles GLMs SparkR Pipelines Featurizers Note: These are unofficial “project” labels basedon JIRA/PR activity.
  15. 15. 1 So what? What are the implications for the dev & usercommunities?
  16. 16. 1 Integrating ML with the Big Data world Seamless integration with SQL, DataFrames, Graphs, Streaming, both within MLlib… Data sources CSV, JSON,Parquet, … DataFrames Simple, scalable pre/post-processing Graphs Graph-based ML implementations Streaming ML models deployed with Streaming
  17. 17. 1 Integrating ML with the Big Data world Seamless integration with SQL, DataFrames, Graphs, Streaming, both within MLlib…and outside MLlib E.g., on spark-packages.org • scikit-learn integration package (“spark-sklearn”) • Stanford CoreNLPintegration (“spark-corenlp”) Deep Learning libraries • Deep Learning Pipelines (“spark-deep-learning”) • BigDL • TensorFlowOnSpark • Distributed training with Keras (“dist-keras”) Framework for integratingnon-“big data” or specialized ML libraries
  18. 18. 1 Scaling Workflow scalability Same code on laptop with small data à cluster with big data Big data scalability Scale-out implementations 4x faster than xgboost Abuzaid et al. (2016) Chen & Guestrin (2016) 8x slower than xgboost
  19. 19. 1 Building workflows Simple concepts: Transformers, Estimators, Models, Evaluators • Unified, standardized APIs, including for algorithm parameters Pipelines • Repeatable workflows across Sparkdeployments and languages • DataFrameAPIs and optimizations • Automated model tuning FeaturizationETL Training Evaluation Tuning
  20. 20. 2 Now what? Where could or should it go?
  21. 21. 2 Top items from the dev@ list… • Continued speed & scalability improvements • Enhancements to core algorithms • ML building blocks: linear algebra, optimization • Extensible and customizable APIs Goals Scale-out ML Standard library Extensible API Theseare unofficialgoals based on my experience+ discussions on the dev@ mailing list.
  22. 22. 2 Scale-out ML General improvements • DataFrame-based implementations Targeted improvements • Gradient-boosted trees: feature subsampling • Linear models: vector-free L-BFGS for billions of features • … E.g., 10x scaling for Connected Components in GraphFrames
  23. 23. 2 Standard library General improvements • NaN/null handling • Instance weights • Warm starts / initial models • Multi-column feature transforms Algorithm-specific improvements • Trees: access tree structure from Python • …
  24. 24. 2 Extensible and customizable APIs MLlib has ~50 algorithms. CRAN lists 10,750 packages. àUsers must be able to modify algorithms or write their own. algorithms usage
  25. 25. 2 Spark Packages 340+ packages written for Spark 80+ packages for ML and Graphs E.g.: • GraphFrames (“graphframes”): DataFrame-based graphs • Bisecting K-Means (“bisecting-kmeans”): now part of MLlib • Stanford CoreNLP wrapper (“spark-corenlp”): UDFs for NLP spark-packages.org
  26. 26. 2 How to write a custom algorithm You need to: • Extendan API like Transformer, Estimator, Model, or Evaluator MLlibprovides some help: • UnaryTransformer • DefaultParamsWritable/Readable • defaultCopy You get: • Natural integration with DataFrames, Pipelines, modeltuning
  27. 27. 2 Challenges in customization Modify existing algorithms with custom: • Loss functions • Optimizers • Stopping criteria • Callbacks Implement new algorithms without worrying about: • Boilerplate • Python API • ML building blocks • Feature attributes/metadata
  28. 28. 2 Example of MLlib APIs Deep Learning Pipelines for Apache Spark spark-packages.org/package/databricks/spark-deep-learning APIs are great for users: • Plug & play algorithms • Standard API But challenges remain: • Development requires expertise • Community under development
  29. 29. 2 My challenge to you Can we fill out the long tail of ML use cases? algorithms usage
  30. 30. 3 Resources Spark Packages: spark-packages.org • Plugin for building packages: https://github.com/databricks/sbt-spark- package • Tool for generating package templates: https://github.com/databricks/spark-package-cmd-tool Example packages: • scikit-learn integration (“spark-sklearn”) – Python-only • GraphFrames – Scala/Java/Python
  31. 31. Thank you! Office hours today @ 3:50pm at Databricks booth Twitter: @jkbatcmu

×