Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Fighting Fraud with Apache Spark

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 29 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à Fighting Fraud with Apache Spark (20)

Plus récents (20)

Publicité

Fighting Fraud with Apache Spark

  1. 1. Fighting Fraud in Medicare with Apache Spark Miklos Christine Solutions Architect mwc@databricks.com, @Miklos_C
  2. 2. About Me: Miklos Christine Solutions Architect @ Databricks - Assist customers architect big data platforms - Help customers understand big data best practices Previously: - Systems Engineer @ Cloudera - Supported customers running a few of the largest clusters in the world - Software Engineer @ Cisco
  3. 3. Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 3 Data Value Created Databricks on top of Spark to make big data simple.
  4. 4. Next Generation Big Data Processing Engine
  5. 5. • Started as a research project at UC Berkeley in 2009 • 600,000 lines of code (75% Scala) • Last Release Spark 1.6 December 2015 • Next Release Spark 2.0 • Open Source License (Apache 2.0) • Built by 1000+ developers from 200+ companies 9
  6. 6. … Apache Spark Engine Spark Core Spark Streaming Spark SQL SparkML / MLLib Graph Frames / GraphX Unified engine across diverse workloads & environments Scale out fault tolerant Python, Java, Scala, and R APIs Standard libraries
  7. 7. History of Spark APIs RDD (2011) DataFrame (2013) Distribute collection of JVM objects Functional Operators (map, filter, etc.) Distribute collection of Row objects Expression-based operations and UDFs Logical plans and optimizer Fast/efficient internal representations DataSet (2015) Internally rows, externally JVM objects Almost the “Best of both worlds”: type safe + fast But slower than DF Not as good for interactive analysis, especially Python
  8. 8. Apache Spark 2.0 API DataSet (2016) • DataFrame = Dataset[Row] • Convenient for interactive analysis • Faster DataFrame DataSet Untyped API Typed API • Optimized for data engineering • Fast
  9. 9. Benefit of Logical Plan: Performance Parity Across Languages DataFrame RDD
  10. 10. Machine Learning with Apache Spark
  11. 11. Why do Machine Learning? • Machine Learning is using computers and algorithms to recognize patterns in data • Businesses have to Adapt Faster to Change • Data driven decisions need to be made quickly and accurately • Customers expect faster responses 15
  12. 12. From Descriptive to Predictive to Prescriptive 16 • •
  13. 13. Data Science Time 17
  14. 14. Iterate on Your Models 18 • • • •
  15. 15. Spark ML
  16. 16. Why Spark ML Provide general purpose ML algorithms on top of Spark • Let Spark handle the distribution of data and queries; scalability • Leverage its improvements (e.g. DataFrames, Datasets, Tungsten) Advantages of MLlib’s Design: • Simplicity • Scalability • Streamlined end-to-end • Compatibility
  17. 17. SparkML ML Pipelines provide: • Integration with DataFrames • Familiar API based on scikit-learn • Easy workflow inspection • Simple parameter tuning 21
  18. 18. Databricks & SparkML • Use DataFrames to directly access data (SQL, raw files) • Extract, Transform and Load Data using an elastic cluster • Create the model using all of the data • Iterate many times on the model • Deploy the same model to production using the same code • Repeat
  19. 19. Advantages for Spark ML • Data can be directly accessed using the Spark Data Sources API (no more endless hours copying data between systems) • Data Scientist can use all of the data rather than subsamples and take advantage of the Law of Large numbers to improve model accuracy • Data Scientist can scale compute needs with the data size and model complexity • Data Scientists can iterate more giving them the opportunity to create better models and test and release more frequently
  20. 20. SparkML - Tips • Understand Spark Partitions • Parquet file format and compact files • coalesce() / repartition() • Leverage Existing Functions / UDFs • Leverage DataFrames and SparkML • Iterative Algorithms • More cores for faster processing 24
  21. 21. What’s new Spark 2.0
  22. 22. Spark 2.0 - SparkML • MLLib is deprecated and in maintenance mode • New Algorithm Support • Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler feature transformer. • PySpark Update • LDA, Gaussian Mixture Model, Generalized Linear Regression • Model Persistence across languages 26
  23. 23. Spark Demo
  24. 24. Thanks! Sign Up For Databricks Community Edition! https://databricks.com/try-databricks
  25. 25. Learning more about MLlib Guides & examples • Example workflow using ML Pipelines (Python) • Power plant data analysis workflow (Scala) • The above 2 links are part of the Databricks Guide, which contains many more examples and references. References • Apache Spark MLlib User Guide • The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation. • Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper) 29

×