Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
1© Cloudera, Inc. All rights reserved.
Uniting Spark & Hadoop
The One Platform Initiative
Doug Cutting | Chief Architect |...
2© Cloudera, Inc. All rights reserved.
Agenda
• Emergence of Spark
• Advantages of Spark
• Spark replacing MapReduce
• The...
3© Cloudera, Inc. All rights reserved.
MapReduce: A great tool for its day
The original scalable, general, processing engi...
4© Cloudera, Inc. All rights reserved.
Enter Apache Spark
General purpose computational framework that
substantially impro...
5© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easier More Powerful Fa...
6© Cloudera, Inc. All rights reserved.
Easy Development
High Productivity Language Support
• Native support for multiple
l...
7© Cloudera, Inc. All rights reserved.
Easy Development
Use Interactively
• Interactive exploration of data
for data scien...
8© Cloudera, Inc. All rights reserved.
Spark Takes Advantage of Memory
Resilient Distributed Datasets (RDD)
• Memory cachi...
9© Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
Spark
Streaming
MLlib SparkSQL GraphX
Data-
frames
Spa...
10© Cloudera, Inc. All rights reserved.
Cloudera is Driving the Spark Movement
2013 2014 2015 2016
Identified Spark’s
earl...
11© Cloudera, Inc. All rights reserved.
Spark at Cloudera
• Cloudera was the first Hadoop vendor to ship and support Spark...
12© Cloudera, Inc. All rights reserved.
Cloudera’s Engineering Commitment to Spark
Cloudera
57%Intel
29%
Hortonworks
14%
S...
13© Cloudera, Inc. All rights reserved.
Cloudera Customers
• More customers running Spark than all other vendors combined
...
14© Cloudera, Inc. All rights reserved.
Cloudera Customer Use Cases
Core Spark Spark Streaming
• Portfolio Risk Analysis
•...
15© Cloudera, Inc. All rights reserved.
Spark will replace MapReduce
as the standard execution engine for Hadoop
16© Cloudera, Inc. All rights reserved.
Community Initiative: Spark Supersedes MapReduce
Stage 1
• Crunch on Spark
• Searc...
17© Cloudera, Inc. All rights reserved.
Uniting Spark and Hadoop
The One Platform Initiative Investment Areas
Management
L...
18© Cloudera, Inc. All rights reserved.
Management
Leverage Hadoop-native resource management
Spark-on-YARN
• Drove Spark-...
19© Cloudera, Inc. All rights reserved.
Security
Full support for Hadoop security and beyond
Perimeter
• Kerberos Integrat...
20© Cloudera, Inc. All rights reserved.
Scale
Enable 10K-Node Clusters
Fault-Tolerance
• Revamp Scheduler handling of node...
21© Cloudera, Inc. All rights reserved.
Streaming
Support for 80% of common stream processing workloads
Zero Data Loss
• D...
22© Cloudera, Inc. All rights reserved.
The Future of Data Processing on Hadoop
Spark complemented by specialized fit-for-...
23© Cloudera, Inc. All rights reserved.
Cloudera is Built for Production Success
Hadoop delivers:
• One place for unlimite...
24© Cloudera, Inc. All rights reserved.
Spark Resources
• Learn Spark
• O’Reilly Advanced Analytics with Spark eBook (writ...
‹#›© 2015 Cloudera, Inc. All rights reserved.
The conference for and by Data Scientists, from startup to enterprise
wrangl...
27© Cloudera, Inc. All rights reserved.
Thank You!
Spark One Platform Webinar
Prochain SlideShare
Chargement dans…5
×

Spark One Platform Webinar

1 769 vues

Publié le

Doug Cutting discusses:
- A brief history of Spark and its rise in popularity across developers and enterprises
- Spark's advantages over MapReduce
- The One Platform Initiative and the roadmap for Spark
- The future of data processing in Hadoop

Publié dans : Technologie
  • Soyez le premier à commenter

Spark One Platform Webinar

  1. 1. 1© Cloudera, Inc. All rights reserved. Uniting Spark & Hadoop The One Platform Initiative Doug Cutting | Chief Architect | Cloudera Anand Iyer | Senior Product Manager | Cloudera
  2. 2. 2© Cloudera, Inc. All rights reserved. Agenda • Emergence of Spark • Advantages of Spark • Spark replacing MapReduce • The “One Platform Initiative” • Future of Data Processing in Hadoop
  3. 3. 3© Cloudera, Inc. All rights reserved. MapReduce: A great tool for its day The original scalable, general, processing engine of Hadoop ecosystem - Useful across diverse problem domains - Fueled initial ecosystem explosion MapReduce Execution Engine Hive Pig Mahout SolrCrunch
  4. 4. 4© Cloudera, Inc. All rights reserved. Enter Apache Spark General purpose computational framework that substantially improves on MapReduce Key Properties: • Leverages distributed memory • Full Directed Graph expressions for data parallel computations • Simpler developer experience Yet Retains: • Linear scalability • Fault-tolerance • Data locality-based computations
  5. 5. 5© Cloudera, Inc. All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easier More Powerful Faster • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  6. 6. 6© Cloudera, Inc. All rights reserved. Easy Development High Productivity Language Support • Native support for multiple languages with identical APIs • Scala, Java, Python • Use of closures, iterations, and other modern language constructs to minimize code • 2-5x less code Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  7. 7. 7© Cloudera, Inc. All rights reserved. Easy Development Use Interactively • Interactive exploration of data for data scientists • No need to develop “applications” • Developers can prototype application on live system $ ./bin/spark-shell --master local[*] ... Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.5.0-SNAPSHOT /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51) Type in expressions to have them evaluated. Type :help for more information. ... scala> val words = sc.textFile("file:/usr/share/dict/words") ... words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21 scala> words.count ... res0: Long = 235886 scala>
  8. 8. 8© Cloudera, Inc. All rights reserved. Spark Takes Advantage of Memory Resilient Distributed Datasets (RDD) • Memory caching layer that stores data in a distributed, fault-tolerant cache • Can fall back to disk when data-set does not fit in memory • Created by parallel transformations on data in stable storage • Provides fault-tolerance through concept of lineage
  9. 9. 9© Cloudera, Inc. All rights reserved. The Spark Ecosystem & Hadoop Spark Streaming MLlib SparkSQL GraphX Data- frames SparkR STORAGE HDFS, HBase RESOURCE MANAGEMENT YARN Spark Impala MR OthersSearch
  10. 10. 10© Cloudera, Inc. All rights reserved. Cloudera is Driving the Spark Movement 2013 2014 2015 2016 Identified Spark’s early potential Ships and Supports Spark with CDH 4.4 Added Spark on YARN integration Announces initiative to make Spark the standard execution engine Launches first Spark training Added security integration Cloudera engineers publish O’Reilly Spark book Driving effort to further performance, usability, and enterprise-readiness
  11. 11. 11© Cloudera, Inc. All rights reserved. Spark at Cloudera • Cloudera was the first Hadoop vendor to ship and support Spark • Spark is a fully integrated part of Cloudera’s platform • Shared data, metadata, resource management, administration, security, and governance • Complements specialized analytic tools for comprehensive big data platform • Cloudera is the first Hadoop vendor to offer Spark training • Trained more customers than any other vendor • Most popular training course • Cloudera has 5x the engineering resources of the next competitor • Most committers on staff and most changes contributed • Well-trained staff across the globe with expertise implementing a broad range of Spark use cases
  12. 12. 12© Cloudera, Inc. All rights reserved. Cloudera’s Engineering Commitment to Spark Cloudera 57%Intel 29% Hortonworks 14% Spark Committers by Hadoop Distribution* * IBM and MapR have 0 committers Spark Patches by Hadoop Distribution Cloudera, 451 Hortonworks, 10 IBM, 29 MapR, 1 Intel, 466
  13. 13. 13© Cloudera, Inc. All rights reserved. Cloudera Customers • More customers running Spark than all other vendors combined • Over 150 customers • Spark clusters as large as 800 nodes • Diverse range of use cases across multiple industries • Search personalization • Genomics research • Insurance modeling • Advertising optimization • Predictive modeling of disease conditions
  14. 14. 14© Cloudera, Inc. All rights reserved. Cloudera Customer Use Cases Core Spark Spark Streaming • Portfolio Risk Analysis • ETL Pipeline Speed-Up • 20+ years of stock dataFinancial Services Health • Identify disease-causing genes in the full human genome • Calculate Jaccard scores on health care data sets ERP • Optical Character Recognition and Bill Classification • Trend analysis • Document classification (LDA) • Fraud analyticsData Services 1010 • Online Fraud Detection Financial Services Health • Incident Prediction for Sepsis Retail • Online Recommendation Systems • Real-Time Inventory Management Ad Tech • Real-Time Ad Performance Analysis
  15. 15. 15© Cloudera, Inc. All rights reserved. Spark will replace MapReduce as the standard execution engine for Hadoop
  16. 16. 16© Cloudera, Inc. All rights reserved. Community Initiative: Spark Supersedes MapReduce Stage 1 • Crunch on Spark • Search on Spark Stage 2 • Hive on Spark (beta) • Spark on HBase (beta) Stage 3 • Pig on Spark (alpha) • Sqoop on Spark Cloudera is driving community development to port components to Spark:
  17. 17. 17© Cloudera, Inc. All rights reserved. Uniting Spark and Hadoop The One Platform Initiative Investment Areas Management Leverage Hadoop-native resource management. Security Full support for Hadoop security and beyond. Scale Enable 10k-node clusters. Streaming Support for 80% of common stream processing workloads.
  18. 18. 18© Cloudera, Inc. All rights reserved. Management Leverage Hadoop-native resource management Spark-on-YARN • Drove Spark-On-YARN Integration • Improve Spark-On-YARN Integration for better multi-tenancy, performance and ease of use Metrics • Improved metrics for debugging and monitoring • Improve metrics for visibility into resource utilization • Revamp WebUI for better debugging and monitoring (especially at high concurrency) Automation • Dynamic Resource Allocation based on needs of job • Smart auto-selection and tuning of job parameters (when data volumes change) Accessibility • SparkSQL & Hive integration improvements • Easy Python dependency management for PySpark
  19. 19. 19© Cloudera, Inc. All rights reserved. Security Full support for Hadoop security and beyond Perimeter • Kerberos Integration Visibility • Audit/Lineage via Cloudera Navigator • Full Spark PCI compliance Access • HDFS Sync (Apache Sentry) • Enable column- and view-level security Data Protection • Integration with Intel’s Advanced Encryption libraries
  20. 20. 20© Cloudera, Inc. All rights reserved. Scale Enable 10K-Node Clusters Fault-Tolerance • Revamp Scheduler handling of node failure • Dynamic resource utilization & prioritization Performance • Task scheduling based on HDFS data locality & HDFS caching (reduce data movement and enable more jobs) • Integrate with HDFS Discardable Distributed Memory (reduce memory pressure) • Scheduler improvements for performance at scale Stability • Sort Based Shuffle Improvements for improved stability at scale • Stress test at scale with mixed multi-tenant workloads • Scale Spark History Server for 1000s of jobs
  21. 21. 21© Cloudera, Inc. All rights reserved. Streaming Support for 80% of common stream processing workloads Zero Data Loss • Delivered full data resiliency Management • Streaming application management via Cloudera Manager for zero downtime Ingest • Delivered Flume integration and drove Kafka integration Accessibility • SQL interfaces and API extensions for Streaming jobs Performance • Improved State Management to enable maintaining a high volume of state information
  22. 22. 22© Cloudera, Inc. All rights reserved. The Future of Data Processing on Hadoop Spark complemented by specialized fit-for-purpose engines General Data Processing w/Spark Fast Batch Processing, Machine Learning, and Stream Processing Analytic Database w/Impala Low-Latency Massively Concurrent Queries Full-Text Search w/Solr Querying textual data On-Disk Processing w/MapReduce Jobs at extreme scale and extremely disk IO intensive Shared: • Data Storage • Metadata • Resource Management • Administration • Security • Governance
  23. 23. 23© Cloudera, Inc. All rights reserved. Cloudera is Built for Production Success Hadoop delivers: • One place for unlimited data • Unified, multi-framework data access Cloudera delivers: • Leading Performance • Enterprise Security • Data Management • Simple Administration Security and Administration Unlimited Storage Process Discover Model Serve Deployment Flexibility On-Premises Appliances Engineered Systems Public Cloud Private Cloud Hybrid Cloud A modern data platform plus what the enterprise requires.
  24. 24. 24© Cloudera, Inc. All rights reserved. Spark Resources • Learn Spark • O’Reilly Advanced Analytics with Spark eBook (written by Clouderans) • Cloudera Developer Blog • cloudera.com/spark • Get Trained • Cloudera Spark Training • Try it Out • Cloudera Live Spark Tutorial
  25. 25. ‹#›© 2015 Cloudera, Inc. All rights reserved. The conference for and by Data Scientists, from startup to enterprise wrangleconf.com Public registration is now open! Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more When: Thursday, October 22, 2015 Where: Broadway Studios, San Francisco
  26. 26. 27© Cloudera, Inc. All rights reserved. Thank You!

×