Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Intro to Apache Spark

36 236 vues

Publié le

This presentation includes a comprehensive introduction to Apache Spark. From an explanation of its rapid ascent to performance and developer advantages over MapReduce. We also explore its built-in functionality for application types involving streaming, machine learning, and Extract, Transform and Load (ETL).

Publié dans : Technologie
  • Soyez le premier à commenter

Intro to Apache Spark

  1. 1. Introduction to Apache Spark
  2. 2. www.mammothdata.com | @mammothdataco The Leader in Big Data Consulting ● BI/Data Strategy ○ Development of a business intelligence/ data architecture strategy. ● Installation ○ Installation of Hadoop or relevant technology. ● Data Consolidation ○ Load data from diverse sources into a single scalable repository. ● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions. ● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to necessary employees who will analyze the data. Mammoth Data, based in downtown Durham (right above Toast)
  3. 3. www.mammothdata.com | @mammothdataco ●Lead Consultant on all things DevOps and Spark ●@carsondial Me!
  4. 4. www.mammothdata.com | @mammothdataco ●Apache Spark™ is a fast and general engine for large-scale data processing What Is Apache Spark?!
  5. 5. www.mammothdata.com | @mammothdataco ● Framework for massive parallel computing (cluster) ● Harnessing power of cheap memory ● Direct Acyclic Graph (DAG) computing engine ● It goes very fast! ● Apache Project (spark.apache.org) What Is Apache Spark?! No, But Really…
  6. 6. www.mammothdata.com | @mammothdataco ● Began at UC Berkeley in 2009 ● Apache project since 2013 ● Top-level Apache project since 2014 ● Creators formed databricks.com History
  7. 7. www.mammothdata.com | @mammothdataco ● Performance ● Developer productivity Why Spark?
  8. 8. www.mammothdata.com | @mammothdataco ● Graysort benchmark (100TB) ● Hadoop - 72 minutes / 2100 nodes / datacentre ● Spark - 23 minutes / 206 nodes / AWS ● HDFS versus Memory Performance!
  9. 9. www.mammothdata.com | @mammothdataco ●First class support for Scala, Java, Python, and R! ●Data Science friendly Developers!
  10. 10. www.mammothdata.com | @mammothdataco Word Count: Hadoop
  11. 11. www.mammothdata.com | @mammothdataco from pyspark import SparkContext logFile = "hdfs:///input" sc = SparkContext("spark://spark-m:7077", "WordCount") textFile = sc.textFile(logFile) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) wordCounts.saveAsTextFile("hdfs:///output") Word Count: Spark
  12. 12. www.mammothdata.com | @mammothdataco ●Spark Streaming ●GraphX (graph algorithms) ●MLLib (machine learning) ●Dataframes (data access) Spark: Batteries Included
  13. 13. www.mammothdata.com | @mammothdataco ● Analytics (batch / streaming) ● Machine Learning ● ETL (Extract - Transform - Load) ● …and many more! Applications
  14. 14. www.mammothdata.com | @mammothdataco ●RDD = Resilient Distributed Dataset ●Immutable, Fault-tolerant ●Operated on in parallel ●Can be created manually or from external sources RDDs – The Building Block
  15. 15. www.mammothdata.com | @mammothdataco ●Transformations ●Actions ●Transformations are lazy ●Actions evaluate transformations in pipeline as well as performing action RDDs – The Building Block
  16. 16. www.mammothdata.com | @mammothdataco ● map() ● filter() ● pipe() ● sample() ● …and more! RDDs – Example Transformations
  17. 17. www.mammothdata.com | @mammothdataco ●reduce() ●count() ●take() ●saveAsTextFile() ●…and yes, more RDDs – Example Actions
  18. 18. www.mammothdata.com | @mammothdataco from pyspark import SparkContext logFile = "hdfs:///input" sc = SparkContext("spark://spark-m:7077", "WordCount") textFile = sc.textFile(logFile) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) wordCounts.saveAsTextFile("hdfs:///output") Word Count: Spark
  19. 19. www.mammothdata.com | @mammothdataco ●cache() / persist() ●When an action is performed for the first time - keep the result in memory ●Different levels of persistence available RDDs – cache()
  20. 20. www.mammothdata.com | @mammothdataco ●Micro-batches (DStreams of RDDs) ●Access to other parts of Spark (MLLib, GraphX, Dataframes) ●Fault-tolerant ●Connectors to Kafka, Flume, Kinesis, ZeroMQ ●(we’ll come back to this) Streaming
  21. 21. www.mammothdata.com | @mammothdataco ●Spark SQL ●Support for JSON, Cassandra, SQL databases, etc. ●Easier syntax than RDDs ●Dataframes ‘borrowed’ from Python/R ●Catalyst query planner Dataframes
  22. 22. www.mammothdata.com | @mammothdataco val sc = new SparkContext() val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.read.json("people.json") df.show() df.filter(df("age") >= 35).show() df.groupBy("age").count().show() Dataframes: Example
  23. 23. www.mammothdata.com | @mammothdataco ●Optimizing query planning for Spark ●Takes Dataframe operations and ‘compiles’ them down to RDD operations ●Often faster than writing RDD code manually ●Use Dataframes whenever possible (v1.4+) Dataframes: Catalyst
  24. 24. www.mammothdata.com | @mammothdataco Dataframes: Catalyst
  25. 25. www.mammothdata.com | @mammothdataco ●Machine Learning ●Includes algorithm implementations for Bayes, k-means clustering, ALS, word2vec, random forests, etc. ●Matrix operations (dense / sparse), dimensionality reduction, etc. ●And basic stats too! MLLib
  26. 26. www.mammothdata.com | @mammothdataco ●Common interface between different ML solutions ●(still in progress, but production-ready as of 1.5) ●Pipelines : ML as Dataframes : RDDs MLLib - Pipelines
  27. 27. www.mammothdata.com | @mammothdataco ●Graph processing algorithms ●Operations on vertices and edges ●Includes PageRank algorithm ●Can be combined with Streaming/Dataframes/MLLib GraphX
  28. 28. www.mammothdata.com | @mammothdataco ●Standalone ●YARN (Hadoop ecosystem) ●Mesos (Hipster ecosystem) Deploying Spark
  29. 29. www.mammothdata.com | @mammothdataco ●Traditional (write code, submit to cluster) ●REPL / Shell (write code interactively, backed by cluster) ●Interactive Notebooks (iPython/Zeppelin) Using Spark
  30. 30. www.mammothdata.com | @mammothdataco ●Log / diary approach to data science ●Type code into a web page ●Visualizations built-in Interactive Notebooks
  31. 31. www.mammothdata.com | @mammothdataco ●iPython / Juypter - most popular ●Zeppelin - built for Spark Interactive Notebooks
  32. 32. www.mammothdata.com | @mammothdataco Interactive Notebooks
  33. 33. www.mammothdata.com | @mammothdataco ●spark.apache.org ●databricks.com ●zeppelin.incubator.apache.org ●mammothdata.com/white-papers/spark-a-modern-tool-for-big- data-applications Links
  34. 34. www.mammothdata.com | @mammothdataco Questions?

×