Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Harnessing Spark and Cassandra with Groovy

359 vues

Publié le

This talk is an introduction to a powerful combination in the big data space: Apache Spark and Cassandra. Spark is a cluster-computing framework that allows users to perform calculations against resilient in-memory datasets using a functional programming interface. Cassandra is a linearly scalable, fault tolerant, decentralized datastore. These two technologies are complicated, but integrate well and provide such a level of utility that whole companies have formed around them.

In this talk we’ll learn how Spark and Cassandra can be leveraged within your Groovy Application: Spark normally asks for a Scala environment. We’ll talk about Spark and Cassandra from a high level and walk through code examples. We’ll discuss the pitfalls of working with these technologies - like modeling your data appropriately to ensure even distribution in Cassandra and general packaging woes with Spark - and ways to avoid them. Finally, we’ll explore how we at ThirdChannel are using these technologies.

Publié dans : Logiciels
  • Soyez le premier à commenter

Harnessing Spark and Cassandra with Groovy

  1. 1. Harnessing the Power of Spark + Cassandra with Groovy Steve Pember CTO, ThirdChannel Gr8Conf US, 2017 @svpember
  2. 2. Relational Database are Fantastic
  3. 3. SQL makes you Strong
  4. 4. @svpember
  5. 5. @svpember Agenda • Spark • Cassandra • Spark + Cassandra • Working with Spark + Cassandra • Demo
  6. 6. @svpember Apache Spark • Distributed Execution Engine
  7. 7. –Johnny Appleseed “Type a quote here.”
  8. 8. @svpember Apache Spark • Distributed Execution Engine • What about Hadoop?
  9. 9. @svpember Hadoop
 Spark • Map / Reduce • Storage via HDFS • Each calculation step written to disk • More than Map/Reduce • No dependent storage mechanism • Clustered Calculations, each step in memory
  10. 10. @svpember Apache Spark • Distributed Execution Engine • What about Hadoop? • Creation was a Happy Accident
  11. 11. –Johnny Appleseed “Type a quote here.”
  12. 12. @svpember Apache Spark • Distributed Execution Engine • What about Hadoop? • Creation was a Happy Accident • Architecture
  13. 13. –Johnny Appleseed “Type a quote here.”
  14. 14. Your Groovy App
  15. 15. @svpember Apache Spark • Distributed Execution Engine • What about Hadoop? • Creation was a Happy Accident • Architecture • Programatic structure
  16. 16. The SparkContext submits Jobs to the Cluster
  17. 17. Operations are performed against RDDs
  18. 18. @svpember Resilient Distributed Dataset • Immutable • Partitioned • Parallel operations • Created by performing operations on other RDDs • Reusable & Composable
  19. 19. @svpember
  20. 20. @svpember Apache Spark • Distributed Execution Engine • What about Hadoop? • Creation was a Happy Accident • Architecture • Programatic structure • APIs
  21. 21. More Than Map/Reduce
  22. 22. @svpember RDD operations • map • reduce • filter • flatmap • zip • groupBy • … plus many more
  23. 23. –Johnny Appleseed “Type a quote here.”
  24. 24. @svpember Apache Spark • Distributed Execution Engine • What about Hadoop? • Creation was a Happy Accident • Architecture • Programatic structure • APIs • Additional Modules
  25. 25. Spark SQL…!
  26. 26. JDBC?
  27. 27. Spark Streaming!
  28. 28. @svpember Agenda • Spark • Cassandra
  29. 29. @svpember Apache Cassandra (C*) • NoSql Datastore
  30. 30. @svpember Apache Cassandra (C*) • NoSql Datastore • Distributed
  31. 31. Deterministic Distribution
  32. 32. @svpember
  33. 33. @svpember Apache Cassandra (C*) • NoSql Datastore • Distributed • High Replication
  34. 34. @svpember
  35. 35. @svpember
  36. 36. @svpember Apache Cassandra (C*) • NoSql Datastore • Distributed • High Replication • High Durability
  37. 37. @svpember
  38. 38. @svpember Apache Cassandra (C*) • NoSql Datastore • Distributed • High Replication • High Durability • Linear Scalability
  39. 39. Each new Node results in increased Storage with no loss in performance
  40. 40. @svpember
  41. 41. @svpember Apache Cassandra (C*) • NoSql Datastore • Distributed • High Replication • High Durability • Linear Scalability • Data Model (CQL)
  42. 42. Column Oriented Database
  43. 43. But it’s SQL-like!
  44. 44. @svpember
  45. 45. @svpember
  46. 46. @svpember
  47. 47. Querying
  48. 48. @svpember C* Querying • select * from • all queries must include partition key(s) in where clause • order by limited to group keys • cannot alter keys, queries must always be by same keys
  49. 49. @svpember Apache Cassandra (C*) • NoSql Datastore • Distributed • High Replication • High Durability • Linear Scalability • Data Model (CQL) • Designing your Data Model
  50. 50. @svpember
  51. 51. @svpember
  52. 52. @svpember Agenda • Spark • Cassandra • Spark + Cassandra
  53. 53. @svpember Spark + Cassandra • Reduce each other’s weaknesses • Filter on the server side (with c*) • Join tables, filter results (with Spark)
  54. 54. Companies have been formed
  55. 55. –Johnny Appleseed “Type a quote here.”
  56. 56. Cluster Design
  57. 57. @svpember
  58. 58. Data Locality!
  59. 59. @svpember
  60. 60. @svpember
  61. 61. Pipeline architecture
  62. 62. @svpember
  63. 63. @svpember Agenda • Spark • Cassandra • Spark + Cassandra • Working with Spark + Cassandra
  64. 64. Coding Spark + C*
  65. 65. @svpember Terminology • SparkConf • JavaSparkContext • JavaFunctions • Mappers
  66. 66. @svpember
  67. 67. @svpember Spark Conf • spark.master -> url to the master node • spark.app.name -> want to see your client show up in the Spark UI? • spark.executor.memory -> Limits memory per executor on workers • spark.executor.cores -> limits cores on each worker (need to share with c*!) • spark.submit.deployMode -> ‘client’ or ‘cluster • spark.jars.packages -> maven / gradle type names • spark.jars.ivy -> specify custom repos for packages • more at: http://spark.apache.org/docs/latest/configuration.html#available- properties
  68. 68. @svpember Master Url Overloading • “local” -> use Spark in stand alone mode. One thread • “local[<K>]” -> Spark, stand alone, with K threads • “local[*]” -> Spark, stand alone, with ALL YOUR THREADS! • “spark://<host string>:<port>” -> url for a Spark cluster master node, using Spark’s cluster management • also options for Mesos and Yarn
  69. 69. @svpember
  70. 70. However, a Warning
  71. 71. But where does my code live?
  72. 72. @svpember
  73. 73. @svpember CLASS_PATH: org.apache.spark, com.fasterxml.jackson, com.yourco.yourapp.pojos.* CLASS_PATH: org.apache.spark, com.fasterxml.jackson CLASS_PATH: org.apache.spark, com.fasterxml.jackson
  74. 74. @svpember Agenda • Spark • Cassandra • Spark + Cassandra • Working with Spark + Cassandra • Demo
  75. 75. Thank You! @svpember
  76. 76. @svpember Links • Cassandra on AWS official Whitepaper: https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf • Demo code: https://github.com/spember/ratpack-spark-cassandra-demo
  77. 77. @svpember Images • Database Sharding: https://dzone.com/articles/ebay-secret-database-scaling • Indian Jones Warehouse: http://logisticalfictions.tumblr.com/page/9 • Strong (Spongebob): www.reactiongifs.com/strongbob/?utm_source=rss&utm_medium=rss&utm_campaign=strongbob • Cheetah: www.livescience.com/21944-usain-bolt-vs-cheetah-animal-olympics.html • Big Data Cartoon: http://www.kdnuggets.com/2016/08/cartoon-make-data-great-again.html • Spark Streaming: http://velvia.github.io/presentations/2015-filodb-spark-streaming/#/ • Picard + Riker: http://www.douxreviews.com/2015/09/star-trek-next-generation-matter-of.html • Software Engineers: http://pyxurz.blogspot.com/2011/10/office-space-page-2-of-6.html

×