Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Heuritech: Apache Spark REX

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 17 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Heuritech: Apache Spark REX (20)

Publicité

Plus récents (20)

Heuritech: Apache Spark REX

  1. 1. APACHE SPARK REX
  2. 2. ABOUT ME Didier Marin PhD in Computer Science (UPMC) Machine Learning, Reinforcement Learning & Robotics Co-founder of Heuritech Likes functional programming and distributed computing
  3. 3. We develop tools to make sense from raw text data Customer insight using the text of visited web pages
  4. 4. Data Analytics Platform Qualify users using their web logs 50M lines/day Match CRM and web data
  5. 5. WHY SPARK ? Performance, in particular when batch size < total RAM in cluster More general than MR, high-level API Extensions (ML, streaming) and connectors (Cassandra) Growing community
  6. 6. PARSING LOGS defparseLine(line:String): Either[ParsingError,LogData]=??? vallogs=sc.textFile("logfile").map(parseLine(_)) valvalidLogs=logs.flatMap(_.right.toOption)
  7. 7. LAMBDA ARCHITECTURE
  8. 8. IMPLEMENTATION
  9. 9. CLUSTER CONFIGURATION LXC + salt N containers : 1 master/executor + (N-1) executors Cassandra node for each Spark executor Using an "uber"-JAR to submit jobs Sharing data through NFS
  10. 10. MANAGING SPARK'S MEMORY Default: 40 % working memory, 60 % cache 20 % of cache used to unroll blocks Explicit caching for huge RDDs we reuse: validLogs.persist(StorageLevel.MEMORY_AND_DISK) Partition tuning may be necessary (spark.default.parallelism)
  11. 11. AGGREGATION valwords=sc.parallelize(List("a","b","a","c")) words.groupBy(x=>x).mapValues(_.size).collect //Array((a,2),(b,1),(c,1)) words.map(x=>(x,1)).reduceByKey(_+_).collect //Array((a,2),(b,1),(c,1))
  12. 12. AGGREGATION groupBy
  13. 13. see also & AGGREGATION reduceByKey combineByKey foldByKey
  14. 14. Databricks knowledge base Spark users mailing list Parsing Apache logs with Spark (Scala) USEFUL LINKS github.com/databricks/spark-knowledgebase apache-spark-user-list.1001560.n3.nabble.com alvinalexander.com/scala/analyzing-apache-access-logs-files- spark-scala
  15. 15. THANK YOU ! contact@heuritech.com

×