Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Why your Spark job is failing

50 388 vues

Publié le

Why your Spark job is failing

Publié dans : Données & analyses
  • Hi there! Essay Help For Students | Discount 10% for your first order! - Check our website! https://vk.cc/80SakO
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Why your Spark job is failing

  1. 1. ● Data science at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache YARN and MapReduce ● Hadoop project management committee
  2. 2. com.esotericsoftware.kryo. KryoException: Unable to find class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC $$iwC$$iwC$$anonfun$4$$anonfun$apply$3
  3. 3. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...] Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler. org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages (DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. apply(DAGScheduler.scala:1015) [...]
  4. 4. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...]
  5. 5. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...]
  6. 6. val file = sc.textFile("hdfs://...") file.filter(_.startsWith(“banana”)) .count()
  7. 7. Job Stage Task Task Task Task Stage Task Task Task Task
  8. 8. val rdd1 = sc.textFile(“hdfs://...”) .map(someFunc) .filter(filterFunc) textFile map filter
  9. 9. val rdd2 = sc.hadoopFile(“hdfs: //...”) .groupByKey() .map(someOtherFunc) hadoopFile groupByKey map
  10. 10. val rdd3 = rdd1.join(rdd2) .map(someFunc) join map
  11. 11. rdd3.collect()
  12. 12. textFile map filter hadoop group File ByKey map join map
  13. 13. textFile map filter hadoop group File ByKey map join map
  14. 14. Stage Task Task Task Task
  15. 15. Stage Task Task Task Task
  16. 16. Stage Task Task Task Task
  17. 17. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...]
  18. 18. 14/04/22 11:59:58 ERROR executor.Executor: Exception in task ID 286 6 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:565 ) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:64 8) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:706 ) at java.io.DataInputStream.read(DataInputStream.java:100 ) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:20 9) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173 ) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:20 6) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:4 5) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164 ) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149 ) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71 ) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:2 7) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327 ) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388 ) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388 ) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327 ) at scala.collection.Iterator$class.foreach(Iterator.scala:727 ) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157 ) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:16 1) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:10 2) at org.apache.spark.scheduler.Task.run(Task.scala:53 ) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:21 1) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:4 2) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:4 1) at java.security.AccessController.doPrivileged(Native Method ) at javax.security.auth.Subject.doAs(Subject.java:415 ) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:140 8) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:4 1) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176 ) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:114 5) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:61 5) at java.lang.Thread.run(Thread.java:724)
  19. 19. ResourceManager NodeManager NodeManager Container Container Application Master Container Client
  20. 20. ResourceManager Client NodeManager NodeManager Container Map Task Container Application Master Container Reduce Task
  21. 21. ResourceManager Client NodeManager NodeManager Container Map Task Container Application Master Container Reduce Task
  22. 22. ResourceManager Client NodeManager NodeManager Container Map Task Container Application Master Container Reduce Task
  23. 23. Container [pid=63375, containerID=container_1388158490598_0001_01_00 0003] is running beyond physical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used. Killing container.
  24. 24. yarn.nodemanager.resource.memory-mb Executor container spark.yarn.executor.memoryOverhead spark.executor.memory spark.shuffle.memoryFraction spark.storage.memoryFraction
  25. 25. ExternalAppend OnlyMap Block Block deserialize deserialize
  26. 26. ExternalAppend OnlyMap key1 -> values key2 -> values key3 -> values
  27. 27. ExternalAppend OnlyMap key1 -> values key2 -> values key3 -> values
  28. 28. ExternalAppend OnlyMap Sort & Spill key1 -> values key2 -> values key3 -> values
  29. 29. rdd.reduceByKey(reduceFunc, numPartitions=1000)
  30. 30. java.io.FileNotFoundException: /dn6/spark/local/spark-local- 20140610134115- 2cee/30/merged_shuffle_0_368_14 (Too many open files)
  31. 31. Task Task Write stuff key1 -> values key2 -> values key3 -> values out key1 -> values key2 -> values key3 -> values Task
  32. 32. Task Task Write stuff key1 -> values key2 -> values key3 -> values out key1 -> values key2 -> values key3 -> values Task
  33. 33. Partition 1 File Partition 2 File Partition 3 File Records
  34. 34. Records Buffer
  35. 35. Single file Buffer Sort & Spill Partition 1 Records Partition 2 Records Partition 3 Records Index file
  36. 36. conf.set(“spark.shuffle.manager”, SORT)
  37. 37. ● No ● Distributed systems are complicated

×