Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Spark performance tuning eng

543 vues

Publié le

spark performance tuning according to the count of executor.
(Korea Polytechnics - Fintech)

Publié dans : Technologie
  • Login to see the comments

Spark performance tuning eng

  1. 1. Spark Performance Tuning 2018. 11. 23(Fri)
  2. 2. Contents 1. Overview 2. Service Configuration 3. Package & Test 2
  3. 3. 1. Overview 3 Object • To check the processing performance of RDB and HDFS data transfer using Spark Premise Schedule Tasks 1) data(15 million) transfer 2) data(100 million) transfer • Performance differences are possible depening on network conditions • 2018.11.22(Thur) ~ 11.23(Fri) (completed) (completed)
  4. 4. Service configuration 4 1. System Configuration 2. Data Configuration
  5. 5. According to the configuration below, Data tranferation is performed 5 Load Processing Unload Source Data Spark Output Data Sellout Data Transfer Sellout Data Oracle DB Spark, Python Oracle DB 1. System Configuration S/W H/W S/W H/W S/W H/W HDFS
  6. 6. 6 1. System Configuration 6 Hadoop Name Node Spark Master Hive Master Resource Manger No Lv1 Lv2 Version Contents 1 Oracle Linux 7.3 OS 2 Hadoop 2.7.6 Distributed Storage 3 Spark 2.0.2 Distributed Processing 4 Hive 2.3.3 Supprt SQL (Master, Only master) 5 MariaDB 10.2.11 RDB (Master, Only master) 6 Oracle Client 18.3.0.0. 0 Oracle DB client Maria DB Hadoop DataNode Spark Worker NodeManager Hadoop DataNode Spark Worker NodeManager Hadoop DataNode Spark Worker NodeManager Service configurationHadoop Ecosystem Secondary Name Node p-master hadoop1 hadoop2 hadoop3
  7. 7. 1. System Configuration (Process) 7 Input DB Output DB 192.168.110.112 192.168.110.111 Hadoop Name Node Spark Master Hive Master Resource Manger Maria DB Hadoop DataNode Spark Worker NodeManager Hadoop DataNode Spark Worker NodeManager Hadoop DataNode Spark Worker NodeManager Hadoop Ecosystem Secondary Name Node p-master hadoop1 hadoop2 hadoop3 192.168.110.117 192.168.110.118 192.168.110.119 192.168.110.120
  8. 8. 2. Data Configuration 8 Define the outbound and inbound data No InterfaceID Content System Type Count Periods Column cnt Comments 1 IB-001 Sellout Dev System RDB 100 million - 17 2 IB-002 Sellout Dev System RDB 15 million 17 3 IB-003 Parameter Dev System RDB 2 5 inbound No InterfaceID Content System Type Count Periods Column cnt Comments 1 OB-001 Sellout Op System RDB 100 million - 17 2 OB-002 Sellout Op System RDB 15 million 17 outbound
  9. 9. Package & Test 9 1. Parameter Mgmt 2. Source Implementation 3. Package 4. Test
  10. 10. 1. Parameter Mgmt 10 Implement the parameter map by selecting only necessary data information.(code flexibility)
  11. 11. 2. Source Implementation 11
  12. 12. 3. Package 12 Maven: Compile production code and package with jar file & manage compatible external modules Compile & Package Manage plug-ins Manage dependencies
  13. 13. 4. Test 15 million (ORACLE → SPARK → ORACLE) spark-env.sh spark-defaults.conf Fix Configure spark-submit --class com.spark.c1_dataLoadWrite.s9_Meddata sparkProgramming- spark-1.0.jar > test.log & If the count of core shall be limited spark.cores.max=10
  14. 14. 4. Test 15 million (ORACLE → SPARK → ORACLE) Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 3 Executor-core 2 Executor-memory 20 Total-core 18 (3 * 3 * 2) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 3 * 20) Cluster * Ex-count * Ex-core spark-defaults.conf 6.6 minutes
  15. 15. 4. Test 15 million (ORACLE → SPARK → ORACLE) Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 1 Executor-core 7 Executor-memory 60 Total-core 21(3 * 7) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 60) Cluster * Ex-count * Ex-core spark-defaults.conf 6.7 minutes
  16. 16. 4. Test 15 million (ORACLE → SPARK → HDFS) Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 1 Executor-core 7 Executor-memory 60 Total-core 21(3 * 7) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 60) Cluster * Ex-count * Ex-core spark-defaults.conf 5.7 minutes
  17. 17. 17 4. Test 100 million (ORACLE → SPARK → ORACLE) Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 3 Executor-core 2 Executor-memory 20 Total-core 18 (3 * 3 * 2) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 3 * 20) Cluster * Ex-count * Ex-core spark-defaults.conf 44 minutes
  18. 18. 4. Test 100 million (ORACLE → SPARK → ORACLE) Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 1 Executor-core 7 Executor-memory 60 Total-core 21(3 * 7) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 60) Cluster * Ex-count * Ex-core spark-defaults.conf 47 minutes
  19. 19. 4. Test 100 million (ORACLE → SPARK → HDFS) 51 minutes Div Value Contents Cluster 3 slave 3 → 118,119,120 Worker 1 Executor-count 1 Executor-core 7 Executor-memory 60 Total-core 21(3 * 7) Cluster * Ex-count * Ex-core Total-memory 180 (3 * 60) Cluster * Ex-count * Ex-core spark-defaults.conf
  20. 20. Conclustion 20
  21. 21. Conclusion • Generating a large number of executors from the same resource can help improve performance • Setting the numpartition is important when manipulating data Next time, performance check is required according to the number of partitions Div 15 millions data 100 millions data Oracle <> Oracle (1Executor) 6.7 min 47 min Oracle <> Oracle (3Executor) 6.6 min 44 min Oracle <> HDFS (1Executor) 5.7 min 51 min
  22. 22. Conclusion • Generating a large number of executors from the same resource can help improve performance • Setting the numpartition is important when manipulating data Next time, performance check is required according to the number of partitions
  23. 23. Thank you 23 End of Document

×