SlideShare une entreprise Scribd logo
1  sur  28
2017 DataWorks Summit , San Jose California
Sandra Guija and Snehal Sakhare
Speed it up and Spark it up
at Intel
IT@Intel
About Sandra
IT@Intel
• Capability Engineer in the Big Data
Team @ Intel IT
• Past 5 years working in extending Big
Data Capabilities beyond Hadoop
• Master of Science in Computer
Science CSU Sacramento
• Specialization in memory parallel
processing and distributed systems.
2
About Snehal
• Working with Hadoop 3+ years as
Application Developer @ IT-Intel
Corporation.
• Master of Science in Computer
Science- CSU, Sacramento
• Publications - Power Efficient
MapReduce Workload Acceleration
using Integrated-GPU
• Zumba lover
IT@Intel 3
Objective
• Share our experience and key learnings in the journey of data
ingestion framework using Spark
• How to speed up application development using reusable ingestion
framework
• How to improve job performance
IT@Intel 4
Reusable Framework.
Speed it up
IT@Intel
1. Rapid Data Ingestion
2. Variety of data sources
3. Reusable solution
4. Skill Challenges
5. Increase productivity
Reusable Framework.
Speed it up
1. Rapid Data Ingestion
2. Variety of data sources
3. Reusable solution
4. Skill Challenges
5. Increase productivity
IT@Intel 6
Reusable Framework with Spark
• Spark takes us to next level
• Data is distributed in memory
• Lazy Computations – Optimize the job before executing
• Efficient pipelining – Avoid data hitting the hard disk.
• Uniform Data access using Spark SQL
• Spark SQL is the core of the ingestion framework
16pt Intel Clear Subhead, Date, Etc.
Video
IT@Intel
IT@Intel 8
Spark SQL Ingestion: Code snippet
JDBC extract:
Google analytics extract:
SFTP extract:
IT@Intel 9
BDRF: Seamless Spark job deployment
• Job Deployment is easier than ever in -local, yarn-client or yarn-cluster mode.
spark_mode: yarn-cluster
• Dependencies management in yarn-cluster mode.
#Run spark batch job
if [ "$SPARK_MODE" = 'yarn-cluster' ]; then
spark-submit --master ${SPARK_MODE} 
--conf "spark.driver.extraClassPath=${SPARK_CLUSTER_EXTRACLASSPATH}" 
--conf "spark.executor.extraClassPath=${SPARK_CLUSTER_EXTRACLASSPATH}" 
--conf "spark.executor.memory = ${exec-mem}" 
--conf "spark.yarn.dist.files=${SPARK_YARN_DIST_FILES},${INI_FILE}" 
--py-files "${SPARK_PYTHON_FILES}" 
${SPARK_SCRIPT} $(basename ${INI_FILE}) ${HDENV} ${NUM_EXEC} ${GATEWAY}
• Put a framework utilization dashboard for projects.
Spark it Up
Spark @ Intel IT
IT@Intel
16pt Intel Clear Subhead, Date, Etc.
Speed it up
IT@Intel
Spark it upBetter resources utilization
Lessons learned:
1. JDBC single executor
2. Tune-up executor memory
3. Large datasets/numerous files
4. Custom package cluster mode
Spark it up
Better resources utilization
Lessons learned:
1. JDBC single executor
2. Tune-up executor memory
3. Large datasets/numerous files
4. Custom package cluster mode
Spark it up
Better resources utilization
1. Parallelism
2. Challenge with Big datasets
3. Deploy dependencies
4. Optimization
Parallelism
IT@Intel 12
How to ensure data and processing is distributed evenly
across the worker nodes?
 Increase number of executors
 Use transformations:
– reduceByKey(), repartition(), coalesce(),
 Set properties:
– spark.sql.shuffle.partitions
– spark.default.parallelism
IT@Intel 13
Solution :
• JDBC ingest data in a single partition.
df =
sqlContext.read.format('jdbc').options(...).load()
options.put("partitionColumn", "emp_no");
options.put("lowerBound", "10001");
options.put("upperBound", "499999");
options.put("numPartitions", "10");
Partition options
JDBC Single Executor
IT@Intel 14
# executors
1
10
Solution :
 Tuning the number of partitions
options.put("numPartitions", "10");  options.put("numPartitions", “1000");
 Tuning the number of executors
Data Skewed
IT@Intel 15
#
executors
# cores per
executor
memory per
executor
# partitions Processing
Time
Performance
1 3 4 gb 1 18m, 47s 1.0 x
Ingest 4,587,847 records
Performance Comparison
num-executors & partitions
5 3 4 gb 1000 9m, 15s 2.0 x
10 3 4 gb 1000 8m, 52s 2.1 x
10 3 4 gb 500 4m, 44s 3.9 x
IT@Intel 16
#
executors
# cores per
executor
memory per
executor
#
partitions
#
columns
Processing
Time
Performance
10 3 4 gb 500 121 14m, 23s 1.0 x
10 3 4 gb 500 90 9m, 44s ~ 1.5 x
10 3 4 gb 500 60 6m, 12s ~ 2.3 x
10 3 4 gb 500 30 4m, 29s ~ 3.2 x
Ingest 11,513,057 records with 121 columns
Performance Comparison
Filtering number of columns
Challenge with Big datasets
IT@Intel 17
 Failures when running queries:
– > 2.3 TB
– > 6 billion records
WARN servlet.ServletHandler: Error for /jobs/job/
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
ERROR server.TransportRequestHandler: Error sending result RpcResponse{requestId=8891220538372697062,
body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=166 cap=166]}} to node019:40175; closing connection
org.apache.spark.SparkException: Error sending message
Caused by: java.nio.channels.ClosedChannelException
ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
Container
is killed
Challenge with Big datasets
IT@Intel 18
Use formula to set executor memory and cores
(spark.executor.memory * shuffle.memFraction * shuffle.safetyFraction )
spark.executor.cores
= memory available
to each task
Solution :
(spark.executor.memory * 0.2 * 0.8)
spark.executor.cores
= X
8 x 1024
6
memory per task = 218 MB
(spark.executor.memory * 0.2 * 0.8)
spark.executor.cores
= X
6 x 1024
4
memory per task = 245 MB
Deploy dependencies on cluster
IT@Intel 19
• Spark Python API non-JVM language
 Non platform independence
 Non jars files or code packing
• Managing and deploy dependencies on a cluster can be a pain
 Executors need access to third-party or custom libraries
 Python setup in each node
• How to deploy packages?
 Create conda environment
 Install libraries
 Zip conda environment and load it in HDFS
 Set Python environment variables and run PySpark job
Solution :
Deploy dependencies on cluster
IT@Intel 20
1. Create conda environments directory
conda config --add envs_dirs /path_to_conda_dir/
2. Create conda env and Install packages (env-name=py35)
conda create -n py35 --copy -y -q python=3.5
3. Install Python favorite packages
conda install -c conda-forge fuzzywuzzy -n py35
4. Zip conda-environment and ship it
zip -r py35.zip py35
ln -sf "/path_to_conda_dir/py35.zip" "PY35"
hdfs dfs -put py35.zip /my_hdfs_path/.
5. Run command to launch PySpark job in cluster mode
PYSPARK_DRIVER_PYTHON=./PY35/py35/bin/python spark-submit 
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON="./PY35/py35/bin/python" 
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON="./PY35/py35/bin/python" 
--master yarn-cluster 
--archives hdfs://path_to_conda_dir/py35.zip#PY35 my_conda_test.py
IT@Intel 21
BroadcastJoin Processing
Time
Performance
Local-mode 40m, 16s 1.0 x
Cluster-mode 6m, 23s 6.3 x
– Custom libraries: fuzzywuzzy
Performance Comparison
Deploy dependencies
Optimization
IT@Intel 22
• spark.sql.autoBroadcastJoinThreshold
− Enabled  SET spark.sql.autoBroadcastJoinThreshold= 20,000,000
− Disabled  SET spark.sql.autoBroadcastJoinThreshold = -1
• explain()
sqlContext.sql("SET
spark.sql.autoBroadcastJoinThreshold=20000000")
Joined.explain()
Enabled
sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold = -1 ")
Joined.explain()
Disabled
IT@Intel 23
BroadcastJoin Processing
Time
Performance
Default 14m, 56s 1.0 x
set=20000000 9m, 19s 1.6 x
 Big Table
– Size: 2.3 TB
– Records: > 6 billion
 Small Table
– Size: 324 KB
– Records: 1,707
Performance Comparison
Set autoBroadcastJoinThreshold
Optimization
IT@Intel 24
• Caching DataFrames
– Repeatedly loads, transformations or joins
– Avoid re-processing time
– df.cache()
Use DataFrame Caching
... In summary
IT@Intel 25
Run your
spark job
Check
Logs!!!
Find
bottlenecks
Spark it
Up
Legal Disclaimer
THE INFORMATION PROVIDED IN THIS PAPER IS INTENDED TO BE GENERAL IN NATURE AND IS
NOT SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE
BASED UPON INTEL’S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE
OR WARRANT OTHERS WILL OBTAIN SIMILAR RESULTS.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS AND
SERVICES. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED
IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY,
RELATING TO SALE AND/OR USE OF INTEL PRODUCTS AND SERVICES INCLUDING LIABILITY
OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY,
OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Intel, the Intel logo, Intel. Experience What’s Inside, the Intel. Experience What’s Inside logo, and Xeon
are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2017 Intel Corporation. All rights reserved.
IT@Intel 26
Q & A
IT@Intel
IT@Intel

Contenu connexe

Tendances

CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationJen Aman
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceMongoDB
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentDatabricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Clusterphanleson
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 

Tendances (20)

CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use Cases
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon Innovation
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 

Similaire à Speed it up and Spark it up at Intel

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Spark performance tuning eng
Spark performance tuning engSpark performance tuning eng
Spark performance tuning enghaiteam
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDatabricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL AzureIke Ellis
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsAshish Thapliyal
 

Similaire à Speed it up and Spark it up at Intel (20)

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Spark performance tuning eng
Spark performance tuning engSpark performance tuning eng
Spark performance tuning eng
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files Syndrome
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL Azure
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight Deployments
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Dernier (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Speed it up and Spark it up at Intel

  • 1. 2017 DataWorks Summit , San Jose California Sandra Guija and Snehal Sakhare Speed it up and Spark it up at Intel IT@Intel
  • 2. About Sandra IT@Intel • Capability Engineer in the Big Data Team @ Intel IT • Past 5 years working in extending Big Data Capabilities beyond Hadoop • Master of Science in Computer Science CSU Sacramento • Specialization in memory parallel processing and distributed systems. 2
  • 3. About Snehal • Working with Hadoop 3+ years as Application Developer @ IT-Intel Corporation. • Master of Science in Computer Science- CSU, Sacramento • Publications - Power Efficient MapReduce Workload Acceleration using Integrated-GPU • Zumba lover IT@Intel 3
  • 4. Objective • Share our experience and key learnings in the journey of data ingestion framework using Spark • How to speed up application development using reusable ingestion framework • How to improve job performance IT@Intel 4
  • 5. Reusable Framework. Speed it up IT@Intel 1. Rapid Data Ingestion 2. Variety of data sources 3. Reusable solution 4. Skill Challenges 5. Increase productivity Reusable Framework. Speed it up 1. Rapid Data Ingestion 2. Variety of data sources 3. Reusable solution 4. Skill Challenges 5. Increase productivity
  • 6. IT@Intel 6 Reusable Framework with Spark • Spark takes us to next level • Data is distributed in memory • Lazy Computations – Optimize the job before executing • Efficient pipelining – Avoid data hitting the hard disk. • Uniform Data access using Spark SQL • Spark SQL is the core of the ingestion framework
  • 7. 16pt Intel Clear Subhead, Date, Etc. Video IT@Intel
  • 8. IT@Intel 8 Spark SQL Ingestion: Code snippet JDBC extract: Google analytics extract: SFTP extract:
  • 9. IT@Intel 9 BDRF: Seamless Spark job deployment • Job Deployment is easier than ever in -local, yarn-client or yarn-cluster mode. spark_mode: yarn-cluster • Dependencies management in yarn-cluster mode. #Run spark batch job if [ "$SPARK_MODE" = 'yarn-cluster' ]; then spark-submit --master ${SPARK_MODE} --conf "spark.driver.extraClassPath=${SPARK_CLUSTER_EXTRACLASSPATH}" --conf "spark.executor.extraClassPath=${SPARK_CLUSTER_EXTRACLASSPATH}" --conf "spark.executor.memory = ${exec-mem}" --conf "spark.yarn.dist.files=${SPARK_YARN_DIST_FILES},${INI_FILE}" --py-files "${SPARK_PYTHON_FILES}" ${SPARK_SCRIPT} $(basename ${INI_FILE}) ${HDENV} ${NUM_EXEC} ${GATEWAY} • Put a framework utilization dashboard for projects.
  • 10. Spark it Up Spark @ Intel IT IT@Intel
  • 11. 16pt Intel Clear Subhead, Date, Etc. Speed it up IT@Intel Spark it upBetter resources utilization Lessons learned: 1. JDBC single executor 2. Tune-up executor memory 3. Large datasets/numerous files 4. Custom package cluster mode Spark it up Better resources utilization Lessons learned: 1. JDBC single executor 2. Tune-up executor memory 3. Large datasets/numerous files 4. Custom package cluster mode Spark it up Better resources utilization 1. Parallelism 2. Challenge with Big datasets 3. Deploy dependencies 4. Optimization
  • 12. Parallelism IT@Intel 12 How to ensure data and processing is distributed evenly across the worker nodes?  Increase number of executors  Use transformations: – reduceByKey(), repartition(), coalesce(),  Set properties: – spark.sql.shuffle.partitions – spark.default.parallelism
  • 13. IT@Intel 13 Solution : • JDBC ingest data in a single partition. df = sqlContext.read.format('jdbc').options(...).load() options.put("partitionColumn", "emp_no"); options.put("lowerBound", "10001"); options.put("upperBound", "499999"); options.put("numPartitions", "10"); Partition options JDBC Single Executor
  • 14. IT@Intel 14 # executors 1 10 Solution :  Tuning the number of partitions options.put("numPartitions", "10");  options.put("numPartitions", “1000");  Tuning the number of executors Data Skewed
  • 15. IT@Intel 15 # executors # cores per executor memory per executor # partitions Processing Time Performance 1 3 4 gb 1 18m, 47s 1.0 x Ingest 4,587,847 records Performance Comparison num-executors & partitions 5 3 4 gb 1000 9m, 15s 2.0 x 10 3 4 gb 1000 8m, 52s 2.1 x 10 3 4 gb 500 4m, 44s 3.9 x
  • 16. IT@Intel 16 # executors # cores per executor memory per executor # partitions # columns Processing Time Performance 10 3 4 gb 500 121 14m, 23s 1.0 x 10 3 4 gb 500 90 9m, 44s ~ 1.5 x 10 3 4 gb 500 60 6m, 12s ~ 2.3 x 10 3 4 gb 500 30 4m, 29s ~ 3.2 x Ingest 11,513,057 records with 121 columns Performance Comparison Filtering number of columns
  • 17. Challenge with Big datasets IT@Intel 17  Failures when running queries: – > 2.3 TB – > 6 billion records WARN servlet.ServletHandler: Error for /jobs/job/ java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2367) ERROR server.TransportRequestHandler: Error sending result RpcResponse{requestId=8891220538372697062, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=166 cap=166]}} to node019:40175; closing connection org.apache.spark.SparkException: Error sending message Caused by: java.nio.channels.ClosedChannelException ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM Container is killed
  • 18. Challenge with Big datasets IT@Intel 18 Use formula to set executor memory and cores (spark.executor.memory * shuffle.memFraction * shuffle.safetyFraction ) spark.executor.cores = memory available to each task Solution : (spark.executor.memory * 0.2 * 0.8) spark.executor.cores = X 8 x 1024 6 memory per task = 218 MB (spark.executor.memory * 0.2 * 0.8) spark.executor.cores = X 6 x 1024 4 memory per task = 245 MB
  • 19. Deploy dependencies on cluster IT@Intel 19 • Spark Python API non-JVM language  Non platform independence  Non jars files or code packing • Managing and deploy dependencies on a cluster can be a pain  Executors need access to third-party or custom libraries  Python setup in each node • How to deploy packages?  Create conda environment  Install libraries  Zip conda environment and load it in HDFS  Set Python environment variables and run PySpark job Solution :
  • 20. Deploy dependencies on cluster IT@Intel 20 1. Create conda environments directory conda config --add envs_dirs /path_to_conda_dir/ 2. Create conda env and Install packages (env-name=py35) conda create -n py35 --copy -y -q python=3.5 3. Install Python favorite packages conda install -c conda-forge fuzzywuzzy -n py35 4. Zip conda-environment and ship it zip -r py35.zip py35 ln -sf "/path_to_conda_dir/py35.zip" "PY35" hdfs dfs -put py35.zip /my_hdfs_path/. 5. Run command to launch PySpark job in cluster mode PYSPARK_DRIVER_PYTHON=./PY35/py35/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON="./PY35/py35/bin/python" --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON="./PY35/py35/bin/python" --master yarn-cluster --archives hdfs://path_to_conda_dir/py35.zip#PY35 my_conda_test.py
  • 21. IT@Intel 21 BroadcastJoin Processing Time Performance Local-mode 40m, 16s 1.0 x Cluster-mode 6m, 23s 6.3 x – Custom libraries: fuzzywuzzy Performance Comparison Deploy dependencies
  • 22. Optimization IT@Intel 22 • spark.sql.autoBroadcastJoinThreshold − Enabled  SET spark.sql.autoBroadcastJoinThreshold= 20,000,000 − Disabled  SET spark.sql.autoBroadcastJoinThreshold = -1 • explain() sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold=20000000") Joined.explain() Enabled sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold = -1 ") Joined.explain() Disabled
  • 23. IT@Intel 23 BroadcastJoin Processing Time Performance Default 14m, 56s 1.0 x set=20000000 9m, 19s 1.6 x  Big Table – Size: 2.3 TB – Records: > 6 billion  Small Table – Size: 324 KB – Records: 1,707 Performance Comparison Set autoBroadcastJoinThreshold
  • 24. Optimization IT@Intel 24 • Caching DataFrames – Repeatedly loads, transformations or joins – Avoid re-processing time – df.cache() Use DataFrame Caching
  • 25. ... In summary IT@Intel 25 Run your spark job Check Logs!!! Find bottlenecks Spark it Up
  • 26. Legal Disclaimer THE INFORMATION PROVIDED IN THIS PAPER IS INTENDED TO BE GENERAL IN NATURE AND IS NOT SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON INTEL’S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS WILL OBTAIN SIMILAR RESULTS. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS AND SERVICES. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS AND SERVICES INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel, the Intel logo, Intel. Experience What’s Inside, the Intel. Experience What’s Inside logo, and Xeon are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2017 Intel Corporation. All rights reserved. IT@Intel 26