Denis Nagorny - Pumping Python Performance

Sergey Maidanov
Software Engineering Manager,
Scripting Analyzers & Tools, Intel

Copyright © 2015, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
2
Programming Languages by Popularity
Python remains #1
programming language in
hiring demand
followed by Java and C++
Go and Scala
demonstrate strong
growth for last 2 years
STAC Summit, November 2015
* Source: CodeEval, Feb 2015

Optimization Notice
3
Programming Languages Productivity
0
1
2
3
4
5
Python Java C++ C
PROGRAMMING
COMPLEXITY (HOURS)
Prechelt* Berkholz**
0
1
2
3
4
Python Java C++ C
LANGUAGE VERBOSITY
(LOC/FEATURE)
Prechelt* Berkholz**
* L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer,
2000, Vol. 33, Issue 10, pp. 23-29
** RedMonk - D.Berkholz, Programming languages ranked by expressiveness

Optimization Notice
Numerical Modeling: From Prototype To Production
4
1. Pre-
Processing
2. Analysis
3. Modeling
4. Model
Validation
5.
Visualization
1. Pre-
processing
2. Model
Calibration
3. Decision
Making
4. Model
Validation
5. Reporting
Prototyping Production
Python*,
R*, Matlab*,
Excel*
Fortran*,
C++.
Java*/Scala*
HPC/Big Data ClusterWorkstation
MPI*,
OpenMP*,
Hadoop*,
Spark*, …
1x
3-10x
and more
Development cost
Development cost

Optimization Notice
5
Why’s Interpreted Code Unfriendly To Modern HW?
Moore’s law still works and will work for at least next 10 years
We have hit limits in
 Power
 Instruction level parallelism
 Clock speed
But not in
 Transistors (more memory, bigger caches, wider SIMD, specialized HW)
 Number of cores
Flop/Byte continues growing
 10x worse in last 20 years
Source: J.Cownie, HPC Trends and What They Mean for
Me, Imperial College, Oct. 2015
Efficient software development means
 Optimizations for data locality & contiguity
 Vectorization
 Threading

Optimization Notice
6
Performance: Native vs. Interpreted Code
0.2
1
5
25
125
625
3125
15625
Python C C (Parallelism)
BLACK SCHOLES FORMULA
MOPTIONS/SEC
Chapter 19. Performance
Optimization of Black
Scholes Pricing
Configuration info: - Versions: Intel® Distribution for Python 2.7.10 Technical Preview 1 (Aug 03, 2015), icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF),
64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS.
55x
349x
Vectorization,
threading, and data
locality optimizations
Static compilation

Optimization Notice
7
Possible Improvement Approaches
How to make Python more usable in both prototyping &
production?
 Numerical/machine learning packages (NumPy/SciPy/Scikit-learn)
accelerated with native libraries (e.g. Intel® MKL)
 Python language extensions that exploit vectorization and multi-
core parallelism, e.g. Cython (via GCC/ICC), Numba (LLVM)
 Better performance profiling of Python codes
 Packages and extensions for multi-node parallelism, e.g. mpi4py
 Integration with Big Data/ML infrastructures (Hadoop, Spark)

Optimization Notice
Configuration info: - Versions: Intel® Distribution for Python 2.7.10 Technical Preview 1 (Aug 03, 2015), Ubuntu* built Python*: Python 2.7.10, NumPy 1.9.2 built with gcc 4.8.4; Hardware: Intel® Xeon® CPU
E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS.
1 1 1 1 12X 4X 4X 4X 4X
22X
61X
70X
77X
97X
SVD Cholesky QR Inv DGEMM
Speedup
Ubuntu
Intel, 1 thread
Intel, 32 threads
Python* Performance Boost on Select Numerical Functions
Intel® Distribution for Python (Technical Preview) vs. Ubuntu Python
N=10000 N=40000 N=30000 N=25000 N=40000
Python Numerical Packages Acceleration
STAC Summit, November 2015 8

Optimization Notice
9
Python Language Extensions For Performance
0.2
1
5
25
125
625
3125
15625
Python NumExpr Numba Cython C (Parallelism)
BLACK SCHOLES FORMULA
MOPTIONS/SEC
36x
88x
6x
<35%
Configuration info: - Versions: Intel® Distribution for Python 2.7.10 Technical Preview 1 (Aug 03, 2015), Ubuntu* built Python*: Python 2.7.10, NumPy 1.9.2 built with gcc 4.8.4, llvm 3.6.0, Numba 0.22.0, icc
15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS.
1x
1.2x
1.5x
10x6x
Development cost 1x

Optimization Notice
Scientific/Engineering
Web/Social
Business
10
Data Analytics Usage Models
1. Pre-
Processing
2. Analysis
3. Modeling
4. Model
Validation
5.
Visualization
1. Pre-
processing
2. Model
Calibration
3. Decision
Making
4. Model
Validation
5. Reporting
Interactive Modeling
Visualization & Reporting Production Use
R*, Python*,
other 3rd party
tools
C++, Java*,
Scala*, Python*
Software
DAAL
Software
DAAL
Visual
IDE
Collaboration
Services
Data
Store
Infrastructure
(Comms, Security)
Management
(Monitoring, Rule
Management)
Client Edge
Data Source Edge
C++, JS, C#
Data Center
Workstation
SQL, noSQL,
Files,
Sensor Data
C++, JS, C#

Optimization Notice
Intel Confidential – NDA presentation; presentation contains Intel
confidential information presented to customers with a non-disclosure
agreement
11
Big Data Requires End-To-End Solutions
What device do I run analytics?
 Perform analysis close to data source to optimize response latency, decrease network
bandwidth utilization, and maximize security.
 Offload data to cluster for large-scale analytics only.
 Make personalized decisions on a client device
Pre-processing Transformation Analysis Modeling Decision Making
Decompression,
Filtering, Normalization
Aggregation,
Dimension Reduction
Summary Statistics
Clustering, etc.
Machine Learning (Training)
Parameter Estimation
Simulation
Forecasting
Decision Trees, etc.
Scientific/Engineering
Web/Social
Business
Compute (Server, Desktop, etc ) Client EdgeData Source Edge
Validation
Hypothesis testing
Model errors

Optimization Notice
Kernel D
Dense
Big Data Flow and Computational Flow
12
…
Observations, n
Time
Memory
Capacity
Categorical
Blank/Missing
Numeric
Attributes,p
Big Data Attributes Computational Solution
Distributed across different nodes/devices •Distributed computing, e.g. comm-avoiding algorithms
Huge data size not fitting into device
memory
•Distributed computing
•Streaming algorithms
Data coming in time •Data buffering
•Streaming algorithms
Non-homogeneous data •CategoricalNumeric (counters, histograms, etc)
•Homogeneous numeric data kernels
• Conversions, Indexing, Repacking
Sparse/Missing/Noisy data •Sparse data algorithms
•Recovery methods (bootstrapping, outlier correction)
Distributed Computing Streaming Computing
D1D2D3
Si+1 = T(Si,Di)
Ri+1 = F(Si+1)
R1
Rk
D1
D2
Dk
R2 R
Si,Ri
Offline Computing
D1Dk-
1
Dk
R
…
Append
R = F(D1,…,Dk)
Converts, Indexing,
Repacking
Data Recovery
F
D
F
I
C
C
B
F
D
F
I
C
C
B
F
D
F
I
C
C
B
F
D
F
I
C
C
B
1
2
7
4
5
6
3
C C C C8
D
D
D
D
D
D
D
D
1
2
D D D D4
D D D D7
D
D
D
D
D
D
D
D
Kernel C
Indexed
D
D
D
D
D
D
D
D
Counter
Histogram
I
I I I I
F
D
F
I
C
C
B
F
D
F
I
C
C
B
F
D
F
I
C
C
B
F
D
F
I
C
C
B
1
2
7
4
5
6
3
C C C C8
Outlier
F
D
F
I
C
C
B
F
D
F
I
C
C
B
F
D
F
I
C
C
B
F
D
F
I
C
C
B
C C C C
Recover

Optimization Notice
13
Intel® DAAL – Essentials for End-To-End Analytics
• C++ and Java*/Scala* library for data analytics
• Targeting Python and R interfaces in future releases
• “MKL” for machine learning with a few key
differences
• Optimizes entire data flow vs. compute part only
• Targets both data center and edge devices
• Supports offline, online and distributed data
processing
• Abstracted from cross-device communication layer
• Allows plugging in different Big Data & IoT analytics
frameworks
• Comes with samples for Hadoop*, Spark*, MPI*
• Builds upon MKL/IPP for best performance
4X
6X 6X
7X 7X
0
2
4
6
8
1M x 200 1M x 400 1M x 600 1M x 800 1M x 1000
Speedup
Table size
PCA (correlation method) on an 8-node Hadoop* cluster
based on Intel® Xeon® Processors E5-2697 v3
Configuration Info - Versions: Intel® Data Analytics Acceleration Library 2016,
CDH v5.3.1, Apache Spark* v1.2.0, Weka 3.6.12; Hardware: Intel® Xeon®
Processor E5-2699 v3, 2 Eighteen-core CPUs (45MB LLC, 2.3GHz), 128GB of
RAM per node; Operating System: CentOS 6.6 x86_64.
2.75X
12X
23X
38X
0
10
20
30
40
50
50000
transactions
250000
transactions
500000
transactions
1000000
transactions
Speedup
Apriori on Intel® Xeon® Processor E5-2699
v3

Optimization Notice Intel Confidential – NDA presentation; presentation
contains Intel confidential information presented to
customers with a non-disclosure agreement
14
Summary
• Python is among top productivity languages
• Python is unfriendly to modern hardware, and hence to production use
• New tools and libraries allow making tradeoff between productivity and performance
• Intel is investing in Python to be more usable in prototyping and production
• Intel® Distribution for Python - in Tech Preview Now!
• Intel® VTune Amplifier for Python – available for evaluation Now!
• Big Data analytics requires end-to-end solutions
• Intel has “end-to-end” response for new analytics challenges
• Intel® Data Analytics Acceleration Library 2016 product available Now!

Optimization Notice
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS
INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information
and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when
combined with other products.
Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are
trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are
reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804
15STAC Summit, November 2015 15

Denis Nagorny - Pumping Python Performance

Denis Nagorny - Pumping Python Performance

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (20)

Similaire à Denis Nagorny - Pumping Python Performance

Similaire à Denis Nagorny - Pumping Python Performance (20)

Dernier

Dernier (20)

Denis Nagorny - Pumping Python Performance

Notes de l'éditeur