Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Denis Nagorny - Pumping Python Performance

480 vues

Publié le

http://www.it52.info/events/2016-01-25-rannts-8

Publié dans : Ingénierie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Denis Nagorny - Pumping Python Performance

  1. 1. Sergey Maidanov Software Engineering Manager, Scripting Analyzers & Tools, Intel
  2. 2. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 2 Programming Languages by Popularity Python remains #1 programming language in hiring demand followed by Java and C++ Go and Scala demonstrate strong growth for last 2 years STAC Summit, November 2015 * Source: CodeEval, Feb 2015
  3. 3. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 3 Programming Languages Productivity STAC Summit, November 2015 0 1 2 3 4 5 Python Java C++ C PROGRAMMING COMPLEXITY (HOURS) Prechelt* Berkholz** 0 1 2 3 4 Python Java C++ C LANGUAGE VERBOSITY (LOC/FEATURE) Prechelt* Berkholz** * L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer, 2000, Vol. 33, Issue 10, pp. 23-29 ** RedMonk - D.Berkholz, Programming languages ranked by expressiveness
  4. 4. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Numerical Modeling: From Prototype To Production 4 1. Pre- Processing 2. Analysis 3. Modeling 4. Model Validation 5. Visualization 1. Pre- processing 2. Model Calibration 3. Decision Making 4. Model Validation 5. Reporting Prototyping Production Python*, R*, Matlab*, Excel* Fortran*, C++. Java*/Scala* HPC/Big Data ClusterWorkstation STAC Summit, November 2015 MPI*, OpenMP*, Hadoop*, Spark*, … 1x 3-10x and more Development cost Development cost
  5. 5. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 5 Why’s Interpreted Code Unfriendly To Modern HW? STAC Summit, November 2015 Moore’s law still works and will work for at least next 10 years We have hit limits in  Power  Instruction level parallelism  Clock speed But not in  Transistors (more memory, bigger caches, wider SIMD, specialized HW)  Number of cores Flop/Byte continues growing  10x worse in last 20 years Source: J.Cownie, HPC Trends and What They Mean for Me, Imperial College, Oct. 2015 Efficient software development means  Optimizations for data locality & contiguity  Vectorization  Threading
  6. 6. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 6 Performance: Native vs. Interpreted Code 0.2 1 5 25 125 625 3125 15625 Python C C (Parallelism) BLACK SCHOLES FORMULA MOPTIONS/SEC STAC Summit, November 2015 Chapter 19. Performance Optimization of Black Scholes Pricing Configuration info: - Versions: Intel® Distribution for Python 2.7.10 Technical Preview 1 (Aug 03, 2015), icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS. 55x 349x Vectorization, threading, and data locality optimizations Static compilation
  7. 7. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 7 Possible Improvement Approaches How to make Python more usable in both prototyping & production?  Numerical/machine learning packages (NumPy/SciPy/Scikit-learn) accelerated with native libraries (e.g. Intel® MKL)  Python language extensions that exploit vectorization and multi- core parallelism, e.g. Cython (via GCC/ICC), Numba (LLVM)  Better performance profiling of Python codes  Packages and extensions for multi-node parallelism, e.g. mpi4py  Integration with Big Data/ML infrastructures (Hadoop, Spark) STAC Summit, November 2015
  8. 8. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Configuration info: - Versions: Intel® Distribution for Python 2.7.10 Technical Preview 1 (Aug 03, 2015), Ubuntu* built Python*: Python 2.7.10, NumPy 1.9.2 built with gcc 4.8.4; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS. 1 1 1 1 12X 4X 4X 4X 4X 22X 61X 70X 77X 97X SVD Cholesky QR Inv DGEMM Speedup Ubuntu Intel, 1 thread Intel, 32 threads Python* Performance Boost on Select Numerical Functions Intel® Distribution for Python (Technical Preview) vs. Ubuntu Python N=10000 N=40000 N=30000 N=25000 N=40000 Python Numerical Packages Acceleration STAC Summit, November 2015 8
  9. 9. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 9 Python Language Extensions For Performance STAC Summit, November 2015 0.2 1 5 25 125 625 3125 15625 Python NumExpr Numba Cython C (Parallelism) BLACK SCHOLES FORMULA MOPTIONS/SEC 36x 88x 6x <35% Configuration info: - Versions: Intel® Distribution for Python 2.7.10 Technical Preview 1 (Aug 03, 2015), Ubuntu* built Python*: Python 2.7.10, NumPy 1.9.2 built with gcc 4.8.4, llvm 3.6.0, Numba 0.22.0, icc 15.0; Hardware: Intel® Xeon® CPU E5-2698 v3 @ 2.30GHz (2 sockets, 16 cores each, HT=OFF), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz; Operating System: Ubuntu 14.04 LTS. 1x 1.2x 1.5x 10x6x Development cost 1x
  10. 10. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Scientific/Engineering Web/Social Business 10 Data Analytics Usage Models 1. Pre- Processing 2. Analysis 3. Modeling 4. Model Validation 5. Visualization 1. Pre- processing 2. Model Calibration 3. Decision Making 4. Model Validation 5. Reporting Interactive Modeling Visualization & Reporting Production Use R*, Python*, other 3rd party tools C++, Java*, Scala*, Python* Software DAAL Software DAAL Visual IDE Collaboration Services Data Store Infrastructure (Comms, Security) Management (Monitoring, Rule Management) Client Edge Data Source Edge C++, JS, C# Data Center Workstation SQL, noSQL, Files, Sensor Data C++, JS, C# STAC Summit, November 2015
  11. 11. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel Confidential – NDA presentation; presentation contains Intel confidential information presented to customers with a non-disclosure agreement 11 Big Data Requires End-To-End Solutions What device do I run analytics?  Perform analysis close to data source to optimize response latency, decrease network bandwidth utilization, and maximize security.  Offload data to cluster for large-scale analytics only.  Make personalized decisions on a client device Pre-processing Transformation Analysis Modeling Decision Making Decompression, Filtering, Normalization Aggregation, Dimension Reduction Summary Statistics Clustering, etc. Machine Learning (Training) Parameter Estimation Simulation Forecasting Decision Trees, etc. Scientific/Engineering Web/Social Business Compute (Server, Desktop, etc ) Client EdgeData Source Edge Validation Hypothesis testing Model errors STAC Summit, November 2015
  12. 12. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Kernel D Dense Big Data Flow and Computational Flow 12 … Observations, n Time Memory Capacity Categorical Blank/Missing Numeric Attributes,p Big Data Attributes Computational Solution Distributed across different nodes/devices •Distributed computing, e.g. comm-avoiding algorithms Huge data size not fitting into device memory •Distributed computing •Streaming algorithms Data coming in time •Data buffering •Streaming algorithms Non-homogeneous data •CategoricalNumeric (counters, histograms, etc) •Homogeneous numeric data kernels • Conversions, Indexing, Repacking Sparse/Missing/Noisy data •Sparse data algorithms •Recovery methods (bootstrapping, outlier correction) Distributed Computing Streaming Computing D1D2D3 Si+1 = T(Si,Di) Ri+1 = F(Si+1) R1 Rk D1 D2 Dk R2 R Si,Ri Offline Computing D1Dk- 1 Dk R … Append R = F(D1,…,Dk) Converts, Indexing, Repacking Data Recovery F D F I C C B F D F I C C B F D F I C C B F D F I C C B 1 2 7 4 5 6 3 C C C C8 D D D D D D D D 1 2 D D D D4 D D D D7 D D D D D D D D Kernel C Indexed D D D D D D D D Counter Histogram I I I I I F D F I C C B F D F I C C B F D F I C C B F D F I C C B 1 2 7 4 5 6 3 C C C C8 Outlier F D F I C C B F D F I C C B F D F I C C B F D F I C C B C C C C Recover STAC Summit, November 2015
  13. 13. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 13 Intel® DAAL – Essentials for End-To-End Analytics • C++ and Java*/Scala* library for data analytics • Targeting Python and R interfaces in future releases • “MKL” for machine learning with a few key differences • Optimizes entire data flow vs. compute part only • Targets both data center and edge devices • Supports offline, online and distributed data processing • Abstracted from cross-device communication layer • Allows plugging in different Big Data & IoT analytics frameworks • Comes with samples for Hadoop*, Spark*, MPI* • Builds upon MKL/IPP for best performance 4X 6X 6X 7X 7X 0 2 4 6 8 1M x 200 1M x 400 1M x 600 1M x 800 1M x 1000 Speedup Table size PCA (correlation method) on an 8-node Hadoop* cluster based on Intel® Xeon® Processors E5-2697 v3 Configuration Info - Versions: Intel® Data Analytics Acceleration Library 2016, CDH v5.3.1, Apache Spark* v1.2.0, Weka 3.6.12; Hardware: Intel® Xeon® Processor E5-2699 v3, 2 Eighteen-core CPUs (45MB LLC, 2.3GHz), 128GB of RAM per node; Operating System: CentOS 6.6 x86_64. 2.75X 12X 23X 38X 0 10 20 30 40 50 50000 transactions 250000 transactions 500000 transactions 1000000 transactions Speedup Apriori on Intel® Xeon® Processor E5-2699 v3 STAC Summit, November 2015
  14. 14. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel Confidential – NDA presentation; presentation contains Intel confidential information presented to customers with a non-disclosure agreement 14 Summary • Python is among top productivity languages • Python is unfriendly to modern hardware, and hence to production use • New tools and libraries allow making tradeoff between productivity and performance • Intel is investing in Python to be more usable in prototyping and production • Intel® Distribution for Python - in Tech Preview Now! • Intel® VTune Amplifier for Python – available for evaluation Now! • Big Data analytics requires end-to-end solutions • Intel has “end-to-end” response for new analytics challenges • Intel® Data Analytics Acceleration Library 2016 product available Now!
  15. 15. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 15STAC Summit, November 2015 15

×