SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Blaze: a large-scale, array-
                   oriented infrastructure for
                              Python
                              Travis E. Oliphant
                          PyData Silicon Valley 2013




Tuesday, March 19, 13
Brief History

                             Person               Package       Year
                                                Matrix Object
                           Jim Fulton                           1994
                                                 in Python
                         Jim Hugunin              Numeric       1995
                        Perry Greenfield, Rick
                         White, Todd Miller      Numarray       2001

                        Travis Oliphant            NumPy        2005



Tuesday, March 19, 13
Early pieces of SciPy
                  fftw wrappers              cephesmodule
                    June 1998               November 1998




                 stats.py
      December 1998
                                   Gary
                                Strangman




Tuesday, March 19, 13
1999 : Early SciPy emerges
       Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis
    environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen,
                 and others. Activity in 1998, led to increased interest in 1999.

     In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be
    present and began wrapping / writing in earnest. On 6 April 1999, I announced I would
                  be creating this uber-package which eventually became SciPy

              Gaussian quadrature                5 Jan 1999
                    cephes 1.0                  30 Jan 1999
                   sigtools 0.40                23 Feb 1999
                  Numeric docs                  March 1999
                    cephes 1.1                  9 Mar 1999                                        Plotting??
                  multipack 0.3                 13 Apr 1999
                 Helper routines                14 Apr 1999                                        Gist
         multipack 0.6 (leastsq, ode, fsolve,   29 Apr 1999                                       XPLOT
                        quad)
                                                                                                  DISLIN
             sparse plan described              30 May 1999
                                                                                                  Gnuplot
                  multipack 0.7                 14 Jun 1999
                   SparsePy 0.1
               cephes 1.2 (vectorize)
                                                5 Nov 1999
                                                29 Dec 1999
                                                                           Helping with f2py


Tuesday, March 19, 13
SciPy 2001       Travis Oliphant
                            optimize
                             sparse
                          interpolate
                           integrate
                             special
                              signal
                               stats      Founded in 2001 with Travis Vaught
                             fftpack
                               misc




                                                           Eric Jones
                                                             weave
                                                            cluster
     Pearu Peterson
                                                              GA*
          linalg
       interpolate
           f2py



Tuesday, March 19, 13
Community effort                          many, many others --- forgive me!
     • Chuck Harris
     • Pauli Virtanen
     • David Cournapeau
     • Stefan van der Walt
     • Jake Vanderplas
     • Josef Perktold
     • Anne Archibald
     • Dag Sverre Seljebotn
     • Robert Kern
     • Matthew Brett
     • Warren Weckesser
     • Ralf Gommers
     • Joe Harrington --- Documentation effort
     • Andrew Straw --- www.scipy.org




Tuesday, March 19, 13
1,000,000 to 2,000,000 users of NumPy!

                                        




                                                                               


                                   


                                            
Tuesday, March 19, 13
Now What?




Tuesday, March 19, 13
What is good about NumPy?
         • Array-oriented
         • Extensive DType System (including structures)
         • C-API --- lots of libraries
         • Simple to understand data-structure
         • Memory mapping
         • Syntax support from Python
         • Large community of users
         • Ufuncs and more
         • Broadcasting
         • Easy to interface C/C++/Fortran code


Tuesday, March 19, 13
What is wrong with NumPy

           • Dtype system is difficult to extend
           • Immediate mode creates huge temporaries
             (spawning Numexpr)
           • “Almost” an in-memory data-base comparable
             to SQL-lite (missing indexes)
           • Integration with sparse arrays
           • Lots of un-optimized parts
           • Minimal support for multi-core / GPU


Tuesday, March 19, 13
Improvements needed
       • NDArray improvements
         • Indexes (esp. for Structured arrays)
         • SQL front-end
         • Multi-level, hierarchical labels
         • selection via mappings (labeled arrays)
         • Memory spaces (array made up of regions)
         • Distributed arrays (global array)
         • Compressed arrays
         • Standard distributed persistance
         • fancy indexing as view and optimizations
         • streaming arrays


Tuesday, March 19, 13
Improvements needed
         • Dtype improvements
           • Enumerated types (including dynamic enumeration)
           • Derived fields
           • Specification as a class (or JSON)
           • Pointer dtype (i.e. C++ object, or varchar)
           • Missing data: masks and bit-patterns
           • Parameterized field names
           • Computed fields




Tuesday, March 19, 13
Improvements needed
         • Ufunc improvements
           • Generalized ufuncs support more than just
             contiguous arrays
           • Specification of ufuncs in Python
           • Move most dtype “array functions” to ufuncs
           • Unify error-handling for all computations
           • Allow lazy-evaluation and remote computation ---
             streaming and generator data
           • Structured and string dtype ufuncs
           • Multi-core and GPU optimized ufuncs
           • Group-by reduction



Tuesday, March 19, 13
More Improvements needed
         • Miscellaneous improvements
           • ABI-management
           • Eventual Move to library (NDLib)?
           • NDLib could serve as base for Javascript and other
             high-level languages?
           • Integration with LLVM
           • Possible dtype / shape / stride unification into a “table
             interface”
           • Remote computation
           • Fast I/O for CSV and Excel




Tuesday, March 19, 13
New Project



              NumPy
                               Blaze
                               Out of Core,
                        Distributed and Optimized
                                  NumPy


Tuesday, March 19, 13
NumPy Array




                        shape




Tuesday, March 19, 13
Blaze: Different kinds of Arrays

                                                    Indexable

               Record Type                                                       Primitive Type

                                   NDTable                             NDArray




                        Deferred         Concrete               Deferred     Concrete




Tuesday, March 19, 13
Blaze Deferred Arrays
           • Symbolic objects which build a graph
           • Represents deferred computation
                                                +"
                        A + B*C
                                           A"        *"

            Usually what you have when          B"        C"
              you have a Blaze Array


Tuesday, March 19, 13
Deferred allows handling large arrays
                                                   

                                                                 


                                          
             Can be handled out-of-
              core using chunks to                          

            stream through memory.                         


                                                           


                                                           


                                               



Tuesday, March 19, 13
Blaze Concrete Array

                                              URL   URL    URL      URL   URL
              Data Descriptor
                    Where are the bytes?                  Indexes



                        DataShape             Extensible Type System
                 What do the bytes mean?
                                               which includes shape

                        MetaData                     Dictionary
                   Labels, provenance, etc.




Tuesday, March 19, 13
Multiple URLs comprising an array



                                                 




                        




Tuesday, March 19, 13
URLs Provide Bytes

                                       Arbitrarily sliced
                        Memory-Like     Random Seeks



                                      Deal with in chunks
                          File-Like     Random Seeks



                                      Deal with in Chunks
                        Stream-Like    Sequential Seeks




Tuesday, March 19, 13
Blaze Data Container
          Index
                                                 Data Buffer
         Operation
                             ByteProvider
                                                   Data Descriptor
                                                      Protocol

                  NumPy          BLZ             RDBMS
                               Persistent
       Data Stream              Format       CSV


Tuesday, March 19, 13
Indexes

                        Contiguous / Strided    NumPy-Like

                          Chunked / Tiled      Special Access

                                                   Opaque
                        Opaque Element-only    Iterator-access




Tuesday, March 19, 13
Indexes allow for many orderings
                              




                                     




Tuesday, March 19, 13
DataShape Type System

                        Shape      DType
                                             




                            DataShape        


            • A data description language                 


            • A super-set of NumPy’s dtype
            • Provides more flexibility


Tuesday, March 19, 13
Allows for all kinds of containers


                              




                                                                          
                                                                  
                                                                  
                                                                  
                                                                  

                           




                                                                                   




Tuesday, March 19, 13
Advanced Types

  Parametrized Types                type Point = {
                                        x : int;
  type SquareMatrix T = N, N, T
                                        y : int
                                    }
    Alias Types
                                    type Space = {
   type IntMatrix N = N, N, int32       a: Point;
                                        b: Point
                                    }

                                    5, 10, Space



Tuesday, March 19, 13
Advanced Shapes

           {1,2,4,2,1}, int32             [
                                              [1],
                        Could Represent       [1,2],
                                              [1,3,2,9],
                                              [3,2],
                                              [3]
                                          ]




Tuesday, March 19, 13
Execution Model

               • Graphs dispatch to specialized library code
                 that is “registered with the system” based on
                 type and meta-data of array (blaze Modules)
               • Many operations can be compiled with LLVM
                 to machine-code
                 • BLIR (simple typed expression syntax)
                 • Numba (Python compiler)



Tuesday, March 19, 13
Blaze Agents
                                      Code

                                       Data
                                                Blaze     CSV
                                                Agent   Directory




                                                Blaze
                        Code Graph with Blaze   Agent   MongoDB
                               Arrays


                                                Blaze
                                                Agent    Vertica




                                                Blaze
                                                Agent    HDFS




Tuesday, March 19, 13
How?




                        “I think you should be more
                          explicit here in step two.”


Tuesday, March 19, 13
Team
                        Travis Oliphant          Stephen Diehl
                         NumPy, SciPy


                                                 Mark Florisson
                         Peter Wang                 Numba
                        Chaco, Bokeh



                                          Francesc Alted
                         Mark Wiebe          PyTables
                        NumPy, DyND


                                          Oscar Villellas




Tuesday, March 19, 13
DARPA providing help


                        DARPA-BAA-12-38: XDATA

     TA-1: Scalable analytics and data processing technology	
  
     TA-2: Visual user interface technology




Tuesday, March 19, 13
Status




Tuesday, March 19, 13
Type System = DataShape




              Best Type System this
                 side of Haskell!




Tuesday, March 19, 13
BLZ persistence

                                         BLZ$layout$at$a$glance$
                          Dataset$                          Super<Chunk$             Chunk$
                                                               Header&               Header&
                                root$                          Offset$0$              Offset$0$
                                                               Offset$1$              Offset$1$
                                                               Offset$2$              Offset$2$
                        meta$           data$                  <<<<<$                <<<<<$
                                                               Chunk$0$              Block$0$
                                                               Chunk$1$              Block$1$
                            __0__.blp$        __1__.blp$       Chunk$2$              Block$2$



                        Blaze$(BLZ)$format$                Bloscpack$(BLP)$format$    Blosc$format$




Tuesday, March 19, 13
Blaze Server

           https://wakari.io/nb/urls/raw.github.com/ContinuumIO/
              blaze-web/master/example/notebooks/Kiva-Tiny
                              %20Example.ipynb


                 Computed Columns!




Tuesday, March 19, 13
Out-of-core calculations




Tuesday, March 19, 13
Distributed Array


                        Coming soon....




Tuesday, March 19, 13
Roadmap

                • 0.1 release expected in May
                • 0.3 release at end of August
                • 1.0 Release by PyData west-coast 2014
                • Now only get involved if you want to develop
                • Then, continue building PyData ecosystem
                        around scalable array.




Tuesday, March 19, 13
NumFOCUS
    Num(Py) Foundation for Open Code for Usable Science




                        http://www.numfocus.org


Tuesday, March 19, 13

Contenu connexe

Plus de Travis Oliphant

SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019Travis Oliphant
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019Travis Oliphant
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationTravis Oliphant
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsTravis Oliphant
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona KeynoteTravis Oliphant
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with AnacondaTravis Oliphant
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataTravis Oliphant
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data ScienceTravis Oliphant
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData SolutionsTravis Oliphant
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with condaTravis Oliphant
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyTravis Oliphant
 

Plus de Travis Oliphant (18)

SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019
 
PyCon Estonia 2019
PyCon Estonia 2019PyCon Estonia 2019
PyCon Estonia 2019
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona Keynote
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
Numba
NumbaNumba
Numba
 

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Blaze: a large-scale, array-oriented infrastructure for Python

  • 1. Blaze: a large-scale, array- oriented infrastructure for Python Travis E. Oliphant PyData Silicon Valley 2013 Tuesday, March 19, 13
  • 2. Brief History Person Package Year Matrix Object Jim Fulton 1994 in Python Jim Hugunin Numeric 1995 Perry Greenfield, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005 Tuesday, March 19, 13
  • 3. Early pieces of SciPy fftw wrappers cephesmodule June 1998 November 1998 stats.py December 1998 Gary Strangman Tuesday, March 19, 13
  • 4. 1999 : Early SciPy emerges Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 Plotting?? multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 Gist multipack 0.6 (leastsq, ode, fsolve, 29 Apr 1999 XPLOT quad) DISLIN sparse plan described 30 May 1999 Gnuplot multipack 0.7 14 Jun 1999 SparsePy 0.1 cephes 1.2 (vectorize) 5 Nov 1999 29 Dec 1999 Helping with f2py Tuesday, March 19, 13
  • 5. SciPy 2001 Travis Oliphant optimize sparse interpolate integrate special signal stats Founded in 2001 with Travis Vaught fftpack misc Eric Jones weave cluster Pearu Peterson GA* linalg interpolate f2py Tuesday, March 19, 13
  • 6. Community effort many, many others --- forgive me! • Chuck Harris • Pauli Virtanen • David Cournapeau • Stefan van der Walt • Jake Vanderplas • Josef Perktold • Anne Archibald • Dag Sverre Seljebotn • Robert Kern • Matthew Brett • Warren Weckesser • Ralf Gommers • Joe Harrington --- Documentation effort • Andrew Straw --- www.scipy.org Tuesday, March 19, 13
  • 7. 1,000,000 to 2,000,000 users of NumPy!      Tuesday, March 19, 13
  • 9. What is good about NumPy? • Array-oriented • Extensive DType System (including structures) • C-API --- lots of libraries • Simple to understand data-structure • Memory mapping • Syntax support from Python • Large community of users • Ufuncs and more • Broadcasting • Easy to interface C/C++/Fortran code Tuesday, March 19, 13
  • 10. What is wrong with NumPy • Dtype system is difficult to extend • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL-lite (missing indexes) • Integration with sparse arrays • Lots of un-optimized parts • Minimal support for multi-core / GPU Tuesday, March 19, 13
  • 11. Improvements needed • NDArray improvements • Indexes (esp. for Structured arrays) • SQL front-end • Multi-level, hierarchical labels • selection via mappings (labeled arrays) • Memory spaces (array made up of regions) • Distributed arrays (global array) • Compressed arrays • Standard distributed persistance • fancy indexing as view and optimizations • streaming arrays Tuesday, March 19, 13
  • 12. Improvements needed • Dtype improvements • Enumerated types (including dynamic enumeration) • Derived fields • Specification as a class (or JSON) • Pointer dtype (i.e. C++ object, or varchar) • Missing data: masks and bit-patterns • Parameterized field names • Computed fields Tuesday, March 19, 13
  • 13. Improvements needed • Ufunc improvements • Generalized ufuncs support more than just contiguous arrays • Specification of ufuncs in Python • Move most dtype “array functions” to ufuncs • Unify error-handling for all computations • Allow lazy-evaluation and remote computation --- streaming and generator data • Structured and string dtype ufuncs • Multi-core and GPU optimized ufuncs • Group-by reduction Tuesday, March 19, 13
  • 14. More Improvements needed • Miscellaneous improvements • ABI-management • Eventual Move to library (NDLib)? • NDLib could serve as base for Javascript and other high-level languages? • Integration with LLVM • Possible dtype / shape / stride unification into a “table interface” • Remote computation • Fast I/O for CSV and Excel Tuesday, March 19, 13
  • 15. New Project NumPy Blaze Out of Core, Distributed and Optimized NumPy Tuesday, March 19, 13
  • 16. NumPy Array shape Tuesday, March 19, 13
  • 17. Blaze: Different kinds of Arrays Indexable Record Type Primitive Type NDTable NDArray Deferred Concrete Deferred Concrete Tuesday, March 19, 13
  • 18. Blaze Deferred Arrays • Symbolic objects which build a graph • Represents deferred computation +" A + B*C A" *" Usually what you have when B" C" you have a Blaze Array Tuesday, March 19, 13
  • 19. Deferred allows handling large arrays      Can be handled out-of- core using chunks to  stream through memory.         Tuesday, March 19, 13
  • 20. Blaze Concrete Array URL URL URL URL URL Data Descriptor Where are the bytes? Indexes DataShape Extensible Type System What do the bytes mean? which includes shape MetaData Dictionary Labels, provenance, etc. Tuesday, March 19, 13
  • 21. Multiple URLs comprising an array     Tuesday, March 19, 13
  • 22. URLs Provide Bytes Arbitrarily sliced Memory-Like Random Seeks Deal with in chunks File-Like Random Seeks Deal with in Chunks Stream-Like Sequential Seeks Tuesday, March 19, 13
  • 23. Blaze Data Container Index Data Buffer Operation ByteProvider Data Descriptor Protocol NumPy BLZ RDBMS Persistent Data Stream Format CSV Tuesday, March 19, 13
  • 24. Indexes Contiguous / Strided NumPy-Like Chunked / Tiled Special Access Opaque Opaque Element-only Iterator-access Tuesday, March 19, 13
  • 25. Indexes allow for many orderings         Tuesday, March 19, 13
  • 26. DataShape Type System Shape DType  DataShape  • A data description language  • A super-set of NumPy’s dtype • Provides more flexibility Tuesday, March 19, 13
  • 27. Allows for all kinds of containers                 Tuesday, March 19, 13
  • 28. Advanced Types Parametrized Types type Point = { x : int; type SquareMatrix T = N, N, T y : int } Alias Types type Space = { type IntMatrix N = N, N, int32 a: Point; b: Point } 5, 10, Space Tuesday, March 19, 13
  • 29. Advanced Shapes {1,2,4,2,1}, int32 [ [1], Could Represent [1,2], [1,3,2,9], [3,2], [3] ] Tuesday, March 19, 13
  • 30. Execution Model • Graphs dispatch to specialized library code that is “registered with the system” based on type and meta-data of array (blaze Modules) • Many operations can be compiled with LLVM to machine-code • BLIR (simple typed expression syntax) • Numba (Python compiler) Tuesday, March 19, 13
  • 31. Blaze Agents Code Data Blaze CSV Agent Directory Blaze Code Graph with Blaze Agent MongoDB Arrays Blaze Agent Vertica Blaze Agent HDFS Tuesday, March 19, 13
  • 32. How? “I think you should be more explicit here in step two.” Tuesday, March 19, 13
  • 33. Team Travis Oliphant Stephen Diehl NumPy, SciPy Mark Florisson Peter Wang Numba Chaco, Bokeh Francesc Alted Mark Wiebe PyTables NumPy, DyND Oscar Villellas Tuesday, March 19, 13
  • 34. DARPA providing help DARPA-BAA-12-38: XDATA TA-1: Scalable analytics and data processing technology   TA-2: Visual user interface technology Tuesday, March 19, 13
  • 36. Type System = DataShape Best Type System this side of Haskell! Tuesday, March 19, 13
  • 37. BLZ persistence BLZ$layout$at$a$glance$ Dataset$ Super<Chunk$ Chunk$ Header& Header& root$ Offset$0$ Offset$0$ Offset$1$ Offset$1$ Offset$2$ Offset$2$ meta$ data$ <<<<<$ <<<<<$ Chunk$0$ Block$0$ Chunk$1$ Block$1$ __0__.blp$ __1__.blp$ Chunk$2$ Block$2$ Blaze$(BLZ)$format$ Bloscpack$(BLP)$format$ Blosc$format$ Tuesday, March 19, 13
  • 38. Blaze Server https://wakari.io/nb/urls/raw.github.com/ContinuumIO/ blaze-web/master/example/notebooks/Kiva-Tiny %20Example.ipynb Computed Columns! Tuesday, March 19, 13
  • 40. Distributed Array Coming soon.... Tuesday, March 19, 13
  • 41. Roadmap • 0.1 release expected in May • 0.3 release at end of August • 1.0 Release by PyData west-coast 2014 • Now only get involved if you want to develop • Then, continue building PyData ecosystem around scalable array. Tuesday, March 19, 13
  • 42. NumFOCUS Num(Py) Foundation for Open Code for Usable Science http://www.numfocus.org Tuesday, March 19, 13