SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Structured Data Challenges in Finance and Statistics

                            Wes McKinney


                   Rice Statistics, 21 November 2011




 Wes McKinney ()            Structured data challenges   Rice Statistics   1 / 43
Me



     S.B., MIT Math ’07
     3 years in the quant finance business
     Now: starting a software company, initially to build financial data
     analysis and research systems
     My blog: http://blog.wesmckinney.com
     Twitter: @wesmckinn
     Book! “Python for Data Analysis”, to hit the shelves later next year
     from O’Reilly Media




     Wes McKinney ()          Structured data challenges     Rice Statistics   2 / 43
Structured data



    cname              year   agefrom       ageto             ls     lsc    pop       ccode
0   Australia          1950   15            19                64.3   15.4   558       AUS
1   Australia          1950   20            24                48.4   26.4   645       AUS
2   Australia          1950   25            29                47.9   26.2   681       AUS
3   Australia          1950   30            34                44     23.8   614       AUS
4   Australia          1950   35            39                42.1   21.9   625       AUS
5   Australia          1950   40            44                38.9   20.1   555       AUS
6   Australia          1950   45            49                34     16.9   491       AUS
7   Australia          1950   50            54                29.6   14.6   439       AUS
8   Australia          1950   55            59                28     12.9   408       AUS
9   Australia          1950   60            64                26.3   12.1   356       AUS




     Wes McKinney ()             Structured data challenges                 Rice Statistics   3 / 43
Partial list of structured data necessities


    Table modification: column insertion/deletion/type changes
    Rich axis indexing, metadata
    Easy data alignment
    Aggregation and transformation by group (“group by”)
    Missing data (NA) handling
    Pivoting and reshaping
    Merging and joining
    Time series-specific manipulations
    Fast Input/Output: text files, databases, HDF5, ...




     Wes McKinney ()         Structured data challenges    Rice Statistics   4 / 43
Are existing tools good enough?




   We care nearly equally about
         Ease-of-use (syntax / API fits your mental model)
         Expressiveness
         Performance (speed and memory usage)
   Clean, consistent interface design is hard




    Wes McKinney ()           Structured data challenges    Rice Statistics   5 / 43
Auxiliary concerns




    Any tool needs to integrate well with:
         Statistical modeling tools
         Data visualization (plotting)
    Target users
         Computer scientists, statisticians, software engineers?
         Data scientists?




    Wes McKinney ()            Structured data challenges          Rice Statistics   6 / 43
Are existing tools good enough?



   The typical players
         R data.frame and friends + CRAN libraries
         SQL and other relational databases
         Python / NumPy: structured (record) arrays
         Commercial products: SAS, Stata, MS Excel...
   My conclusion: we still have a ways to go
   R has become demonstrably better in the last 5 years (e.g. via plyr,
   reshape2)




    Wes McKinney ()          Structured data challenges    Rice Statistics   7 / 43
Deeper problems in many industries




   Facilitating the research process only part of problem
   Much of academia: “Production systems?”
   Industry: a wasteland of misshapen wheels or expensive vendor
   products
   Explosive growth in data-driven production systems
   Hybrid-language systems are not always a good idea




    Wes McKinney ()         Structured data challenges      Rice Statistics   8 / 43
The big data conundrum




   Great effort being invested in the (difficult) problem of large-scale
   data processing, e.g. MapReduce-based
   Less effort in the fundamental tooling for data manipulation /
   preparation / integration
   Single-node performance does matter
   Single-node code development time matters too




    Wes McKinney ()        Structured data challenges      Rice Statistics   9 / 43
pandas: my effort in this arena



    Pick your favorite: panel data structures or Python structured data
    analysis
    Starting building April 2008 back at AQR Capital
    Open-sourced (BSD license) mid-2009
    Heavily tested, being used by many companies (inc. lots of financial
    firms) as the cornerstone of their systems
    Goal: optimal balance of ease-of-use, flexibility, and performance
    Heavy development the last 6 months




    Wes McKinney ()         Structured data challenges     Rice Statistics   10 / 43
Why did I start from scratch?



    Accusations of NIH Syndrome abound
    In 2008 I simultaneously needed
         Agile, high performance data structures
         A high productivity programming language for implementing all of the
         non-computational business logic
         A production application platform that would seamlessly integrate with
         an interactive data analysis / research platform
    In short, I was rebuilding major financial systems and I found my
    options inadequate
    Thrilling innovation opportunity!




    Wes McKinney ()           Structured data challenges       Rice Statistics   11 / 43
Why did I use Python?



   High productivity general purpose language
   Well thought-out object-oriented model
   Excellent software-development tools
   Easy for MATLAB/R users to learn
   Flexible built-in data structures (dicts, sets, lists, tuples)
   The right open-source scientific computing tools
         Powerful array processing (NumPy)
         Abundant tools for performance computing




    Wes McKinney ()           Structured data challenges       Rice Statistics   12 / 43
But, Python is not perfect




    For statistical computing, a chicken-and-egg problem
    Python’s plotting libraries are not designed for statistical graphics
    Built-in data structures are not especially optimized for my large data
    use cases
    Occasional semantic / syntactic niggles




    Wes McKinney ()           Structured data challenges      Rice Statistics   13 / 43
Partial list of structured data necessities


    Table modification: column insertion/deletion/type changes
    Rich axis indexing, metadata
    Easy data alignment
    Aggregation and transformation by group (“group by”)
    Missing data (NA) handling
    Pivoting and reshaping
    Merging and joining
    Time series-specific manipulations
    Fast Input/Output: text files, databases, HDF5, ...




     Wes McKinney ()         Structured data challenges    Rice Statistics   14 / 43
DataFrame, the pandas workhorse




   A 2D tabular data structure with row and column indexes
   Fast for row- and column-oriented operations
   Support heterogeneous columns WITHOUT sacrificing performance in
   the homogeneous (e.g. floating point only) case




    Wes McKinney ()        Structured data challenges   Rice Statistics   15 / 43
DataFrame



    cname              year   agefrom       ageto             ls     lsc    pop        ccode
0   Australia          1950   15            19                64.3   15.4   558        AUS
1   Australia          1950   20            24                48.4   26.4   645        AUS
2   Australia          1950   25            29                47.9   26.2   681        AUS
3   Australia          1950   30            34                44     23.8   614        AUS
4   Australia          1950   35            39                42.1   21.9   625        AUS
5   Australia          1950   40            44                38.9   20.1   555        AUS
6   Australia          1950   45            49                34     16.9   491        AUS
7   Australia          1950   50            54                29.6   14.6   439        AUS
8   Australia          1950   55            59                28     12.9   408        AUS
9   Australia          1950   60            64                26.3   12.1   356        AUS




     Wes McKinney ()             Structured data challenges                 Rice Statistics   16 / 43
Axis indexing and metadata



   Basic concept: labeled axes in use throughout the library
   Need to support
         Fast lookups (constant time)
         Data realignment / selection by labels (linear)
         Munging together irregularly indexed data
   Key innovation: index is a data structure itself. Different
   implementations can support more sophisticated indexing
   Axis labels can be any immutable Python object




    Wes McKinney ()            Structured data challenges   Rice Statistics   17 / 43
Irregularly indexed data
                      DataFrame
                                              Columns




                           I
                           N
                           D
                           E
                           X




    Wes McKinney ()            Structured data challenges   Rice Statistics   18 / 43
Axis indexing

                      Axis Index

                      d                        0


                      a                        1


                      b                        2


                      c                        3


                      e                        4


    Wes McKinney ()   Structured data challenges   Rice Statistics   19 / 43
Why does this matter?




   Real world data is highly irregular, especially time series
   Operations between DataFrame objects automatically align on the
   indexes
         Nearly impossible to have errors due to misaligned data
   Can vastly facilitate munging unstructured data into structured form
   Grants immense freedom in writing research code
   Time series are just a special case of a general indexed data structure




    Wes McKinney ()           Structured data challenges      Rice Statistics   20 / 43
Axis indexing

           "I have a brain!"


             year             1965 1970 1975 1980 1985
            cname     agefrom
            Australia 15       85.5 92.6 91.1 91.7 87.7
                      20       57.7 70.6 74.1 73.4 72
                      25       52.8 57.7 71.8 74   73.5
                      30       53.8 57.7 60.4 72.8 74.1
Me too!               35       46.6 52.6 59.5 63.2 72.7
                      40       47.5 52.6 56.3 61.4 62.9
                      45       41.3 48.7 55.9 61.1 61.1
                      50       41.7 48.7 53.8 60.2 60
                      55       36.3 42.1 54.1 61   59.2
                      60       37.5 42.1 48.9 61.6 59.1
                      65       30.2 30.9 48.8 60.4 59.7
                      70       30.2 30.9 41.7 60.4 57.6
                      75       30.2 30.9 41.7 62.5 57.5




      Wes McKinney ()          Structured data challenges   Rice Statistics   21 / 43
Hierarchical indexing



    Basic idea: represent high dimensional data in a lower-dimensional
    structure that is easier to reason about
    Axis index with k levels of indexing
    Slice chunks of data in constant time!
    Provides a very natural way of implementing reshaping operations
    Advantage over a truly N-dimensional object: space-efficient dense
    representation if groups are unbalanced
    Extremely useful for econometric models on panel data




    Wes McKinney ()          Structured data challenges     Rice Statistics   22 / 43
Hierarchical indexing

                       agefrom       15   20    25    30    35
                      cname     year
                      Australia 1965 85.5 57.7 52.8 53.8 46.6
                                1970 92.6 70.6 57.7 57.7 52.6
                                1975 91.1 74.1 71.8 60.4 59.5
                                1980 91.7 73.4 74      72.8 63.2
                                1985 87.7 72     73.5 74.1 72.7
                                1990 80    66.4 72     73.5 74.1
                                1995 66.3 59.9 66.4 72       73.5
                                2000 73.4 38.2 59.9 66.4 72
                                2005 82.3 44.8 38.2 59.9 66.4
                                2010 78.4 41.5 44.8 38.2 59.9
                      Austria   1965 8.1   50.9 50.9 24.2 24.2
                                1970 13.3 61.4 50.9 50.9 39.6
                                1975 23.5 64.3 61.4 56.6 49.6
                                1980 23.8 73.9 64.3 61.4 61.4
                                1985 22.4 69.2 71.1 62.9 59.9
                                1990 27.4 72     68.9 68.3 61.5
                                1995 38.9 64.2 66.9 66.6 66.7
                                2000 41.4 73.2 64.2 63.1 64.5
                                2005 42.9 93.9 75      62.9 59.2
                                2010 45.4 93.5 92.4 73.2 62.7




    Wes McKinney ()               Structured data challenges        Rice Statistics   23 / 43
Joining and merging




   Join and merge-type operations are very easy to implement with
   indexing in place
   Multi-key join same code as aligning hierarchically-indexed
   DataFrames
   Will illustrate this with examples




    Wes McKinney ()         Structured data challenges    Rice Statistics   24 / 43
Supporting size mutability



    In order to have good row-oriented performance, need to store
    like-typed columns in a single ndarray
    “Column” insertion: accumulate 1 × N × . . . homogeneous columns,
    later consolidate with other like-typed into a single block
    I.e. avoid reallocate-copy or array concatenation steps as long as
    possible
    Column deletions can be no-copy events (since ndarrays support
    views)




    Wes McKinney ()          Structured data challenges     Rice Statistics   25 / 43
DataFrame, under the hood


                Actually                                       You see
        6        2    4   5   1        3            1          2   3   4   5       6




        O             F       I       B              I         F   B   F   F      O




    Wes McKinney ()               Structured data challenges               Rice Statistics   26 / 43
Reshaping
    The fundamental operations
          stack: pivot level from columns to rows
          unstack: pivot level from rows to columns
    Completely natural and intuitive with hierarchical indexing
    No munging of column names necessary

   year                1965   1970    1975     1980      1985     1990   1995   2000    2005       2010
 cname     agefrom
 Australia 15          85.5   92.6    91.1     91.7      87.7     80     66.3   73.4    82.3       78.4
           20          57.7   70.6    74.1     73.4      72       66.4   59.9   38.2    44.8       41.5
           25          52.8   57.7    71.8     74        73.5     72     66.4   59.9    38.2       44.8
           30          53.8   57.7    60.4     72.8      74.1     73.5   72     66.4    59.9       38.2
           35          46.6   52.6    59.5     63.2      72.7     74.1   73.5   72      66.4       59.9
 Austria   15          8.1    13.3    23.5     23.8      22.4     27.4   38.9   41.4    42.9       45.4
           20          50.9   61.4    64.3     73.9      69.2     72     64.2   73.2    93.9       93.5
           25          50.9   50.9    61.4     64.3      71.1     68.9   66.9   64.2    75         92.4
           30          24.2   50.9    56.6     61.4      62.9     68.3   66.6   63.1    62.9       73.2
           35          24.2   39.6    49.6     61.4      59.9     61.5   66.7   64.5    59.2       62.7



     Wes McKinney ()                 Structured data challenges                  Rice Statistics    27 / 43
Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)
                       agefrom       15   20    25    30    35
                      cname     year
                      Australia 1965 85.5 57.7 52.8 53.8 46.6
                                1970 92.6 70.6 57.7 57.7 52.6
                                1975 91.1 74.1 71.8 60.4 59.5
                                1980 91.7 73.4 74      72.8 63.2
                                1985 87.7 72     73.5 74.1 72.7
                                1990 80    66.4 72     73.5 74.1
                                1995 66.3 59.9 66.4 72       73.5
                                2000 73.4 38.2 59.9 66.4 72
                                2005 82.3 44.8 38.2 59.9 66.4
                                2010 78.4 41.5 44.8 38.2 59.9
                      Austria   1965 8.1   50.9 50.9 24.2 24.2
                                1970 13.3 61.4 50.9 50.9 39.6
                                1975 23.5 64.3 61.4 56.6 49.6
                                1980 23.8 73.9 64.3 61.4 61.4
                                1985 22.4 69.2 71.1 62.9 59.9
                                1990 27.4 72     68.9 68.3 61.5
                                1995 38.9 64.2 66.9 66.6 66.7
                                2000 41.4 73.2 64.2 63.1 64.5
                                2005 42.9 93.9 75      62.9 59.2
                                2010 45.4 93.5 92.4 73.2 62.7



    Wes McKinney ()               Structured data challenges        Rice Statistics   28 / 43
Reshaping implementation nuances




   Must carefully deal with unbalanced group sizes / missing data
   I play vectorization tricks with the NumPy memory layout: no for
   loops!
   Care must be taken to handle heterogeneous and homogeneous data
   cases




    Wes McKinney ()        Structured data challenges    Rice Statistics   29 / 43
GroupBy




   High level process
        split data set into groups
        apply function to each group (an aggregation or a transformation)
        combine results intelligently into a result data structure
   Can be used to emulate SQL GROUP BY operations




   Wes McKinney ()           Structured data challenges      Rice Statistics   30 / 43
GroupBy



   Grouping closely related to indexing
   Create correspondence between axis labels and group labels using one
   of:
        Array of group labels (like a DataFrame column)
        Python function to be applied to each axis tick
   Can group by multiple keys
   For a hierarchically indexed axis, can select a level and group by that
   (or some transformation thereof)




   Wes McKinney ()           Structured data challenges    Rice Statistics   31 / 43
Anatomy of GroupBy




     grouped = obj.groupby([key1, key2, key3])

   This returns a GroupBy object
   Each of the keys could be any of:
        A Python function
        A vector
        A column name




   Wes McKinney ()          Structured data challenges   Rice Statistics   32 / 43
Anatomy of GroupBy




   aggregate, transform, and the more general apply supported

   group_means = grouped.agg(np.mean)
   group_means2 = grouped.mean()
   demeaned = grouped.transform(lambda x: x - x.mean())




   Wes McKinney ()       Structured data challenges   Rice Statistics   33 / 43
Anatomy of GroupBy




   The GroupBy object is also iterable

   group_means = {}
   for group_name, group in grouped:
       group_means[group_name] = grouped.mean()




   Wes McKinney ()         Structured data challenges   Rice Statistics   34 / 43
GroupBy and hierarchical indexing

   Hierarchical indexing came about as the natural result of a multi-key
   aggregation:
         >>> group_means = df.groupby(['country', 'agefrom']).mean()
         >>> group_means[['ls', 'lsc', 'pop']].unstack('country')

                      ls                    lsc                      pop
            country   Australia   Austria   Australia      Austria   Australia   Austria
         agefrom
         15           70.03       31.1      26.17          14.67     6163        3310
         20           58.02       59.98     34.51          45.83     1113        531
         25           57.02       45.5      33.02          32.73     5021        2791
         30           59.16       46.56     35.29          33.87     1082        527
         35           59.58       43.29     34.53          30.3      1053        528.8
         40           58.8        40.98     33.92          27.88     1005        522.5
         45           56.79       39.19     31.71          26        927.2       503.5
         50           54.71       37.14     30.37          23.94     836.6       475.2
         55           51.93       34.82     26.98          22.11     735.8       443.2
         60           49.92       32.35     25.85          20.2      632.6       408.8
         65           47.02       29.59     22.98          18.44     522.5       361.9
         70           46.27       28.52     22.53          17.72     410.5       295.8
         75           46.28       28.06     23.41          18.42     624.5       437.6




    Wes McKinney ()                 Structured data challenges                   Rice Statistics   35 / 43
What makes GroupBy hard?




   factor-izing the group labels is very expensive
   Function call overhead on small groups
   To sort or not to sort?
        Cheaper than computing the group labels!
   Munging together results in exceptional cases is tricky




   Wes McKinney ()          Structured data challenges       Rice Statistics   36 / 43
Time series operations




    Fixed frequency indexing (DateRange)
    Domain-specific time offsets (like business day, business month end)
    Frequency conversion
    Forward-filling / back-filling / interpolation
    Leading/lagging
    In the works (later this year), better/faster upsampling/downsampling




    Wes McKinney ()          Structured data challenges   Rice Statistics   37 / 43
Shoot-out with R’s time series objects

“Inner join” addition between two irregular 500K-length time series
sampled from 1 million-length time series. Timestamps are POSIXct (R) /
int64 (Python)

                               package             timing             factor
                                pandas           21.5 ms                 1.0
                                    xts          41.3 ms                1.92
                                    fts           370 ms                17.2
                                    its          1117 ms              51.95
                                   zoo           3479 ms              161.8


                         Where there is smoke, there is fire?
                   Iron: Macbook Pro Core i7 with R 2.13.1, pandas 0.5.1git / Python 2.7.2




     Wes McKinney ()                     Structured data challenges                          Rice Statistics   38 / 43
Erm, inner join?




                      Intersecting time stamps loses information

    Wes McKinney ()                Structured data challenges      Rice Statistics   39 / 43
My (mild) performance obsession




    Wes McKinney ()   Structured data challenges   Rice Statistics   40 / 43
My (mild) performance obsession




   Good performance stems from many factors
         Well-designed algorithms (from a complexity standpoint)
         Minimizing copying of data
         Minimizing interpreter / function call overhead
         Taking advantage of memory layout
   I value functionality over performance, but I do spend a lot of time
   profiling code and grokking performance characteristics




    Wes McKinney ()           Structured data challenges      Rice Statistics   41 / 43
Demo, time permitting




Wes McKinney ()         Structured data challenges   Rice Statistics   42 / 43

Contenu connexe

Tendances

Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using pythonPurna Chander
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Toolsboorad
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandasmaikroeder
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsEdwin de Jonge
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Alexander Hendorf
 
Pandas data transformational data structure patterns and challenges final
Pandas   data transformational data structure patterns and challenges  finalPandas   data transformational data structure patterns and challenges  final
Pandas data transformational data structure patterns and challenges finalRajesh M
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream miningAlbert Bifet
 
Python and Data Analysis
Python and Data AnalysisPython and Data Analysis
Python and Data AnalysisPraveen Nair
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 Albert Bifet
 
Osss manual-5-extract data
Osss manual-5-extract dataOsss manual-5-extract data
Osss manual-5-extract datacwarner7_11
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
Visualization-Driven Data Aggregation
Visualization-Driven Data AggregationVisualization-Driven Data Aggregation
Visualization-Driven Data AggregationZbigniew Jerzak
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environmentizahn
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data scienceLong Nguyen
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
 

Tendances (20)

SciPy 2010 Review
SciPy 2010 ReviewSciPy 2010 Review
SciPy 2010 Review
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Tools
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
 
Pandas data transformational data structure patterns and challenges final
Pandas   data transformational data structure patterns and challenges  finalPandas   data transformational data structure patterns and challenges  final
Pandas data transformational data structure patterns and challenges final
 
Pandas
PandasPandas
Pandas
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Python and Data Analysis
Python and Data AnalysisPython and Data Analysis
Python and Data Analysis
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Osss manual-5-extract data
Osss manual-5-extract dataOsss manual-5-extract data
Osss manual-5-extract data
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
Visualization-Driven Data Aggregation
Visualization-Driven Data AggregationVisualization-Driven Data Aggregation
Visualization-Driven Data Aggregation
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 

Similaire à Structured Data Challenges in Finance and Statistics

Documenting sacs compliance with microsoft excel
Documenting sacs compliance with microsoft excelDocumenting sacs compliance with microsoft excel
Documenting sacs compliance with microsoft excelSandra Nicks
 
Week 1 Lab Directions
Week 1 Lab DirectionsWeek 1 Lab Directions
Week 1 Lab Directionsoudesign
 
Practical Data Visualization
Practical Data VisualizationPractical Data Visualization
Practical Data VisualizationAngela Zoss
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...Craig Knoblock
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesasnaparveen414
 
2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptx2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptxJavier Daza
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebStefan Dietze
 
Ee184405 statistika dan stokastik statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik   statistik deskriptif 1 grafikEe184405 statistika dan stokastik   statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik statistik deskriptif 1 grafikyusufbf
 
Eyeriss Introduction
Eyeriss IntroductionEyeriss Introduction
Eyeriss IntroductionMichael Lee
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdfpaijitk
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008Peiling Wang
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebStefan Dietze
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningAbcdDcba12
 
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docxSTATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docxrafaelaj1
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic conceptsPritiRishi
 
Qda ces 2013 toronto workshop
Qda ces 2013 toronto workshopQda ces 2013 toronto workshop
Qda ces 2013 toronto workshopCesToronto
 
Data Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignData Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignKoo Ping Shung
 

Similaire à Structured Data Challenges in Finance and Statistics (20)

Documenting sacs compliance with microsoft excel
Documenting sacs compliance with microsoft excelDocumenting sacs compliance with microsoft excel
Documenting sacs compliance with microsoft excel
 
Week 1 Lab Directions
Week 1 Lab DirectionsWeek 1 Lab Directions
Week 1 Lab Directions
 
Practical Data Visualization
Practical Data VisualizationPractical Data Visualization
Practical Data Visualization
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
 
INT 1010 07-6.pdf
INT 1010 07-6.pdfINT 1010 07-6.pdf
INT 1010 07-6.pdf
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptx2. DATABASE MODELING_Database Fundamentals.pptx
2. DATABASE MODELING_Database Fundamentals.pptx
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
 
Ee184405 statistika dan stokastik statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik   statistik deskriptif 1 grafikEe184405 statistika dan stokastik   statistik deskriptif 1 grafik
Ee184405 statistika dan stokastik statistik deskriptif 1 grafik
 
Sql xp 05
Sql xp 05Sql xp 05
Sql xp 05
 
Eyeriss Introduction
Eyeriss IntroductionEyeriss Introduction
Eyeriss Introduction
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
CIKM Tutorial 2008
CIKM Tutorial 2008CIKM Tutorial 2008
CIKM Tutorial 2008
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docxSTATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
STATISTICSINFORMED DECISIONS USING DATAFifth EditionChapte.docx
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic concepts
 
Qda ces 2013 toronto workshop
Qda ces 2013 toronto workshopQda ces 2013 toronto workshop
Qda ces 2013 toronto workshop
 
Data Science, Data & Dashboards Design
Data Science, Data & Dashboards DesignData Science, Data & Dashboards Design
Data Science, Data & Dashboards Design
 

Plus de Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 

Plus de Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Dernier

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Dernier (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Structured Data Challenges in Finance and Statistics

  • 1. Structured Data Challenges in Finance and Statistics Wes McKinney Rice Statistics, 21 November 2011 Wes McKinney () Structured data challenges Rice Statistics 1 / 43
  • 2. Me S.B., MIT Math ’07 3 years in the quant finance business Now: starting a software company, initially to build financial data analysis and research systems My blog: http://blog.wesmckinney.com Twitter: @wesmckinn Book! “Python for Data Analysis”, to hit the shelves later next year from O’Reilly Media Wes McKinney () Structured data challenges Rice Statistics 2 / 43
  • 3. Structured data cname year agefrom ageto ls lsc pop ccode 0 Australia 1950 15 19 64.3 15.4 558 AUS 1 Australia 1950 20 24 48.4 26.4 645 AUS 2 Australia 1950 25 29 47.9 26.2 681 AUS 3 Australia 1950 30 34 44 23.8 614 AUS 4 Australia 1950 35 39 42.1 21.9 625 AUS 5 Australia 1950 40 44 38.9 20.1 555 AUS 6 Australia 1950 45 49 34 16.9 491 AUS 7 Australia 1950 50 54 29.6 14.6 439 AUS 8 Australia 1950 55 59 28 12.9 408 AUS 9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney () Structured data challenges Rice Statistics 3 / 43
  • 4. Partial list of structured data necessities Table modification: column insertion/deletion/type changes Rich axis indexing, metadata Easy data alignment Aggregation and transformation by group (“group by”) Missing data (NA) handling Pivoting and reshaping Merging and joining Time series-specific manipulations Fast Input/Output: text files, databases, HDF5, ... Wes McKinney () Structured data challenges Rice Statistics 4 / 43
  • 5. Are existing tools good enough? We care nearly equally about Ease-of-use (syntax / API fits your mental model) Expressiveness Performance (speed and memory usage) Clean, consistent interface design is hard Wes McKinney () Structured data challenges Rice Statistics 5 / 43
  • 6. Auxiliary concerns Any tool needs to integrate well with: Statistical modeling tools Data visualization (plotting) Target users Computer scientists, statisticians, software engineers? Data scientists? Wes McKinney () Structured data challenges Rice Statistics 6 / 43
  • 7. Are existing tools good enough? The typical players R data.frame and friends + CRAN libraries SQL and other relational databases Python / NumPy: structured (record) arrays Commercial products: SAS, Stata, MS Excel... My conclusion: we still have a ways to go R has become demonstrably better in the last 5 years (e.g. via plyr, reshape2) Wes McKinney () Structured data challenges Rice Statistics 7 / 43
  • 8. Deeper problems in many industries Facilitating the research process only part of problem Much of academia: “Production systems?” Industry: a wasteland of misshapen wheels or expensive vendor products Explosive growth in data-driven production systems Hybrid-language systems are not always a good idea Wes McKinney () Structured data challenges Rice Statistics 8 / 43
  • 9. The big data conundrum Great effort being invested in the (difficult) problem of large-scale data processing, e.g. MapReduce-based Less effort in the fundamental tooling for data manipulation / preparation / integration Single-node performance does matter Single-node code development time matters too Wes McKinney () Structured data challenges Rice Statistics 9 / 43
  • 10. pandas: my effort in this arena Pick your favorite: panel data structures or Python structured data analysis Starting building April 2008 back at AQR Capital Open-sourced (BSD license) mid-2009 Heavily tested, being used by many companies (inc. lots of financial firms) as the cornerstone of their systems Goal: optimal balance of ease-of-use, flexibility, and performance Heavy development the last 6 months Wes McKinney () Structured data challenges Rice Statistics 10 / 43
  • 11. Why did I start from scratch? Accusations of NIH Syndrome abound In 2008 I simultaneously needed Agile, high performance data structures A high productivity programming language for implementing all of the non-computational business logic A production application platform that would seamlessly integrate with an interactive data analysis / research platform In short, I was rebuilding major financial systems and I found my options inadequate Thrilling innovation opportunity! Wes McKinney () Structured data challenges Rice Statistics 11 / 43
  • 12. Why did I use Python? High productivity general purpose language Well thought-out object-oriented model Excellent software-development tools Easy for MATLAB/R users to learn Flexible built-in data structures (dicts, sets, lists, tuples) The right open-source scientific computing tools Powerful array processing (NumPy) Abundant tools for performance computing Wes McKinney () Structured data challenges Rice Statistics 12 / 43
  • 13. But, Python is not perfect For statistical computing, a chicken-and-egg problem Python’s plotting libraries are not designed for statistical graphics Built-in data structures are not especially optimized for my large data use cases Occasional semantic / syntactic niggles Wes McKinney () Structured data challenges Rice Statistics 13 / 43
  • 14. Partial list of structured data necessities Table modification: column insertion/deletion/type changes Rich axis indexing, metadata Easy data alignment Aggregation and transformation by group (“group by”) Missing data (NA) handling Pivoting and reshaping Merging and joining Time series-specific manipulations Fast Input/Output: text files, databases, HDF5, ... Wes McKinney () Structured data challenges Rice Statistics 14 / 43
  • 15. DataFrame, the pandas workhorse A 2D tabular data structure with row and column indexes Fast for row- and column-oriented operations Support heterogeneous columns WITHOUT sacrificing performance in the homogeneous (e.g. floating point only) case Wes McKinney () Structured data challenges Rice Statistics 15 / 43
  • 16. DataFrame cname year agefrom ageto ls lsc pop ccode 0 Australia 1950 15 19 64.3 15.4 558 AUS 1 Australia 1950 20 24 48.4 26.4 645 AUS 2 Australia 1950 25 29 47.9 26.2 681 AUS 3 Australia 1950 30 34 44 23.8 614 AUS 4 Australia 1950 35 39 42.1 21.9 625 AUS 5 Australia 1950 40 44 38.9 20.1 555 AUS 6 Australia 1950 45 49 34 16.9 491 AUS 7 Australia 1950 50 54 29.6 14.6 439 AUS 8 Australia 1950 55 59 28 12.9 408 AUS 9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney () Structured data challenges Rice Statistics 16 / 43
  • 17. Axis indexing and metadata Basic concept: labeled axes in use throughout the library Need to support Fast lookups (constant time) Data realignment / selection by labels (linear) Munging together irregularly indexed data Key innovation: index is a data structure itself. Different implementations can support more sophisticated indexing Axis labels can be any immutable Python object Wes McKinney () Structured data challenges Rice Statistics 17 / 43
  • 18. Irregularly indexed data DataFrame Columns I N D E X Wes McKinney () Structured data challenges Rice Statistics 18 / 43
  • 19. Axis indexing Axis Index d 0 a 1 b 2 c 3 e 4 Wes McKinney () Structured data challenges Rice Statistics 19 / 43
  • 20. Why does this matter? Real world data is highly irregular, especially time series Operations between DataFrame objects automatically align on the indexes Nearly impossible to have errors due to misaligned data Can vastly facilitate munging unstructured data into structured form Grants immense freedom in writing research code Time series are just a special case of a general indexed data structure Wes McKinney () Structured data challenges Rice Statistics 20 / 43
  • 21. Axis indexing "I have a brain!" year 1965 1970 1975 1980 1985 cname agefrom Australia 15 85.5 92.6 91.1 91.7 87.7 20 57.7 70.6 74.1 73.4 72 25 52.8 57.7 71.8 74 73.5 30 53.8 57.7 60.4 72.8 74.1 Me too! 35 46.6 52.6 59.5 63.2 72.7 40 47.5 52.6 56.3 61.4 62.9 45 41.3 48.7 55.9 61.1 61.1 50 41.7 48.7 53.8 60.2 60 55 36.3 42.1 54.1 61 59.2 60 37.5 42.1 48.9 61.6 59.1 65 30.2 30.9 48.8 60.4 59.7 70 30.2 30.9 41.7 60.4 57.6 75 30.2 30.9 41.7 62.5 57.5 Wes McKinney () Structured data challenges Rice Statistics 21 / 43
  • 22. Hierarchical indexing Basic idea: represent high dimensional data in a lower-dimensional structure that is easier to reason about Axis index with k levels of indexing Slice chunks of data in constant time! Provides a very natural way of implementing reshaping operations Advantage over a truly N-dimensional object: space-efficient dense representation if groups are unbalanced Extremely useful for econometric models on panel data Wes McKinney () Structured data challenges Rice Statistics 22 / 43
  • 23. Hierarchical indexing agefrom 15 20 25 30 35 cname year Australia 1965 85.5 57.7 52.8 53.8 46.6 1970 92.6 70.6 57.7 57.7 52.6 1975 91.1 74.1 71.8 60.4 59.5 1980 91.7 73.4 74 72.8 63.2 1985 87.7 72 73.5 74.1 72.7 1990 80 66.4 72 73.5 74.1 1995 66.3 59.9 66.4 72 73.5 2000 73.4 38.2 59.9 66.4 72 2005 82.3 44.8 38.2 59.9 66.4 2010 78.4 41.5 44.8 38.2 59.9 Austria 1965 8.1 50.9 50.9 24.2 24.2 1970 13.3 61.4 50.9 50.9 39.6 1975 23.5 64.3 61.4 56.6 49.6 1980 23.8 73.9 64.3 61.4 61.4 1985 22.4 69.2 71.1 62.9 59.9 1990 27.4 72 68.9 68.3 61.5 1995 38.9 64.2 66.9 66.6 66.7 2000 41.4 73.2 64.2 63.1 64.5 2005 42.9 93.9 75 62.9 59.2 2010 45.4 93.5 92.4 73.2 62.7 Wes McKinney () Structured data challenges Rice Statistics 23 / 43
  • 24. Joining and merging Join and merge-type operations are very easy to implement with indexing in place Multi-key join same code as aligning hierarchically-indexed DataFrames Will illustrate this with examples Wes McKinney () Structured data challenges Rice Statistics 24 / 43
  • 25. Supporting size mutability In order to have good row-oriented performance, need to store like-typed columns in a single ndarray “Column” insertion: accumulate 1 × N × . . . homogeneous columns, later consolidate with other like-typed into a single block I.e. avoid reallocate-copy or array concatenation steps as long as possible Column deletions can be no-copy events (since ndarrays support views) Wes McKinney () Structured data challenges Rice Statistics 25 / 43
  • 26. DataFrame, under the hood Actually You see 6 2 4 5 1 3 1 2 3 4 5 6 O F I B I F B F F O Wes McKinney () Structured data challenges Rice Statistics 26 / 43
  • 27. Reshaping The fundamental operations stack: pivot level from columns to rows unstack: pivot level from rows to columns Completely natural and intuitive with hierarchical indexing No munging of column names necessary year 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 cname agefrom Australia 15 85.5 92.6 91.1 91.7 87.7 80 66.3 73.4 82.3 78.4 20 57.7 70.6 74.1 73.4 72 66.4 59.9 38.2 44.8 41.5 25 52.8 57.7 71.8 74 73.5 72 66.4 59.9 38.2 44.8 30 53.8 57.7 60.4 72.8 74.1 73.5 72 66.4 59.9 38.2 35 46.6 52.6 59.5 63.2 72.7 74.1 73.5 72 66.4 59.9 Austria 15 8.1 13.3 23.5 23.8 22.4 27.4 38.9 41.4 42.9 45.4 20 50.9 61.4 64.3 73.9 69.2 72 64.2 73.2 93.9 93.5 25 50.9 50.9 61.4 64.3 71.1 68.9 66.9 64.2 75 92.4 30 24.2 50.9 56.6 61.4 62.9 68.3 66.6 63.1 62.9 73.2 35 24.2 39.6 49.6 61.4 59.9 61.5 66.7 64.5 59.2 62.7 Wes McKinney () Structured data challenges Rice Statistics 27 / 43
  • 28. Reshaping In [5]: df.unstack(’agefrom’).stack(’year’) agefrom 15 20 25 30 35 cname year Australia 1965 85.5 57.7 52.8 53.8 46.6 1970 92.6 70.6 57.7 57.7 52.6 1975 91.1 74.1 71.8 60.4 59.5 1980 91.7 73.4 74 72.8 63.2 1985 87.7 72 73.5 74.1 72.7 1990 80 66.4 72 73.5 74.1 1995 66.3 59.9 66.4 72 73.5 2000 73.4 38.2 59.9 66.4 72 2005 82.3 44.8 38.2 59.9 66.4 2010 78.4 41.5 44.8 38.2 59.9 Austria 1965 8.1 50.9 50.9 24.2 24.2 1970 13.3 61.4 50.9 50.9 39.6 1975 23.5 64.3 61.4 56.6 49.6 1980 23.8 73.9 64.3 61.4 61.4 1985 22.4 69.2 71.1 62.9 59.9 1990 27.4 72 68.9 68.3 61.5 1995 38.9 64.2 66.9 66.6 66.7 2000 41.4 73.2 64.2 63.1 64.5 2005 42.9 93.9 75 62.9 59.2 2010 45.4 93.5 92.4 73.2 62.7 Wes McKinney () Structured data challenges Rice Statistics 28 / 43
  • 29. Reshaping implementation nuances Must carefully deal with unbalanced group sizes / missing data I play vectorization tricks with the NumPy memory layout: no for loops! Care must be taken to handle heterogeneous and homogeneous data cases Wes McKinney () Structured data challenges Rice Statistics 29 / 43
  • 30. GroupBy High level process split data set into groups apply function to each group (an aggregation or a transformation) combine results intelligently into a result data structure Can be used to emulate SQL GROUP BY operations Wes McKinney () Structured data challenges Rice Statistics 30 / 43
  • 31. GroupBy Grouping closely related to indexing Create correspondence between axis labels and group labels using one of: Array of group labels (like a DataFrame column) Python function to be applied to each axis tick Can group by multiple keys For a hierarchically indexed axis, can select a level and group by that (or some transformation thereof) Wes McKinney () Structured data challenges Rice Statistics 31 / 43
  • 32. Anatomy of GroupBy grouped = obj.groupby([key1, key2, key3]) This returns a GroupBy object Each of the keys could be any of: A Python function A vector A column name Wes McKinney () Structured data challenges Rice Statistics 32 / 43
  • 33. Anatomy of GroupBy aggregate, transform, and the more general apply supported group_means = grouped.agg(np.mean) group_means2 = grouped.mean() demeaned = grouped.transform(lambda x: x - x.mean()) Wes McKinney () Structured data challenges Rice Statistics 33 / 43
  • 34. Anatomy of GroupBy The GroupBy object is also iterable group_means = {} for group_name, group in grouped: group_means[group_name] = grouped.mean() Wes McKinney () Structured data challenges Rice Statistics 34 / 43
  • 35. GroupBy and hierarchical indexing Hierarchical indexing came about as the natural result of a multi-key aggregation: >>> group_means = df.groupby(['country', 'agefrom']).mean() >>> group_means[['ls', 'lsc', 'pop']].unstack('country') ls lsc pop country Australia Austria Australia Austria Australia Austria agefrom 15 70.03 31.1 26.17 14.67 6163 3310 20 58.02 59.98 34.51 45.83 1113 531 25 57.02 45.5 33.02 32.73 5021 2791 30 59.16 46.56 35.29 33.87 1082 527 35 59.58 43.29 34.53 30.3 1053 528.8 40 58.8 40.98 33.92 27.88 1005 522.5 45 56.79 39.19 31.71 26 927.2 503.5 50 54.71 37.14 30.37 23.94 836.6 475.2 55 51.93 34.82 26.98 22.11 735.8 443.2 60 49.92 32.35 25.85 20.2 632.6 408.8 65 47.02 29.59 22.98 18.44 522.5 361.9 70 46.27 28.52 22.53 17.72 410.5 295.8 75 46.28 28.06 23.41 18.42 624.5 437.6 Wes McKinney () Structured data challenges Rice Statistics 35 / 43
  • 36. What makes GroupBy hard? factor-izing the group labels is very expensive Function call overhead on small groups To sort or not to sort? Cheaper than computing the group labels! Munging together results in exceptional cases is tricky Wes McKinney () Structured data challenges Rice Statistics 36 / 43
  • 37. Time series operations Fixed frequency indexing (DateRange) Domain-specific time offsets (like business day, business month end) Frequency conversion Forward-filling / back-filling / interpolation Leading/lagging In the works (later this year), better/faster upsampling/downsampling Wes McKinney () Structured data challenges Rice Statistics 37 / 43
  • 38. Shoot-out with R’s time series objects “Inner join” addition between two irregular 500K-length time series sampled from 1 million-length time series. Timestamps are POSIXct (R) / int64 (Python) package timing factor pandas 21.5 ms 1.0 xts 41.3 ms 1.92 fts 370 ms 17.2 its 1117 ms 51.95 zoo 3479 ms 161.8 Where there is smoke, there is fire? Iron: Macbook Pro Core i7 with R 2.13.1, pandas 0.5.1git / Python 2.7.2 Wes McKinney () Structured data challenges Rice Statistics 38 / 43
  • 39. Erm, inner join? Intersecting time stamps loses information Wes McKinney () Structured data challenges Rice Statistics 39 / 43
  • 40. My (mild) performance obsession Wes McKinney () Structured data challenges Rice Statistics 40 / 43
  • 41. My (mild) performance obsession Good performance stems from many factors Well-designed algorithms (from a complexity standpoint) Minimizing copying of data Minimizing interpreter / function call overhead Taking advantage of memory layout I value functionality over performance, but I do spend a lot of time profiling code and grokking performance characteristics Wes McKinney () Structured data challenges Rice Statistics 41 / 43
  • 42. Demo, time permitting Wes McKinney () Structured data challenges Rice Statistics 42 / 43