SlideShare une entreprise Scribd logo
1  sur  63
Télécharger pour lire hors ligne
MAD Skills: New
                          Analysis Practices for
                                Big Data
                                   Presented By:
                                  Christan Grant
                                cgrant@cise.ufl.edu




Thursday, August 25, 11                              1
MAD Skills: New Analysis
                               Practices for Big Data
                     •    Authors

                          •   Jeff Cohen – Greenplum

                          •   Brian Dolan – Fox Audience Network

                          •   Mark Dunlap – Evergreen Technologies

                          •   Joseph M Hellerstein – UC Berkeley

                          •   Caleb Welton – Greenplum

                     •    Presented at Very Large Database Conference 2009 in Lyon, France




Thursday, August 25, 11                                                                      2
If you are looking for a career where your services will be
                          in high demand, you should find something where you
                          provide a scarce, complementary service to something
                          that is getting ubiquitous and cheap.

                          So what’s getting ubiquitous and cheap? Data.

                          And what is complementary to data? Analysis.

                          – Prof. Hal Varian, Chief Economist at Google




Thursday, August 25, 11                                                                 3
• Storage is Cheap
                      • The worlds largest data warehouse of
                          ~15 years ago can be stored on disks for
                          about $2000.




Thursday, August 25, 11                                              4
• Traditionally, a lot of time is spent building
                          well structured data warehouses.

                          • Contains summaries of data
                          • Main analytics location
                          • Jealously guarded by IT

Thursday, August 25, 11                                                 5
• Data analysis is common culture
                       • Culture is to collect and analyze data in
                          difference business units.




Thursday, August 25, 11                                              6
• Data analysis is common culture
                       • Culture is to collect and analyze data in
                          difference business units.

                     • “In God we trust; all others must bring data”




Thursday, August 25, 11                                                6
• Data analysis is common culture
                       • Culture is to collect and analyze data in
                            difference business units.

                     • “In God we trust; all others must bring data”
                     • Analytics are crucial – Analysts need more
                          access and more permissions

                          • Hence M.A.D.
Thursday, August 25, 11                                                6
• Magnetic – Attract data and practitioners.
                          The Database must be painless and efficient.

                     • Agile – Analysts need to ingest and analyze
                          on the fly.

                     • Deep – Sophisticated analytics at scale.
                          Don’t force analyst to work on samples
                          (where the long tail is lost).

                     • Library of tools
Thursday, August 25, 11                                                 7
Outline
                     •    Background
                     •    FOX Audience Network
                     •    Being MAD
                     •    In-Database Operations
                     •    MAD DBMS
                     •    Conclusions
                     •    Question


Thursday, August 25, 11                            8
Background
                     • OLAP and Data Cubes
                          • Data structure to
                             provides descriptive
                             statistics – Simple
                             summaries (sum,
                             average, std)

                          • We want inferential/
                             inductive statistics –
                             Predictions, causality
                             analysis, distributional
                             comparison



Thursday, August 25, 11                                 9
Background
                     •    Databases and Statistical Packages
                          •   Many analysts download data to use in Excel/SAS/
                              Matlab/R or their favorite programming language?
                              FORTRAN??


                          •   Use matrix/vector operations
                          •   Most of these stat packages require data to fit in RAM
                              •   Taking samples from the full data to fit into ram
                                  results in loss of precision
                          •   External toolkits may also lack parallelism



Thursday, August 25, 11                                                               10
Background
                     • MapReduce and Parallel Programming
                      • Strengths of MR are scalable batch
                            parallelism and fault tolerance.
                          • There is work on putting ML algorithms
                            in MapReduce Framework (Mahout)
                          • MR is just a programming framework and
                            works well with MAD


Thursday, August 25, 11                                              11
Background
                     • Data Mining and Analytics in the DB
                      • Plenty of work in this area
                      • Methods are typically implemented as a
                            black box
                          • Algorithms are not flexible as in
                            programmed packages



Thursday, August 25, 11                                          12
FOX Audience Network
                     • Served MySpace.com, IGN.com, Scout.com
                          etc.
                     • About 150 Million users
                     • Ad network, bought by Rubicon in Nov
                          2010




Thursday, August 25, 11                                         13
Fox Audience Network
                     •    42 Nodes (2 Masters, 40 Workers)
                     •    Sun X4500 Thumper
                     •    48 500GB drives
                     •    16GB RAM
                     •    ~ 5TB Daily
                     •    1 Table 1.5 Trillion Rows
                     •    Many different types of users workloads
                     •    Dynamic query ecosystem


Thursday, August 25, 11                                             14
Fox Audience Network
                     •    Use Cases
                     •    Queries to the system are vastly different
                     •    Example queries:
                          •   How many female WWF enthusiasts under the age of 30 visited
                              the Toyota community over the last four days and saw a medium
                              rectangle?
                              •   Expensive
                          •   How are these people similar to those that visited Nissan?
                              •   Open-ended, requires some statistics and the analyst to
                                  be in the loop.



Thursday, August 25, 11                                                                       15
Being MAD
                     • Analysts (Data Scientists) trump DBAs
                      • They crave data
                      • They can work with dirty data
                      • They like all data (not samples)
                      • Have a belt of tools


Thursday, August 25, 11                                        16
Being MAD

                     • Get data to the warehouse as soon as
                          possible
                     • Cleaning and integration should be done
                          intelligently




Thursday, August 25, 11                                          17
Being MAD
                     • Three logical layers for data stores:
                      • Reporting – Specialized static aggregates
                      • Production – Aggregates for most users
                      • Stages – Raw tables and logs
                      • Sandboxes – Play ground for analysts
                     • The logical levels encourages creativity
Thursday, August 25, 11                                             18
Being MAD

                     • Data Scientists need power tools Math +
                          Stats + ML
                     • The MADLib approach is to develop a
                          hierarchy of mathematical concepts in the
                          database.




Thursday, August 25, 11                                               19
In-Database Operations
                     • User Defined Functions (UDF)
                       • SQL or PL/pgSQL
                       • Internal (c)
                       • Dynamically Loaded (c, c++, java, javascript,
                            perl, python, R, ruby, scheme, etc)
                          • SELECT f1(city) FROM states
                            WHERE f2(capital)


Thursday, August 25, 11                                                  20
In-Database Operations
                     •    Value abstraction levels

                     •    1.0                         Scalar

                     •    [0.3, 0.3, 0.4]            Vector

                     •    [[0.3, 0.7], [0.6, 0.4]]   Matrix
                                                     Function
                     •    f( • ) → value




Thursday, August 25, 11                                         21
In-Database Operations


                          Scalar Arithmetic



Thursday, August 25, 11                       22
In-Database Operations

                     • Scalar Arithmetic
                      • SELECT 5*4;
                      • SELECT sqrt(64);
                      • SELECT cos(-3.14159 * sqrt(2) / 2 );

Thursday, August 25, 11                                        23
In-Database Operations


                          Vector Arithmetic



Thursday, August 25, 11                       24
In-Database Operations

                     • Vector
                      • array objects: float8[]
                      • array operations are implemented as
                          UDFs.


                                                  technically violates 1st normal form



Thursday, August 25, 11                                                                  25
In-Database Operations
                     • Vector




                                technically violates 1st normal form



Thursday, August 25, 11                                                26
In-Database Operations


                          Matrix Arithmetic



Thursday, August 25, 11                       27
In-Database Operations

                • Matrix Representation #1
                          CREATE TABLE B (
                               row_number integer,
                               vector numeric[]
                          );


   http://www.postgresql.org/docs/current/static/arrays.html

Thursday, August 25, 11                                        28
In-Database Operations

                • Matrix addition:
                          SELECT A. row_number, A.vector + B.vector
                          FROM A, B
                          WHERE A.row_number = B.row_numbner



       This can be easily implemented as a primitive
       operator:
       SELECT A ** B from A ,B
Thursday, August 25, 11                                               29
In-Database Operations
                • Matrix and vector multiplication: Av

                          SELECT 1, array_accum(row_number, vector*v)
                          FROM A


                          array_accum(x,v) is a custom function. It takes the
                          value v and puts it in the row indexed by x.


Thursday, August 25, 11                                                         30
In-Database Operations
                •         Matrix Representation #2
                          CREATE TABLE A (
                               row_number INTEGER,
                               column_number INTEGER,
                               value DECIMAL
                          );
                • Great sparse representation!
   http://www.postgresql.org/docs/current/static/arrays.html

Thursday, August 25, 11                                        31
In-Database Operations
                • Matrix Multiplication (using rep #2):
                          SELECT A. row_number, B.column_number, SUM(A.value * B.value)

                          FROM A, B

                          WHERE A.column_number = B.row_number

                          GROUP BY A.row_number, B.column_number




Thursday, August 25, 11                                                                   32
In-Database Operations
                • These operations require one pass over the
                          data and can be done in parallel.
                • What about Matrix Division, and Matrix
                          Inverse?
                • SQL is awkward bc of it’s lack of loops for
                          iteration.




Thursday, August 25, 11                                         33
In-Database Operations
                     • Document similarity for fraud detection
                       • Check to see if several advertisers are
                            linking to pages to the same content.
                          • If several advertisers have out links to
                            similar documents it is possible they are
                            using stolen credit card.
                          • Especially if the outbound documents are
                            malicious.


Thursday, August 25, 11                                                 34
In-Database Operations
                • For document similarity we use tf-idf. This is a
                          score for a term in a given document.
                          For a term:
                             tf-idft,d = tft,d * idft
                               tft,d = term count in doc
                               idft = log { |docs| / |docs w. term|}



Thursday, August 25, 11                                                35
In-Database Operations
                 •        Table docs contains triples (document, term, count)

                 • tf UDF in SQL takes a term and document as parameters
                          SELECT count(term)
                          FROM docs
                          WHERE term = t AND document = d



                 • idf UDF in SQL takes a term as a parameter
                          SELECT log (
                                  (SELECT count(document) FROM docs ) /
                                  (SELECT count(document) FROM docs WHERE term = t GROUP BY document))
                            FROM docs




Thursday, August 25, 11                                                                                  36
In-Database Operations
                •         Each doc will have a vector with the tf-idf scores,
                          we can compare the results using cosine similarity.
                          cos θ = v * u / (||v|| * ||u||)
                          small θ means v and u are similar


                • SQL is straight forward using the vector
                          operators



Thursday, August 25, 11                                                         37
In-Database Operations
                •         Ordinary Least Squares is used to fit a line to a collection of points.
                •         This helps to describe seasonal trends - its a first look at the data.
                          Solve Y = Xβ where X is an n × k Matrix.
                          Estimate β using β* = (XTX)-1XTy
                          We can Parallellize using computation
                                Distribute:    A ← XTX, b ← XTy
                                Collect:      A and b.
                                Calculate:    A-1
                                Multiply:     A-1 * b




Thursday, August 25, 11                                                                            38
In-Database Operations


                          Functional Arithmetic



Thursday, August 25, 11                           39
In-Database Operations
                     • A/B Testing (Multivariate testing)
                      • Web companies show people different
                            versions of their website to to see which
                            one is the best.
                          • Ad companies compare different add
                            campaigns and may want to find the one
                            with the higher clicks-through rate.



Thursday, August 25, 11                                                 40
In-Database Operations
                     •    Mann-Whitney U Test
                          •   This is a popular test to compare non-
                              parametric data.
                          •   non-parametric def= data set that does not fit
                              in a well known distribution
                          •   Combine data values and score the values, find
                              the average location of the list for each value.
                          •   Z = U – (n1n2/2) / √(n1n2(n1 + n2 +1)/12)


Thursday, August 25, 11                                                          41
In-Database Operations
                •         Mann-Whitney U Test
                           •   Given Table T with (sample_id, value)
                          CREATE VIEW R AS
                          SELECT sample_id, avg(value) AS sample_avg
                             sum(rown) AS rank_sum, count(*) AS sample_n,
                             sum(rown) - count(*) * count(*) + 1 AS sample_us
                           FROM ( SELECT sample_id,
                             row_number() OVER (ORDER BY value DESC) AS
                             rown, value FROM T) AS ordered
                           GROUP BY sample_id



Thursday, August 25, 11                                                         42
In-Database Operations
                •         Mann-Whitney U Test
                          •   With large sample count we can get the z_score
                          SELECT r.sample_u, r.sample_avg, r.sample_n,
                              (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u *
                              (a.sum_n + 1) / 12) AS z_score
                          FROM R AS r, (SELECT sum(sample_u) AS
                          sum_u, sum(sample_n) AS sum_n FROM R) AS a
                          GROUP BY r.sample_u, r.sample_avg, r.sample_n,
                          a.sum_n, a.sum_u


Thursday, August 25, 11                                                        43
In-Database Operations
                •         Mann-Whitney U Test
                          •   With large sample count we can get the z_score
                          SELECT r.sample_u, r.sample_avg, r.sample_n,
                              (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u *
                              (a.sum_n + 1) / 12) AS z_score
                          FROM R AS r, (SELECT sum(sample_u) AS
                          sum_u, sum(sample_n) AS sum_n FROM R) AS a
                          GROUP BY r.sample_u, r.sample_avg, r.sample_n,
                          a.sum_n, a.sum_u
 This can be easily implemented as a stored procedure:
 SELECT man_whitney(value) FROM table
Thursday, August 25, 11                                                        43
In-Database Operations
                     • Resampling
                       • Sampling several subsets instead of the
                            whole population.
                          • Bootstrapping       def=
                                                  Sampling to buckets
                            (w/ replacement) then combine buckets
                          • Jackknifing   def=
                                           Leave some data out and
                            sample; then combine runs



Thursday, August 25, 11                                                 44
In-Database Operations
                     •    Bootstrapping
                          •   Pre-select which trials contain which rows
                          •    Suppose we have population N = 100 and we
                              have subsamples of M = 3
                          CREATE VIEW design AS
                          SELECT a.trial_id, floor(100 * random()) AS
                          row_id
                          FROM generate_series(1, 10000) AS a (trial_id),
                          generate_series(1, 3) AS b (subsample_id)


Thursday, August 25, 11                                                     45
In-Database Operations
                     •    Bootstrapping
                     •    Calculate the sampling distribution
                          CREATE VIEW trials AS
                          SELECT d.trial_id, AVG(T.value) AS avg_value
                          FROM design d,T
                          WHERE d.row_id = T.row_id
                          GROUP BY d.trial_id



Thursday, August 25, 11                                                  46
In-Database Operations
                     • Bootstrapping
                     • Final distribution parameters
                          SELECT AVG(avg_value), STDDEV(avg_value)
                          FROM trials




Thursday, August 25, 11                                              47
MAD DBMS
                     • Loading/Unloading
                      • Scatter/Gather Streaming – fast!
                      • Extract-Transform-Load vs Extract-Load-
                            Transform
                          • Greenplum has MapReduce support!


Thursday, August 25, 11                                           48
MAD DBMS
                     •    Storage
                          •   Tunable table types:
                              •   external tables (e.g. files)
                              •   heap tables (frequent updates)
          compressible
                              •   append-only tables (rare updates)
                              •   column-stores flexibility
                          •   All tables have a distribution policy


Thursday, August 25, 11                                               49
MAD DBMS
                     •    Partitioning
                          •   Partition by range of values or columns (list)
                              •   i.e. partition by timestamp old stuff goes to
                                  compressed table, new stuff goes to heap
                                  storage.
                          •   Query optimizer knows the partitioning
                              scheme
                          •   This is useful for staging data


Thursday, August 25, 11                                                           50
MAD DBMS
                     •    MAD Programming
                          •   Use your favorite programming language extension
                          •   Programmers must think out the code works w/o
                              shared memory. (data-parallel)
                          •   Map and Reduce functions can be written in Python,
                              Perl, or R.
                          •   They run using the Scatter/Gather technique. (Any
                              table type)
                          •   SQL is not necessary!


Thursday, August 25, 11                                                            51
MAD DBMS
                     • Future
                      • Need a better repository for the library
                            of functions. (Morpheus SIGMOD06)
                          • Automatically choose data and tables
                            layouts automatically
                          • Use online aggregation for faster answers
                            to queries.


Thursday, August 25, 11                                                 52
Questions




Thursday, August 25, 11               53
Questions

                     • Why would one be apposed to the M.A.D.
                          paradigm?




Thursday, August 25, 11                                         53
Questions

                     • Why would one be apposed to the M.A.D.
                          paradigm?
                     • How does MADLib and Greenplum fit into
                          the NoSQL movement?




Thursday, August 25, 11                                         53
Questions

                     • Why would one be apposed to the M.A.D.
                          paradigm?
                     • How does MADLib and Greenplum fit into
                          the NoSQL movement?
                     • Is there a market for an AppStore for
                          functions/utilities for MADLib?



Thursday, August 25, 11                                         53
Conclusion
                     • This paper describes the M.A.D. (Magnetic
                          Agile Deep) mindset and the M.A.D. library.
                     • It demonstrated how to perform analytics
                          inside the DB using SQL and stored
                          procedures.
                     • It also described the Greenplum database.
                          A general data-parallel database for simple
                          and complex queries from flexible tables.


Thursday, August 25, 11                                                 54
Thank You!
Thursday, August 25, 11                55
MADLib Installation
                              (ubuntu)
                     •    sudo apt-get install postgresql flex bison oxygen
                          libboost1.42-* git graphviz libarmadillo-dev
                          libarmadillo-doc
                     •    sudo apt-get install lapack++ blas++
                     •    git clone https://github.com/madlib/madlib.git
                     •    ./configure
                     •    cd build
                     •    cd make
  https://github.com/madlib/madlib/wiki/Building-MADlib-from-Source



Thursday, August 25, 11                                                      56
Matrix Multiplication
                                       Demo
                          -- Create tables in the new matrix format   	

      (2, 1, 7),
                          CREATE TABLE A (                            	

      (2, 2, 8);
                             row_number INTEGER,
                              column_number INTEGER,                  -- Check to see that the values are multiplying correctly
                          	

      	

   value DECIMAL                SELECT A.row_number, B.column_number,
                          );                                          A.value::text || '*' || B.value::text
                          INSERT INTO A VALUES                        FROM A, B
                          	

     (1, 1, 1),                          WHERE A.column_number = B.row_number
                          	

        (1, 2, 2),
                          	

        (2, 1, 3),                       -- Add the Groupby to see the sum work
                          	

        (2, 2, 4);                       SELECT A.row_number, B.column_number, SUM(A.value *
                                                                      B.value)
                          CREATE TABLE B (                            FROM A, B
                                row_number INTEGER,                   WHERE A.column_number = B.row_number
                                column_number INTEGER,                GROUP BY A.row_number, B.column_number;
                          	

     	

   value DECIMAL
                          );                                          DROP TABLE A;
                          INSERT INTO B VALUES                        DROP TABLE B;
                          	

        (1, 1, 5),
                          	

        (1, 2, 6),




Thursday, August 25, 11                                                                                                           57

Contenu connexe

Tendances

Pengertian politik menurut para tokoh dan ahli
Pengertian politik menurut para tokoh dan ahliPengertian politik menurut para tokoh dan ahli
Pengertian politik menurut para tokoh dan ahli
Operator Warnet Vast Raha
 
bab 3 : perdagangan antarabangsa dan imbangan pembayaran
bab 3 : perdagangan antarabangsa dan imbangan pembayaranbab 3 : perdagangan antarabangsa dan imbangan pembayaran
bab 3 : perdagangan antarabangsa dan imbangan pembayaran
grace anzelletha
 
Mikroekonomi Topik 7
Mikroekonomi Topik 7Mikroekonomi Topik 7
Mikroekonomi Topik 7
WanBK Leo
 
Barangan tempatan
Barangan tempatanBarangan tempatan
Barangan tempatan
121101
 
Pasaran persaingan sempurna dan analisis pasaran
Pasaran persaingan sempurna dan analisis pasaranPasaran persaingan sempurna dan analisis pasaran
Pasaran persaingan sempurna dan analisis pasaran
kim rae KI
 
Ecn 2013 pasaran persaingan tak sempurna
Ecn 2013   pasaran persaingan tak sempurnaEcn 2013   pasaran persaingan tak sempurna
Ecn 2013 pasaran persaingan tak sempurna
Sukhairi Husain
 

Tendances (20)

LECTURE 12: Technology Protection
LECTURE 12: Technology ProtectionLECTURE 12: Technology Protection
LECTURE 12: Technology Protection
 
Perkembangan Teknologi Jepun
Perkembangan Teknologi JepunPerkembangan Teknologi Jepun
Perkembangan Teknologi Jepun
 
STPM SEM 1 P.PERNIAGAAN TEMA 3 (PENDEKATAN PEMASARAN)
STPM SEM 1 P.PERNIAGAAN TEMA 3 (PENDEKATAN PEMASARAN)STPM SEM 1 P.PERNIAGAAN TEMA 3 (PENDEKATAN PEMASARAN)
STPM SEM 1 P.PERNIAGAAN TEMA 3 (PENDEKATAN PEMASARAN)
 
Pengertian politik menurut para tokoh dan ahli
Pengertian politik menurut para tokoh dan ahliPengertian politik menurut para tokoh dan ahli
Pengertian politik menurut para tokoh dan ahli
 
PELAJAR ASING /EKSPATRIAT DI MALAYSIA
PELAJAR ASING /EKSPATRIAT DI MALAYSIAPELAJAR ASING /EKSPATRIAT DI MALAYSIA
PELAJAR ASING /EKSPATRIAT DI MALAYSIA
 
Slide kmk
Slide kmkSlide kmk
Slide kmk
 
bab 3 : perdagangan antarabangsa dan imbangan pembayaran
bab 3 : perdagangan antarabangsa dan imbangan pembayaranbab 3 : perdagangan antarabangsa dan imbangan pembayaran
bab 3 : perdagangan antarabangsa dan imbangan pembayaran
 
Mikroekonomi Topik 7
Mikroekonomi Topik 7Mikroekonomi Topik 7
Mikroekonomi Topik 7
 
MALAYSIA BERMINAT UNTUK MENDALAMI BIDANG INOVASI HINGGA KE PERINGKAT ANTARABA...
MALAYSIA BERMINAT UNTUK MENDALAMI BIDANG INOVASI HINGGA KE PERINGKAT ANTARABA...MALAYSIA BERMINAT UNTUK MENDALAMI BIDANG INOVASI HINGGA KE PERINGKAT ANTARABA...
MALAYSIA BERMINAT UNTUK MENDALAMI BIDANG INOVASI HINGGA KE PERINGKAT ANTARABA...
 
Syarikat multinasional
Syarikat multinasionalSyarikat multinasional
Syarikat multinasional
 
Bab7 pengembangan perniagaan[1]
Bab7 pengembangan perniagaan[1]Bab7 pengembangan perniagaan[1]
Bab7 pengembangan perniagaan[1]
 
Dasar luar malaysia
Dasar luar malaysiaDasar luar malaysia
Dasar luar malaysia
 
Barangan tempatan
Barangan tempatanBarangan tempatan
Barangan tempatan
 
Perlembagaan
PerlembagaanPerlembagaan
Perlembagaan
 
Pasaran persaingan sempurna dan analisis pasaran
Pasaran persaingan sempurna dan analisis pasaranPasaran persaingan sempurna dan analisis pasaran
Pasaran persaingan sempurna dan analisis pasaran
 
Industri homestay
Industri homestayIndustri homestay
Industri homestay
 
Ecn 2013 pasaran persaingan tak sempurna
Ecn 2013   pasaran persaingan tak sempurnaEcn 2013   pasaran persaingan tak sempurna
Ecn 2013 pasaran persaingan tak sempurna
 
Pengajian Am Konsep Negara
Pengajian Am Konsep NegaraPengajian Am Konsep Negara
Pengajian Am Konsep Negara
 
Tajuk kajian kerja kursus pengajian am penggal 2
Tajuk kajian kerja kursus pengajian am penggal 2Tajuk kajian kerja kursus pengajian am penggal 2
Tajuk kajian kerja kursus pengajian am penggal 2
 
Perniagaan STPM -PRODUK
Perniagaan STPM -PRODUKPerniagaan STPM -PRODUK
Perniagaan STPM -PRODUK
 

En vedette

Plan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationalesPlan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationales
Florian Brunner
 
Plumber colorado springs co open rooter
Plumber colorado springs co   open rooterPlumber colorado springs co   open rooter
Plumber colorado springs co open rooter
plumber80903
 

En vedette (16)

Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017
Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017
Media Pitching - jak efektywnie pisać do dziennikarzy - MailMyDay2017
 
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
 
Useful Cosmetic Dentistry Information
Useful Cosmetic Dentistry InformationUseful Cosmetic Dentistry Information
Useful Cosmetic Dentistry Information
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 
MagenTys Service Overview
MagenTys Service OverviewMagenTys Service Overview
MagenTys Service Overview
 
Oportunidades y desafíos del Acuerdo Comercial del Perú con la India
Oportunidades y desafíos del Acuerdo Comercial del Perú con la IndiaOportunidades y desafíos del Acuerdo Comercial del Perú con la India
Oportunidades y desafíos del Acuerdo Comercial del Perú con la India
 
Ringmakers of-saturn---norman-r.-bergrun
Ringmakers of-saturn---norman-r.-bergrunRingmakers of-saturn---norman-r.-bergrun
Ringmakers of-saturn---norman-r.-bergrun
 
Lazette Harnish: America's Most-Visited Tourist Places
Lazette Harnish: America's Most-Visited Tourist PlacesLazette Harnish: America's Most-Visited Tourist Places
Lazette Harnish: America's Most-Visited Tourist Places
 
Religion and Enviroment
Religion and EnviromentReligion and Enviroment
Religion and Enviroment
 
Goyescas
GoyescasGoyescas
Goyescas
 
Plan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationalesPlan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationales
 
Plumber colorado springs co open rooter
Plumber colorado springs co   open rooterPlumber colorado springs co   open rooter
Plumber colorado springs co open rooter
 
How to Develop a Social Media Presence in 30 Days or Less
How to Develop a Social Media Presence in 30 Days or LessHow to Develop a Social Media Presence in 30 Days or Less
How to Develop a Social Media Presence in 30 Days or Less
 
Moodle is dead... Iain Bruce, James Blair, Michael O'Loughlin
Moodle is dead... 	Iain Bruce, James Blair, Michael O'LoughlinMoodle is dead... 	Iain Bruce, James Blair, Michael O'Loughlin
Moodle is dead... Iain Bruce, James Blair, Michael O'Loughlin
 
Sintesis informativa 22 de marzo 2017
Sintesis informativa 22 de marzo 2017Sintesis informativa 22 de marzo 2017
Sintesis informativa 22 de marzo 2017
 
Announcements- Thursday, March 23, 2017
Announcements- Thursday, March 23, 2017Announcements- Thursday, March 23, 2017
Announcements- Thursday, March 23, 2017
 

Similaire à MAD Skills: New Analysis Practices for Big Data

Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
larsgeorge
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
MapR Technologies
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
DataWorks Summit
 
Gilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead YetGilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead Yet
Peter O'Kelly
 

Similaire à MAD Skills: New Analysis Practices for Big Data (20)

NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
 
Needs for Data Management & Citation Throughout the Information Lifecycle
Needs for Data Management & Citation Throughout  the Information LifecycleNeeds for Data Management & Citation Throughout  the Information Lifecycle
Needs for Data Management & Citation Throughout the Information Lifecycle
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data Management
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
 
251 carpenter
251 carpenter251 carpenter
251 carpenter
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...
Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...
Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on Introduction
 
Gilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead YetGilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead Yet
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

MAD Skills: New Analysis Practices for Big Data

  • 1. MAD Skills: New Analysis Practices for Big Data Presented By: Christan Grant cgrant@cise.ufl.edu Thursday, August 25, 11 1
  • 2. MAD Skills: New Analysis Practices for Big Data • Authors • Jeff Cohen – Greenplum • Brian Dolan – Fox Audience Network • Mark Dunlap – Evergreen Technologies • Joseph M Hellerstein – UC Berkeley • Caleb Welton – Greenplum • Presented at Very Large Database Conference 2009 in Lyon, France Thursday, August 25, 11 2
  • 3. If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. – Prof. Hal Varian, Chief Economist at Google Thursday, August 25, 11 3
  • 4. • Storage is Cheap • The worlds largest data warehouse of ~15 years ago can be stored on disks for about $2000. Thursday, August 25, 11 4
  • 5. • Traditionally, a lot of time is spent building well structured data warehouses. • Contains summaries of data • Main analytics location • Jealously guarded by IT Thursday, August 25, 11 5
  • 6. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. Thursday, August 25, 11 6
  • 7. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. • “In God we trust; all others must bring data” Thursday, August 25, 11 6
  • 8. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. • “In God we trust; all others must bring data” • Analytics are crucial – Analysts need more access and more permissions • Hence M.A.D. Thursday, August 25, 11 6
  • 9. • Magnetic – Attract data and practitioners. The Database must be painless and efficient. • Agile – Analysts need to ingest and analyze on the fly. • Deep – Sophisticated analytics at scale. Don’t force analyst to work on samples (where the long tail is lost). • Library of tools Thursday, August 25, 11 7
  • 10. Outline • Background • FOX Audience Network • Being MAD • In-Database Operations • MAD DBMS • Conclusions • Question Thursday, August 25, 11 8
  • 11. Background • OLAP and Data Cubes • Data structure to provides descriptive statistics – Simple summaries (sum, average, std) • We want inferential/ inductive statistics – Predictions, causality analysis, distributional comparison Thursday, August 25, 11 9
  • 12. Background • Databases and Statistical Packages • Many analysts download data to use in Excel/SAS/ Matlab/R or their favorite programming language? FORTRAN?? • Use matrix/vector operations • Most of these stat packages require data to fit in RAM • Taking samples from the full data to fit into ram results in loss of precision • External toolkits may also lack parallelism Thursday, August 25, 11 10
  • 13. Background • MapReduce and Parallel Programming • Strengths of MR are scalable batch parallelism and fault tolerance. • There is work on putting ML algorithms in MapReduce Framework (Mahout) • MR is just a programming framework and works well with MAD Thursday, August 25, 11 11
  • 14. Background • Data Mining and Analytics in the DB • Plenty of work in this area • Methods are typically implemented as a black box • Algorithms are not flexible as in programmed packages Thursday, August 25, 11 12
  • 15. FOX Audience Network • Served MySpace.com, IGN.com, Scout.com etc. • About 150 Million users • Ad network, bought by Rubicon in Nov 2010 Thursday, August 25, 11 13
  • 16. Fox Audience Network • 42 Nodes (2 Masters, 40 Workers) • Sun X4500 Thumper • 48 500GB drives • 16GB RAM • ~ 5TB Daily • 1 Table 1.5 Trillion Rows • Many different types of users workloads • Dynamic query ecosystem Thursday, August 25, 11 14
  • 17. Fox Audience Network • Use Cases • Queries to the system are vastly different • Example queries: • How many female WWF enthusiasts under the age of 30 visited the Toyota community over the last four days and saw a medium rectangle? • Expensive • How are these people similar to those that visited Nissan? • Open-ended, requires some statistics and the analyst to be in the loop. Thursday, August 25, 11 15
  • 18. Being MAD • Analysts (Data Scientists) trump DBAs • They crave data • They can work with dirty data • They like all data (not samples) • Have a belt of tools Thursday, August 25, 11 16
  • 19. Being MAD • Get data to the warehouse as soon as possible • Cleaning and integration should be done intelligently Thursday, August 25, 11 17
  • 20. Being MAD • Three logical layers for data stores: • Reporting – Specialized static aggregates • Production – Aggregates for most users • Stages – Raw tables and logs • Sandboxes – Play ground for analysts • The logical levels encourages creativity Thursday, August 25, 11 18
  • 21. Being MAD • Data Scientists need power tools Math + Stats + ML • The MADLib approach is to develop a hierarchy of mathematical concepts in the database. Thursday, August 25, 11 19
  • 22. In-Database Operations • User Defined Functions (UDF) • SQL or PL/pgSQL • Internal (c) • Dynamically Loaded (c, c++, java, javascript, perl, python, R, ruby, scheme, etc) • SELECT f1(city) FROM states WHERE f2(capital) Thursday, August 25, 11 20
  • 23. In-Database Operations • Value abstraction levels • 1.0 Scalar • [0.3, 0.3, 0.4] Vector • [[0.3, 0.7], [0.6, 0.4]] Matrix Function • f( • ) → value Thursday, August 25, 11 21
  • 24. In-Database Operations Scalar Arithmetic Thursday, August 25, 11 22
  • 25. In-Database Operations • Scalar Arithmetic • SELECT 5*4; • SELECT sqrt(64); • SELECT cos(-3.14159 * sqrt(2) / 2 ); Thursday, August 25, 11 23
  • 26. In-Database Operations Vector Arithmetic Thursday, August 25, 11 24
  • 27. In-Database Operations • Vector • array objects: float8[] • array operations are implemented as UDFs. technically violates 1st normal form Thursday, August 25, 11 25
  • 28. In-Database Operations • Vector technically violates 1st normal form Thursday, August 25, 11 26
  • 29. In-Database Operations Matrix Arithmetic Thursday, August 25, 11 27
  • 30. In-Database Operations • Matrix Representation #1 CREATE TABLE B ( row_number integer, vector numeric[] ); http://www.postgresql.org/docs/current/static/arrays.html Thursday, August 25, 11 28
  • 31. In-Database Operations • Matrix addition: SELECT A. row_number, A.vector + B.vector FROM A, B WHERE A.row_number = B.row_numbner This can be easily implemented as a primitive operator: SELECT A ** B from A ,B Thursday, August 25, 11 29
  • 32. In-Database Operations • Matrix and vector multiplication: Av SELECT 1, array_accum(row_number, vector*v) FROM A array_accum(x,v) is a custom function. It takes the value v and puts it in the row indexed by x. Thursday, August 25, 11 30
  • 33. In-Database Operations • Matrix Representation #2 CREATE TABLE A ( row_number INTEGER, column_number INTEGER, value DECIMAL ); • Great sparse representation! http://www.postgresql.org/docs/current/static/arrays.html Thursday, August 25, 11 31
  • 34. In-Database Operations • Matrix Multiplication (using rep #2): SELECT A. row_number, B.column_number, SUM(A.value * B.value) FROM A, B WHERE A.column_number = B.row_number GROUP BY A.row_number, B.column_number Thursday, August 25, 11 32
  • 35. In-Database Operations • These operations require one pass over the data and can be done in parallel. • What about Matrix Division, and Matrix Inverse? • SQL is awkward bc of it’s lack of loops for iteration. Thursday, August 25, 11 33
  • 36. In-Database Operations • Document similarity for fraud detection • Check to see if several advertisers are linking to pages to the same content. • If several advertisers have out links to similar documents it is possible they are using stolen credit card. • Especially if the outbound documents are malicious. Thursday, August 25, 11 34
  • 37. In-Database Operations • For document similarity we use tf-idf. This is a score for a term in a given document. For a term: tf-idft,d = tft,d * idft tft,d = term count in doc idft = log { |docs| / |docs w. term|} Thursday, August 25, 11 35
  • 38. In-Database Operations • Table docs contains triples (document, term, count) • tf UDF in SQL takes a term and document as parameters SELECT count(term) FROM docs WHERE term = t AND document = d • idf UDF in SQL takes a term as a parameter SELECT log ( (SELECT count(document) FROM docs ) / (SELECT count(document) FROM docs WHERE term = t GROUP BY document)) FROM docs Thursday, August 25, 11 36
  • 39. In-Database Operations • Each doc will have a vector with the tf-idf scores, we can compare the results using cosine similarity. cos θ = v * u / (||v|| * ||u||) small θ means v and u are similar • SQL is straight forward using the vector operators Thursday, August 25, 11 37
  • 40. In-Database Operations • Ordinary Least Squares is used to fit a line to a collection of points. • This helps to describe seasonal trends - its a first look at the data. Solve Y = Xβ where X is an n × k Matrix. Estimate β using β* = (XTX)-1XTy We can Parallellize using computation Distribute: A ← XTX, b ← XTy Collect: A and b. Calculate: A-1 Multiply: A-1 * b Thursday, August 25, 11 38
  • 41. In-Database Operations Functional Arithmetic Thursday, August 25, 11 39
  • 42. In-Database Operations • A/B Testing (Multivariate testing) • Web companies show people different versions of their website to to see which one is the best. • Ad companies compare different add campaigns and may want to find the one with the higher clicks-through rate. Thursday, August 25, 11 40
  • 43. In-Database Operations • Mann-Whitney U Test • This is a popular test to compare non- parametric data. • non-parametric def= data set that does not fit in a well known distribution • Combine data values and score the values, find the average location of the list for each value. • Z = U – (n1n2/2) / √(n1n2(n1 + n2 +1)/12) Thursday, August 25, 11 41
  • 44. In-Database Operations • Mann-Whitney U Test • Given Table T with (sample_id, value) CREATE VIEW R AS SELECT sample_id, avg(value) AS sample_avg sum(rown) AS rank_sum, count(*) AS sample_n, sum(rown) - count(*) * count(*) + 1 AS sample_us FROM ( SELECT sample_id, row_number() OVER (ORDER BY value DESC) AS rown, value FROM T) AS ordered GROUP BY sample_id Thursday, August 25, 11 42
  • 45. In-Database Operations • Mann-Whitney U Test • With large sample count we can get the z_score SELECT r.sample_u, r.sample_avg, r.sample_n, (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u * (a.sum_n + 1) / 12) AS z_score FROM R AS r, (SELECT sum(sample_u) AS sum_u, sum(sample_n) AS sum_n FROM R) AS a GROUP BY r.sample_u, r.sample_avg, r.sample_n, a.sum_n, a.sum_u Thursday, August 25, 11 43
  • 46. In-Database Operations • Mann-Whitney U Test • With large sample count we can get the z_score SELECT r.sample_u, r.sample_avg, r.sample_n, (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u * (a.sum_n + 1) / 12) AS z_score FROM R AS r, (SELECT sum(sample_u) AS sum_u, sum(sample_n) AS sum_n FROM R) AS a GROUP BY r.sample_u, r.sample_avg, r.sample_n, a.sum_n, a.sum_u This can be easily implemented as a stored procedure: SELECT man_whitney(value) FROM table Thursday, August 25, 11 43
  • 47. In-Database Operations • Resampling • Sampling several subsets instead of the whole population. • Bootstrapping def= Sampling to buckets (w/ replacement) then combine buckets • Jackknifing def= Leave some data out and sample; then combine runs Thursday, August 25, 11 44
  • 48. In-Database Operations • Bootstrapping • Pre-select which trials contain which rows • Suppose we have population N = 100 and we have subsamples of M = 3 CREATE VIEW design AS SELECT a.trial_id, floor(100 * random()) AS row_id FROM generate_series(1, 10000) AS a (trial_id), generate_series(1, 3) AS b (subsample_id) Thursday, August 25, 11 45
  • 49. In-Database Operations • Bootstrapping • Calculate the sampling distribution CREATE VIEW trials AS SELECT d.trial_id, AVG(T.value) AS avg_value FROM design d,T WHERE d.row_id = T.row_id GROUP BY d.trial_id Thursday, August 25, 11 46
  • 50. In-Database Operations • Bootstrapping • Final distribution parameters SELECT AVG(avg_value), STDDEV(avg_value) FROM trials Thursday, August 25, 11 47
  • 51. MAD DBMS • Loading/Unloading • Scatter/Gather Streaming – fast! • Extract-Transform-Load vs Extract-Load- Transform • Greenplum has MapReduce support! Thursday, August 25, 11 48
  • 52. MAD DBMS • Storage • Tunable table types: • external tables (e.g. files) • heap tables (frequent updates) compressible • append-only tables (rare updates) • column-stores flexibility • All tables have a distribution policy Thursday, August 25, 11 49
  • 53. MAD DBMS • Partitioning • Partition by range of values or columns (list) • i.e. partition by timestamp old stuff goes to compressed table, new stuff goes to heap storage. • Query optimizer knows the partitioning scheme • This is useful for staging data Thursday, August 25, 11 50
  • 54. MAD DBMS • MAD Programming • Use your favorite programming language extension • Programmers must think out the code works w/o shared memory. (data-parallel) • Map and Reduce functions can be written in Python, Perl, or R. • They run using the Scatter/Gather technique. (Any table type) • SQL is not necessary! Thursday, August 25, 11 51
  • 55. MAD DBMS • Future • Need a better repository for the library of functions. (Morpheus SIGMOD06) • Automatically choose data and tables layouts automatically • Use online aggregation for faster answers to queries. Thursday, August 25, 11 52
  • 57. Questions • Why would one be apposed to the M.A.D. paradigm? Thursday, August 25, 11 53
  • 58. Questions • Why would one be apposed to the M.A.D. paradigm? • How does MADLib and Greenplum fit into the NoSQL movement? Thursday, August 25, 11 53
  • 59. Questions • Why would one be apposed to the M.A.D. paradigm? • How does MADLib and Greenplum fit into the NoSQL movement? • Is there a market for an AppStore for functions/utilities for MADLib? Thursday, August 25, 11 53
  • 60. Conclusion • This paper describes the M.A.D. (Magnetic Agile Deep) mindset and the M.A.D. library. • It demonstrated how to perform analytics inside the DB using SQL and stored procedures. • It also described the Greenplum database. A general data-parallel database for simple and complex queries from flexible tables. Thursday, August 25, 11 54
  • 62. MADLib Installation (ubuntu) • sudo apt-get install postgresql flex bison oxygen libboost1.42-* git graphviz libarmadillo-dev libarmadillo-doc • sudo apt-get install lapack++ blas++ • git clone https://github.com/madlib/madlib.git • ./configure • cd build • cd make https://github.com/madlib/madlib/wiki/Building-MADlib-from-Source Thursday, August 25, 11 56
  • 63. Matrix Multiplication Demo -- Create tables in the new matrix format (2, 1, 7), CREATE TABLE A ( (2, 2, 8); row_number INTEGER, column_number INTEGER, -- Check to see that the values are multiplying correctly value DECIMAL SELECT A.row_number, B.column_number, ); A.value::text || '*' || B.value::text INSERT INTO A VALUES FROM A, B (1, 1, 1), WHERE A.column_number = B.row_number (1, 2, 2), (2, 1, 3), -- Add the Groupby to see the sum work (2, 2, 4); SELECT A.row_number, B.column_number, SUM(A.value * B.value) CREATE TABLE B ( FROM A, B row_number INTEGER, WHERE A.column_number = B.row_number column_number INTEGER, GROUP BY A.row_number, B.column_number; value DECIMAL ); DROP TABLE A; INSERT INTO B VALUES DROP TABLE B; (1, 1, 5), (1, 2, 6), Thursday, August 25, 11 57