SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Numerical Linear Algebra for
  Data and Link Analysis


     Leonid Zhukov, Yahoo! Inc
       David Gleich, Stanford
Outline

•  Intro
•  Scale free graphs
   – Flickr example
•  Page rank computing
Scientific computing

•  Engineering Computing             •  Information Retrieval


   –  continuum problems, PDE           –  Discrete data is given, no
      governed, control over               control over resolution
      discretization                    –  No associated physical space
   –  2D or 3D                             (coordinates)
   –  Uniform distribution of node      –  Not planar, triangulation
      degrees                           –  Power-low degree distribution
                                        –  “Small world” effect
Large graphs

•  Explicit: graphs & networks
    –    Web graph
    –    Internet
    –    Yahoo! Photo Sharing (Flickr)
    –    Yahoo! 360 (Social network)

•  Implicit: transaction history, email, messenger
    –  Yahoo! Search marketing (Overture)
    –  Yahoo! Mail
    –  Yahoo Messenger

•  Constructed: affinity between data points
    –  Yahoo! Music (Launch)
    –  Yahoo! Movies
    –  Yahoo! ….
Flickr – social network
Flickr: adjacency matrix




    580,000 users,   3,500,000 links
Flickr: reverse Cuthill-McKee




    580,000 users,   3,500,000 links
Flickr – a graph
Flickr: some stats

•    # nodes = 584,207
•    # edges (nnz) = 3,555,115

•    max in-degree = 3531
•    max out-degree = 8976
•    < in-degree > = < out-degree > = 6
•    diameter = 18

•    # strongly connected components = 152,324
•    largest strongly connected components = 274,649 374 186 155 11

•    # connected component = 43,189
•    largest connected components = 404,893 378 112 108 103
•  highest core number = 249 (size 668)
Flickr: scale free graph

in-degrees	





out-degrees
Data and link analysis

•  Web Search
   –  ranking of pages/hosts/domains
   –  graph algorithms
   –  web spam

•  Social networks
    –  finding communities
    –  trust propagation

•  Data mining
    –  clustering
    –  classification
Data and links analysis

•  Graph partitioning:
   –  “balanced”, for data distribution
   –  “natural boundaries”, for clustering / communities


•  Numerical computing on graphs:
   –  Node ranking methods (PageRank)
   –  Dimensionality reduction (SVD, HITS)
   –  Low dimensional embeddings
   –  Graph interpolation / regularization
What do we compute?

•  Probability matrix:


•  Graph Laplacian:




 •  SVD/LSI:



        Computing eigenvalues / eigenvectors, linear systems
Yahoo! Search Marketing (Overture)




                       View Bids Tool
Overture data

 search                 $ bid   advertiser
 accessory desk           .83   Office Max
 alfred hitchcock         .01   Videoflicks.com
 educational software     .13   Buy.com
 educational software     .4    eBAY International AG
 educational software     .17   OfficeMax
 epson                    .28   Buy.com
 epson                    .4    eBAY International AG
 fax                      .13   Buy.com
 fax                      .4    eBAY International AG
 fax                      .38   OfficeMax
 game                     .02   Net Business
 game                     .25   Buy.com
 george harrison          .15   eBAY International AG
 george harrison          .05   MusicStack
 george harrison          .01   Videoflicks.com
 jackson michael          .05   MusicStack
 jackson michael          .01   Videoflicks.com
Overture bid graph
                        Terms
     accessory desk
educational software
                                         Office Max
              epson
                                         Buy.com
                 fax

               game                      eBAY International AG

     george harrison                     Net Business

     jackson michael                     MusicStack

     alfred hitchcock                    Videoflicks.com


                                Advertisers
Overture bid matrix

                       1 2 3 4 5 6
    accessory desk

           game                      1. Net Business

               fax                   2. Office Max

educational software                 3. Buy.com

           epson                     4. eBAY

   george harrison                   5. MusicStack

                                     6. Videoflicks.com
  jackson michael

   alfred hitchcock
170,077 terms, 25,971 advertisers, 1,307,316 bids
Overture bid matrix: 18-core




2000 advertisers , 3000 search terms, 92,347 bids
Spectral graph partitioning
                                     •    assign each node indicator pi = §1
 1           2       3   4           •    partition p = {-1,-1,-1,-1, +1, +1, +1}
                             5
     6
                 7       8
     +1	

                   -1	



Relaxation procedure:	

                        =>

Quadratic optimization:	




Min cut solution :	



 Rounding off:
Spectral: ordering
  Eigenvector                                  Eigenvector – sorted




Adjacency matrix                          Adjacency matrix – re-ordered




        perm =   [1   2   6   7   3   8    4    5]
Spectral : additions

•  Recursively
•  Min cut, ratio cut, normalized cut, quotient cut, …
•  Sweep for optimal value
•  Flow based refinement
•  Optimize partition among several eigenvectors
•  Bi-partite graphs
•  + other bells & whistles
Search Market Place
Sample clusters

'A.S.C. Incorporated'
    'Atlantic Telecom'       'Berent Associates Center for
         'AudioLink'          Shyness and Social Therapy'
 'Cost Plus Electronics'                 'Dr. Puff'
   'Headset Express'       'New York Psychotherapy Collective'
        'Hello Direct'                 'Ruby Shoes'
      'PDA Mountain'           'Self-employed psychologist'
     'PK TECH INC‘                 'www.kabbalah.com‘
   …………………..                           ……………

   'accessory phone'                'health mental'
     'cordless headset'                 'help self'
'cordless headset phone'           'improvement self'
          'free hands'                 'parenting‘
            'headset'
  'headset microphone'
       'headset phone'
   'headset plantronics'
     'headset wireless'
              'isdn'
        'plantronics‘
         …………….
Flickr: “10-core” spectral ordering
Websearch Engines

•  Crawler + indexer               •  Front end




                                                  #   Qs    PR
                                                  1   0.7   0.4
                                                  2   0.8   0.1

 Link Analysis                                    3   0.1   0.5
                  Text Analysis
   PageRank       Inverted Index
PageRank model

•  Random walk on graph
•  Markov process: memory-less, homogeneous
•  Stationary distribution: existence, uniqueness, convergence




Perron-Frobenius theorem, irreducible, every state is reachable from every
                    other, and aperiodic – no cycles
PageRank

•  Construct probability matrix:



•  Construct transition row-stochastic matrix for Markov process



 •  Correct reducibility:



 •  Find stationary distribution ( exist & unique):




c – teleportation coefficient, v – personalization vector, d – dangling nodes (sinks) indicator
PageRank linear system

•  Eigensystem:




•  Linear system:
Computational diagram
              PageRank Problem


Eigensystem                 Linear system



   Power        Jacobi
                           GMRES          BiCG         BiCGSTAB
 Iterations   Iterations
                                     Preconditioners

                                         Block         Additive
                            Jacobi
                                         Jacobi        Schwarz




              PageRank Vector
Data: WebGraphs

 Name         # Nodes/size      # Links / nnz             Memory
  bs               0.3M              1.6M                 20.4 MB
  edu               2M                14M                 176MB
yahoo-r2           14M               266M                 3.25GB
  uk              18.5M              300M                 3.67GB
yahoo-r3           60M               850M                 10.4GB
  db               70M                 1B                 12.3GB
  av               1.4B               6.6B                80GB



  Graph in memory, compressed storage, int, int, double
Power law graphs ( y = x - b) ?




bs: 0.3M nodes     y2: 266 M nodes   db: 1B nodes
Computational Methods




                                 Error Metrics for Jacobi Convergence                                                                   Error Metrics for BiCG Convergence
                    0                                                                                                       2
               10                                                                                                      10
                                                                      ||x(k) - x*||                                                                                     ||x(k) - x*||
                    -1
                                                                               (k)
               10                                                     ||A x          - b||                                                                              ||A x(k) - b||
                                                                         (k)         *       (k)
                                                                      ||x      - x ||/||x ||                           10
                                                                                                                            0
                                                                                                                                                                        ||x(k) - x*||/||x(k)||
                    -2
               10

                    -3
               10                                                                                                           -2
Metric Value




                                                                                                        Metric Value
                                                                                                                       10
                    -4
               10
                                                                                                                            -4
                    -5                                                                                                 10
               10

                    -6
               10
                                                                                                                            -6
                                                                                                                       10
                    -7
               10

                    -8                                                                                                      -8
               10                                                                                                      10
                        0   10    20      30      40       50    60                  70            80                           0   5   10    15    20      25   30     35         40            45
                                               Iteration                                                                                             Iteration
Convergence results


             0
                                   uk iteration convergence                                0
                                                                                                                db iteration convergence
        10                                                                            10
                                                                    std                                                                              std
        10
             -1                                                     jacobi            10
                                                                                           -1                                                        jacobi
                                                                    gmres                                                                            gmres
             -2
                                                                    bicg                   -2
                                                                                                                                                     bicg
        10                                                          bcgs              10                                                             bcgs
             -3                                                                            -3
        10                                                                            10
Error




                                                                              Error
             -4                                                                            -4
        10                                                                            10

             -5                                                                            -5
        10                                                                            10

             -6                                                                            -6
        10                                                                            10

             -7                                                                            -7
        10                                                                            10

             -8
        10                                                                                 -8
                 0   10       20      30      40       50     60   70        80       10
                                                                                               0   10      20         30        40         50   60            70
                                           Iteration                                                                    Iteration


                          uk: 18.5 M pages, 300 M links	

                                              db: 70 M pages, 1 B links.
Convergence results


             0
                              uk time convergence                                      0
                                                                                                             db time convergence
        10                                                                        10
                                                            std                                                                                std
        10
             -1                                             jacobi                     -1                                                      jacobi
                                                                                  10
                                                            gmres                                                                              gmres
             -2
                                                            bicg                       -2
                                                                                                                                               bicg
        10                                                  bcgs                  10                                                           bcgs
             -3                                                                        -3
        10                                                                        10
Error




                                                                          Error
             -4                                                                        -4
        10                                                                        10

             -5                                                                        -5
        10                                                                        10

             -6
        10                                                                             -6
                                                                                  10

             -7
        10                                                                        10
                                                                                       -7



             -8
        10                                                                        10
                                                                                       -8
                 0   5   10    15       20       25   30   35        40                    0   100     200       300     400       500   600            700
                                    Time (sec)                                                                   Time (sec)


                     uk: 18.5 M pages, 300 M links                                                   db: 70 M pages, 1 B links.
Summary
 Name       Size     Power       Jacobi     GMRES        BiCG        BCGS
  edu        2M        84          84         21*         44*         21*
20 procs    14M     0.09/7.5s   0.07/6.5s   0.6/13.2s   0.4/17.7s   0.4/8.7s
yahoo-r2    14M        71          65         12*          35*         17*
20 procs    266M    1.8/129s    1.9/126s    16/194s     8.6/300s    9.9/168s

   uk       18.5M      73          71         22*         25*         11*
60 procs    300M    0.09/7s     0.1/10s     0.8/17.6s   0.8/19.4s   1.0/10.8s
yahoo-r3    60M        76          75
60 procs    850M    1.6/119s    1.5/112s
   db       70M        62          58          29          45         15*
60 procs     1B     9.0/557s    8.7/506s    15/432s     15/676s     15/220s
   av       1.4B       72                                              26
140 procs   6.6B    4.6/333s                                        15/391s
Reduced teleportation
                             2
                                          Convergence and Conditioning for db
                        10
                                                                                       std
                                                                                       gmres
                             0
                                                                                       c = 0.85
                        10                                                             c = 0.90
                                                                                       c = 0.95
                                                                                       c = 0.99
       Error/Residual




                             -2
                        10



                             -4
                        10



                             -6
                        10



                             -8
                        10
                                 0   20   40   60   80     100       120   140   160   180        200
                                                         Iteration
db: 70 M pages, 1 B links.
YRL research cluster

                              Custom
      Parallel PageRank

                PETSc
                              Off the shelf
            MPI

                              Gigabit Switch

                              RLX Blades
                              Dual 2.8 GHz Xeon
                              4 GB RAM
                              Gigabit Ethernet
                              120 Total
RLX       RLX           RLX
Typical graph




out-degrees.




in-degrees.
                bs: 70 M pages, 1 B links.
Parallel Matrix-Vector Multiply




                      =
Graph distribution methods

•    Balance rows
•    Balance nnz
•    Random destribution
•    Balance rows and nnz (heuristics)




•  2D distributions
•  Graph partitioning
•  Hypergraph partitioning
Some parallel experiments
Domain lexicographic sorting

http://host.domain.tld/path       !   http://tld.domain.host/path




 bs-cc: 17 k pages, 133 k links
Domain ordering

                             0
                                                 Host Order Improvement
                        10
                                                                            std-140
                        10
                             -1                                             bcgs-140
                                                                            std-140-host
                             -2                                             bcgs-140-host
                        10

                             -3
                        10
       error/residual




                             -4
                        10

                             -5
                        10

                             -6
                        10

                             -7
                        10

                             -8
                        10
                                 0   100   200   300   400   500    600   700    800        900
                                                       Time (sec)
av: 1.4 B pages, 6.6 B links.
Full web graph parallelization

                                                                    Scaling for computing with full-web
 Performance Increase (Percent decrease in time)
                                                   250%
                                                             std
                                                             bcgs

                                                   200%



                                                   150%



                                                   100%



                                                   50%



                                                    0%
                                                     90%    100%    110%   120%    130%    140%   150%    160%   170%

                                                           140                         186                       226
av: 1.4 B pages, 6.6 B links.
FEM mesh for CFD computations
Power-law graph embedding in 3D
final thoughts

•  Power-law graphs are different
•  Not all power-law graphs are the same
•  But all have central core, chains, singletons.
•  Difficult to distribute / parallelize
•  Its is experimental data
•  Lots of cool applications
•  Let’s talk …..
References

•  Spectral graph partitioning:

-    M. Fiedler (1973), A. Pothen (1990), H. Simon (1991), B. Mohar (1992),
    B. Hendrickson (1995), D. Spielman (1996), F. Chang (1996), S. Guattery
    (1998), R. Kannan (1999), J. Shi (2000), I. Dhillon ( 2001), A. Ng (2001),
    H. Zha (2001), C. Ding (2001)


•  PageRank computing:
-   S.Brin (1998), L. Page (1998), J. Kleinberg (1999), A. Arasu
    (2002), T. Haveliwala (2002-03), A. Langville (2002), G. Jeh
    (2003), S. Kamvar (2003), A. Broder (2004)

Contenu connexe

Similaire à Numerical Linear Algebra for Data and Link Analysis

Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Kaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKeisuke Hosaka
 
Facebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenFacebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenHARMAN Services
 
OBACS - SCINET ENTERPRISE Data Sheet
OBACS - SCINET ENTERPRISE Data SheetOBACS - SCINET ENTERPRISE Data Sheet
OBACS - SCINET ENTERPRISE Data SheetCoskun Oba
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackElasticsearch
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
[SNU Computer Vision Course Project] Image Style Recognition
[SNU Computer Vision Course Project] Image Style Recognition[SNU Computer Vision Course Project] Image Style Recognition
[SNU Computer Vision Course Project] Image Style RecognitionHunjae Jung
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElasticsearch
 
สื่อการเรียนการสอน Cai2
สื่อการเรียนการสอน Cai2สื่อการเรียนการสอน Cai2
สื่อการเรียนการสอน Cai2Bombbam Deewiset
 
The Art Of Performance Tuning
The Art Of Performance TuningThe Art Of Performance Tuning
The Art Of Performance TuningJonathan Ross
 
Computer vision for transportation
Computer vision for transportationComputer vision for transportation
Computer vision for transportationWanjin Yu
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Chris Fregly
 

Similaire à Numerical Linear Algebra for Data and Link Analysis (20)

Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Kaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKaggle boschコンペ振り返り
Kaggle boschコンペ振り返り
 
Facebook Architecture - Breaking it Open
Facebook Architecture - Breaking it OpenFacebook Architecture - Breaking it Open
Facebook Architecture - Breaking it Open
 
OBACS - SCINET ENTERPRISE Data Sheet
OBACS - SCINET ENTERPRISE Data SheetOBACS - SCINET ENTERPRISE Data Sheet
OBACS - SCINET ENTERPRISE Data Sheet
 
Kurukshetra - Big Data
Kurukshetra - Big DataKurukshetra - Big Data
Kurukshetra - Big Data
 
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic Stack
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
[SNU Computer Vision Course Project] Image Style Recognition
[SNU Computer Vision Course Project] Image Style Recognition[SNU Computer Vision Course Project] Image Style Recognition
[SNU Computer Vision Course Project] Image Style Recognition
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
 
สื่อการเรียนการสอน Cai2
สื่อการเรียนการสอน Cai2สื่อการเรียนการสอน Cai2
สื่อการเรียนการสอน Cai2
 
The Art Of Performance Tuning
The Art Of Performance TuningThe Art Of Performance Tuning
The Art Of Performance Tuning
 
Computer vision for transportation
Computer vision for transportationComputer vision for transportation
Computer vision for transportation
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
 

Plus de Leonid Zhukov

Ecosystem challenges around data use
Ecosystem challenges around data useEcosystem challenges around data use
Ecosystem challenges around data useLeonid Zhukov
 
Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorLeonid Zhukov
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.comLeonid Zhukov
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data StartupsLeonid Zhukov
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших ДанныхLeonid Zhukov
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data ScientistLeonid Zhukov
 
Большие Данные
Большие ДанныеБольшие Данные
Большие ДанныеLeonid Zhukov
 
Information cascades
Information cascadesInformation cascades
Information cascadesLeonid Zhukov
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскадыLeonid Zhukov
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network AnalysisLeonid Zhukov
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Leonid Zhukov
 

Plus de Leonid Zhukov (13)

Ecosystem challenges around data use
Ecosystem challenges around data useEcosystem challenges around data use
Ecosystem challenges around data use
 
Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to Macrobehavior
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.com
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data Startups
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших Данных
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data Scientist
 
Большие Данные
Большие ДанныеБольшие Данные
Большие Данные
 
Information cascades
Information cascadesInformation cascades
Information cascades
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскады
 
Social Networks
Social NetworksSocial Networks
Social Networks
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.
 
Monitorium DLP
Monitorium DLPMonitorium DLP
Monitorium DLP
 

Dernier

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Dernier (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Numerical Linear Algebra for Data and Link Analysis

  • 1. Numerical Linear Algebra for Data and Link Analysis Leonid Zhukov, Yahoo! Inc David Gleich, Stanford
  • 2. Outline •  Intro •  Scale free graphs – Flickr example •  Page rank computing
  • 3. Scientific computing •  Engineering Computing •  Information Retrieval –  continuum problems, PDE –  Discrete data is given, no governed, control over control over resolution discretization –  No associated physical space –  2D or 3D (coordinates) –  Uniform distribution of node –  Not planar, triangulation degrees –  Power-low degree distribution –  “Small world” effect
  • 4. Large graphs •  Explicit: graphs & networks –  Web graph –  Internet –  Yahoo! Photo Sharing (Flickr) –  Yahoo! 360 (Social network) •  Implicit: transaction history, email, messenger –  Yahoo! Search marketing (Overture) –  Yahoo! Mail –  Yahoo Messenger •  Constructed: affinity between data points –  Yahoo! Music (Launch) –  Yahoo! Movies –  Yahoo! ….
  • 6. Flickr: adjacency matrix 580,000 users, 3,500,000 links
  • 7. Flickr: reverse Cuthill-McKee 580,000 users, 3,500,000 links
  • 8. Flickr – a graph
  • 9. Flickr: some stats •  # nodes = 584,207 •  # edges (nnz) = 3,555,115 •  max in-degree = 3531 •  max out-degree = 8976 •  < in-degree > = < out-degree > = 6 •  diameter = 18 •  # strongly connected components = 152,324 •  largest strongly connected components = 274,649 374 186 155 11 •  # connected component = 43,189 •  largest connected components = 404,893 378 112 108 103 •  highest core number = 249 (size 668)
  • 10. Flickr: scale free graph in-degrees out-degrees
  • 11. Data and link analysis •  Web Search –  ranking of pages/hosts/domains –  graph algorithms –  web spam •  Social networks –  finding communities –  trust propagation •  Data mining –  clustering –  classification
  • 12. Data and links analysis •  Graph partitioning: –  “balanced”, for data distribution –  “natural boundaries”, for clustering / communities •  Numerical computing on graphs: –  Node ranking methods (PageRank) –  Dimensionality reduction (SVD, HITS) –  Low dimensional embeddings –  Graph interpolation / regularization
  • 13. What do we compute? •  Probability matrix: •  Graph Laplacian: •  SVD/LSI: Computing eigenvalues / eigenvectors, linear systems
  • 14. Yahoo! Search Marketing (Overture) View Bids Tool
  • 15. Overture data search $ bid advertiser accessory desk .83 Office Max alfred hitchcock .01 Videoflicks.com educational software .13 Buy.com educational software .4 eBAY International AG educational software .17 OfficeMax epson .28 Buy.com epson .4 eBAY International AG fax .13 Buy.com fax .4 eBAY International AG fax .38 OfficeMax game .02 Net Business game .25 Buy.com george harrison .15 eBAY International AG george harrison .05 MusicStack george harrison .01 Videoflicks.com jackson michael .05 MusicStack jackson michael .01 Videoflicks.com
  • 16. Overture bid graph Terms accessory desk educational software Office Max epson Buy.com fax game eBAY International AG george harrison Net Business jackson michael MusicStack alfred hitchcock Videoflicks.com Advertisers
  • 17. Overture bid matrix 1 2 3 4 5 6 accessory desk game 1. Net Business fax 2. Office Max educational software 3. Buy.com epson 4. eBAY george harrison 5. MusicStack 6. Videoflicks.com jackson michael alfred hitchcock
  • 18. 170,077 terms, 25,971 advertisers, 1,307,316 bids
  • 19. Overture bid matrix: 18-core 2000 advertisers , 3000 search terms, 92,347 bids
  • 20. Spectral graph partitioning •  assign each node indicator pi = §1 1 2 3 4 •  partition p = {-1,-1,-1,-1, +1, +1, +1} 5 6 7 8 +1 -1 Relaxation procedure: => Quadratic optimization: Min cut solution : Rounding off:
  • 21. Spectral: ordering Eigenvector Eigenvector – sorted Adjacency matrix Adjacency matrix – re-ordered perm = [1 2 6 7 3 8 4 5]
  • 22. Spectral : additions •  Recursively •  Min cut, ratio cut, normalized cut, quotient cut, … •  Sweep for optimal value •  Flow based refinement •  Optimize partition among several eigenvectors •  Bi-partite graphs •  + other bells & whistles
  • 24. Sample clusters 'A.S.C. Incorporated' 'Atlantic Telecom' 'Berent Associates Center for 'AudioLink' Shyness and Social Therapy' 'Cost Plus Electronics' 'Dr. Puff' 'Headset Express' 'New York Psychotherapy Collective' 'Hello Direct' 'Ruby Shoes' 'PDA Mountain' 'Self-employed psychologist' 'PK TECH INC‘ 'www.kabbalah.com‘ ………………….. …………… 'accessory phone' 'health mental' 'cordless headset' 'help self' 'cordless headset phone' 'improvement self' 'free hands' 'parenting‘ 'headset' 'headset microphone' 'headset phone' 'headset plantronics' 'headset wireless' 'isdn' 'plantronics‘ …………….
  • 26. Websearch Engines •  Crawler + indexer •  Front end # Qs PR 1 0.7 0.4 2 0.8 0.1 Link Analysis 3 0.1 0.5 Text Analysis PageRank Inverted Index
  • 27. PageRank model •  Random walk on graph •  Markov process: memory-less, homogeneous •  Stationary distribution: existence, uniqueness, convergence Perron-Frobenius theorem, irreducible, every state is reachable from every other, and aperiodic – no cycles
  • 28. PageRank •  Construct probability matrix: •  Construct transition row-stochastic matrix for Markov process •  Correct reducibility: •  Find stationary distribution ( exist & unique): c – teleportation coefficient, v – personalization vector, d – dangling nodes (sinks) indicator
  • 29. PageRank linear system •  Eigensystem: •  Linear system:
  • 30. Computational diagram PageRank Problem Eigensystem Linear system Power Jacobi GMRES BiCG BiCGSTAB Iterations Iterations Preconditioners Block Additive Jacobi Jacobi Schwarz PageRank Vector
  • 31. Data: WebGraphs Name # Nodes/size # Links / nnz Memory bs 0.3M 1.6M 20.4 MB edu 2M 14M 176MB yahoo-r2 14M 266M 3.25GB uk 18.5M 300M 3.67GB yahoo-r3 60M 850M 10.4GB db 70M 1B 12.3GB av 1.4B 6.6B 80GB Graph in memory, compressed storage, int, int, double
  • 32. Power law graphs ( y = x - b) ? bs: 0.3M nodes y2: 266 M nodes db: 1B nodes
  • 33. Computational Methods Error Metrics for Jacobi Convergence Error Metrics for BiCG Convergence 0 2 10 10 ||x(k) - x*|| ||x(k) - x*|| -1 (k) 10 ||A x - b|| ||A x(k) - b|| (k) * (k) ||x - x ||/||x || 10 0 ||x(k) - x*||/||x(k)|| -2 10 -3 10 -2 Metric Value Metric Value 10 -4 10 -4 -5 10 10 -6 10 -6 10 -7 10 -8 -8 10 10 0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30 35 40 45 Iteration Iteration
  • 34. Convergence results 0 uk iteration convergence 0 db iteration convergence 10 10 std std 10 -1 jacobi 10 -1 jacobi gmres gmres -2 bicg -2 bicg 10 bcgs 10 bcgs -3 -3 10 10 Error Error -4 -4 10 10 -5 -5 10 10 -6 -6 10 10 -7 -7 10 10 -8 10 -8 0 10 20 30 40 50 60 70 80 10 0 10 20 30 40 50 60 70 Iteration Iteration uk: 18.5 M pages, 300 M links db: 70 M pages, 1 B links.
  • 35. Convergence results 0 uk time convergence 0 db time convergence 10 10 std std 10 -1 jacobi -1 jacobi 10 gmres gmres -2 bicg -2 bicg 10 bcgs 10 bcgs -3 -3 10 10 Error Error -4 -4 10 10 -5 -5 10 10 -6 10 -6 10 -7 10 10 -7 -8 10 10 -8 0 5 10 15 20 25 30 35 40 0 100 200 300 400 500 600 700 Time (sec) Time (sec) uk: 18.5 M pages, 300 M links db: 70 M pages, 1 B links.
  • 36. Summary Name Size Power Jacobi GMRES BiCG BCGS edu 2M 84 84 21* 44* 21* 20 procs 14M 0.09/7.5s 0.07/6.5s 0.6/13.2s 0.4/17.7s 0.4/8.7s yahoo-r2 14M 71 65 12* 35* 17* 20 procs 266M 1.8/129s 1.9/126s 16/194s 8.6/300s 9.9/168s uk 18.5M 73 71 22* 25* 11* 60 procs 300M 0.09/7s 0.1/10s 0.8/17.6s 0.8/19.4s 1.0/10.8s yahoo-r3 60M 76 75 60 procs 850M 1.6/119s 1.5/112s db 70M 62 58 29 45 15* 60 procs 1B 9.0/557s 8.7/506s 15/432s 15/676s 15/220s av 1.4B 72 26 140 procs 6.6B 4.6/333s 15/391s
  • 37. Reduced teleportation 2 Convergence and Conditioning for db 10 std gmres 0 c = 0.85 10 c = 0.90 c = 0.95 c = 0.99 Error/Residual -2 10 -4 10 -6 10 -8 10 0 20 40 60 80 100 120 140 160 180 200 Iteration db: 70 M pages, 1 B links.
  • 38. YRL research cluster Custom Parallel PageRank PETSc Off the shelf MPI Gigabit Switch RLX Blades Dual 2.8 GHz Xeon 4 GB RAM Gigabit Ethernet 120 Total RLX RLX RLX
  • 39. Typical graph out-degrees. in-degrees. bs: 70 M pages, 1 B links.
  • 41. Graph distribution methods •  Balance rows •  Balance nnz •  Random destribution •  Balance rows and nnz (heuristics) •  2D distributions •  Graph partitioning •  Hypergraph partitioning
  • 43. Domain lexicographic sorting http://host.domain.tld/path ! http://tld.domain.host/path bs-cc: 17 k pages, 133 k links
  • 44. Domain ordering 0 Host Order Improvement 10 std-140 10 -1 bcgs-140 std-140-host -2 bcgs-140-host 10 -3 10 error/residual -4 10 -5 10 -6 10 -7 10 -8 10 0 100 200 300 400 500 600 700 800 900 Time (sec) av: 1.4 B pages, 6.6 B links.
  • 45. Full web graph parallelization Scaling for computing with full-web Performance Increase (Percent decrease in time) 250% std bcgs 200% 150% 100% 50% 0% 90% 100% 110% 120% 130% 140% 150% 160% 170% 140 186 226 av: 1.4 B pages, 6.6 B links.
  • 46. FEM mesh for CFD computations
  • 48. final thoughts •  Power-law graphs are different •  Not all power-law graphs are the same •  But all have central core, chains, singletons. •  Difficult to distribute / parallelize •  Its is experimental data •  Lots of cool applications •  Let’s talk …..
  • 49. References •  Spectral graph partitioning: - M. Fiedler (1973), A. Pothen (1990), H. Simon (1991), B. Mohar (1992), B. Hendrickson (1995), D. Spielman (1996), F. Chang (1996), S. Guattery (1998), R. Kannan (1999), J. Shi (2000), I. Dhillon ( 2001), A. Ng (2001), H. Zha (2001), C. Ding (2001) •  PageRank computing: - S.Brin (1998), L. Page (1998), J. Kleinberg (1999), A. Arasu (2002), T. Haveliwala (2002-03), A. Langville (2002), G. Jeh (2003), S. Kamvar (2003), A. Broder (2004)