SlideShare a Scribd company logo
1 of 23
S3G2: a Scalable Structure-
correlated Social Graph Generator
   Minh-Duc Pham                 Peter
      Boncz                Orri Erling



    Database Architectures Group
      Centrum Wiskunde & Informatica (CWI)




                                         S3G2 . 27-Aug-12. Page 1/23
Data correlations between attributes

SELECT personID from person
WHERE firstName = ‘Joachim’ AND addressCountry = ‘Germany’

SELECT personID from person
                                                                   Anti-Correlation
WHERE firstName = ‘Cesare’          AND addressCountry = ‘Italy’




 Query optimizers may underestimate or overestimate the result size of conjunctive
predicates
                        Joachim Loew
                          Cesare          Cesare
                                          Joachim Prandelli
             Correlation between predicates has been studied to some
             extent in database research (e.g. in the LEO project)


             But: correlation-aware query optimization is still hardly
             mainstream in database products                        S3G2   . 27-Aug-12. Page 2/23
Data correlations between attributes

SELECT COUNT(*)
FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID
     paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID
WHERE   pa1.author = pa2.author     AND
        jn1.name = ‘VLDB Journal’   AND   jn2.name = ‘TODS’




                                                 S3G2 . 27-Aug-12. Page 3/23
Data correlations over joins

SELECT COUNT(*)
FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID
       paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID
WHERE    pa1.author = pa2.author              AND
         jn1.name = ‘VLDB Journal’             AND               ‘TODS’
                                                      jn2.name = ‘Bioinformatics’

 A challenge to the optimizers to adjust estimated join hit ratio pa1.author =
  pa2.author depending on other predicates




Correlated predicates are still a frontier area in database research


                                                                S3G2 . 27-Aug-12. Page 4/23
Graph database systems

 Emerging class in database systems


 Higher need for correlation-awareness
   • graph queries navigate over many steps (=joins)
   • well known effect in RDF systems (many self-joins)
   • implicit structure of graph/RDF data model re-appears in queries as correlations
   (structural correlation)

 No existing graph benchmark specifically tests for the effects of correlations
   • Synthetic graphs used for benchmarking do not have structural correlations

             Need a data generator generating synthetic graph
             with data/structure correlations  S3G2


                                                                      S3G2 . 27-Aug-12. Page 5/23
Next …

 whatdata do we generate?
  • social network, Facebook-like

 how to generate correlated properties?
  • with a compact data generator


 how to generate correlated structure?
  • multiple correlation dimensions
  • scalable MapReduce algorithm (multi-pass)




                                                S3G2 . 27-Aug-12. Page 6/23
S3G2: Generating a Correlated Social Graph
                                                                              “Switzerland”
                                                        “Yamaku”




                                                                    t
                                                                    eA
                                      Ph




                                                                 liv
                                                          me
                 Po                      ot                                         “EPFL”
                    st                      o                                At
                                                                           dy




                                                             a
                                                        hasN
               Co                                                   st   u
                 mm
                      en
                           t




                                         upload
                           cre
                                                                        InRelationShip




                               a
                                       Co




                                te
                                          m                                                         User
                                              me                 User




                         create
                                                   nt




                                         s
                                      ow
                                   kn


                                                         knows
                                                                            kn
                                                                               o   ws

                               User




                                                        cre
                                like


                                                            a
                                                           te
                                                                                             User


                                                             User

                                                                        S3G2 . 27-Aug-12. Page 7/23
Next …

 what data do we generate?
  • social network, Facebook-like


 how to generate correlated properties?
  • with a compact data generator

 how to generate correlated structure?
  • multiple correlation dimensions
  • scalable MapReduce algorithm (multi-pass)




                                                S3G2 . 27-Aug-12. Page 8/23
Generating Correlated Property Values

 How do data generators generate values?   E.g. FirstName




                                                             S3G2 . 27-Aug-12. Page 9/23
Generating Property Values

 How do data generators generate values?                 E.g. FirstName

 Value Dictionary D()
  • a fixed set of values, e.g.,
    {“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri”
  .. }

 Probability density function F()
  • steers how the generator chooses values
  − cumulative distribution over dictionary entries determines which value to pick

  • could be anything: uniform, binomial, geometric, etc…
  − geometric (discrete exponential) seems to explain many natural phenomena




                                                                                     S3G2 . 27-Aug-12. Page 10/23
Generating Correlated Property Values

 How do data generators generate values? E.g. FirstName


 Value Dictionary D()


 Probability density function F()


 Ranking Function R()
  • Gives each value a unique rank between 1 and |D|
  −determines which value gets which probability

  • Depends on some parameters (parameterized function)
  − value frequency distribution becomes correlated by the parameters or R()




                                                                               S3G2 . 27-Aug-12. Page 11/23
Generating Correlated Property Values

 How do data generators generate values? E.g. FirstName


 Value Dictionary D()
                                  How to implement R()?
  {“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri”
  , .. }
                                      We need a table
                         limited #combinations             storing
 Probability density function F()
  geometric distribution
                           |Gender| X |Country| X |BirthYear| X |D|
                                                                              Potentially
 Ranking Function R(gender,country,birthyear)                                Many! 
  • gender, country, birthyear  correlation parameters
                             Our Solution:
- Just store the rank of the top-N values, not all|D|
- Assign the rank of the other dictionary values randomly
                                                             S3G2 . 27-Aug-12. Page 12/23
Compact Correlated Property Value Generation
                 Using geometric distribution for function F()




                                        S3G2 . 27-Aug-12. Page 13/23
Correlated Properties used in S3G2

 Main source of dictionary values from DBpedia (http://dbpedia.org)


 Various realistic property value correlations ()
   e.g.,
   (person.location,person.gender,person.birthDay)  person.firstName
   person.location  person.lastName
   person.location  person.university
   person.createdDate  person.photoAlbum.createdDate
   ….




                                                                  S3G2 . 27-Aug-12. Page 14/23
Next …

 what data do we generate?
  • social network, Facebook-like


 how to generate correlated properties?
  • with a compact data generator


 how to generate correlated structure?
  • multiple correlation dimensions
  • scalable MapReduce algorithm (multi-pass)




                                           S3G2 . 27-Aug-12. Page 15/23
Correlated Edges in a social network
                                                                      <Britney Spears>
                                                            ik   e>
                    “1990”                                <l
                                                       P4
                                                                                               a   ”
                          <b                                                                nn
  “University of < s        irt                                                           “A
                                                                               t            y”




                                           s>
                     t ud    hY                                            n
                                                                        de               an



                                          w




                                                                                     e>
  Leipzig”                yAt  ea
                                                            Stu                      e rm

                                       no
                              >




                                                                                   am
                                 r>
                                                                                   “G
                                      <k
                          me>
                                          <knows>                              t>




                                                                                 n
                <firstna                                                      A




                                                                      <is>
                                                                         iv e




                                                                             rst
      “Laura”                   P1                                    <l




                                                                             <fi
                                                                                            ”
                                                                                         90
                                       <k
                                                                                  r> “19
                                                                           hYea
                        e>




                                         no

                                                            5       <birt
                        k




                                                        s> P                                           f
                                           ws
                     <li




                                                      w                                           ityo
                                            >

                                                   no                   <stu
                                                                                dyA          v ers
<Britney Spears>
                                         P3      <k                   >             t “Uni ig”
                                                                                              pz
                                       >                        P2                        Lei
                                                 <birth



                                  y At
                                 d
                           < stu                                     <s
                                                                   > tudy
                                                   ear> Y




                                                                              <live
                                                                               At
                 “University
                                                                                    “University of
                                                                                    At>
                 of Leipzig”
                                            “1990”                                  Amsterdam”
                                                             “Netherlands”
                                                                                            S3G2 . 27-Aug-12. Page 16/23
How to generated correlated edges? ed
                                                      as                                        b
                                                                        <Britney Spears>s
                                                                                     de
                                                             ke
                                                                >
                                                                                 o no ies.                 t to
                 “1990”
                                                    P4 <l
                                                            i
                                                                         f tw rt ” n wr
                             <b                                      y o prope Annctio es
                                                                  rit )                    na d
                                                              a                        “u
                                                           mil atednt sity f g no any”
 “University of < s            irt
                                hY                       i
                                                   e s rrelStud den ctin erm
                    t ud                                               e




                                                                                   e>
                         yAt
                                              put (co lity nne > “G
 Leipzig”                         ea
                             >




                                                                                 am
                                    r>
                      name>              Com heir babi r co veAt




                                                                                  n
                                                  ?


                                                                     ?
                                                                       <is>
               <first
                               P1 •




                                                                              rst
                                              t                       o




                                                                                                                   ?
      “Laura”                              on a pro arity f                     <li




                                                                              <fi
                                                                                                       ”
                                                                                                     90
                                            Use si l

                                                                 ?
                                                                                           r> “19
     Multiple correlation dimensions:mi P5                                           hYea
                           e>




                                                 ?
                                         •                                   <birt
                           k




                                                 is                                                               f
                        <li




     -Studying near each other t               h                                                             ityo
                                                                                  <stu
                                                                                        dyA             v ers
<Britney Spears>
     --liking the same music               P3          ion                       >           t “Uni ig”
                                                    ct                                            la r z
                                                                                               imi Leip
                                                             ?
                                              o nne bility                                    s
     -- etc, etc                        t > c roba                    P2                 less
                                                      <birth



                                      A
                                tu dy         p
                                                                                  ilar
                                                                                       
     --                       <s                                              i<
                                                                               m
                                                                          ly s stu
                                                                          hig >
                                                             Y




                                                                        h                  dy
                                                        ear>




                                                                               <live
                                                                                              A   t
            “University
                                                                                    “University of
                                                                                     At>
    Continuously access
            of Leipzig”possibly any
                           “1990”                                    node for correlated
                                                                                    Amsterdam”
                                                                      “Netherlands”
    edges  Expensive random I/Os for graphs of a size > RAM                                          S3G2 . 27-Aug-12. Page 17/23
Our observation
                                                                       <Britney Spears>
                                                             ik   e>
                      “1990”                               <l nodes with too large similarity distance
                                                   P4
                                           Trick: disregard                                       ”
                            <b                                                               n na
                              irt          (only connect nodes in a similarity window)   “A
 “University of < s                                                         t                              y”




                                              s>
                    t ud        hY                                       en                             an



                                              w
                                                                       d




                                                                                   e>
 Leipzig”                yAt      ea
                                                                   Stu                             e rm

                                           no
                             >                                Window




                                                                                 am
                                    r>
                                                                                               “G
                                         <k
                         me>
                                            <knows>                                      t>




                                                                                  n
               <firstna                                                                 A




                                                                       <is>
                                                                                   iv e




                                                                              rst
     “Laura”                   P1                                               <l




                                                                              <fi
                                                                                                          ”
                                                                                                        90
                                          <k
                                                                                              r> “19
                                                                                     hYea
                          e>




                                            no

                                                                   5          <birt
                          k




                                                               s> P                                                  f
                                              ws
                       <li




                                                             w                                                  ityo
                                                >

                                                          no                      <stu
                                                                                          dyA              v ers
<Britney Spears>
                                           P3          <k
                                                      io n                       >              t “Uni ig”
                                                   ct                                                 la r z
                                              o nne bility                                       simi Leip
                                        t > c roba                     P2                   less
                                                     <birth



                                      A
                                tu dy          p
                                                                                  ilar
                                                                                        
                              <s                                               <
                                                                               im
                                                                           ly s stud
                                                                      hig >
                                                                          h
                                                            Y




                                                                                       yA
                                                       ear>




                  “University                                                  <live      t
                                                                                                “University of
                                                                                     At>
   Probability    of that two nodes are connected is skewed w.r.t
                      Leipzig”
                                               “1990”                                           Amsterdam”
                                                                    “Netherlands”
    the similarity between the nodes (due to probability distr.)                           S3G2 . 27-Aug-12. Page 18/23
We can Sort nodes on Correlation
                     Dimension

Similarity metric + Probability function
 Similar metric
    Sort nodes on similarity (similar nodes are brought near each other)


    P1      P5      P3      P2         P4
   Munich Dresden Leipzig Leipzig Potsdam
               <Ranking along the “Having study together” dimension>
                 we use space filling curves (e.g. Z-order) to get a linear dimension
 Probability function
  Pick edge between two nodes based on their ranked distance
  (often: geometric distribution, again)


                                                                  S3G2 . 27-Aug-12. Page 19/23
Generate edges along correlation dimensions

W




 Sort nodes using MapReduce on similarity metric
 Reduce() function keeps a window of nodes to generate edges
  • Keep low memory usage (sliding window approach)


 Slide the window for multiple passes, each pass corresponds to one correlation
dimension (multiple MapReduce jobs)
   • for each node we choose degree per pass (also using a prob. function)
    steers how many edges are picked in the window for that node


                                                                   S3G2 . 27-Aug-12. Page 20/23
Correlation dimensions for our Social Graph

 Having studied together


 Having common interests (hobbies)


 Random dimension
   • motivation: not all friendships are explainable (…)



(of course, these two correlation dimensions are still a gross simplification of reality
but this provides some interesting material for benchmark queries)




                                                                       S3G2 . 27-Aug-12. Page 21/23
Evaluation (… see the paper)

 Social graph characteristics
   • Output graph has similar characteristics as observed in real social network
   (i.e., “small-world network” characteristics)
         - Power-law social degree distribution
         - Low average path-length
         - High clustering coefficient




 Scalability
   • Generates up to 1.2 TB of data (1.2 million users) in half an hour
         - Runs on a cluster of 16 nodes (part of the SciLens cluster, www.scilens.org)
  • Scales out linearly




                                                                     S3G2 . 27-Aug-12. Page 22/23
Conclusion

 Propose novel framework for scalable graph generator that can
  • Generate huge graph having correlations between the graph structure and
  graph data
  • Exploit parallelism offered by MapReduce paradigm for scalability


 Future step: Use S3G2 as the base for a novel benchmark in graph query processing (
www.w3.org/wiki/Social_Network_Intelligence_BenchMark)




                                                                  S3G2 . 27-Aug-12. Page 23/23

More Related Content

Similar to S3G2 - a Scalable Structure-correlated Social Graph Generator

Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project ReportArnab Mukhopadhyay
 
IRJET- Link Prediction in Social Networks
IRJET- Link Prediction in Social NetworksIRJET- Link Prediction in Social Networks
IRJET- Link Prediction in Social NetworksIRJET Journal
 
Elastic r sc10-tutorial
Elastic r sc10-tutorialElastic r sc10-tutorial
Elastic r sc10-tutorialArden Chan
 
Gremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageGremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageMarko Rodriguez
 
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...IRJET Journal
 
LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
 LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
LCF: A Temporal Approach to Link Prediction in Dynamic Social NetworksIJCSIS Research Publications
 
Mathematical Semantics of Statistical Data
Mathematical Semantics of Statistical DataMathematical Semantics of Statistical Data
Mathematical Semantics of Statistical DataChristoph Lange
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix DatasetBen Mabey
 
08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (ERGM)08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (ERGM)dnac
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET Journal
 
IRJET - Face Recognition in Digital Documents with Live Image
IRJET - Face Recognition in Digital Documents with Live ImageIRJET - Face Recognition in Digital Documents with Live Image
IRJET - Face Recognition in Digital Documents with Live ImageIRJET Journal
 
Graph Databases: Trends in the Web of Data
Graph Databases: Trends in the Web of DataGraph Databases: Trends in the Web of Data
Graph Databases: Trends in the Web of DataMarko Rodriguez
 
Cross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentCross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentlau
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningEditor IJCATR
 
Automatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative NetworksAutomatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative NetworksMarko Rodriguez
 
Quiterian modules and_componentes_eng
Quiterian modules and_componentes_engQuiterian modules and_componentes_eng
Quiterian modules and_componentes_engaromeromoreno
 

Similar to S3G2 - a Scalable Structure-correlated Social Graph Generator (20)

Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project Report
 
IRJET- Link Prediction in Social Networks
IRJET- Link Prediction in Social NetworksIRJET- Link Prediction in Social Networks
IRJET- Link Prediction in Social Networks
 
Elastic r sc10-tutorial
Elastic r sc10-tutorialElastic r sc10-tutorial
Elastic r sc10-tutorial
 
Gremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageGremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming Language
 
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...
 
LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
 LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
 
Mathematical Semantics of Statistical Data
Mathematical Semantics of Statistical DataMathematical Semantics of Statistical Data
Mathematical Semantics of Statistical Data
 
PointNet
PointNetPointNet
PointNet
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
08 Exponential Random Graph Models (2016)
08 Exponential Random Graph Models (2016)08 Exponential Random Graph Models (2016)
08 Exponential Random Graph Models (2016)
 
08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (ERGM)08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (ERGM)
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and Python
 
Ijetcas14 347
Ijetcas14 347Ijetcas14 347
Ijetcas14 347
 
IRJET - Face Recognition in Digital Documents with Live Image
IRJET - Face Recognition in Digital Documents with Live ImageIRJET - Face Recognition in Digital Documents with Live Image
IRJET - Face Recognition in Digital Documents with Live Image
 
Graph Databases: Trends in the Web of Data
Graph Databases: Trends in the Web of DataGraph Databases: Trends in the Web of Data
Graph Databases: Trends in the Web of Data
 
Cross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentCross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignment
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
Automatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative NetworksAutomatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative Networks
 
Quiterian modules and_componentes_eng
Quiterian modules and_componentes_engQuiterian modules and_componentes_eng
Quiterian modules and_componentes_eng
 

S3G2 - a Scalable Structure-correlated Social Graph Generator

  • 1. S3G2: a Scalable Structure- correlated Social Graph Generator Minh-Duc Pham Peter Boncz Orri Erling Database Architectures Group Centrum Wiskunde & Informatica (CWI) S3G2 . 27-Aug-12. Page 1/23
  • 2. Data correlations between attributes SELECT personID from person WHERE firstName = ‘Joachim’ AND addressCountry = ‘Germany’ SELECT personID from person Anti-Correlation WHERE firstName = ‘Cesare’ AND addressCountry = ‘Italy’  Query optimizers may underestimate or overestimate the result size of conjunctive predicates Joachim Loew Cesare Cesare Joachim Prandelli Correlation between predicates has been studied to some extent in database research (e.g. in the LEO project) But: correlation-aware query optimization is still hardly mainstream in database products S3G2 . 27-Aug-12. Page 2/23
  • 3. Data correlations between attributes SELECT COUNT(*) FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID WHERE pa1.author = pa2.author AND jn1.name = ‘VLDB Journal’ AND jn2.name = ‘TODS’ S3G2 . 27-Aug-12. Page 3/23
  • 4. Data correlations over joins SELECT COUNT(*) FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID WHERE pa1.author = pa2.author AND jn1.name = ‘VLDB Journal’ AND ‘TODS’ jn2.name = ‘Bioinformatics’  A challenge to the optimizers to adjust estimated join hit ratio pa1.author = pa2.author depending on other predicates Correlated predicates are still a frontier area in database research S3G2 . 27-Aug-12. Page 4/23
  • 5. Graph database systems  Emerging class in database systems  Higher need for correlation-awareness • graph queries navigate over many steps (=joins) • well known effect in RDF systems (many self-joins) • implicit structure of graph/RDF data model re-appears in queries as correlations (structural correlation)  No existing graph benchmark specifically tests for the effects of correlations • Synthetic graphs used for benchmarking do not have structural correlations Need a data generator generating synthetic graph with data/structure correlations  S3G2 S3G2 . 27-Aug-12. Page 5/23
  • 6. Next …  whatdata do we generate? • social network, Facebook-like  how to generate correlated properties? • with a compact data generator  how to generate correlated structure? • multiple correlation dimensions • scalable MapReduce algorithm (multi-pass) S3G2 . 27-Aug-12. Page 6/23
  • 7. S3G2: Generating a Correlated Social Graph “Switzerland” “Yamaku” t eA Ph liv me Po ot “EPFL” st o At dy a hasN Co st u mm en t upload cre InRelationShip a Co te m User me User create nt s ow kn knows kn o ws User cre like a te User User S3G2 . 27-Aug-12. Page 7/23
  • 8. Next …  what data do we generate? • social network, Facebook-like  how to generate correlated properties? • with a compact data generator  how to generate correlated structure? • multiple correlation dimensions • scalable MapReduce algorithm (multi-pass) S3G2 . 27-Aug-12. Page 8/23
  • 9. Generating Correlated Property Values  How do data generators generate values? E.g. FirstName S3G2 . 27-Aug-12. Page 9/23
  • 10. Generating Property Values  How do data generators generate values? E.g. FirstName  Value Dictionary D() • a fixed set of values, e.g., {“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri” .. }  Probability density function F() • steers how the generator chooses values − cumulative distribution over dictionary entries determines which value to pick • could be anything: uniform, binomial, geometric, etc… − geometric (discrete exponential) seems to explain many natural phenomena S3G2 . 27-Aug-12. Page 10/23
  • 11. Generating Correlated Property Values  How do data generators generate values? E.g. FirstName  Value Dictionary D()  Probability density function F()  Ranking Function R() • Gives each value a unique rank between 1 and |D| −determines which value gets which probability • Depends on some parameters (parameterized function) − value frequency distribution becomes correlated by the parameters or R() S3G2 . 27-Aug-12. Page 11/23
  • 12. Generating Correlated Property Values  How do data generators generate values? E.g. FirstName  Value Dictionary D() How to implement R()? {“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri” , .. } We need a table limited #combinations storing  Probability density function F() geometric distribution |Gender| X |Country| X |BirthYear| X |D| Potentially  Ranking Function R(gender,country,birthyear) Many!  • gender, country, birthyear  correlation parameters Our Solution: - Just store the rank of the top-N values, not all|D| - Assign the rank of the other dictionary values randomly S3G2 . 27-Aug-12. Page 12/23
  • 13. Compact Correlated Property Value Generation Using geometric distribution for function F() S3G2 . 27-Aug-12. Page 13/23
  • 14. Correlated Properties used in S3G2  Main source of dictionary values from DBpedia (http://dbpedia.org)  Various realistic property value correlations () e.g., (person.location,person.gender,person.birthDay)  person.firstName person.location  person.lastName person.location  person.university person.createdDate  person.photoAlbum.createdDate …. S3G2 . 27-Aug-12. Page 14/23
  • 15. Next …  what data do we generate? • social network, Facebook-like  how to generate correlated properties? • with a compact data generator  how to generate correlated structure? • multiple correlation dimensions • scalable MapReduce algorithm (multi-pass) S3G2 . 27-Aug-12. Page 15/23
  • 16. Correlated Edges in a social network <Britney Spears> ik e> “1990” <l P4 a ” <b nn “University of < s irt “A t y” s> t ud hY n de an w e> Leipzig” yAt ea Stu e rm no > am r> “G <k me> <knows> t> n <firstna A <is> iv e rst “Laura” P1 <l <fi ” 90 <k r> “19 hYea e> no 5 <birt k s> P f ws <li w ityo > no <stu dyA v ers <Britney Spears> P3 <k > t “Uni ig” pz > P2 Lei <birth y At d < stu <s > tudy ear> Y <live At “University “University of At> of Leipzig” “1990” Amsterdam” “Netherlands” S3G2 . 27-Aug-12. Page 16/23
  • 17. How to generated correlated edges? ed as b <Britney Spears>s de ke > o no ies. t to “1990” P4 <l i f tw rt ” n wr <b y o prope Annctio es rit ) na d a “u mil atednt sity f g no any” “University of < s irt hY i e s rrelStud den ctin erm t ud e e> yAt put (co lity nne > “G Leipzig” ea > am r> name> Com heir babi r co veAt n ? ? <is> <first P1 • rst t o ? “Laura” on a pro arity f <li <fi ” 90 Use si l ? r> “19 Multiple correlation dimensions:mi P5 hYea e> ? • <birt k is f <li -Studying near each other t h ityo <stu dyA v ers <Britney Spears> --liking the same music P3 ion > t “Uni ig” ct la r z imi Leip ? o nne bility s -- etc, etc t > c roba P2 less <birth A tu dy p ilar  -- <s i< m ly s stu hig > Y h dy ear> <live A t “University “University of At> Continuously access of Leipzig”possibly any “1990” node for correlated Amsterdam” “Netherlands” edges  Expensive random I/Os for graphs of a size > RAM S3G2 . 27-Aug-12. Page 17/23
  • 18. Our observation <Britney Spears> ik e> “1990” <l nodes with too large similarity distance P4 Trick: disregard ” <b n na irt (only connect nodes in a similarity window) “A “University of < s t y” s> t ud hY en an w d e> Leipzig” yAt ea Stu e rm no > Window am r> “G <k me> <knows> t> n <firstna A <is> iv e rst “Laura” P1 <l <fi ” 90 <k r> “19 hYea e> no 5 <birt k s> P f ws <li w ityo > no <stu dyA v ers <Britney Spears> P3 <k io n > t “Uni ig” ct la r z o nne bility simi Leip t > c roba P2 less <birth A tu dy p ilar  <s < im ly s stud hig > h Y yA ear> “University <live t “University of At> Probability of that two nodes are connected is skewed w.r.t Leipzig” “1990” Amsterdam” “Netherlands” the similarity between the nodes (due to probability distr.) S3G2 . 27-Aug-12. Page 18/23
  • 19. We can Sort nodes on Correlation Dimension Similarity metric + Probability function  Similar metric Sort nodes on similarity (similar nodes are brought near each other) P1 P5 P3 P2 P4 Munich Dresden Leipzig Leipzig Potsdam <Ranking along the “Having study together” dimension> we use space filling curves (e.g. Z-order) to get a linear dimension  Probability function Pick edge between two nodes based on their ranked distance (often: geometric distribution, again) S3G2 . 27-Aug-12. Page 19/23
  • 20. Generate edges along correlation dimensions W  Sort nodes using MapReduce on similarity metric  Reduce() function keeps a window of nodes to generate edges • Keep low memory usage (sliding window approach)  Slide the window for multiple passes, each pass corresponds to one correlation dimension (multiple MapReduce jobs) • for each node we choose degree per pass (also using a prob. function) steers how many edges are picked in the window for that node S3G2 . 27-Aug-12. Page 20/23
  • 21. Correlation dimensions for our Social Graph  Having studied together  Having common interests (hobbies)  Random dimension • motivation: not all friendships are explainable (…) (of course, these two correlation dimensions are still a gross simplification of reality but this provides some interesting material for benchmark queries) S3G2 . 27-Aug-12. Page 21/23
  • 22. Evaluation (… see the paper)  Social graph characteristics • Output graph has similar characteristics as observed in real social network (i.e., “small-world network” characteristics) - Power-law social degree distribution - Low average path-length - High clustering coefficient  Scalability • Generates up to 1.2 TB of data (1.2 million users) in half an hour - Runs on a cluster of 16 nodes (part of the SciLens cluster, www.scilens.org) • Scales out linearly S3G2 . 27-Aug-12. Page 22/23
  • 23. Conclusion  Propose novel framework for scalable graph generator that can • Generate huge graph having correlations between the graph structure and graph data • Exploit parallelism offered by MapReduce paradigm for scalability  Future step: Use S3G2 as the base for a novel benchmark in graph query processing ( www.w3.org/wiki/Social_Network_Intelligence_BenchMark) S3G2 . 27-Aug-12. Page 23/23

Editor's Notes

  1. As you can see, data in real life is correlated, for example, people living in Germany have different distribution of names than people in Italy. And in the database systems, the data correlations strongly influence the performance of the system in processing queries. It influences the intermediate result sizes of query plans, the effectiveness of indexing techniques &amp; cause the absence or presence of locality in data access pattern. As an example, let’s have a look at the influence of the data correlation to the intermediate results of selection. Here we have two queries. One looking for people with firstname Joachim in Germany, and the other looks for people with first name Cesare in Italy. As Joachim is more popular in Germany than in other countries, and similarly for Cesare in Italy, these queries returns large number of results. These example queries are actually motivated from names of the coaches Joachim Lowe and Cesare Prandelli, of nation football teams in Germany and Italy. What if we change the predicates in these queries, for e.g., looking for people with name Cesare in Germany. As there is anti-correlation between these names and the coutries, these queries will return a very small number of results. Since the query optimizers commonly use independence assumption for estimating the result size of conjunctive predicates, they may underestimate or overestimate the results size In the relation database, the correlations between predicates in the same table have been studied in some degree. However, employing technique for detecting the correlations is still hardly mainstream in database products.
  2. We talked about the correlations between attributes of the same table
  3. Now consider the data correlations between predicates separated by joins. Here, we consider a DBLP example that look for all the authors have publication both in VLDB Journal and TODS. This query is likely to have a larger result size than a query that substitutes TODS for Bioinformatics, even though Bioinformatics is a much larger publication than TODS. The reason is that database researchers are less likely do cross-disciplinary work. In this query plan, the query optimizer should be aware of the correlation predicates in different tables cross join. And this correlation ofcourse influences the best join order. As we know, currently no system can handle this well. To be summarized, Correlated predicates are still a frontier area in database research
  4. And the requirement for recognizing correlations in query processing is even higher in the graph database system, which is an emerging class in database systems with many recent start-up companies. If we consider the most popular graph model, the RDF graph model. In RDF work load, there are many self-join over big table of RDF triple. The selection of a property will be join with the selection of other property in a big table. And there can be more than 20 joins. Thus, the join hit ratio is heavily correlated with the correlations. And there are implicit correlations between the structure of the graph and the data in the graph, which also strongly influence the performance of systems and algorithm. However, existing graph benchmarks do not specifically test for the effect of the correlations. The reason is the synthetic graph generated by these existing graph benchmarks do not have structural correlations. Therefore, in our work, we propose a framework for generating a huge highly connected graph with data/structure correlations.
  5. Now we talk about the data that we generate for demonstrating our framework which is a social network graph simulates the logical schema of the most popular social network, Facebook.
  6. This social network data generator is actually a part of our current work on a graph benchmark, however, the benchmark is out of the scope of our talk. A s real social network is huge graph with many structural &amp; data correlations. It is a very good test case for the performance of the graph database system. In addition, we would want to note that the social network data is very precious &amp; interesting. For instance, marketing companies usually try to obtain or crawl a subset of social network data for their analysis. Therefore, the social network seems to be one typical market for graph database system.
  7. Next we talk about the correlated properties can be generated with compact data generator.
  8. However, before talk about generating correlated values for property
  9. We talk about how the generator generates non-correlated property values like firstname. To do that, the data generator needs two ingredients A value dictionary which contains a fixed set of values. Here we have a set of first names. A probability density function, that pick a value from the dictionary with different distribution. The density function could be uniform, bionomial or geometric distribution.
  10. Back to our question, How to generate correlated property values, To do this, our data generator use the third ingredient, a ranking function This function introduce the correlation by having correlation parameters. It map each dictionary value to unique rank. However, given different parameter, it does that in different ways.
  11. Specifically, in example case of generating the correlated property values for the first names. We use the geometric distribution for the probability density function which is appropriate for many natural phenomena. We use the parameters gender, country and birthyear for the ranking function since the distribution of firstnames will be influenced by these parameters. However, a question is how we should implement the ranking function. Normally, we need a table that stores a cartisian products of all parameters and dictionary values. There number of combinations for the parameters usually limited, however, there potentially many dictionary values. Which requires a too big table. We don’t want our generator to depend on huge data files. Thus, we propose a simple solution by just store the rank of only top-N values from dictionary, and assign the rank of other dictionary values randomly. The implicit reason is that the values ranked lower than N have a very small probability to be selected, thus, randomly assign their ranks only slightly decreases the plausibility of the generated values.
  12. This figure show the distribution of name popularity in Germany above and Italy below. The x-axis is the rank of the dictionary value produced by the ranking function. And y-axis is the popularity. We store the top-10 ranks which is the green stuff in the table, and the other ranks are produced randomly. We do not store any of them. So that it is compact. The figure also shows that certain names popular in Germany but not popular in Italy and vice versa. Thus, we have the location correlated firstname here.
  13. For our Social network, the main sources of our dictionary values are from Dbpedia, And we genenerate the property values with various correlations such as the lastname &amp; the university where people study correlated with the location. Detailed correlations can be found in the paper.
  14. Finally, we talk about how to generate the structure by something called correlation dimension and make the generating algorithm scalable with MapReduce paradigm.
  15. Here, the edges in the social graph are the friendships. The friendship generation between two people is usually correlated with their properties. For example, people study in the same university have high probability to be friends, or people are likely to be connected with the one who have the same hobby.
  16. How the correlated friendship edges are generated. Formalizing what I have said that people study together have high probability to be friend, For connecting two nodes, we compute the similarity of two nodes based on their correlated properties, and then use a density function that give high probability for two nodes of small similarity distance, and low probability for large similarity distance. We call the combination of the similarity metric and the probability function as the correlation dimension, and there are multiple correlation dimensions However, if you would use monte carlo approach &amp; start comparing all nodes using the probability function to decide the whether they should be connected or not, you get random access pattern. And for large graph, this cause expensive random I/Os, so that it is not feasible for generating huge graph.
  17. To address this problem, we need sth smarter. we observe that the probability that two nodes are connected is skewed with regards to the similarity between nodes. And the connection probability is very small between two nodes that are less similar. Thus, we suggested to use a trick, that disregard nodes with too much large similarity distance, and only consider generating the connections for nodes in a similarity window
  18. To do that with the correlation dimension, we first sort nodes according to the similarity metric so that similar nodes are brought near each other. Here is an example of sorting people according to the similarity metric “having study together”. We would like to note that for the similarity metric that are multi-dimensions, we use space filling curves to map it to a linear dimension so that the values can be sorted. The probability function is used for selecting edges according to the distance between ranks of two nodes along the similarity metric.
  19. We implement this using MapReduce paradigm. Each pass of MapReduce sort nodes on one similarity metric. The reducer keeps a window of sorted nodes to generate edges so that we do not need to keep all nodes in memory. Since there can be many correlation dimensions, the window slides for multiple passes, each pass for one correlations dimension. This means that we have to run multiple MapReduce jobs. Here, we also have degree for each pass, which specify how many edges that we should generate for a node in each pass.
  20. In our social graph generator, we consider three correlation dimensions: Having studied together, having common interest or hobbies, and a random dimension. The reason that we use a random dimension is that not all friendships are explainable. Their connectivity is not only around these two above correlation dimension. The random noise occurs in practice, and this random dimension can make the data distribution more realistic. And of course, considering two correlations is still a simplication of reality but we believe that this can provide interesting material for graph benchmark queries.
  21. The evaluation on the generated social graph shows that our graph have all the important characteristics observed in real social network. The experimental also shows that our generating algorithm is scalable that can generate 1.2 TB of data in about half an hour using a cluster of 16 nodes.
  22. To be concluded, we have prosed a novel framework for scalable graph generator that can generate huge graph having structure and data correlations. We have exploit the MapReduce paradigm for implementing a scalable generator. As a future step we will use this data generator for a novel benchmark in graph query processing that we call Social Network Intelligence Benchmark.