SlideShare une entreprise Scribd logo
1  sur  53
1




Searching Data with Substance
and Style
Amélie Marian
Rutgers University

http://www.cs.rutgers.edu/~amelie
2




Semi-structured Data Processing
• Large amount of data online and in personal
  devices
 ▫ Structure (style)
 ▫ Text content (substance)
 ▫ Different sources (soul)

 ▫ Finding the data we need can be difficult



                    Amélie Marian - Rutgers University
3



Semi-structured Data Processing at Rutgers
SPIDR Lab
• Personal Information Search
  ▫ Semi-structured data
  ▫ Need for high -quality search tools
• Structuring of User Web Posts
  ▫ Large amount of user-generated data untapped
  ▫ Text has inherent structure
  ▫ Use of text for guiding search and analyze data
• Data Corroboration
  ▫ Conflicting sources of data
  ▫ Need to identify true facts
                      Amélie Marian - Rutgers University
4




Joint work with:

Wei Wang
Christopher Peery
Thu Nguyen
                Computer Science, Rutgers University

                   Amélie Marian - Rutgers University
5


Personal Information Search
  Web                                                         Personal
                                                                Data




Search for relevant documents                                             Search for specific documents
   Information that can be used for personal information search
   • Content (keywords)
   • Metadata (file size, modification time, etc.)
   • Structure
      ▫ Directory (external)
      ▫ File structure (internal): XML, LaTeX tags, Picture tags, etc.
      ▫ Partially known

                                     Amélie Marian - Rutgers University
6

                                                              EDBT’08
                                                              ICDE’08 (demo)
PIMS Project Description                                      DEB’09
                                                              EDBT’11
                                                              TKDE (accepted)

• Data and query models that unify content and structure

• Scoring framework to rank unified search results

• Query processing algorithms and index structures to
  score and rank answers efficiently

• Evaluation of the quality and efficiency of the unified
  scoring

                                  NSF CAREER Award July 2009-2014
                         Amélie Marian - Rutgers University
7


Target file: Halloween party pictures taken at home where someone
  wears a witch costume

Separate Structure and Content


   File
Boundary

                                                                  Directory: //Home
                                                                  Keywords: Halloween, witch




                             Amélie Marian - Rutgers University
8



Current Search Tools
Current search tools (i.e. web, desktop, GDS) mostly rely on
 ranking and filtering.
  ▫ Ranking     content keywords
  ▫ Filtering   additional conditions (e.g., metadata, structure)
     Find a jpg file saved in directory /Desktop/Pictures/Home
             that contains the words “Halloween witch”

This approach is often insufficient.
  ▫ Filtering forces a binary decision. Gif files and files under
    directory /Archive/Pictures/Home are not returned.
  ▫ Structure and content are strictly separated. Files under
    directory /Pictures/Halloween are not returned.



                           Amélie Marian - Rutgers University
9



Unified Approach
Goal: Unify structure and content
 ▫ Develop a unified view of directory and file structure
 ▫ Allow for a single query to contain both structure and
   content components and to be answered at once
 ▫ Return results even if queries are incomplete or contain
   mistakes

Approach:
 ▫   Define a unified data model by ignoring file boundaries
 ▫   Define a unified query model
 ▫   Define relaxations to approximate unified queries
 ▫   Define relevance score for unified queries
                        Amélie Marian - Rutgers University
10


Unified Structure and Content
Target file: Halloween party pictures taken at home where someone
  wears a witch costume




                                                         //Home[.//“Halloween” and .//“witch”]
  File
                                                                         root
Boundary
                                                                         Home

                                                                 “Halloween” “witch”




                            Amélie Marian - Rutgers University
11




From Query to Answers


                                 DAG
    Relaxation                                           Matching
                 Relaxed Queries
                                                              Matches
  Query
                                                              / Answers


      User                                               Scoring
                 Ranked Answers
                 (TA algorithm)



                    Amélie Marian - Rutgers University
12



Query Relaxations
Target: IMG_1391.gif
• Edge Generalization ── missing terms
  ▫ /Desktop/Home → /Desktop//Home
• Path Extension ── only remember prefix
  ▫ /Desktop/Pictures → /Desktop/Pictures//*
• Node Generalization ── misremember structure/content
  ▫ //Home//Halloween → //Home//{Halloween}
• Node Inversion ── misremember order
  ▫ /Desktop//Home//{Halloween} → /Desktop//(Home//{Halloween})
• Node Deletion ── extraneous terms
  ▫ /Desktop/Backup/Pictures//Home → /Desktop//Pictures//Home


                           Amélie Marian - Rutgers University
13



DAG Representation
                                                                           IDF score
p – Pictures                                                               ▫ Function of how many
h – Home                                                                     files match the query
                                         /p/h (exact match)
                                                                           ▫ DAG stores IDF scoring
                                                                             information
             //p/h            /p//h                              /(p/h)
                                         1


//p//h          2       3

       //n

                     //p//*                        //h//*
1 - /p/h//*
2 - //p/h//*                               //* (match all)
3 - //(p/h)
                                      Amélie Marian - Rutgers University
14




Query Evaluation
• Top-k query processing
  ▫ Branch-and-bound approach
• Lazy evaluation of the relaxed DAG structure
  ▫ DAG is query dependent and has to be generated at runtime
  ▫ We developed two algorithms to speed up query evaluation
     DAGJump allows skip unnecessary parts of the DAG (sorted
      accesses)
     RandomDAG allows to zoom in on the relevant part of the DAG
      (random accesses)
• Matching of answers using dedicated data structures
     We extended PathStack (Bruno et al. ICDE’02) to support
      permutations (NIPathstack)

                         Amélie Marian - Rutgers University
15



Traditional Content TF∙IDF Scoring
• Consider files as “bag of terms”
• TF (Term Frequency)
  ▫ A file that mentions a query term more often is more relevant
  ▫ TF could be normalized by file length

• IDF (Inverse Document Frequency)
  ▫ Terms that appear in too many files have little differentiation
    power in determining relevance
• TF∙IDF Scoring
  ▫ Aggregate TF and IDF scores across all query terms


             score ( q , d )             tf t , d       idf t
                                  t q


                               Amélie Marian - Rutgers University
16


Unified IDF Score
 For a unified data tree T, a path query PQ, and a file F, we define:
 • IDF Score

                                                                      N
                                          log
                                                     matches (T , PQ )
                score   idf
                              ( PQ )
                                                            log N

   where N is total number of files, and matches (T , PQ ) is the set of files that
   match PQ in T.




                                       Amélie Marian - Rutgers University
17



  TF Score
Path query: //a//{b}
               /                         matchstruct = 1                               Normalized
                                                                                         0.25
               a                         nodesstruct = 4
                           File F                                                                                    TF Score
              c                                                                          ∑f(x)                   f(0.25)+f(0.4)

         b            d
                                         matchcontent = 2
                                                                                         0.4                1




     “” “b e f b f”                      nodescontent = 5                              Normalized
                                                                                                           0.8




                                                                                                           0.6




                                                                                                    f(x)
                                                                                                           0.4




                                                                                                           0.2




                                                                                                            0
                                                                                                                 0    0.2   0.4       0.6   0.8    1
                                                                                                                                  x
                                 1

f ( x)       log( 1       x)  x , n
                                 n
                                       2 , 3,    affects relative impact on TF to unified scores

                                                  Amélie Marian - Rutgers University
18


Unified Score
Aggregate IDF and TF scores across all relaxed queries

  /a/b (exact match)            //a/b                                /a//b
  idf           tf      idf                      tf            idf           tf
   1.0          0.15    0.8                  0.25              0.8           0.1         ...
          *                         *                                  *

tf*idf   0.15                    0.2                                  0.08         ...
                                    +

                               0.875                                              ...
                        Unified Score


                          Amélie Marian - Rutgers University
19



Experimental Setup
• Platform
  PC with a 64-bit hyper-threaded 2.8GHz Intel Xeon
   processor, 2GB memory, a 10K RPM 70GB SCSI disk,
   Linux 2.6.16 kernel, Sun Java 1.5.0 JVM.
• Data Set
  ▫ Files and directories from the environment of a
    graduate student (15Gb)
  ▫ 95,172 files (document 59%, email 34%) in 7,788
    directories. Average directory depth is 6.3 with the
    longest being 12.
  ▫ 57M nodes in the unified data tree, with 49M (86%)
    leaf content nodes


                       Amélie Marian - Rutgers University
20




Relevance Comparison
• Use Lucene as a comparison basis
• Content-only
 Use the standard Lucene content indexing and
  search
• Content:Dir
 Create two Lucene indexes: content terms, and
  terms from the directory pathnames (treated as a
  small file)
• Content+Dir
  Augment content index with directory path terms


                    Amélie Marian - Rutgers University
21



Case Study
  ▫ Search for a witch costume picture taken at home on Halloween
     Target: IMG_1391.gif (tagged with “witch” and “Halloween”)




Query          Query Condition                                        Comment               Rank
Type
 U           //home[.//”witch” and                                 Accurate condition        1
                .//”halloween”]
 U          //halloween/witch/”home”                         Structure / content switched    1
 C             {witch, halloween}                                  Accurate condition        20
 C:D       {witch, halloween} : {home}                             Accurate condition        1
 C:D       {witch, home} : {halloween}                       Structure / content switched   245-
                                                                                            252

                              Amélie Marian - Rutgers University
22


CDFs (Impact of Inaccuracies)
                                                                            100%
   100%                                                                                  U
              U




                            Percentage of Queries
                                                                             90%         C:D
   90%        C:D                                                                        C+D
              C+D                                                            80%
   80%
                                                                             70%
   70%
                                                                             60%
   60%
                                                                             50%
   50%
                                                                             40%
   40%
                                                                             30%
   30%
                                                                             20%
   20%
                                                                             10%
   10%
                                                                              0%
    0%                                                                             1            10    100
          1          10    100
50% error, 1 swap   Rank                                          100% error, 1 swap           Rank

   100%                                                                     100%
              U                                                                          U
                            Percentage of Queries




    90%       C:D                                                            90%         C:D
              C+D                                                                        C+D
    80%                                                                      80%

    70%                                                                      70%

    60%                                                                      60%

    50%                                                                      50%

    40%                                                                      40%

    30%                                                                      30%

    20%                                                                      20%

    10%                                                                      10%

    0%                                                                        0%
          1          10    100                                                     1            10    100
50% error, 2 swap   Rank                                          100% error, 2 swap           Rank
                                                    Amélie Marian - Rutgers University
23



Query Processing Performance
      100%

      90%

      80%

      70%

      60%

      50%

      40%

      30%

      20%

      10%                                                   U
                                                          C:D
       0%
             0   2            4                  6        8     10
                     Query Processing Time (sec)




                     Amélie Marian - Rutgers University
24


Personal Information Search
Contributions
• A multi-dimensional search framework that supports
  fuzzy query conditions
• Scoring techniques for fuzzy query conditions against a
  unified view of structure and content
    Improves search accuracy over content-based methods by leveraging
     both structure and content information as well as relationships between
     the terms
    Shows improvements over existing techniques (GDS, topX)

• Efficient index structures and optimizations to efficiently
  process multi-dimensional and unified queries
    Significantly reduced the overall query processing time

• Future work directions:
       User studies, Twig matching, Result granularity, Context
                              Amélie Marian - Rutgers University
Joint work with:

Gayatree Ganu
          Computer Science, Rutgers University
Noémie Elhadad
          Biomedical Informatics, Columbia University
                     User Review Structure Analysis Project – URSA
    Patient Emotion and stRucture SEarch USer interface - PERSEUS
26


URSA:User Review Structure Analysis
Project Description               WebDB’09



  • Aim:
    Better understanding of user reviews
    Better search and access of user reviews
  • Tasks:
    Structure Identification and Analysis
    Text and Structure Search
    Similarity Search in Social Networks                            

                                                 Google Research Award – April 2008
                       Amélie Marian - Rutgers University
27

Online Reviewing Systems:
Citysearch

                                              Data in Reviews
                                          •         Structured metadata
                                          •         Textual review body
                                                     Sentiment information
                                                     Information on product specific
                                                      features

                                              Users are inconvenienced
                                              because:
                                              • Large number of reviews
                                                available
                                              • Hard to find relevant reviews
                                              • Vague or undefined
                                                information needs
               Amélie Marian - Rutgers University
28



Data Description
• Restaurant reviews extracted from
  Citysearch, New York
  (http://newyork.citysearch.com)
• The corpus contains:
 ▫ 5531 restaurants
   - associated structured information (name, location, cuisine type)
   - a set of reviews
 ▫ 52264 reviews, of which 1359 are editorial reviews
   - structured information (star rating, username, date)
   - unstructured text (title, body, pros, cons)
 ▫ 32284 distinct users
   - Distinct username information
• Dataset accessible at
 http://www.research.rutgers.edu/~gganu/datasets/
                            Amélie Marian - Rutgers University
29



Structure Identification
• Classification of review sentences with topic
  and sentiment information
 Sentence Topics                                      Sentence Sentiment

            Food                                              Positive

            Price                                             Negative

           Service                                            Neutral

         Ambience                                             Conflict

         Anecdotes

        Miscellaneous

                        Amélie Marian - Rutgers University
30


Text Based Recommendation
System: Evaluation Setting

• For evaluation, we separated three non-
  overlapping test sets of about 260 reviews:
  ▫ Test A and B : Users who have reviewed at least two
    restaurants (so that training set has at least one
    review)
  ▫ Test C : Users with at least 5 reviews
• For measuring accuracy of prediction we use the
  Root Mean Square Error (RMSE)


                       Amélie Marian - Rutgers University
31


Text-Based Recommendation System:
Steps

• Text-derived rating score
 ▫ Regression-based rating
• Goals
 1. Predicting the metadata star rating
 2. Predicting the text-derived score
   • Only predicts the score, not the content of the reviews
   • Lower standard deviations: lower RMSE
• Prediction Strategies
 ▫ Average-based prediction
 ▫ Personalized prediction


                            Amélie Marian - Rutgers University
32



Regression-based Text Rating
• Use text of reviews to generate a rating
• Different categories and sentiment should have different
  importance in the rating

Method
• We use multivariate quadratic regression
• Each normalized sentence type [(category, sentiment)] is
  a variable in the regression
• Dependent variable is metadata star-rating

• Used training sets to learn the weights for each sentence
  type; weights are used in computing text-based score


                       Amélie Marian - Rutgers University
Regression-based Text Rating                                                                Food and
                                                                                            Negative
  • Regression Constant: 3.68                                                               Price and
                                                                                            Service
  • Regression Weights (First order variables)                                              appear to
   Regression Weights    Positive              Negative              Neutral     Conflict
                                                                                            be most
   Food                    2.62                  -2.65                   -0.08    -0.69
                                                                                            important
   Price                   0.39                  -2.12                   -1.27    0.93
   Service                 0.85                  -4.25                   -1.83    0.36
   Ambience                0.75                  -0.27                   0.16     0.21
   Anecdotes               0.95                  -1.75                   0.06     -0.19
   Miscellaneous           1.30                  -2.62                   -0.30    0.36


  • Regression Weights (Second order variables)
   Regression Weights    Positive             Negative               Neutral     Conflict
   Food                   -1.99                  2.05                    -0.14    0.67
   Price                  -0.27                  2.04                    2.17     -1.01
   Service                -0.52                  3.15                    1.76     0.34
   Ambience               -0.44                  0.81                    -0.28    -0.61
   Anecdotes              -0.40                  2.03                    -0.03    -0.20
   Miscellaneous          -0.65                  2.38                    0.5      -0.10


                                    Amélie Marian - Rutgers University                              33
Regression-Based Text                                                               Baseline

Rating                                                                              Case



 Restaurant Average-based Prediction
 • Prediction using average rating given to a restaurant by all users
   (we also tried user-average and combined)
 • RMSE Errors:
                                                                    Predicting using text does better
                                                                    than popularly used star rating

Predicting Star Ratings                                            TEST A    TEST B TEST C
Using Star Rating                                                   1.127     1.267      1.126
Using Sentiment-based text rating                                   1.126     1.224      1.046
Predicting Sentiment Text Rating                                   TEST A    TEST B TEST C
Using Star Rating                                                   0.703     0.718      0.758
Using Sentiment-based text rating                                   0.545     0.557      0.514

                              Amélie Marian - Rutgers University                               34
35



Clustering-based strategies for
recommendations
• KNN based on a clustering over star ratings
  ▫   Little improvement over baseline
  ▫   Does not take into account the textual information
  ▫   Sparse data
  ▫   Cold start problem
  ▫   Hard clustering not appropriate
• Soft clustering
  ▫ Partitions objects into clusters,
  ▫ Each user has a membership probability to each
    cluster


                       Amélie Marian - Rutgers University
Information Bottleneck Method
• Foundations in Rate Distortion Theory
• Allows choosing tradeoff between
 ▫ Compression (number of clusters T)
 ▫ Quality estimated through the average distortion
   between cluster points and cluster centroid (β
   parameter)
• Shown to work well with sparse datasets


                                    N. Slonim, SIGIR 2002
37



Leveraging text content for
personalized predictions
• Use the sentence types (categories, sentiments)
  within the reviews as features
• Users clustered based on the type of information
  in their reviews
• Predictions are made using membership
  probabilities of clusters to find neighbors




                    Amélie Marian - Rutgers University
38


  Example: Clustering using iIB algorithm
                   Restaurant1          Restaurant2       Restaurant3
   User1                4                    -                    -
   User2                2                    5                    4
   User3                4                   ???                   3
                                                                             Input matrix to the iIB algorithm
   User4                5                    2                    -
                                                                             (before normalization)
   User5                -                    -                    1



                       Restaurant1                                    Restaurant2                              Restaurant3
        Food         Food        Price      Price      Food       Food       Price      Price      Food       Food       Price      Price
        Positive     Negative    Positive   Negative   Positive   Negative   Positive   Negative   Positive   Negative   Positive   Negative

User1      0.6         0.2         0.2            -        -           -         -         -           -         -          -          -
User2      0.3         0.6         0.1            -      0.9           -       0.1         -         0.6        0.1        0.2        0.1
User3      0.7          0.1       0.15        0.05         -           -         -         -         0.2        0.8         -          -
User4      0.9         0.05       0.05            -      0.3          0.4      0.2        0.1          -         -          -          -
User5       -            -          -             -        -           -         -         -           -        0.7        0.3         -
39


     Example: Soft-clustering Prediction
User rating (star or text)
                                                                                 Cluster Membership Probabilities
           Restaurant1       Restaurant   Restaurant
                                 2             3                                        Cluster1        Cluster2    Cluster3

User1             4              -            -                            User1          0.040           0.057      0.903
User2             2              5            4                            User2          0.396           0.202      0.402
User3             4              *            3                            User3          0.380           0.502      0.118
User4             5              2            -                            User4          0.576           0.015      0.409
User5             -              -            1                            User5          0.006           0.990      0.004

    •For each cluster we compute the cluster contribution for the test
    restaurant
            •Weighted average of ratings given to the restaurant


                                          Contribution (c2,r2)=4.793,
                                          Contribution(c3,r2)=3.487

    •We compute the final prediction based on the cluster contributions for
    the test restaurant and the test user’s membership probabilities
                                                                                                      = 4.042

                                                  Amélie Marian - Rutgers University
iIB Algorithm
  • Experimented with different values of β and T, used
    β=20, T=100.
     RMSE errors and percentage improvement over baseline:
Predicting Star Ratings              TEST A          TEST B          TEST C
Using Star Rating                   1.103 (2.13%)   1.242 (1.74%)   1.106 (1.78%)
Using Sentiment-based text rating   1.113 (1.15%)   1.211(1.06%)     1.046(0%)

Predicting Sentiment Text Rating     TEST A          TEST B          TEST C
Using Star Rating                   0.692 (1.56%)   0.704(1.95%)    0.742(2.11%)
Using Sentiment-based text rating   0.544(0.18%)    0.549(1.44%)     0.514(0%)

  • Always improve by using text features for clustering for
    the traditional goal of predicting star ratings
  • Even small improvement in RMSE are useful (Netflix,
    precision in top-k)
41




URSA: Qualitative Predictions
• Predict sentiment towards each topic
• Cluster users along each dimension separately
• Use threshold to classify sentiment (actual and
  predicted)
                                                                                                              100%
                                                                                                              80%




                                                                                                                    Accuracy
                                                                                                              60%              80%-100%
                                                                                                              40%              60%-80%
                                                                                                              20%              40%-60%
                                                                                                              0%
Prediction accuracy                                                                                                            20%-40%
                                                                                                                               0%-20%
for positive ambience.
                         A-0
                               A-0.1
                                       A-0.2
                                                A-0.3
                                                        A-0.4
                                                                A-0.5
                                                                        A-0.6
                                                                                A-0.7
                                                                                        A-0.8
                                                                                                A-0.9
                                                                                                        A-1



                                               θact


                                        Amélie Marian - Rutgers University
42



PERSEUS Project Description
Patient Emotion and StRucture SEarch USer
 Interface
 ▫ Large amount of patient-produced data
   • Difficult to search and understand
   • Patients need help finding information
   • Health professionals could learn from the data
 ▫ Analyze and Search patient forums, mailing lists and blogs
   • Topical information
   • Specific Language
   • Time sensitive
   • Emotionally charged
                                        Google Research Award – April 2010
                                        NSF CDI Type I – October 2010-2013
                        Amélie Marian - Rutgers University
43



PERSEUS Project Description
▫ Automatically add structure to free-text
  • Use of context information
    • “hair loss” side effect or symptom
  • Approximate structure
▫ Use structure to guide search
  • Need for high recall, but good precision
  • Find users with similar experiences
  • Various results granularities
    • Thread vs. sentence
  • Context dependent
  • Needs to take approximation into account



                                Amélie Marian - Rutgers University
44




Structuring and Searching Web Content
Contributions
• Leveraged automatically generated structure to improve
  predictions
  ▫ Around 2% RMSE improvements
  ▫ Used inferred structure to group users using soft clustering
    techniques
• Qualitative predictions
  ▫ High Accuracy
• Future directions
  ▫   Extension to healthcare domains
  ▫   Use of inferred structure to guide search
  ▫   Use user clusters in search
  ▫   Adapt to various result granularities
  ▫   Take classification inaccuracies into account

                           Amélie Marian - Rutgers University
45




Joint work with:
Minji Wu           Computer Science, Rutgers University

Collaborators:
Serge Abiteboul, Alban Galland                           INRIA
Pierre Senellart                                         Telecom ParisTech
Magda Procopiuc, Divesh Srivasatava                      AT&T Research Labs
Laure Berti-Equille                                      IRD

                    Amélie Marian - Rutgers University
46




Motivations
• Information on web sources are unreliable
  ▫   Erroneous
  ▫   Misleading
  ▫   Biased
  ▫   Outdated
• Users need to check web sites to confirm the
  information
  ▫ Data corroboration


                         Minji Wu - Rutgers University
47


Example: What is the gas mileage of my
Honda Civic?
                                                 Query: “honda civic 2007
                                                   gas mileage” on MSN
                                                   Search
                                                 • Is the top hit; the
                                                   honda.com site
                                                   unbiased?
                                                 • Is the autoweb.com web
                                                   site trustworthy?
                                                 • Are all these values
                                                   referring to the correct
                                                   year of the model?
                                                Users may check several web
                                                sites to get an answer


                Minji Wu - Rutgers University
48




Example: Identifying good business
listings
• NYC restaurant information from 6 sources
 ▫   Yellowpages
 ▫   Menupages
 ▫   Yelp
 ▫   Foursquare
 ▫   OpenTable
 ▫   Mechanical Turk (check streetview)

                   Which listings are correct ?


                      Amélie Marian - Rutgers University
49
                                                           WebDB’07
                                                           WSDM’10
                                                           IS’11
                                                           DEB’11
Data Corroboration Project Description
     Trustworthy sources report true facts
     True facts come from trustworthy sources
• Sources have different
 ▫   Coverage
 ▫   Domain
 ▫   Dependencies
 ▫   Overlap                   Conflict resolution with maximum
                               coverage

                  Microsoft Live Labs Search Award – May 2006


                     Amélie Marian - Rutgers University
50

                                                                                                    CleanDB’06
                                                                                                    PVLDB’10
Top-k Join: Project Description
 Integrate and aggregate information from several sources

                             (“minji”, “vldb10”, 0.2)


                 (“minji”, “amélie”, 1.0)

                                                        (“amélie”, “vldb10”, 0.5)



                                             (“amélie”, “SIN”, 0.3)
     (“minji”, “SIN”, 0.1)

                                                                           (“SIN”, “vldb10”, 0.9)



                                      Amélie Marian - Rutgers University
51




Data Corroboration
Contributions
• Probabilistic model for corroboration
  ▫   Fact uncertainty
  ▫   Source trustworthiness
  ▫   Source coverage
  ▫   Conflict between sources
• Fixpoint techniques to compute truth values of facts and
  source quality estimates
• Top-k query algorithms for computing corroborated answers
• Open Issues:
  ▫   Functional dependencies
  ▫   Time
  ▫   Social network
  ▫   Uncertain data
  ▫   Source dependence

                            Amélie Marian - Rutgers University
52




Conclusions
• New Challenges in web data management
 ▫ Semi-structured data
    PIMS
    User reviews
 ▫ Multiple sources of data
    Conflicting information
    Low quality data providers (Web 2.0)
• SPIDR lab at Rutgers focuses on helping users
  identify useful data in the wealth of information
  available
                     Amélie Marian - Rutgers University
53




Amélie Marian - Rutgers University

Contenu connexe

En vedette

A strong voice through unity
A strong voice through unityA strong voice through unity
A strong voice through unitysmespire
 
Каталог продукции LR HEALTH&BEAUTY SYSTEMS 2014_2
Каталог продукции LR HEALTH&BEAUTY SYSTEMS 2014_2Каталог продукции LR HEALTH&BEAUTY SYSTEMS 2014_2
Каталог продукции LR HEALTH&BEAUTY SYSTEMS 2014_2t575ae
 
Cсправочник по продукции LR
Cсправочник по продукции LRCсправочник по продукции LR
Cсправочник по продукции LRt575ae
 
Moz agric extension-contact-farmers_zambezi-valley
Moz agric extension-contact-farmers_zambezi-valleyMoz agric extension-contact-farmers_zambezi-valley
Moz agric extension-contact-farmers_zambezi-valleyIFPRI-Maputo
 
Hale esempio di mapping di dati istat
Hale esempio di mapping di dati istatHale esempio di mapping di dati istat
Hale esempio di mapping di dati istatsmespire
 
позакласна діяльність
позакласна діяльністьпозакласна діяльність
позакласна діяльністьHalyna Kasyan
 
Panel Discussion – Grooming Data Scientists for Today and for Tomorrow
Panel Discussion – Grooming Data Scientists for Today and for TomorrowPanel Discussion – Grooming Data Scientists for Today and for Tomorrow
Panel Discussion – Grooming Data Scientists for Today and for TomorrowHPCC Systems
 
Block 4 task 4.6 part 3 (1) ttss
Block 4  task 4.6 part 3 (1) ttssBlock 4  task 4.6 part 3 (1) ttss
Block 4 task 4.6 part 3 (1) ttssTony Perez
 
Communicating science
Communicating scienceCommunicating science
Communicating sciencetonivanuzzo
 
здоровое питание
здоровое питаниездоровое питание
здоровое питаниеMarta Butkevic
 
La testa di stampa GraphJet Zanasi copie i suoi primi 10 anni!
La testa di stampa GraphJet Zanasi copie i suoi primi 10 anni!La testa di stampa GraphJet Zanasi copie i suoi primi 10 anni!
La testa di stampa GraphJet Zanasi copie i suoi primi 10 anni!Fantuz
 
Enjoy Upto 50% Discounts on all computer training courses
Enjoy Upto 50% Discounts on all computer training coursesEnjoy Upto 50% Discounts on all computer training courses
Enjoy Upto 50% Discounts on all computer training coursesCMS Computer
 
The checklist for preparing your Exchange 2010 infrastructure for Exchange 20...
The checklist for preparing your Exchange 2010 infrastructure for Exchange 20...The checklist for preparing your Exchange 2010 infrastructure for Exchange 20...
The checklist for preparing your Exchange 2010 infrastructure for Exchange 20...Eyal Doron
 

En vedette (20)

A strong voice through unity
A strong voice through unityA strong voice through unity
A strong voice through unity
 
Каталог продукции LR HEALTH&BEAUTY SYSTEMS 2014_2
Каталог продукции LR HEALTH&BEAUTY SYSTEMS 2014_2Каталог продукции LR HEALTH&BEAUTY SYSTEMS 2014_2
Каталог продукции LR HEALTH&BEAUTY SYSTEMS 2014_2
 
Cсправочник по продукции LR
Cсправочник по продукции LRCсправочник по продукции LR
Cсправочник по продукции LR
 
Taller final de ingles
Taller final de inglesTaller final de ingles
Taller final de ingles
 
Coca Cola
Coca ColaCoca Cola
Coca Cola
 
Moz agric extension-contact-farmers_zambezi-valley
Moz agric extension-contact-farmers_zambezi-valleyMoz agric extension-contact-farmers_zambezi-valley
Moz agric extension-contact-farmers_zambezi-valley
 
Hale esempio di mapping di dati istat
Hale esempio di mapping di dati istatHale esempio di mapping di dati istat
Hale esempio di mapping di dati istat
 
Ruin of Rebellion
Ruin of RebellionRuin of Rebellion
Ruin of Rebellion
 
позакласна діяльність
позакласна діяльністьпозакласна діяльність
позакласна діяльність
 
Panel Discussion – Grooming Data Scientists for Today and for Tomorrow
Panel Discussion – Grooming Data Scientists for Today and for TomorrowPanel Discussion – Grooming Data Scientists for Today and for Tomorrow
Panel Discussion – Grooming Data Scientists for Today and for Tomorrow
 
Block 4 task 4.6 part 3 (1) ttss
Block 4  task 4.6 part 3 (1) ttssBlock 4  task 4.6 part 3 (1) ttss
Block 4 task 4.6 part 3 (1) ttss
 
Communicating science
Communicating scienceCommunicating science
Communicating science
 
Il condizionale
Il condizionaleIl condizionale
Il condizionale
 
1
11
1
 
здоровое питание
здоровое питаниездоровое питание
здоровое питание
 
Poem for palestin
Poem for palestinPoem for palestin
Poem for palestin
 
La testa di stampa GraphJet Zanasi copie i suoi primi 10 anni!
La testa di stampa GraphJet Zanasi copie i suoi primi 10 anni!La testa di stampa GraphJet Zanasi copie i suoi primi 10 anni!
La testa di stampa GraphJet Zanasi copie i suoi primi 10 anni!
 
Enjoy Upto 50% Discounts on all computer training courses
Enjoy Upto 50% Discounts on all computer training coursesEnjoy Upto 50% Discounts on all computer training courses
Enjoy Upto 50% Discounts on all computer training courses
 
The checklist for preparing your Exchange 2010 infrastructure for Exchange 20...
The checklist for preparing your Exchange 2010 infrastructure for Exchange 20...The checklist for preparing your Exchange 2010 infrastructure for Exchange 20...
The checklist for preparing your Exchange 2010 infrastructure for Exchange 20...
 
Taurus y bovina
Taurus y bovinaTaurus y bovina
Taurus y bovina
 

Similaire à Searching data with substance and style

Remembrance of data past
Remembrance of data pastRemembrance of data past
Remembrance of data pastAmélie Marian
 
Provenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingProvenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingUniversity of Arizona
 
Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014
Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014
Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014aceas13tern
 
It's 2015. Do You Know Where Your Data Are?
It's 2015. Do You Know Where Your Data Are?It's 2015. Do You Know Where Your Data Are?
It's 2015. Do You Know Where Your Data Are?Patricia Hswe
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Riccardo Albertoni
 
Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...GarethKnight
 
Introduction to Data Management Powerpoint
Introduction to Data Management PowerpointIntroduction to Data Management Powerpoint
Introduction to Data Management Powerpointichanismo
 
Silicon valley nosql meetup april 2012
Silicon valley nosql meetup  april 2012Silicon valley nosql meetup  april 2012
Silicon valley nosql meetup april 2012InfiniteGraph
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smithVince Smith
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides DuraSpace
 
On demand access to Big Data through Semantic Technologies
 On demand access to Big Data through Semantic Technologies On demand access to Big Data through Semantic Technologies
On demand access to Big Data through Semantic TechnologiesPeter Haase
 
Preventing data loss
Preventing data lossPreventing data loss
Preventing data lossIUPUI
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseHilmar Lapp
 
Automated Target Definition Using Existing Evidence and Outlier Analysis
Automated Target Definition Using Existing Evidence and Outlier AnalysisAutomated Target Definition Using Existing Evidence and Outlier Analysis
Automated Target Definition Using Existing Evidence and Outlier AnalysisGeorge Ang
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic WebRoberto García
 
RDAP13 Mark Leggott: Stewarding research data using the Islandora framework
RDAP13 Mark Leggott: Stewarding research data using the Islandora frameworkRDAP13 Mark Leggott: Stewarding research data using the Islandora framework
RDAP13 Mark Leggott: Stewarding research data using the Islandora frameworkASIS&T
 
DataUp: Data Curation for Excel
DataUp: Data Curation for Excel DataUp: Data Curation for Excel
DataUp: Data Curation for Excel Carly Strasser
 

Similaire à Searching data with substance and style (20)

Remembrance of data past
Remembrance of data pastRemembrance of data past
Remembrance of data past
 
Provenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingProvenance Management to Enable Data Sharing
Provenance Management to Enable Data Sharing
 
Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014
Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014
Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014
 
Data managementbasics issr_20130301
Data managementbasics issr_20130301Data managementbasics issr_20130301
Data managementbasics issr_20130301
 
It's 2015. Do You Know Where Your Data Are?
It's 2015. Do You Know Where Your Data Are?It's 2015. Do You Know Where Your Data Are?
It's 2015. Do You Know Where Your Data Are?
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
 
Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...
 
B01DataMgt.ppt
B01DataMgt.pptB01DataMgt.ppt
B01DataMgt.ppt
 
Introduction to Data Management Powerpoint
Introduction to Data Management PowerpointIntroduction to Data Management Powerpoint
Introduction to Data Management Powerpoint
 
Silicon valley nosql meetup april 2012
Silicon valley nosql meetup  april 2012Silicon valley nosql meetup  april 2012
Silicon valley nosql meetup april 2012
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith
 
Ldl2012
Ldl2012Ldl2012
Ldl2012
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides
 
On demand access to Big Data through Semantic Technologies
 On demand access to Big Data through Semantic Technologies On demand access to Big Data through Semantic Technologies
On demand access to Big Data through Semantic Technologies
 
Preventing data loss
Preventing data lossPreventing data loss
Preventing data loss
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
 
Automated Target Definition Using Existing Evidence and Outlier Analysis
Automated Target Definition Using Existing Evidence and Outlier AnalysisAutomated Target Definition Using Existing Evidence and Outlier Analysis
Automated Target Definition Using Existing Evidence and Outlier Analysis
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
RDAP13 Mark Leggott: Stewarding research data using the Islandora framework
RDAP13 Mark Leggott: Stewarding research data using the Islandora frameworkRDAP13 Mark Leggott: Stewarding research data using the Islandora framework
RDAP13 Mark Leggott: Stewarding research data using the Islandora framework
 
DataUp: Data Curation for Excel
DataUp: Data Curation for Excel DataUp: Data Curation for Excel
DataUp: Data Curation for Excel
 

Plus de Amélie Marian

Integration and Exploration of Connected Personal Digital Traces
Integration and Exploration of Connected Personal Digital TracesIntegration and Exploration of Connected Personal Digital Traces
Integration and Exploration of Connected Personal Digital TracesAmélie Marian
 
Miettes de données - Keynote BDA 2015
Miettes de données - Keynote BDA 2015Miettes de données - Keynote BDA 2015
Miettes de données - Keynote BDA 2015Amélie Marian
 
Personal Information Management Systems - EDBT/ICDT'15 Tutorial
Personal Information Management Systems - EDBT/ICDT'15 TutorialPersonal Information Management Systems - EDBT/ICDT'15 Tutorial
Personal Information Management Systems - EDBT/ICDT'15 TutorialAmélie Marian
 
Personal Information Search and Discovery
Personal Information Search and DiscoveryPersonal Information Search and Discovery
Personal Information Search and DiscoveryAmélie Marian
 
Personalizing Forum Search using Multidimensional Random Walks
Personalizing Forum Search using Multidimensional Random WalksPersonalizing Forum Search using Multidimensional Random Walks
Personalizing Forum Search using Multidimensional Random WalksAmélie Marian
 
Corroborating Facts from Affirmative Statements
Corroborating Facts from Affirmative StatementsCorroborating Facts from Affirmative Statements
Corroborating Facts from Affirmative StatementsAmélie Marian
 

Plus de Amélie Marian (7)

Integration and Exploration of Connected Personal Digital Traces
Integration and Exploration of Connected Personal Digital TracesIntegration and Exploration of Connected Personal Digital Traces
Integration and Exploration of Connected Personal Digital Traces
 
Miettes de données - Keynote BDA 2015
Miettes de données - Keynote BDA 2015Miettes de données - Keynote BDA 2015
Miettes de données - Keynote BDA 2015
 
Personal Information Management Systems - EDBT/ICDT'15 Tutorial
Personal Information Management Systems - EDBT/ICDT'15 TutorialPersonal Information Management Systems - EDBT/ICDT'15 Tutorial
Personal Information Management Systems - EDBT/ICDT'15 Tutorial
 
Personal Information Search and Discovery
Personal Information Search and DiscoveryPersonal Information Search and Discovery
Personal Information Search and Discovery
 
Personalizing Forum Search using Multidimensional Random Walks
Personalizing Forum Search using Multidimensional Random WalksPersonalizing Forum Search using Multidimensional Random Walks
Personalizing Forum Search using Multidimensional Random Walks
 
Corroborating Facts from Affirmative Statements
Corroborating Facts from Affirmative StatementsCorroborating Facts from Affirmative Statements
Corroborating Facts from Affirmative Statements
 
Searching Web Forums
Searching Web ForumsSearching Web Forums
Searching Web Forums
 

Searching data with substance and style

  • 1. 1 Searching Data with Substance and Style Amélie Marian Rutgers University http://www.cs.rutgers.edu/~amelie
  • 2. 2 Semi-structured Data Processing • Large amount of data online and in personal devices ▫ Structure (style) ▫ Text content (substance) ▫ Different sources (soul) ▫ Finding the data we need can be difficult Amélie Marian - Rutgers University
  • 3. 3 Semi-structured Data Processing at Rutgers SPIDR Lab • Personal Information Search ▫ Semi-structured data ▫ Need for high -quality search tools • Structuring of User Web Posts ▫ Large amount of user-generated data untapped ▫ Text has inherent structure ▫ Use of text for guiding search and analyze data • Data Corroboration ▫ Conflicting sources of data ▫ Need to identify true facts Amélie Marian - Rutgers University
  • 4. 4 Joint work with: Wei Wang Christopher Peery Thu Nguyen Computer Science, Rutgers University Amélie Marian - Rutgers University
  • 5. 5 Personal Information Search Web Personal Data Search for relevant documents Search for specific documents Information that can be used for personal information search • Content (keywords) • Metadata (file size, modification time, etc.) • Structure ▫ Directory (external) ▫ File structure (internal): XML, LaTeX tags, Picture tags, etc. ▫ Partially known Amélie Marian - Rutgers University
  • 6. 6 EDBT’08 ICDE’08 (demo) PIMS Project Description DEB’09 EDBT’11 TKDE (accepted) • Data and query models that unify content and structure • Scoring framework to rank unified search results • Query processing algorithms and index structures to score and rank answers efficiently • Evaluation of the quality and efficiency of the unified scoring NSF CAREER Award July 2009-2014 Amélie Marian - Rutgers University
  • 7. 7 Target file: Halloween party pictures taken at home where someone wears a witch costume Separate Structure and Content File Boundary Directory: //Home Keywords: Halloween, witch Amélie Marian - Rutgers University
  • 8. 8 Current Search Tools Current search tools (i.e. web, desktop, GDS) mostly rely on ranking and filtering. ▫ Ranking content keywords ▫ Filtering additional conditions (e.g., metadata, structure) Find a jpg file saved in directory /Desktop/Pictures/Home that contains the words “Halloween witch” This approach is often insufficient. ▫ Filtering forces a binary decision. Gif files and files under directory /Archive/Pictures/Home are not returned. ▫ Structure and content are strictly separated. Files under directory /Pictures/Halloween are not returned. Amélie Marian - Rutgers University
  • 9. 9 Unified Approach Goal: Unify structure and content ▫ Develop a unified view of directory and file structure ▫ Allow for a single query to contain both structure and content components and to be answered at once ▫ Return results even if queries are incomplete or contain mistakes Approach: ▫ Define a unified data model by ignoring file boundaries ▫ Define a unified query model ▫ Define relaxations to approximate unified queries ▫ Define relevance score for unified queries Amélie Marian - Rutgers University
  • 10. 10 Unified Structure and Content Target file: Halloween party pictures taken at home where someone wears a witch costume //Home[.//“Halloween” and .//“witch”] File root Boundary Home “Halloween” “witch” Amélie Marian - Rutgers University
  • 11. 11 From Query to Answers DAG Relaxation Matching Relaxed Queries Matches Query / Answers User Scoring Ranked Answers (TA algorithm) Amélie Marian - Rutgers University
  • 12. 12 Query Relaxations Target: IMG_1391.gif • Edge Generalization ── missing terms ▫ /Desktop/Home → /Desktop//Home • Path Extension ── only remember prefix ▫ /Desktop/Pictures → /Desktop/Pictures//* • Node Generalization ── misremember structure/content ▫ //Home//Halloween → //Home//{Halloween} • Node Inversion ── misremember order ▫ /Desktop//Home//{Halloween} → /Desktop//(Home//{Halloween}) • Node Deletion ── extraneous terms ▫ /Desktop/Backup/Pictures//Home → /Desktop//Pictures//Home Amélie Marian - Rutgers University
  • 13. 13 DAG Representation IDF score p – Pictures ▫ Function of how many h – Home files match the query /p/h (exact match) ▫ DAG stores IDF scoring information //p/h /p//h /(p/h) 1 //p//h 2 3 //n //p//* //h//* 1 - /p/h//* 2 - //p/h//* //* (match all) 3 - //(p/h) Amélie Marian - Rutgers University
  • 14. 14 Query Evaluation • Top-k query processing ▫ Branch-and-bound approach • Lazy evaluation of the relaxed DAG structure ▫ DAG is query dependent and has to be generated at runtime ▫ We developed two algorithms to speed up query evaluation  DAGJump allows skip unnecessary parts of the DAG (sorted accesses)  RandomDAG allows to zoom in on the relevant part of the DAG (random accesses) • Matching of answers using dedicated data structures  We extended PathStack (Bruno et al. ICDE’02) to support permutations (NIPathstack) Amélie Marian - Rutgers University
  • 15. 15 Traditional Content TF∙IDF Scoring • Consider files as “bag of terms” • TF (Term Frequency) ▫ A file that mentions a query term more often is more relevant ▫ TF could be normalized by file length • IDF (Inverse Document Frequency) ▫ Terms that appear in too many files have little differentiation power in determining relevance • TF∙IDF Scoring ▫ Aggregate TF and IDF scores across all query terms score ( q , d ) tf t , d idf t t q Amélie Marian - Rutgers University
  • 16. 16 Unified IDF Score For a unified data tree T, a path query PQ, and a file F, we define: • IDF Score N log matches (T , PQ ) score idf ( PQ ) log N where N is total number of files, and matches (T , PQ ) is the set of files that match PQ in T. Amélie Marian - Rutgers University
  • 17. 17 TF Score Path query: //a//{b} / matchstruct = 1 Normalized 0.25 a nodesstruct = 4 File F TF Score c ∑f(x) f(0.25)+f(0.4) b d matchcontent = 2 0.4 1 “” “b e f b f” nodescontent = 5 Normalized 0.8 0.6 f(x) 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 x 1 f ( x) log( 1 x)  x , n n 2 , 3,  affects relative impact on TF to unified scores Amélie Marian - Rutgers University
  • 18. 18 Unified Score Aggregate IDF and TF scores across all relaxed queries /a/b (exact match) //a/b /a//b idf tf idf tf idf tf 1.0 0.15 0.8 0.25 0.8 0.1 ... * * * tf*idf 0.15 0.2 0.08 ... + 0.875 ... Unified Score Amélie Marian - Rutgers University
  • 19. 19 Experimental Setup • Platform PC with a 64-bit hyper-threaded 2.8GHz Intel Xeon processor, 2GB memory, a 10K RPM 70GB SCSI disk, Linux 2.6.16 kernel, Sun Java 1.5.0 JVM. • Data Set ▫ Files and directories from the environment of a graduate student (15Gb) ▫ 95,172 files (document 59%, email 34%) in 7,788 directories. Average directory depth is 6.3 with the longest being 12. ▫ 57M nodes in the unified data tree, with 49M (86%) leaf content nodes Amélie Marian - Rutgers University
  • 20. 20 Relevance Comparison • Use Lucene as a comparison basis • Content-only Use the standard Lucene content indexing and search • Content:Dir Create two Lucene indexes: content terms, and terms from the directory pathnames (treated as a small file) • Content+Dir Augment content index with directory path terms Amélie Marian - Rutgers University
  • 21. 21 Case Study ▫ Search for a witch costume picture taken at home on Halloween Target: IMG_1391.gif (tagged with “witch” and “Halloween”) Query Query Condition Comment Rank Type U //home[.//”witch” and Accurate condition 1 .//”halloween”] U //halloween/witch/”home” Structure / content switched 1 C {witch, halloween} Accurate condition 20 C:D {witch, halloween} : {home} Accurate condition 1 C:D {witch, home} : {halloween} Structure / content switched 245- 252 Amélie Marian - Rutgers University
  • 22. 22 CDFs (Impact of Inaccuracies) 100% 100% U U Percentage of Queries 90% C:D 90% C:D C+D C+D 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% 1 10 100 1 10 100 50% error, 1 swap Rank 100% error, 1 swap Rank 100% 100% U U Percentage of Queries 90% C:D 90% C:D C+D C+D 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% 1 10 100 1 10 100 50% error, 2 swap Rank 100% error, 2 swap Rank Amélie Marian - Rutgers University
  • 23. 23 Query Processing Performance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% U C:D 0% 0 2 4 6 8 10 Query Processing Time (sec) Amélie Marian - Rutgers University
  • 24. 24 Personal Information Search Contributions • A multi-dimensional search framework that supports fuzzy query conditions • Scoring techniques for fuzzy query conditions against a unified view of structure and content  Improves search accuracy over content-based methods by leveraging both structure and content information as well as relationships between the terms  Shows improvements over existing techniques (GDS, topX) • Efficient index structures and optimizations to efficiently process multi-dimensional and unified queries  Significantly reduced the overall query processing time • Future work directions:  User studies, Twig matching, Result granularity, Context Amélie Marian - Rutgers University
  • 25. Joint work with: Gayatree Ganu Computer Science, Rutgers University Noémie Elhadad Biomedical Informatics, Columbia University User Review Structure Analysis Project – URSA Patient Emotion and stRucture SEarch USer interface - PERSEUS
  • 26. 26 URSA:User Review Structure Analysis Project Description WebDB’09 • Aim: Better understanding of user reviews Better search and access of user reviews • Tasks: Structure Identification and Analysis Text and Structure Search Similarity Search in Social Networks  Google Research Award – April 2008 Amélie Marian - Rutgers University
  • 27. 27 Online Reviewing Systems: Citysearch Data in Reviews • Structured metadata • Textual review body  Sentiment information  Information on product specific features Users are inconvenienced because: • Large number of reviews available • Hard to find relevant reviews • Vague or undefined information needs Amélie Marian - Rutgers University
  • 28. 28 Data Description • Restaurant reviews extracted from Citysearch, New York (http://newyork.citysearch.com) • The corpus contains: ▫ 5531 restaurants - associated structured information (name, location, cuisine type) - a set of reviews ▫ 52264 reviews, of which 1359 are editorial reviews - structured information (star rating, username, date) - unstructured text (title, body, pros, cons) ▫ 32284 distinct users - Distinct username information • Dataset accessible at http://www.research.rutgers.edu/~gganu/datasets/ Amélie Marian - Rutgers University
  • 29. 29 Structure Identification • Classification of review sentences with topic and sentiment information Sentence Topics Sentence Sentiment Food Positive Price Negative Service Neutral Ambience Conflict Anecdotes Miscellaneous Amélie Marian - Rutgers University
  • 30. 30 Text Based Recommendation System: Evaluation Setting • For evaluation, we separated three non- overlapping test sets of about 260 reviews: ▫ Test A and B : Users who have reviewed at least two restaurants (so that training set has at least one review) ▫ Test C : Users with at least 5 reviews • For measuring accuracy of prediction we use the Root Mean Square Error (RMSE) Amélie Marian - Rutgers University
  • 31. 31 Text-Based Recommendation System: Steps • Text-derived rating score ▫ Regression-based rating • Goals 1. Predicting the metadata star rating 2. Predicting the text-derived score • Only predicts the score, not the content of the reviews • Lower standard deviations: lower RMSE • Prediction Strategies ▫ Average-based prediction ▫ Personalized prediction Amélie Marian - Rutgers University
  • 32. 32 Regression-based Text Rating • Use text of reviews to generate a rating • Different categories and sentiment should have different importance in the rating Method • We use multivariate quadratic regression • Each normalized sentence type [(category, sentiment)] is a variable in the regression • Dependent variable is metadata star-rating • Used training sets to learn the weights for each sentence type; weights are used in computing text-based score Amélie Marian - Rutgers University
  • 33. Regression-based Text Rating Food and Negative • Regression Constant: 3.68 Price and Service • Regression Weights (First order variables) appear to Regression Weights Positive Negative Neutral Conflict be most Food 2.62 -2.65 -0.08 -0.69 important Price 0.39 -2.12 -1.27 0.93 Service 0.85 -4.25 -1.83 0.36 Ambience 0.75 -0.27 0.16 0.21 Anecdotes 0.95 -1.75 0.06 -0.19 Miscellaneous 1.30 -2.62 -0.30 0.36 • Regression Weights (Second order variables) Regression Weights Positive Negative Neutral Conflict Food -1.99 2.05 -0.14 0.67 Price -0.27 2.04 2.17 -1.01 Service -0.52 3.15 1.76 0.34 Ambience -0.44 0.81 -0.28 -0.61 Anecdotes -0.40 2.03 -0.03 -0.20 Miscellaneous -0.65 2.38 0.5 -0.10 Amélie Marian - Rutgers University 33
  • 34. Regression-Based Text Baseline Rating Case Restaurant Average-based Prediction • Prediction using average rating given to a restaurant by all users (we also tried user-average and combined) • RMSE Errors: Predicting using text does better than popularly used star rating Predicting Star Ratings TEST A TEST B TEST C Using Star Rating 1.127 1.267 1.126 Using Sentiment-based text rating 1.126 1.224 1.046 Predicting Sentiment Text Rating TEST A TEST B TEST C Using Star Rating 0.703 0.718 0.758 Using Sentiment-based text rating 0.545 0.557 0.514 Amélie Marian - Rutgers University 34
  • 35. 35 Clustering-based strategies for recommendations • KNN based on a clustering over star ratings ▫ Little improvement over baseline ▫ Does not take into account the textual information ▫ Sparse data ▫ Cold start problem ▫ Hard clustering not appropriate • Soft clustering ▫ Partitions objects into clusters, ▫ Each user has a membership probability to each cluster Amélie Marian - Rutgers University
  • 36. Information Bottleneck Method • Foundations in Rate Distortion Theory • Allows choosing tradeoff between ▫ Compression (number of clusters T) ▫ Quality estimated through the average distortion between cluster points and cluster centroid (β parameter) • Shown to work well with sparse datasets N. Slonim, SIGIR 2002
  • 37. 37 Leveraging text content for personalized predictions • Use the sentence types (categories, sentiments) within the reviews as features • Users clustered based on the type of information in their reviews • Predictions are made using membership probabilities of clusters to find neighbors Amélie Marian - Rutgers University
  • 38. 38 Example: Clustering using iIB algorithm Restaurant1 Restaurant2 Restaurant3 User1 4 - - User2 2 5 4 User3 4 ??? 3 Input matrix to the iIB algorithm User4 5 2 - (before normalization) User5 - - 1 Restaurant1 Restaurant2 Restaurant3 Food Food Price Price Food Food Price Price Food Food Price Price Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative User1 0.6 0.2 0.2 - - - - - - - - - User2 0.3 0.6 0.1 - 0.9 - 0.1 - 0.6 0.1 0.2 0.1 User3 0.7 0.1 0.15 0.05 - - - - 0.2 0.8 - - User4 0.9 0.05 0.05 - 0.3 0.4 0.2 0.1 - - - - User5 - - - - - - - - - 0.7 0.3 -
  • 39. 39 Example: Soft-clustering Prediction User rating (star or text) Cluster Membership Probabilities Restaurant1 Restaurant Restaurant 2 3 Cluster1 Cluster2 Cluster3 User1 4 - - User1 0.040 0.057 0.903 User2 2 5 4 User2 0.396 0.202 0.402 User3 4 * 3 User3 0.380 0.502 0.118 User4 5 2 - User4 0.576 0.015 0.409 User5 - - 1 User5 0.006 0.990 0.004 •For each cluster we compute the cluster contribution for the test restaurant •Weighted average of ratings given to the restaurant Contribution (c2,r2)=4.793, Contribution(c3,r2)=3.487 •We compute the final prediction based on the cluster contributions for the test restaurant and the test user’s membership probabilities = 4.042 Amélie Marian - Rutgers University
  • 40. iIB Algorithm • Experimented with different values of β and T, used β=20, T=100. RMSE errors and percentage improvement over baseline: Predicting Star Ratings TEST A TEST B TEST C Using Star Rating 1.103 (2.13%) 1.242 (1.74%) 1.106 (1.78%) Using Sentiment-based text rating 1.113 (1.15%) 1.211(1.06%) 1.046(0%) Predicting Sentiment Text Rating TEST A TEST B TEST C Using Star Rating 0.692 (1.56%) 0.704(1.95%) 0.742(2.11%) Using Sentiment-based text rating 0.544(0.18%) 0.549(1.44%) 0.514(0%) • Always improve by using text features for clustering for the traditional goal of predicting star ratings • Even small improvement in RMSE are useful (Netflix, precision in top-k)
  • 41. 41 URSA: Qualitative Predictions • Predict sentiment towards each topic • Cluster users along each dimension separately • Use threshold to classify sentiment (actual and predicted) 100% 80% Accuracy 60% 80%-100% 40% 60%-80% 20% 40%-60% 0% Prediction accuracy 20%-40% 0%-20% for positive ambience. A-0 A-0.1 A-0.2 A-0.3 A-0.4 A-0.5 A-0.6 A-0.7 A-0.8 A-0.9 A-1 θact Amélie Marian - Rutgers University
  • 42. 42 PERSEUS Project Description Patient Emotion and StRucture SEarch USer Interface ▫ Large amount of patient-produced data • Difficult to search and understand • Patients need help finding information • Health professionals could learn from the data ▫ Analyze and Search patient forums, mailing lists and blogs • Topical information • Specific Language • Time sensitive • Emotionally charged Google Research Award – April 2010 NSF CDI Type I – October 2010-2013 Amélie Marian - Rutgers University
  • 43. 43 PERSEUS Project Description ▫ Automatically add structure to free-text • Use of context information • “hair loss” side effect or symptom • Approximate structure ▫ Use structure to guide search • Need for high recall, but good precision • Find users with similar experiences • Various results granularities • Thread vs. sentence • Context dependent • Needs to take approximation into account Amélie Marian - Rutgers University
  • 44. 44 Structuring and Searching Web Content Contributions • Leveraged automatically generated structure to improve predictions ▫ Around 2% RMSE improvements ▫ Used inferred structure to group users using soft clustering techniques • Qualitative predictions ▫ High Accuracy • Future directions ▫ Extension to healthcare domains ▫ Use of inferred structure to guide search ▫ Use user clusters in search ▫ Adapt to various result granularities ▫ Take classification inaccuracies into account Amélie Marian - Rutgers University
  • 45. 45 Joint work with: Minji Wu Computer Science, Rutgers University Collaborators: Serge Abiteboul, Alban Galland INRIA Pierre Senellart Telecom ParisTech Magda Procopiuc, Divesh Srivasatava AT&T Research Labs Laure Berti-Equille IRD Amélie Marian - Rutgers University
  • 46. 46 Motivations • Information on web sources are unreliable ▫ Erroneous ▫ Misleading ▫ Biased ▫ Outdated • Users need to check web sites to confirm the information ▫ Data corroboration Minji Wu - Rutgers University
  • 47. 47 Example: What is the gas mileage of my Honda Civic? Query: “honda civic 2007 gas mileage” on MSN Search • Is the top hit; the honda.com site unbiased? • Is the autoweb.com web site trustworthy? • Are all these values referring to the correct year of the model? Users may check several web sites to get an answer Minji Wu - Rutgers University
  • 48. 48 Example: Identifying good business listings • NYC restaurant information from 6 sources ▫ Yellowpages ▫ Menupages ▫ Yelp ▫ Foursquare ▫ OpenTable ▫ Mechanical Turk (check streetview) Which listings are correct ? Amélie Marian - Rutgers University
  • 49. 49 WebDB’07 WSDM’10 IS’11 DEB’11 Data Corroboration Project Description Trustworthy sources report true facts True facts come from trustworthy sources • Sources have different ▫ Coverage ▫ Domain ▫ Dependencies ▫ Overlap Conflict resolution with maximum coverage Microsoft Live Labs Search Award – May 2006 Amélie Marian - Rutgers University
  • 50. 50 CleanDB’06 PVLDB’10 Top-k Join: Project Description Integrate and aggregate information from several sources (“minji”, “vldb10”, 0.2) (“minji”, “amélie”, 1.0) (“amélie”, “vldb10”, 0.5) (“amélie”, “SIN”, 0.3) (“minji”, “SIN”, 0.1) (“SIN”, “vldb10”, 0.9) Amélie Marian - Rutgers University
  • 51. 51 Data Corroboration Contributions • Probabilistic model for corroboration ▫ Fact uncertainty ▫ Source trustworthiness ▫ Source coverage ▫ Conflict between sources • Fixpoint techniques to compute truth values of facts and source quality estimates • Top-k query algorithms for computing corroborated answers • Open Issues: ▫ Functional dependencies ▫ Time ▫ Social network ▫ Uncertain data ▫ Source dependence Amélie Marian - Rutgers University
  • 52. 52 Conclusions • New Challenges in web data management ▫ Semi-structured data  PIMS  User reviews ▫ Multiple sources of data  Conflicting information  Low quality data providers (Web 2.0) • SPIDR lab at Rutgers focuses on helping users identify useful data in the wealth of information available Amélie Marian - Rutgers University
  • 53. 53 Amélie Marian - Rutgers University