SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
Data Enrichment using Web APIs

                                      Karthik Gomadam, Peter Z. Yeh, Kunal Verma
                                                     Accenture Technology Labs
                                                       50, W. San Fernando St,
                                                            San Jose, CA
                                       { karthik.gomadam, peter.z.yeh, k.verma} @accenture.com



                             Abstract                                    at a granular level, based on the input data that is available
                                                                         and the data sources that can be used.
   As businesses seek to monetize their data, they are leveraging
   Web-based delivery mechanisms to provide publically avail-         2. Quality of a data service may vary, depending on the
   able data sources. Also, as analytics becomes a central part          input data: Calling a data source with some missing val-
   of many business functions such as customer segmentation,             ues (even though they may not be mandatory), can result
   competitive intelligence and fraud detection, many businesses         in poor quality results. In such a scenario, one must be
   are seeking to enrich their internal data records with data from      able to find the missing values, before calling the source
   these data sources. As the number of sources with varying             or use an alternative.
   degrees of accuracy and quality proliferate, it is a non-trivial
   task to effectively select which sources to use for a particu-     In this paper, we present an overview of a data enrichment
   lar enrichment task. The old model of statically buying data       framework which attempts to automate many sub-tasks of
   from one or two providers becomes inefficient because of the        data enrichment. The framework uses a combination of data
   rapid growth of new forms of useful data such as social me-        mining and semantic technologies to automate various tasks
   dia and the lack of dynamism to plug sources in and out. In        such as calculating which attributes are more important than
   this paper, we present the data enrichment framework, a tool       others for source selection, selecting sources based on infor-
   that uses data mining and other semantic techniques to au-
   tomatically guide the selection of sources. The enrichment
                                                                      mation available about a data record and past performance
   framework also monitors the quality of the data sources and        of the sources, using multiple sources to reinforce low con-
   automatically penalizes sources that continue to return low        fidence values, monitoring the quality of sources, as well as
   quality results.                                                   adaptation of sources based on past performance. The data
                                                                      enrichment framework makes the following contributions:
                                                                      1. Granular context dependent ordering of data sources:
                         Introduction                                    We have developed a novel approach to the order in which
As enterprises become more data and analytics driven many                data sources are called, based on the data that is available
businesses are seeking to enrich their internal data records             at that point.
with data from data sources available on the Web. Consider
                                                                      2. Automated assessment of data source quality: Our data
the example where a companys consumer database might
                                                                         enrichment algorithm measures the quality of the output
have the name and address of its consumers. Being able
                                                                         of a data source and its overall utility to the enrichment
to use publically available data sources such as LinkedIn,
                                                                         process.
White Pages and Facebook to find information such as em-
ployment details and interests, can help the company col-             3. Dynamic selection adaptation of data sources: Using
lect features for tasks such as customer segmentation. Cur-              the data availability and utility scores from prior invo-
rently, this is done in a static fashion where a business buys           cations, the enrichment framework penalized or rewards
data from one or two sources and statically integrates with              data sources, which affects how often they are selected.
their internal data. However, this approach has the following            The rest of the paper is organized as follows:
shortcomings:
1. Inconsistencies in input data: It is not uncommon to                                        Motivation
   have different set of missing attributes across records in
   the input data. For example, for some consumers, the               We motivate the need for data enrichment through three real-
   street address information might be missing and for oth-           world examples gathered from large Fortune 500 companies
   ers, information about the city might be missing. To ad-           that are clients of Accenture.
   dress this issue, one must be able to select the data sources         Creating comprehensive customer models: Creating
                                                                      comprehensive customer models has become a holy grail in
Copyright c 2012, Association for the Advancement of Artificial        the consumer business, especially in verticals like retail and
Intelligence (www.aaai.org). All rights reserved.                     healthcare. The information that is collected impacts key de-
cisions like inventory management, promotions and rewards,          set of sources for a particular attribute in particular data en-
and for delivering a more personal experience.                      richment task. The proposed framework can switch sources
1. Personalized and targeted promotions: The more infor-            across customer records, if the most preferred source does
   mation a business has about its customer, the better it can      not have information about some attributes for a particular
   personalize deals and promotions. Further, business could        record. For low confidence values, the proposed system uses
   also use aggregated information (such as interests of peo-       reconciliation across sources to increase the confidence of
   ple living in a neighborhood) to manage inventory and run        the value. ADEF also continuously monitors and downgrade
   local deals and promotions.                                      sources, if there is any loss of quality.
                                                                       Capital Equipment Maintenance: Companies within
2. Better segmentation and analytics: The provider may              the energy and resources industry have significant invest-
   need more information about their customers, beyond              ments in capital equipments (i.e. drills, oil pumps, etc.).
   what they have in the profile, for better segmentation and        Accurate data about these equipments (e.g. manufacturer,
   analytics. For example, a certain e-commerce site may            model, etc.) is paramount to operational efficiency, proper
   know a persons browsing and buying history on their site         maintenance, etc.
   and have some limited information about the person such             The current process for capturing this data begins with
   as their address and credit card information. However, un-       manual entry. Followed by manual, periodic “walk-downs”
   derstanding their professional activities and hobbies may        to confirm and validate this information. However, this pro-
   help them get more features for customer segmentation            cess is error-prone, and often results in incomplete and inac-
   that they can use for suggestions or promotions.                 curate data about the equipments.
3. Fraud detection: The provider may need more informa-                This does not have to be the case. A wealth of structured
   tion about their customers for detecting fraud. Providers        data sources (e.g. from manufacturers) exist that provides
   typically create detailed customer profiles to predict their      much of the incomplete, missing information. Hence, a so-
   behaviors and detect anomalies. Having demographic and           lution that can automatically leverage these sources to enrich
   other attributes such as interests and hobbies helps build-      existing, internal capital equipment data can significantly
   ing more accurate customer behavior profiles. Most e-             improve the quality of the data, which in turn can improve
   commerce providers are typically under a lot of pressure         operational efficiency and enable proper maintenance.
   to detect fraudulent activities on their site as early as pos-      Competitive Intelligence. The explosive growth of ex-
   sible, so that they can limit their exposure to lawsuits,        ternal data (i.e. data outside the business such as Web data,
   compliance laws or even loss of reputation.                      data providers, etc.) can enable businesses to gather rich in-
                                                                    telligence about their competitors. For example, companies
Often businesses engage customers to register for programs          in the energy and resources industry are very interested in
like reward cards or connect with the customers using so-           competitive insights such as where a competitor is drilling
cial media. This limited initial engagement gives them the          (or planning to drill); disruptions to drilling due to accidents,
access to basic information about a customer, such as name,         weather, etc.; and more.
email, address, and social media handles. However, in a vast           To gather these insights, companies currently purchase
majority of the cases, such information is incomplete and           relevant data from third party sources – e.g. IHS and Dod-
the gaps are not uniform. For example, for a customer John          son are just a few examples of third party data sources that
Doe, a business might have the name, street address, and a          aggregate and sell drilling data – to manually enrich exist-
phone number, whereas for Jane Doe, the available infor-            ing internal data to generate a comprehensive view of the
mation will be name, email, and a Twitter handle. Lever-            competitive environment. However, this current process is
aging the basic information and completing the gaps, also           manual one, which makes it difficult to scale beyond a small
called as creating a 360 degree customer view, is a signifi-         handful of data sources. Many useful, data sources that are
cant challenge. Current approaches to addressing this chal-         open (public access) (e.g. sources that provide weather data
lenge largely revolve around subscribing to data sources like       based on GIS information) are omitted, resulting in gaps in
Experian. This approach has the following shortcomings:             the intelligence gathered.
1. The enrichment task is restricted to the attributes provided        A solution that can automatically perform this enrichment
   by the one or two data sources that they buy from. If they       across a broad range of data sources can provide more in-
   need some other attributes about the customers, it is hard       depth, comprehensive competitive insight.
   to get them.
2. The selected data sources may have high quality infor-              Overview of Data Enrichment Algorithm
   mation about attributes, but poor quality about some oth-        Our Data Enrichment Framework (DEF) takes two inputs –
   ers. Even if the e-commerce provider knows about other           1) an instance of a data object to be enriched and 2) a set
   sources, which have those attributes, it is hard to manually     of data sources to use for the enrichment – and outputs an
   integrate more sources.                                          enriched version of the input instance.
                                                                       DEF enriches the input instance through the following
3. There is no good way to monitor if there is any degrada-         steps. DEF first assesses the importance of each attribute in
   tion in the quality of data sources.                             the input instance. This information is then used by DEF to
Using the enrichment framework in this context would al-            guide the selection of appropriate data sources to use. Fi-
low the e-commerce provider to dynamically select the best          nally, DEF determines the utility of the sources used, so it
can adapt its usage of these source (either in a favorable or      step until either there are no attributes in d whose values are
unfavorable manner) going forward.                                 unknown or there are no more sources to select.
                                                                      DEF considers two important factors when selecting the
Preliminaries                                                      next best source to use: 1) whether the source will be able
A data object D is a collection of attributes describing           to provide values if called, and 2) whether the source targets
a real-world object of interest. We formally define D as            unknown attributes in du (esp. attributes with high impor-
{a1 , a2 , ...an } where ai is an attribute.                       tance). DEF satisfies the first factor by measuring how well
   An instance d of a data object D is a partial instanti-         known values of d match the inputs required by the source.
ation of D – i.e. some attributes ai may not have an in-           If there is a good match, then the source is more likely to re-
stantiated value. We formally define d as having two el-            turn values when it’s called. DEF also considers the number
ements dk and du . dk consists of attributes whose values          of times a source was called previously (while enriching d)
are known (i.e. instantiated), which we define formally as          to prevent “starvation” of other sources.
dk = {< a, v(a), ka , kv(a) > ...}, where v(a) is the value           DEF satisfies the second factor by measuring how many
of attribute a, ka is the importance of a to the data object D     high-importance, unknown attributes the source claims to
that d is an instance of (ranging from 0.0 to 1.0), and kv(a) is   provide. If a source claims to provide a large number of these
the confidence in the correctness of v(a) (ranging from 0.0         attributes, then DEF should select the source over others.
to 1.0). du consists of attributes whose values are unknown        This second factor serves as the selection bias.
and hence the targets for enrichment. We define du formally            DEF formally captures these two considerations with the
as du = {< a, ka > ...}.                                           following equation:

Attribute Importance Assessment                                                                            kv(a)                  ka
                                                                                     1          a∈dk ∩Is               a∈du ∩Os
Given an instance d of a data object, DEF first assesses (and                Fs =           Bs                      +                   (4)
sets) the importance ka of each attribute a to the data object                     2M −1             |Is |                |du |
that d is an instance of. DEF uses the importance to guide         where Bs is the base fitness score of a data source s being
the subsequent selection of appropriate data sources for en-       considered (this value is randomly set between 0.5 and 0.75
richment (see next subsection).                                    when DEF is initialized), Is is the set of input attributes to
   Our definition of importance is based on the intuition that      the data source, Os is the set of output attributes from the
an attribute a has high importance to a data object D if its       data source, and M is the number of times the data source
values are highly unique across all instances of D. For exam-      has been selected in the context of enriching the current data
ple, the attribute e-mail contact should have high importance      object instance.
to the Customer data object because it satisfies this intuition.       The data source with the highest score Fs that also ex-
However, the attribute Zip should have lower importance to         ceeds a predefined minimum threshold R is selected as the
the Customer object because it does not – i.e. many instances      next source to use for enrichment.
of the Customer object have the same zipcode.                         For each unknown attribute a enriched by the selected
   DEF captures the above intuition formally with the fol-         data source, DEF moves it from du to dk , and computes the
lowing equation:                                                   confidence kv(a ) in the value provided for a by the selected
                                                                   source. This confidence is used in subsequent iterations of
                                  X2
                       ka =                                 (1)    the enrichment process, and is computed using the following
                                1 + X2                             formula:
where,                                                                           
                                                                                         1
                                                                                            −1
                                    U (a, D)
                                                                                      |V |
                X = HN (D) (a)                              (2)         kv(a ) =    e a         W , if kv(a ) = N ull         (5)
                                    |N (D)|                                       λ(kv(a ) −1)
                                                                                    e                , if kv(a ) = N ull
and
                HN (D) (a) = −         Pv logPv             (3)    where,
                                 v∈a                                                                         kv(a)
U (a, D) is the number of unique values of a across all                                           a∈dk ∩Is
                                                                                           W =                                         (6)
instance of the data object D observed by DEF so far,                                                   |Is |
and N (D) is all instances of D observed by DEF so far.            W is the confidence over all input attributes to the source,
HN (D) (a) is the entropy of the values of a across N (D),         and Va is the set of output values returned by a data source
and serves as a proxy for the distribution of the values of a.     for an unknown attribute a .
   We note that DEF recomputes ka as new instances of the             This formula captures two important factors. First, if mul-
data object containing a are observed. Hence, the impor-           tiple values are returned, then there is ambiguity and hence
tance of an attribute to a data object will change over time.      the confidence in the output should be discounted. Second,
                                                                   if an output value is corroborated by output values given by
Data Source Selection                                              previously selected data sources, then the confidence should
DEF selects data sources to enrich attributes of a data object     be further increased. The λ factor is the corroboration factor
instance d whose values are unknown. DEF will repeat this          (< 1.0), and defaults to 1.0.
In addition to selecting appropriate data sources to use,      past T values returned by the data source for the attribute a.
DEF must also resolve ambiguities that occur during the en-       W is the confidence over all input attributes to the source,
richment process. For example, given the following instance       and is defined in the previous subsection.
of the Customer data object:                                         The utility of a data source Us from the past n calls are
   (Name: John Smith, City: San Jose, Occupation:                 then used to adjust the base fitness score of the data source.
   NULL)                                                          This adjustment is captured with the following equation
                                                                                                    n
a data source may return multiple values for the unknown                                        1
attribute of Occupation (e.g. Programmer, Artist, etc).                          Bs = Bs + γ            Us (T − i)           (10)
                                                                                                n
   To resolve this ambiguity, DEF will branch the original                                          1
instance – one branch for each returned value – and each          where Bs is the base fitness score of a data source s, Us (T −
branched instance will be subsequently enriched using the         i) is the utility of the data source i time steps back, and γ is
same steps above. Hence, a single data object instance may        the adjustment rate.
result in multiple instances at the end of the enrichment pro-
cess.                                                                               System Architecture
   DEF will repeat the above process until either du is empty
or there are no sources whose score Fs exceeds R. Once
this process terminates, DEF computes the fitness for each
resulting instance using the following equation:
                                  kv(a) ka
                      a∈dk ∩dU
                                                           (7)
                          |dk ∪ du |
and returns top K instances.

Data Source Utility Adaptation
Once a data source has been called, DEF determines the util-
ity of the source in enriching the data object instance of in-
terest. Intuitively, DEF models the utility of a data source as
a “contract” – i.e. if DEF provides a data source with high
confidence input values, then it is reasonable to expect the
data source to provide values for all the output attributes
that it claims to target. Moreover, these values should not          Figure 1: Design overview of enrichment framework
be generic and should have low ambiguity. If these expecta-
tions are violated, then the data source should be penalized         The main components of DEF are illustrated in Figure 1.
heavily.                                                          The task manager starts a new enrichment project by in-
   On the other hand, if DEF did not provide a data source        stantiates and executes the enrichment engine. The enrich-
with good inputs, then the source should be penalized mini-       ment engine uses the attribute computation module to cal-
mally (if at all) if it fails to provide any useful outputs.      culate the attribute relevance. The relevance scores are used
   Alternatively, if a data source is able to provide unam-       in source selection. Using the HTTP helper module, the en-
biguous values for unknown attributes in the data object in-      gine then invokes the connector for the selected data source.
stance (esp. high importance attributes), then DEF should re-     A connector is a proxy that communicates with the actual
ward the source and give it more preference going forward.        data source and is a RESTful Web service in itself. The en-
   DEF captures this notion formally with the following           richment framework requires every data source to have a
equation:                                                         connector and that each connector have two operations: 1)
                                                              a return schema GET operation that returns the input and
                 1                 1
                                       −1 PT v(a)
                                                                  output schema, and 2) a get data POST operation that takes
  Us = W                       e |Va | ka        −     ka      as input the input for the data source as POST parameters
                |Os |
                     a∈Os     +
                                               a∈Os   −           and returns the response as a JSON. For internal databases,
                                                           (8)    we have special connectors that wrap queries as RESTful
where,                                                            end points. Once a response is obtained from the connec-
                                                                  tor, the enrichment engine computes the output value con-
                        PT (v(a))        , if |Va | = 1           fidence, applies the necessary mapping rules, and integrates
   PT (v(a))    =       argmin PT (v(a)) , if |Va | > 1 (9)
                        v(a)∈Va
                                                                  the response with the existing data object. In addition to this,
                                                                  the source degradation factor is also computed. The map-
  +
Os are the output attributes from a data source for which         ping, invocation, confidence and source degradation value
                            −
values were returned, Os are the output attributes from           computation steps are repeated until either all values for all
the same source for which values were not returned, and           attributes are computed or if all sources have been invoked.
PT (v(a)) is the relative frequency of the value v(a) over the    The result is then written into the enrichment database.
In designing the enrichment framework, we have adopted         • Inconsistent auth mechanisms across different APIs and
a service oriented approach, with the goal of exposing the          mandatory auth makes it expensive (even when pulling
enrichment framework as a “platform as a service”. The core         information that is open on the Web)
tasks in the framework are exposed as RESTful end points.         • easy to develop prototypes, but do not instill enough con-
These include end points for creating data objects, import-         fidence to develop a deployable client solution
ing datasets, adding data sources, and for starting an enrich-
ment task. When the “start enrichment task” task resource is      • Data services and APIs are an integral part of the system
invoked and a task is successfully started, the framework re-     • Have been around for a while, minimal traction in the en-
sponds with a JSON that has the enrichment task identifier.          terprise
This identifier can then be used to GET the enriched data
from the database. The framework supports both batch GET          • API rate limiting, lack of SLA driven API contracts
and streaming GET using the comet (?) pattern.                    • Poor API documentation, often leading the developer in
   Data mapping is one of the key challenges in any data in-        circles. Minimalism in APIs might not be a good idea;
tegration system. While extensive research literature exists
for automated and semi-automated approaches for mapping           • Changes not “pushed” to developers, often finding out
(?; ?; ?; ?; ?), it is our observation that these techniques do     when an application breaks
not guarantee the high-level of accuracy required in enter-       • Inconsistent auth mechanisms across different APIs and
prise solutions. So, we currently adopt a manual approach,          mandatory auth makes it expensive (even when pulling
aided by a graphical interface for data mapping. The source         information that is open on the Web)
and the target schemas are shown to the users as two trees,
                                                                  • easy to develop prototypes, but do not instill enough con-
one to the left and one to the right. Users can select the at-
                                                                    fidence to develop a deployable client solution
tributes from the source schema and draw a line between
them and attributes of the target schema. Currently, our map-
ping system supports assignment, merge, split, numerical                               Related Work
operations, and unit conversions. When the user saves the         Web APIs and data services, that have helped establish the
mappings, the maps are stored as mapping rules. Each map-         Web as a platform. Web APIs have enabled the deployment
ping rule is represented as a tuple containing the source         and use of services on the Web using standardized commu-
attributes, target attributes, mapping operations, and condi-     nication protocols and message formats. Leveraging on Web
tions. Conditions include merge and split delimiters and con-     APIs, data services have allowed access to vast amounts of
version factors.                                                  data that were hiterto hidden in proprietary silos in a stan-
                                                                  dardized manner. A notable outcome is the development of
            Challenges and Experiences                            mashups or Web application hybrids. A mashup is created by
                                                                  integrating data from various services on the Web using their
• Data services and APIs are an integral part of the system       APIs. Although early mashups were consumer centric ap-
• Have been around for a while, minimal traction in the en-       plications, their adoption within enterprise has been increas-
  terprise                                                        ing, especially in addressing data intensive problems such
                                                                  as data filtering, data transformation, and data enrichment.
• API rate limiting, lack of SLA driven API contracts             Web APIs and data services, that have helped establish the
• Poor API documentation, often leading the developer in          Web as a platform. Web APIs have enabled the deployment
  circles. Minimalism in APIs might not be a good idea;           and use of services on the Web using standardized commu-
                                                                  nication protocols and message formats. Leveraging on Web
• Changes not “pushed” to developers, often finding out            APIs, data services have allowed access to vast amounts of
  when an application breaks                                      data that were hiterto hidden in proprietary silos in a stan-
• Inconsistent auth mechanisms across different APIs and          dardized manner. A notable outcome is the development of
  mandatory auth makes it expensive (even when pulling            mashups or Web application hybrids. A mashup is created by
  information that is open on the Web)                            integrating data from various services on the Web using their
                                                                  APIs. Although early mashups were consumer centric appli-
• easy to develop prototypes, but do not instill enough con-
                                                                  cations, their adoption within enterprise has been increasing,
  fidence to develop a deployable client solution
                                                                  especially in addressing data intensive problems such as data
• Data services and APIs are an integral part of the system       filtering, data transformation, and data enrichment.
                                                                     Web APIs and data services, that have helped establish the
• Have been around for a while, minimal traction in the en-
                                                                  Web as a platform. Web APIs have enabled the deployment
  terprise
                                                                  and use of services on the Web using standardized commu-
• API rate limiting, lack of SLA driven API contracts             nication protocols and message formats. Leveraging on Web
                                                                  APIs, data services have allowed access to vast amounts of
• Poor API documentation, often leading the developer in
                                                                  data that were hiterto hidden in proprietary silos in a stan-
  circles. Minimalism in APIs might not be a good idea;
                                                                  dardized manner. A notable outcome is the development of
• Changes not “pushed” to developers, often finding out            mashups or Web application hybrids. A mashup is created by
  when an application breaks                                      integrating data from various services on the Web using their
APIs. Although early mashups were consumer centric appli-
cations, their adoption within enterprise has been increasing,
especially in addressing data intensive problems such as data
filtering, data transformation, and data enrichment.

Contenu connexe

Tendances

DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASET
DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASETDATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASET
DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASET
AM Publications
 
GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378
Parag Kapile
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
Kun Le
 
Data warehouse Project Report
Data warehouse Project ReportData warehouse Project Report
Data warehouse Project Report
Himanshu Yadav
 
White Paper: Causata Big Data Architecture
White Paper: Causata Big Data ArchitectureWhite Paper: Causata Big Data Architecture
White Paper: Causata Big Data Architecture
Causata
 

Tendances (10)

DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASET
DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASETDATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASET
DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASET
 
Information Extraction and Aggregation from Unstructured Web Data for Busines...
Information Extraction and Aggregation from Unstructured Web Data for Busines...Information Extraction and Aggregation from Unstructured Web Data for Busines...
Information Extraction and Aggregation from Unstructured Web Data for Busines...
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer Segmentation
 
Intro to insight as-a-service
Intro to insight as-a-serviceIntro to insight as-a-service
Intro to insight as-a-service
 
GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
 
The Comparison of Big Data Strategies in Corporate Environment
The Comparison of Big Data Strategies in Corporate EnvironmentThe Comparison of Big Data Strategies in Corporate Environment
The Comparison of Big Data Strategies in Corporate Environment
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Data warehouse Project Report
Data warehouse Project ReportData warehouse Project Report
Data warehouse Project Report
 
White Paper: Causata Big Data Architecture
White Paper: Causata Big Data ArchitectureWhite Paper: Causata Big Data Architecture
White Paper: Causata Big Data Architecture
 

En vedette

presentacion mabel condemarin
presentacion mabel condemarinpresentacion mabel condemarin
presentacion mabel condemarin
guest2495c5c
 
Attitude is Everything
Attitude is EverythingAttitude is Everything
Attitude is Everything
grsalomon
 
100 ways to boost yourself confidence
100 ways to boost yourself   confidence100 ways to boost yourself   confidence
100 ways to boost yourself confidence
Tejesh Rao
 

En vedette (17)

Lean 6 Sigma Searchtec
Lean 6 Sigma SearchtecLean 6 Sigma Searchtec
Lean 6 Sigma Searchtec
 
Urari 2010
Urari 2010Urari 2010
Urari 2010
 
presentacion mabel condemarin
presentacion mabel condemarinpresentacion mabel condemarin
presentacion mabel condemarin
 
Gomadam Dissertation
Gomadam DissertationGomadam Dissertation
Gomadam Dissertation
 
Millenia Presentation Compressed
Millenia Presentation CompressedMillenia Presentation Compressed
Millenia Presentation Compressed
 
Apa Guidelines
Apa GuidelinesApa Guidelines
Apa Guidelines
 
Pcori2013 (23)
Pcori2013 (23)Pcori2013 (23)
Pcori2013 (23)
 
Millenia presentation
Millenia presentationMillenia presentation
Millenia presentation
 
Green is the New Blue
Green is the New BlueGreen is the New Blue
Green is the New Blue
 
Frumos 1
Frumos 1Frumos 1
Frumos 1
 
Attitude is Everything
Attitude is EverythingAttitude is Everything
Attitude is Everything
 
Semantics Enriched Service Environments
Semantics Enriched Service EnvironmentsSemantics Enriched Service Environments
Semantics Enriched Service Environments
 
Final Defense
Final DefenseFinal Defense
Final Defense
 
Eastern Urban Center
Eastern Urban CenterEastern Urban Center
Eastern Urban Center
 
Kgomadam Candidacy
Kgomadam CandidacyKgomadam Candidacy
Kgomadam Candidacy
 
100 ways to boost yourself confidence
100 ways to boost yourself   confidence100 ways to boost yourself   confidence
100 ways to boost yourself confidence
 
mabel condemarin
mabel condemarinmabel condemarin
mabel condemarin
 

Similaire à Data Enrichment using Web APIs

1. What are the business costs or risks of poor data quality Sup.docx
1.  What are the business costs or risks of poor data quality Sup.docx1.  What are the business costs or risks of poor data quality Sup.docx
1. What are the business costs or risks of poor data quality Sup.docx
SONU61709
 
Data Mining Presentation for College Harsh.pptx
Data Mining Presentation for College Harsh.pptxData Mining Presentation for College Harsh.pptx
Data Mining Presentation for College Harsh.pptx
hp41112004
 

Similaire à Data Enrichment using Web APIs (20)

Data valuation
Data valuationData valuation
Data valuation
 
Semantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data Lake
 
Semantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data Lake
 
1. What are the business costs or risks of poor data quality Sup.docx
1.  What are the business costs or risks of poor data quality Sup.docx1.  What are the business costs or risks of poor data quality Sup.docx
1. What are the business costs or risks of poor data quality Sup.docx
 
Developing A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product DataDeveloping A Universal Approach to Cleansing Customer and Product Data
Developing A Universal Approach to Cleansing Customer and Product Data
 
Big data
Big dataBig data
Big data
 
Derak Presentation
Derak Presentation Derak Presentation
Derak Presentation
 
Big Data use cases in telcos
Big Data use cases in telcosBig Data use cases in telcos
Big Data use cases in telcos
 
Big Data use cases in telcos
Big Data use cases in telcosBig Data use cases in telcos
Big Data use cases in telcos
 
Technology & Innovation - User Experience in Business Database Systems
Technology & Innovation - User Experience in Business Database SystemsTechnology & Innovation - User Experience in Business Database Systems
Technology & Innovation - User Experience in Business Database Systems
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Aziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaAziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jha
 
Abstract
AbstractAbstract
Abstract
 
Cis 500 assignment 4
Cis 500 assignment 4Cis 500 assignment 4
Cis 500 assignment 4
 
Turning Big Data to Business Advantage
Turning Big Data to Business AdvantageTurning Big Data to Business Advantage
Turning Big Data to Business Advantage
 
Data Mining Presentation for College Harsh.pptx
Data Mining Presentation for College Harsh.pptxData Mining Presentation for College Harsh.pptx
Data Mining Presentation for College Harsh.pptx
 
How Can You Calculate the Cost of Your Data?
How Can You Calculate the Cost of Your Data?How Can You Calculate the Cost of Your Data?
How Can You Calculate the Cost of Your Data?
 
Top Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their ApplicationsTop Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their Applications
 
To Become a Data-Driven Enterprise, Data Democratization is Essential
To Become a Data-Driven Enterprise, Data Democratization is EssentialTo Become a Data-Driven Enterprise, Data Democratization is Essential
To Become a Data-Driven Enterprise, Data Democratization is Essential
 
Big Data, Analytics and Data Science
Big Data, Analytics and Data ScienceBig Data, Analytics and Data Science
Big Data, Analytics and Data Science
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Data Enrichment using Web APIs

  • 1. Data Enrichment using Web APIs Karthik Gomadam, Peter Z. Yeh, Kunal Verma Accenture Technology Labs 50, W. San Fernando St, San Jose, CA { karthik.gomadam, peter.z.yeh, k.verma} @accenture.com Abstract at a granular level, based on the input data that is available and the data sources that can be used. As businesses seek to monetize their data, they are leveraging Web-based delivery mechanisms to provide publically avail- 2. Quality of a data service may vary, depending on the able data sources. Also, as analytics becomes a central part input data: Calling a data source with some missing val- of many business functions such as customer segmentation, ues (even though they may not be mandatory), can result competitive intelligence and fraud detection, many businesses in poor quality results. In such a scenario, one must be are seeking to enrich their internal data records with data from able to find the missing values, before calling the source these data sources. As the number of sources with varying or use an alternative. degrees of accuracy and quality proliferate, it is a non-trivial task to effectively select which sources to use for a particu- In this paper, we present an overview of a data enrichment lar enrichment task. The old model of statically buying data framework which attempts to automate many sub-tasks of from one or two providers becomes inefficient because of the data enrichment. The framework uses a combination of data rapid growth of new forms of useful data such as social me- mining and semantic technologies to automate various tasks dia and the lack of dynamism to plug sources in and out. In such as calculating which attributes are more important than this paper, we present the data enrichment framework, a tool others for source selection, selecting sources based on infor- that uses data mining and other semantic techniques to au- tomatically guide the selection of sources. The enrichment mation available about a data record and past performance framework also monitors the quality of the data sources and of the sources, using multiple sources to reinforce low con- automatically penalizes sources that continue to return low fidence values, monitoring the quality of sources, as well as quality results. adaptation of sources based on past performance. The data enrichment framework makes the following contributions: 1. Granular context dependent ordering of data sources: Introduction We have developed a novel approach to the order in which As enterprises become more data and analytics driven many data sources are called, based on the data that is available businesses are seeking to enrich their internal data records at that point. with data from data sources available on the Web. Consider 2. Automated assessment of data source quality: Our data the example where a companys consumer database might enrichment algorithm measures the quality of the output have the name and address of its consumers. Being able of a data source and its overall utility to the enrichment to use publically available data sources such as LinkedIn, process. White Pages and Facebook to find information such as em- ployment details and interests, can help the company col- 3. Dynamic selection adaptation of data sources: Using lect features for tasks such as customer segmentation. Cur- the data availability and utility scores from prior invo- rently, this is done in a static fashion where a business buys cations, the enrichment framework penalized or rewards data from one or two sources and statically integrates with data sources, which affects how often they are selected. their internal data. However, this approach has the following The rest of the paper is organized as follows: shortcomings: 1. Inconsistencies in input data: It is not uncommon to Motivation have different set of missing attributes across records in the input data. For example, for some consumers, the We motivate the need for data enrichment through three real- street address information might be missing and for oth- world examples gathered from large Fortune 500 companies ers, information about the city might be missing. To ad- that are clients of Accenture. dress this issue, one must be able to select the data sources Creating comprehensive customer models: Creating comprehensive customer models has become a holy grail in Copyright c 2012, Association for the Advancement of Artificial the consumer business, especially in verticals like retail and Intelligence (www.aaai.org). All rights reserved. healthcare. The information that is collected impacts key de-
  • 2. cisions like inventory management, promotions and rewards, set of sources for a particular attribute in particular data en- and for delivering a more personal experience. richment task. The proposed framework can switch sources 1. Personalized and targeted promotions: The more infor- across customer records, if the most preferred source does mation a business has about its customer, the better it can not have information about some attributes for a particular personalize deals and promotions. Further, business could record. For low confidence values, the proposed system uses also use aggregated information (such as interests of peo- reconciliation across sources to increase the confidence of ple living in a neighborhood) to manage inventory and run the value. ADEF also continuously monitors and downgrade local deals and promotions. sources, if there is any loss of quality. Capital Equipment Maintenance: Companies within 2. Better segmentation and analytics: The provider may the energy and resources industry have significant invest- need more information about their customers, beyond ments in capital equipments (i.e. drills, oil pumps, etc.). what they have in the profile, for better segmentation and Accurate data about these equipments (e.g. manufacturer, analytics. For example, a certain e-commerce site may model, etc.) is paramount to operational efficiency, proper know a persons browsing and buying history on their site maintenance, etc. and have some limited information about the person such The current process for capturing this data begins with as their address and credit card information. However, un- manual entry. Followed by manual, periodic “walk-downs” derstanding their professional activities and hobbies may to confirm and validate this information. However, this pro- help them get more features for customer segmentation cess is error-prone, and often results in incomplete and inac- that they can use for suggestions or promotions. curate data about the equipments. 3. Fraud detection: The provider may need more informa- This does not have to be the case. A wealth of structured tion about their customers for detecting fraud. Providers data sources (e.g. from manufacturers) exist that provides typically create detailed customer profiles to predict their much of the incomplete, missing information. Hence, a so- behaviors and detect anomalies. Having demographic and lution that can automatically leverage these sources to enrich other attributes such as interests and hobbies helps build- existing, internal capital equipment data can significantly ing more accurate customer behavior profiles. Most e- improve the quality of the data, which in turn can improve commerce providers are typically under a lot of pressure operational efficiency and enable proper maintenance. to detect fraudulent activities on their site as early as pos- Competitive Intelligence. The explosive growth of ex- sible, so that they can limit their exposure to lawsuits, ternal data (i.e. data outside the business such as Web data, compliance laws or even loss of reputation. data providers, etc.) can enable businesses to gather rich in- telligence about their competitors. For example, companies Often businesses engage customers to register for programs in the energy and resources industry are very interested in like reward cards or connect with the customers using so- competitive insights such as where a competitor is drilling cial media. This limited initial engagement gives them the (or planning to drill); disruptions to drilling due to accidents, access to basic information about a customer, such as name, weather, etc.; and more. email, address, and social media handles. However, in a vast To gather these insights, companies currently purchase majority of the cases, such information is incomplete and relevant data from third party sources – e.g. IHS and Dod- the gaps are not uniform. For example, for a customer John son are just a few examples of third party data sources that Doe, a business might have the name, street address, and a aggregate and sell drilling data – to manually enrich exist- phone number, whereas for Jane Doe, the available infor- ing internal data to generate a comprehensive view of the mation will be name, email, and a Twitter handle. Lever- competitive environment. However, this current process is aging the basic information and completing the gaps, also manual one, which makes it difficult to scale beyond a small called as creating a 360 degree customer view, is a signifi- handful of data sources. Many useful, data sources that are cant challenge. Current approaches to addressing this chal- open (public access) (e.g. sources that provide weather data lenge largely revolve around subscribing to data sources like based on GIS information) are omitted, resulting in gaps in Experian. This approach has the following shortcomings: the intelligence gathered. 1. The enrichment task is restricted to the attributes provided A solution that can automatically perform this enrichment by the one or two data sources that they buy from. If they across a broad range of data sources can provide more in- need some other attributes about the customers, it is hard depth, comprehensive competitive insight. to get them. 2. The selected data sources may have high quality infor- Overview of Data Enrichment Algorithm mation about attributes, but poor quality about some oth- Our Data Enrichment Framework (DEF) takes two inputs – ers. Even if the e-commerce provider knows about other 1) an instance of a data object to be enriched and 2) a set sources, which have those attributes, it is hard to manually of data sources to use for the enrichment – and outputs an integrate more sources. enriched version of the input instance. DEF enriches the input instance through the following 3. There is no good way to monitor if there is any degrada- steps. DEF first assesses the importance of each attribute in tion in the quality of data sources. the input instance. This information is then used by DEF to Using the enrichment framework in this context would al- guide the selection of appropriate data sources to use. Fi- low the e-commerce provider to dynamically select the best nally, DEF determines the utility of the sources used, so it
  • 3. can adapt its usage of these source (either in a favorable or step until either there are no attributes in d whose values are unfavorable manner) going forward. unknown or there are no more sources to select. DEF considers two important factors when selecting the Preliminaries next best source to use: 1) whether the source will be able A data object D is a collection of attributes describing to provide values if called, and 2) whether the source targets a real-world object of interest. We formally define D as unknown attributes in du (esp. attributes with high impor- {a1 , a2 , ...an } where ai is an attribute. tance). DEF satisfies the first factor by measuring how well An instance d of a data object D is a partial instanti- known values of d match the inputs required by the source. ation of D – i.e. some attributes ai may not have an in- If there is a good match, then the source is more likely to re- stantiated value. We formally define d as having two el- turn values when it’s called. DEF also considers the number ements dk and du . dk consists of attributes whose values of times a source was called previously (while enriching d) are known (i.e. instantiated), which we define formally as to prevent “starvation” of other sources. dk = {< a, v(a), ka , kv(a) > ...}, where v(a) is the value DEF satisfies the second factor by measuring how many of attribute a, ka is the importance of a to the data object D high-importance, unknown attributes the source claims to that d is an instance of (ranging from 0.0 to 1.0), and kv(a) is provide. If a source claims to provide a large number of these the confidence in the correctness of v(a) (ranging from 0.0 attributes, then DEF should select the source over others. to 1.0). du consists of attributes whose values are unknown This second factor serves as the selection bias. and hence the targets for enrichment. We define du formally DEF formally captures these two considerations with the as du = {< a, ka > ...}. following equation: Attribute Importance Assessment kv(a) ka 1 a∈dk ∩Is a∈du ∩Os Given an instance d of a data object, DEF first assesses (and Fs = Bs + (4) sets) the importance ka of each attribute a to the data object 2M −1 |Is | |du | that d is an instance of. DEF uses the importance to guide where Bs is the base fitness score of a data source s being the subsequent selection of appropriate data sources for en- considered (this value is randomly set between 0.5 and 0.75 richment (see next subsection). when DEF is initialized), Is is the set of input attributes to Our definition of importance is based on the intuition that the data source, Os is the set of output attributes from the an attribute a has high importance to a data object D if its data source, and M is the number of times the data source values are highly unique across all instances of D. For exam- has been selected in the context of enriching the current data ple, the attribute e-mail contact should have high importance object instance. to the Customer data object because it satisfies this intuition. The data source with the highest score Fs that also ex- However, the attribute Zip should have lower importance to ceeds a predefined minimum threshold R is selected as the the Customer object because it does not – i.e. many instances next source to use for enrichment. of the Customer object have the same zipcode. For each unknown attribute a enriched by the selected DEF captures the above intuition formally with the fol- data source, DEF moves it from du to dk , and computes the lowing equation: confidence kv(a ) in the value provided for a by the selected source. This confidence is used in subsequent iterations of X2 ka = (1) the enrichment process, and is computed using the following 1 + X2 formula: where,  1 −1 U (a, D)  |V | X = HN (D) (a) (2) kv(a ) = e a W , if kv(a ) = N ull (5) |N (D)|  λ(kv(a ) −1) e , if kv(a ) = N ull and HN (D) (a) = − Pv logPv (3) where, v∈a kv(a) U (a, D) is the number of unique values of a across all a∈dk ∩Is W = (6) instance of the data object D observed by DEF so far, |Is | and N (D) is all instances of D observed by DEF so far. W is the confidence over all input attributes to the source, HN (D) (a) is the entropy of the values of a across N (D), and Va is the set of output values returned by a data source and serves as a proxy for the distribution of the values of a. for an unknown attribute a . We note that DEF recomputes ka as new instances of the This formula captures two important factors. First, if mul- data object containing a are observed. Hence, the impor- tiple values are returned, then there is ambiguity and hence tance of an attribute to a data object will change over time. the confidence in the output should be discounted. Second, if an output value is corroborated by output values given by Data Source Selection previously selected data sources, then the confidence should DEF selects data sources to enrich attributes of a data object be further increased. The λ factor is the corroboration factor instance d whose values are unknown. DEF will repeat this (< 1.0), and defaults to 1.0.
  • 4. In addition to selecting appropriate data sources to use, past T values returned by the data source for the attribute a. DEF must also resolve ambiguities that occur during the en- W is the confidence over all input attributes to the source, richment process. For example, given the following instance and is defined in the previous subsection. of the Customer data object: The utility of a data source Us from the past n calls are (Name: John Smith, City: San Jose, Occupation: then used to adjust the base fitness score of the data source. NULL) This adjustment is captured with the following equation n a data source may return multiple values for the unknown 1 attribute of Occupation (e.g. Programmer, Artist, etc). Bs = Bs + γ Us (T − i) (10) n To resolve this ambiguity, DEF will branch the original 1 instance – one branch for each returned value – and each where Bs is the base fitness score of a data source s, Us (T − branched instance will be subsequently enriched using the i) is the utility of the data source i time steps back, and γ is same steps above. Hence, a single data object instance may the adjustment rate. result in multiple instances at the end of the enrichment pro- cess. System Architecture DEF will repeat the above process until either du is empty or there are no sources whose score Fs exceeds R. Once this process terminates, DEF computes the fitness for each resulting instance using the following equation: kv(a) ka a∈dk ∩dU (7) |dk ∪ du | and returns top K instances. Data Source Utility Adaptation Once a data source has been called, DEF determines the util- ity of the source in enriching the data object instance of in- terest. Intuitively, DEF models the utility of a data source as a “contract” – i.e. if DEF provides a data source with high confidence input values, then it is reasonable to expect the data source to provide values for all the output attributes that it claims to target. Moreover, these values should not Figure 1: Design overview of enrichment framework be generic and should have low ambiguity. If these expecta- tions are violated, then the data source should be penalized The main components of DEF are illustrated in Figure 1. heavily. The task manager starts a new enrichment project by in- On the other hand, if DEF did not provide a data source stantiates and executes the enrichment engine. The enrich- with good inputs, then the source should be penalized mini- ment engine uses the attribute computation module to cal- mally (if at all) if it fails to provide any useful outputs. culate the attribute relevance. The relevance scores are used Alternatively, if a data source is able to provide unam- in source selection. Using the HTTP helper module, the en- biguous values for unknown attributes in the data object in- gine then invokes the connector for the selected data source. stance (esp. high importance attributes), then DEF should re- A connector is a proxy that communicates with the actual ward the source and give it more preference going forward. data source and is a RESTful Web service in itself. The en- DEF captures this notion formally with the following richment framework requires every data source to have a equation: connector and that each connector have two operations: 1)    a return schema GET operation that returns the input and 1  1 −1 PT v(a) output schema, and 2) a get data POST operation that takes Us = W  e |Va | ka − ka   as input the input for the data source as POST parameters |Os | a∈Os + a∈Os − and returns the response as a JSON. For internal databases, (8) we have special connectors that wrap queries as RESTful where, end points. Once a response is obtained from the connec- tor, the enrichment engine computes the output value con- PT (v(a)) , if |Va | = 1 fidence, applies the necessary mapping rules, and integrates PT (v(a)) = argmin PT (v(a)) , if |Va | > 1 (9) v(a)∈Va the response with the existing data object. In addition to this, the source degradation factor is also computed. The map- + Os are the output attributes from a data source for which ping, invocation, confidence and source degradation value − values were returned, Os are the output attributes from computation steps are repeated until either all values for all the same source for which values were not returned, and attributes are computed or if all sources have been invoked. PT (v(a)) is the relative frequency of the value v(a) over the The result is then written into the enrichment database.
  • 5. In designing the enrichment framework, we have adopted • Inconsistent auth mechanisms across different APIs and a service oriented approach, with the goal of exposing the mandatory auth makes it expensive (even when pulling enrichment framework as a “platform as a service”. The core information that is open on the Web) tasks in the framework are exposed as RESTful end points. • easy to develop prototypes, but do not instill enough con- These include end points for creating data objects, import- fidence to develop a deployable client solution ing datasets, adding data sources, and for starting an enrich- ment task. When the “start enrichment task” task resource is • Data services and APIs are an integral part of the system invoked and a task is successfully started, the framework re- • Have been around for a while, minimal traction in the en- sponds with a JSON that has the enrichment task identifier. terprise This identifier can then be used to GET the enriched data from the database. The framework supports both batch GET • API rate limiting, lack of SLA driven API contracts and streaming GET using the comet (?) pattern. • Poor API documentation, often leading the developer in Data mapping is one of the key challenges in any data in- circles. Minimalism in APIs might not be a good idea; tegration system. While extensive research literature exists for automated and semi-automated approaches for mapping • Changes not “pushed” to developers, often finding out (?; ?; ?; ?; ?), it is our observation that these techniques do when an application breaks not guarantee the high-level of accuracy required in enter- • Inconsistent auth mechanisms across different APIs and prise solutions. So, we currently adopt a manual approach, mandatory auth makes it expensive (even when pulling aided by a graphical interface for data mapping. The source information that is open on the Web) and the target schemas are shown to the users as two trees, • easy to develop prototypes, but do not instill enough con- one to the left and one to the right. Users can select the at- fidence to develop a deployable client solution tributes from the source schema and draw a line between them and attributes of the target schema. Currently, our map- ping system supports assignment, merge, split, numerical Related Work operations, and unit conversions. When the user saves the Web APIs and data services, that have helped establish the mappings, the maps are stored as mapping rules. Each map- Web as a platform. Web APIs have enabled the deployment ping rule is represented as a tuple containing the source and use of services on the Web using standardized commu- attributes, target attributes, mapping operations, and condi- nication protocols and message formats. Leveraging on Web tions. Conditions include merge and split delimiters and con- APIs, data services have allowed access to vast amounts of version factors. data that were hiterto hidden in proprietary silos in a stan- dardized manner. A notable outcome is the development of Challenges and Experiences mashups or Web application hybrids. A mashup is created by integrating data from various services on the Web using their • Data services and APIs are an integral part of the system APIs. Although early mashups were consumer centric ap- • Have been around for a while, minimal traction in the en- plications, their adoption within enterprise has been increas- terprise ing, especially in addressing data intensive problems such as data filtering, data transformation, and data enrichment. • API rate limiting, lack of SLA driven API contracts Web APIs and data services, that have helped establish the • Poor API documentation, often leading the developer in Web as a platform. Web APIs have enabled the deployment circles. Minimalism in APIs might not be a good idea; and use of services on the Web using standardized commu- nication protocols and message formats. Leveraging on Web • Changes not “pushed” to developers, often finding out APIs, data services have allowed access to vast amounts of when an application breaks data that were hiterto hidden in proprietary silos in a stan- • Inconsistent auth mechanisms across different APIs and dardized manner. A notable outcome is the development of mandatory auth makes it expensive (even when pulling mashups or Web application hybrids. A mashup is created by information that is open on the Web) integrating data from various services on the Web using their APIs. Although early mashups were consumer centric appli- • easy to develop prototypes, but do not instill enough con- cations, their adoption within enterprise has been increasing, fidence to develop a deployable client solution especially in addressing data intensive problems such as data • Data services and APIs are an integral part of the system filtering, data transformation, and data enrichment. Web APIs and data services, that have helped establish the • Have been around for a while, minimal traction in the en- Web as a platform. Web APIs have enabled the deployment terprise and use of services on the Web using standardized commu- • API rate limiting, lack of SLA driven API contracts nication protocols and message formats. Leveraging on Web APIs, data services have allowed access to vast amounts of • Poor API documentation, often leading the developer in data that were hiterto hidden in proprietary silos in a stan- circles. Minimalism in APIs might not be a good idea; dardized manner. A notable outcome is the development of • Changes not “pushed” to developers, often finding out mashups or Web application hybrids. A mashup is created by when an application breaks integrating data from various services on the Web using their
  • 6. APIs. Although early mashups were consumer centric appli- cations, their adoption within enterprise has been increasing, especially in addressing data intensive problems such as data filtering, data transformation, and data enrichment.