The document discusses a data enrichment framework that uses data mining and semantic techniques to automatically select and enrich data from web APIs. The framework aims to address issues with static data enrichment approaches by dynamically selecting sources based on data availability and source quality. It assesses attribute importance, selects sources contextually based on input data and source performance, monitors source quality over time, and adjusts source selection accordingly. The framework provides granular, adaptive data enrichment to integrate diverse data sources for tasks like customer profiling, competitive intelligence, and fraud detection.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Data Enrichment using Web APIs
1. Data Enrichment using Web APIs
Karthik Gomadam, Peter Z. Yeh, Kunal Verma
Accenture Technology Labs
50, W. San Fernando St,
San Jose, CA
{ karthik.gomadam, peter.z.yeh, k.verma} @accenture.com
Abstract at a granular level, based on the input data that is available
and the data sources that can be used.
As businesses seek to monetize their data, they are leveraging
Web-based delivery mechanisms to provide publically avail- 2. Quality of a data service may vary, depending on the
able data sources. Also, as analytics becomes a central part input data: Calling a data source with some missing val-
of many business functions such as customer segmentation, ues (even though they may not be mandatory), can result
competitive intelligence and fraud detection, many businesses in poor quality results. In such a scenario, one must be
are seeking to enrich their internal data records with data from able to find the missing values, before calling the source
these data sources. As the number of sources with varying or use an alternative.
degrees of accuracy and quality proliferate, it is a non-trivial
task to effectively select which sources to use for a particu- In this paper, we present an overview of a data enrichment
lar enrichment task. The old model of statically buying data framework which attempts to automate many sub-tasks of
from one or two providers becomes inefficient because of the data enrichment. The framework uses a combination of data
rapid growth of new forms of useful data such as social me- mining and semantic technologies to automate various tasks
dia and the lack of dynamism to plug sources in and out. In such as calculating which attributes are more important than
this paper, we present the data enrichment framework, a tool others for source selection, selecting sources based on infor-
that uses data mining and other semantic techniques to au-
tomatically guide the selection of sources. The enrichment
mation available about a data record and past performance
framework also monitors the quality of the data sources and of the sources, using multiple sources to reinforce low con-
automatically penalizes sources that continue to return low fidence values, monitoring the quality of sources, as well as
quality results. adaptation of sources based on past performance. The data
enrichment framework makes the following contributions:
1. Granular context dependent ordering of data sources:
Introduction We have developed a novel approach to the order in which
As enterprises become more data and analytics driven many data sources are called, based on the data that is available
businesses are seeking to enrich their internal data records at that point.
with data from data sources available on the Web. Consider
2. Automated assessment of data source quality: Our data
the example where a companys consumer database might
enrichment algorithm measures the quality of the output
have the name and address of its consumers. Being able
of a data source and its overall utility to the enrichment
to use publically available data sources such as LinkedIn,
process.
White Pages and Facebook to find information such as em-
ployment details and interests, can help the company col- 3. Dynamic selection adaptation of data sources: Using
lect features for tasks such as customer segmentation. Cur- the data availability and utility scores from prior invo-
rently, this is done in a static fashion where a business buys cations, the enrichment framework penalized or rewards
data from one or two sources and statically integrates with data sources, which affects how often they are selected.
their internal data. However, this approach has the following The rest of the paper is organized as follows:
shortcomings:
1. Inconsistencies in input data: It is not uncommon to Motivation
have different set of missing attributes across records in
the input data. For example, for some consumers, the We motivate the need for data enrichment through three real-
street address information might be missing and for oth- world examples gathered from large Fortune 500 companies
ers, information about the city might be missing. To ad- that are clients of Accenture.
dress this issue, one must be able to select the data sources Creating comprehensive customer models: Creating
comprehensive customer models has become a holy grail in
Copyright c 2012, Association for the Advancement of Artificial the consumer business, especially in verticals like retail and
Intelligence (www.aaai.org). All rights reserved. healthcare. The information that is collected impacts key de-
2. cisions like inventory management, promotions and rewards, set of sources for a particular attribute in particular data en-
and for delivering a more personal experience. richment task. The proposed framework can switch sources
1. Personalized and targeted promotions: The more infor- across customer records, if the most preferred source does
mation a business has about its customer, the better it can not have information about some attributes for a particular
personalize deals and promotions. Further, business could record. For low confidence values, the proposed system uses
also use aggregated information (such as interests of peo- reconciliation across sources to increase the confidence of
ple living in a neighborhood) to manage inventory and run the value. ADEF also continuously monitors and downgrade
local deals and promotions. sources, if there is any loss of quality.
Capital Equipment Maintenance: Companies within
2. Better segmentation and analytics: The provider may the energy and resources industry have significant invest-
need more information about their customers, beyond ments in capital equipments (i.e. drills, oil pumps, etc.).
what they have in the profile, for better segmentation and Accurate data about these equipments (e.g. manufacturer,
analytics. For example, a certain e-commerce site may model, etc.) is paramount to operational efficiency, proper
know a persons browsing and buying history on their site maintenance, etc.
and have some limited information about the person such The current process for capturing this data begins with
as their address and credit card information. However, un- manual entry. Followed by manual, periodic “walk-downs”
derstanding their professional activities and hobbies may to confirm and validate this information. However, this pro-
help them get more features for customer segmentation cess is error-prone, and often results in incomplete and inac-
that they can use for suggestions or promotions. curate data about the equipments.
3. Fraud detection: The provider may need more informa- This does not have to be the case. A wealth of structured
tion about their customers for detecting fraud. Providers data sources (e.g. from manufacturers) exist that provides
typically create detailed customer profiles to predict their much of the incomplete, missing information. Hence, a so-
behaviors and detect anomalies. Having demographic and lution that can automatically leverage these sources to enrich
other attributes such as interests and hobbies helps build- existing, internal capital equipment data can significantly
ing more accurate customer behavior profiles. Most e- improve the quality of the data, which in turn can improve
commerce providers are typically under a lot of pressure operational efficiency and enable proper maintenance.
to detect fraudulent activities on their site as early as pos- Competitive Intelligence. The explosive growth of ex-
sible, so that they can limit their exposure to lawsuits, ternal data (i.e. data outside the business such as Web data,
compliance laws or even loss of reputation. data providers, etc.) can enable businesses to gather rich in-
telligence about their competitors. For example, companies
Often businesses engage customers to register for programs in the energy and resources industry are very interested in
like reward cards or connect with the customers using so- competitive insights such as where a competitor is drilling
cial media. This limited initial engagement gives them the (or planning to drill); disruptions to drilling due to accidents,
access to basic information about a customer, such as name, weather, etc.; and more.
email, address, and social media handles. However, in a vast To gather these insights, companies currently purchase
majority of the cases, such information is incomplete and relevant data from third party sources – e.g. IHS and Dod-
the gaps are not uniform. For example, for a customer John son are just a few examples of third party data sources that
Doe, a business might have the name, street address, and a aggregate and sell drilling data – to manually enrich exist-
phone number, whereas for Jane Doe, the available infor- ing internal data to generate a comprehensive view of the
mation will be name, email, and a Twitter handle. Lever- competitive environment. However, this current process is
aging the basic information and completing the gaps, also manual one, which makes it difficult to scale beyond a small
called as creating a 360 degree customer view, is a signifi- handful of data sources. Many useful, data sources that are
cant challenge. Current approaches to addressing this chal- open (public access) (e.g. sources that provide weather data
lenge largely revolve around subscribing to data sources like based on GIS information) are omitted, resulting in gaps in
Experian. This approach has the following shortcomings: the intelligence gathered.
1. The enrichment task is restricted to the attributes provided A solution that can automatically perform this enrichment
by the one or two data sources that they buy from. If they across a broad range of data sources can provide more in-
need some other attributes about the customers, it is hard depth, comprehensive competitive insight.
to get them.
2. The selected data sources may have high quality infor- Overview of Data Enrichment Algorithm
mation about attributes, but poor quality about some oth- Our Data Enrichment Framework (DEF) takes two inputs –
ers. Even if the e-commerce provider knows about other 1) an instance of a data object to be enriched and 2) a set
sources, which have those attributes, it is hard to manually of data sources to use for the enrichment – and outputs an
integrate more sources. enriched version of the input instance.
DEF enriches the input instance through the following
3. There is no good way to monitor if there is any degrada- steps. DEF first assesses the importance of each attribute in
tion in the quality of data sources. the input instance. This information is then used by DEF to
Using the enrichment framework in this context would al- guide the selection of appropriate data sources to use. Fi-
low the e-commerce provider to dynamically select the best nally, DEF determines the utility of the sources used, so it
3. can adapt its usage of these source (either in a favorable or step until either there are no attributes in d whose values are
unfavorable manner) going forward. unknown or there are no more sources to select.
DEF considers two important factors when selecting the
Preliminaries next best source to use: 1) whether the source will be able
A data object D is a collection of attributes describing to provide values if called, and 2) whether the source targets
a real-world object of interest. We formally define D as unknown attributes in du (esp. attributes with high impor-
{a1 , a2 , ...an } where ai is an attribute. tance). DEF satisfies the first factor by measuring how well
An instance d of a data object D is a partial instanti- known values of d match the inputs required by the source.
ation of D – i.e. some attributes ai may not have an in- If there is a good match, then the source is more likely to re-
stantiated value. We formally define d as having two el- turn values when it’s called. DEF also considers the number
ements dk and du . dk consists of attributes whose values of times a source was called previously (while enriching d)
are known (i.e. instantiated), which we define formally as to prevent “starvation” of other sources.
dk = {< a, v(a), ka , kv(a) > ...}, where v(a) is the value DEF satisfies the second factor by measuring how many
of attribute a, ka is the importance of a to the data object D high-importance, unknown attributes the source claims to
that d is an instance of (ranging from 0.0 to 1.0), and kv(a) is provide. If a source claims to provide a large number of these
the confidence in the correctness of v(a) (ranging from 0.0 attributes, then DEF should select the source over others.
to 1.0). du consists of attributes whose values are unknown This second factor serves as the selection bias.
and hence the targets for enrichment. We define du formally DEF formally captures these two considerations with the
as du = {< a, ka > ...}. following equation:
Attribute Importance Assessment kv(a) ka
1 a∈dk ∩Is a∈du ∩Os
Given an instance d of a data object, DEF first assesses (and Fs = Bs + (4)
sets) the importance ka of each attribute a to the data object 2M −1 |Is | |du |
that d is an instance of. DEF uses the importance to guide where Bs is the base fitness score of a data source s being
the subsequent selection of appropriate data sources for en- considered (this value is randomly set between 0.5 and 0.75
richment (see next subsection). when DEF is initialized), Is is the set of input attributes to
Our definition of importance is based on the intuition that the data source, Os is the set of output attributes from the
an attribute a has high importance to a data object D if its data source, and M is the number of times the data source
values are highly unique across all instances of D. For exam- has been selected in the context of enriching the current data
ple, the attribute e-mail contact should have high importance object instance.
to the Customer data object because it satisfies this intuition. The data source with the highest score Fs that also ex-
However, the attribute Zip should have lower importance to ceeds a predefined minimum threshold R is selected as the
the Customer object because it does not – i.e. many instances next source to use for enrichment.
of the Customer object have the same zipcode. For each unknown attribute a enriched by the selected
DEF captures the above intuition formally with the fol- data source, DEF moves it from du to dk , and computes the
lowing equation: confidence kv(a ) in the value provided for a by the selected
source. This confidence is used in subsequent iterations of
X2
ka = (1) the enrichment process, and is computed using the following
1 + X2 formula:
where,
1
−1
U (a, D)
|V |
X = HN (D) (a) (2) kv(a ) = e a W , if kv(a ) = N ull (5)
|N (D)| λ(kv(a ) −1)
e , if kv(a ) = N ull
and
HN (D) (a) = − Pv logPv (3) where,
v∈a kv(a)
U (a, D) is the number of unique values of a across all a∈dk ∩Is
W = (6)
instance of the data object D observed by DEF so far, |Is |
and N (D) is all instances of D observed by DEF so far. W is the confidence over all input attributes to the source,
HN (D) (a) is the entropy of the values of a across N (D), and Va is the set of output values returned by a data source
and serves as a proxy for the distribution of the values of a. for an unknown attribute a .
We note that DEF recomputes ka as new instances of the This formula captures two important factors. First, if mul-
data object containing a are observed. Hence, the impor- tiple values are returned, then there is ambiguity and hence
tance of an attribute to a data object will change over time. the confidence in the output should be discounted. Second,
if an output value is corroborated by output values given by
Data Source Selection previously selected data sources, then the confidence should
DEF selects data sources to enrich attributes of a data object be further increased. The λ factor is the corroboration factor
instance d whose values are unknown. DEF will repeat this (< 1.0), and defaults to 1.0.
4. In addition to selecting appropriate data sources to use, past T values returned by the data source for the attribute a.
DEF must also resolve ambiguities that occur during the en- W is the confidence over all input attributes to the source,
richment process. For example, given the following instance and is defined in the previous subsection.
of the Customer data object: The utility of a data source Us from the past n calls are
(Name: John Smith, City: San Jose, Occupation: then used to adjust the base fitness score of the data source.
NULL) This adjustment is captured with the following equation
n
a data source may return multiple values for the unknown 1
attribute of Occupation (e.g. Programmer, Artist, etc). Bs = Bs + γ Us (T − i) (10)
n
To resolve this ambiguity, DEF will branch the original 1
instance – one branch for each returned value – and each where Bs is the base fitness score of a data source s, Us (T −
branched instance will be subsequently enriched using the i) is the utility of the data source i time steps back, and γ is
same steps above. Hence, a single data object instance may the adjustment rate.
result in multiple instances at the end of the enrichment pro-
cess. System Architecture
DEF will repeat the above process until either du is empty
or there are no sources whose score Fs exceeds R. Once
this process terminates, DEF computes the fitness for each
resulting instance using the following equation:
kv(a) ka
a∈dk ∩dU
(7)
|dk ∪ du |
and returns top K instances.
Data Source Utility Adaptation
Once a data source has been called, DEF determines the util-
ity of the source in enriching the data object instance of in-
terest. Intuitively, DEF models the utility of a data source as
a “contract” – i.e. if DEF provides a data source with high
confidence input values, then it is reasonable to expect the
data source to provide values for all the output attributes
that it claims to target. Moreover, these values should not Figure 1: Design overview of enrichment framework
be generic and should have low ambiguity. If these expecta-
tions are violated, then the data source should be penalized The main components of DEF are illustrated in Figure 1.
heavily. The task manager starts a new enrichment project by in-
On the other hand, if DEF did not provide a data source stantiates and executes the enrichment engine. The enrich-
with good inputs, then the source should be penalized mini- ment engine uses the attribute computation module to cal-
mally (if at all) if it fails to provide any useful outputs. culate the attribute relevance. The relevance scores are used
Alternatively, if a data source is able to provide unam- in source selection. Using the HTTP helper module, the en-
biguous values for unknown attributes in the data object in- gine then invokes the connector for the selected data source.
stance (esp. high importance attributes), then DEF should re- A connector is a proxy that communicates with the actual
ward the source and give it more preference going forward. data source and is a RESTful Web service in itself. The en-
DEF captures this notion formally with the following richment framework requires every data source to have a
equation: connector and that each connector have two operations: 1)
a return schema GET operation that returns the input and
1 1
−1 PT v(a)
output schema, and 2) a get data POST operation that takes
Us = W e |Va | ka − ka as input the input for the data source as POST parameters
|Os |
a∈Os +
a∈Os − and returns the response as a JSON. For internal databases,
(8) we have special connectors that wrap queries as RESTful
where, end points. Once a response is obtained from the connec-
tor, the enrichment engine computes the output value con-
PT (v(a)) , if |Va | = 1 fidence, applies the necessary mapping rules, and integrates
PT (v(a)) = argmin PT (v(a)) , if |Va | > 1 (9)
v(a)∈Va
the response with the existing data object. In addition to this,
the source degradation factor is also computed. The map-
+
Os are the output attributes from a data source for which ping, invocation, confidence and source degradation value
−
values were returned, Os are the output attributes from computation steps are repeated until either all values for all
the same source for which values were not returned, and attributes are computed or if all sources have been invoked.
PT (v(a)) is the relative frequency of the value v(a) over the The result is then written into the enrichment database.
5. In designing the enrichment framework, we have adopted • Inconsistent auth mechanisms across different APIs and
a service oriented approach, with the goal of exposing the mandatory auth makes it expensive (even when pulling
enrichment framework as a “platform as a service”. The core information that is open on the Web)
tasks in the framework are exposed as RESTful end points. • easy to develop prototypes, but do not instill enough con-
These include end points for creating data objects, import- fidence to develop a deployable client solution
ing datasets, adding data sources, and for starting an enrich-
ment task. When the “start enrichment task” task resource is • Data services and APIs are an integral part of the system
invoked and a task is successfully started, the framework re- • Have been around for a while, minimal traction in the en-
sponds with a JSON that has the enrichment task identifier. terprise
This identifier can then be used to GET the enriched data
from the database. The framework supports both batch GET • API rate limiting, lack of SLA driven API contracts
and streaming GET using the comet (?) pattern. • Poor API documentation, often leading the developer in
Data mapping is one of the key challenges in any data in- circles. Minimalism in APIs might not be a good idea;
tegration system. While extensive research literature exists
for automated and semi-automated approaches for mapping • Changes not “pushed” to developers, often finding out
(?; ?; ?; ?; ?), it is our observation that these techniques do when an application breaks
not guarantee the high-level of accuracy required in enter- • Inconsistent auth mechanisms across different APIs and
prise solutions. So, we currently adopt a manual approach, mandatory auth makes it expensive (even when pulling
aided by a graphical interface for data mapping. The source information that is open on the Web)
and the target schemas are shown to the users as two trees,
• easy to develop prototypes, but do not instill enough con-
one to the left and one to the right. Users can select the at-
fidence to develop a deployable client solution
tributes from the source schema and draw a line between
them and attributes of the target schema. Currently, our map-
ping system supports assignment, merge, split, numerical Related Work
operations, and unit conversions. When the user saves the Web APIs and data services, that have helped establish the
mappings, the maps are stored as mapping rules. Each map- Web as a platform. Web APIs have enabled the deployment
ping rule is represented as a tuple containing the source and use of services on the Web using standardized commu-
attributes, target attributes, mapping operations, and condi- nication protocols and message formats. Leveraging on Web
tions. Conditions include merge and split delimiters and con- APIs, data services have allowed access to vast amounts of
version factors. data that were hiterto hidden in proprietary silos in a stan-
dardized manner. A notable outcome is the development of
Challenges and Experiences mashups or Web application hybrids. A mashup is created by
integrating data from various services on the Web using their
• Data services and APIs are an integral part of the system APIs. Although early mashups were consumer centric ap-
• Have been around for a while, minimal traction in the en- plications, their adoption within enterprise has been increas-
terprise ing, especially in addressing data intensive problems such
as data filtering, data transformation, and data enrichment.
• API rate limiting, lack of SLA driven API contracts Web APIs and data services, that have helped establish the
• Poor API documentation, often leading the developer in Web as a platform. Web APIs have enabled the deployment
circles. Minimalism in APIs might not be a good idea; and use of services on the Web using standardized commu-
nication protocols and message formats. Leveraging on Web
• Changes not “pushed” to developers, often finding out APIs, data services have allowed access to vast amounts of
when an application breaks data that were hiterto hidden in proprietary silos in a stan-
• Inconsistent auth mechanisms across different APIs and dardized manner. A notable outcome is the development of
mandatory auth makes it expensive (even when pulling mashups or Web application hybrids. A mashup is created by
information that is open on the Web) integrating data from various services on the Web using their
APIs. Although early mashups were consumer centric appli-
• easy to develop prototypes, but do not instill enough con-
cations, their adoption within enterprise has been increasing,
fidence to develop a deployable client solution
especially in addressing data intensive problems such as data
• Data services and APIs are an integral part of the system filtering, data transformation, and data enrichment.
Web APIs and data services, that have helped establish the
• Have been around for a while, minimal traction in the en-
Web as a platform. Web APIs have enabled the deployment
terprise
and use of services on the Web using standardized commu-
• API rate limiting, lack of SLA driven API contracts nication protocols and message formats. Leveraging on Web
APIs, data services have allowed access to vast amounts of
• Poor API documentation, often leading the developer in
data that were hiterto hidden in proprietary silos in a stan-
circles. Minimalism in APIs might not be a good idea;
dardized manner. A notable outcome is the development of
• Changes not “pushed” to developers, often finding out mashups or Web application hybrids. A mashup is created by
when an application breaks integrating data from various services on the Web using their
6. APIs. Although early mashups were consumer centric appli-
cations, their adoption within enterprise has been increasing,
especially in addressing data intensive problems such as data
filtering, data transformation, and data enrichment.