Data pricing and data license agreements
Magdalena Balazinska, U. Washington (videoconference)
-Key note-
International conference on
“DATA, DIGITAL BUSINESS MODELS, CLOUD COMPUTING AND ORGANIZATIONAL DESIGN”
24-25 November 2014 ,
Université Paris–Sud
Magdalena Balazinska: Data pricing and data license agreements
1. Magdalena Balazinska
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
ESCIENCE INSTITUTE
UNIVERSITY OF WASHINGTON
http://www.cs.washington.edu/people/faculty/magda
Data Pricing and Data License
Agreements in the Cloud
1
2. Acknowledgments
Research team on project
• Prasang Upadhyaya (UW, lead on licensing)
• Paraschos Koutris (UW, lead on pricing)
• Prof. Dan Suciu (UW)
• Dr. Hakan Hacigumus (NEC Labs)
Sponsors
• National Science Foundation
• NEC
• Microsoft Research
Magdalena Balazinska - University of Washington 2
3. Data Has Value
• First wave of computing: value in hardware
– IBM, Intel, DEC
• Second wave of computing: value in software
– Microsoft, Oracle, Google
• Third wave of computing: value in data
– Dun & Bradstreet, Factual, Facebook, Google
Magdalena Balazinska - University of Washington 3
4. Data itself is now a product that is being
created, improved, bought and sold on the Web
Magdalena Balazinska - University of Washington 4
5. What are the Technical Challenges
• Challenge 1: Data License Agreements
– All data comes with terms of use
– Can we automate their enforcement?
• Challenge 2: Data Pricing
– Existing pricing methods are limited
– Can we support flexible pricing?
Magdalena Balazinska - University of Washington 5
6. Data Comes with Terms of Use
Overlaying map data
with any other data is
prohibited (Navteq)
Each book may be lent
once for 2 weeks while
being inaccessible by
the lender (Kindle)
Name Ailment Birth
date
Se
x
Locati
on
John Doe Asthma Jan 7th 1972 M Seattle
Mary
Jane
Dislocated
shoulder
Mar 21st
1965
F San Diego
Alice
Summer
Flu May 28th
1986
M San
Francisco
… … … … …
Bob B Flu Oct 14th
2000
M Miami
Medical Data Maps Digital Books
Queries that try to identify
an individual referenced in
the database are
prohibited (MIMIC II)
6
7. More Examples
Terms of use Source
Overlaying Navteq data with any other data is prohibited Navteq
Each book may be lent once for a duration of 14 days and will not be
readable by the lender during the loan period
Amazon Kindle
In a month, all queries may, in total, return up to 2M characters of
data at the free tier
Microsoft Translator
OAuth calls are permitted 350 requests per hour Twitter and
Foursquare
Queries that try to identify an individual referenced in the database
are prohibited
MIMIC II
You are required to display all attribution information and any
proprietary notices associated with the Foursquare Data
Foursquare, Yelp,
World Bank
Don’t aggregate or blend our star ratings and review counts with other
providers. You may show content from multiple providers, but Yelp
data should stand on its own …
Yelp
7
8. Fine-grained Access
Look up specific patient
Augmenting Data Sources
Join medical data with
voters registry
Terms of Use Control Use, Not Access
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Histogram
0
20000
40000
60000
80000
100000
120000
140000
0 5 10 15 20 25
Linear Regression
Data
Permitted
Denied
Goal
Enforce policies
to constrain
how data is used
8
Example for medical research data
14. Approach Overview
• Data seller defines policies
• Data and policies are loaded into a DataLawyer-
enabled database system
• Buyer queries the data
• DataLawyer checks all queries before execution
Magdalena Balazinska - University of Washington 14
15. Challenge: Semantics
Example Policy 1: Can access up to
10K records/month.
If the buyer computes a histogram
on the data and filters out some
buckets, did he use the input tuples
from the filtered bucket or not?
15
16. Challenge: Performance
Example Policy 2: Only allow
aggregate queries where each
output tuple must aggregate over
at least 10 values.
Policies are expensive to check online!
Cheap
Expensive
16
17. DataLawyer Setup
17
Data
Metadata on
Usage
Policies
Features of
user and query
behavior
Data Usage
Arbitrary code
Shared across
multiple policies
SELECT DISTINCT ‘P5 violated:
Fewer than 10 patients contribute
to an answer’ AS errorMessage
FROM Provenance p
WHERE p.irid = ‘patients’
GROUP BY p.qid, p.otid
HAVING COUNT(DISTINCT p.itid) < 10
SELECT DISTINCT ‘P5 violated:
Fewer than 10 patients contribute
to an answer’ AS errorMessage
FROM Provenance p
WHERE p.irid = ‘patients’
GROUP BY p.qid, p.otid
HAVING COUNT(DISTINCT p.itid) < 10
SELECT DISTINCT ‘P5 violated:
Fewer than 10 patients contribute
to an answer’ AS errorMessage
FROM Provenance p
WHERE p.irid = ‘patients’
GROUP BY p.qid, p.otid
HAVING COUNT(DISTINCT p.itid) < 10
SELECT DISTINCT ‘P5 violated:
Fewer than 10 patients contribute
to an answer’ AS errorMessage
FROM Provenance p
WHERE p.irid = ‘patients’
GROUP BY p.qid, p.otid
HAVING COUNT(DISTINCT p.itid) < 10
Declarative policies
(DataLawyer uses SQL)
18. Usage Log
18
They capture features of a query that are used in the policies
We require them to:
1. Be deterministic
2. Be append only
3. Contain a timestamp with each tuple
Examples are:
1. Provenance
2. User log
3. Static analysis of the query
4. Pricing
5. …
20. DataLawyer Workflow Example
Magdalena Balazinska - University of Washington 20
pid disease treatment outcome
1 asthma albuterol positive
…
Data: Patients
Query: What fraction of asthma patients were treated with albuterol?
DataLawyer: Populates the usage logs
uid query table column
1 1 Patient treatment
1 1 Patient outcome
DataLawyer checks policies, which are queries over the usage logs
• Queries are not allowed to access column pid
• Queries must aggregate data from at least 10 rows in Patients
21. Example Using SQL
21
Policy: Stop queries where fewer than 10 patients contribute to any output tuple.
SELECT DISTINCT ‘P5 violated: Fewer than 10 patients contribute to an answer’
AS errorMessage
FROM Provenance p
WHERE p.irid = ‘patients’
GROUP BY p.qid, p.otid
HAVING COUNT(DISTINCT p.itid) < 10
Usage log that captures how each tuple in query result
was derived from records on disk
Provenance(ts, // Timestamp
qid, // Query id
otid, // Output tuple id, a hash of the output tuple
irid, // Input relation id, usually the name
itid // Input tuple id, usually the name
)
If false, no violation
If true, at least
one example of a violation
Policy refers to the provenance usage log
22. Need for Optimizations
22
0
50
100
150
200
250
300
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Averagetimeforthebatch(inms)
Batch number (Each batch consists of 120 queries)
Baseline, Hard
DataLawyer, Hard
Baseline, Easy
DataLawyer, Easy
Without optimizations,
time to process queries &
check policies grows
Time grows even for simple policies
DataLawyer keeps time constant
DataLawyer keeps time low
23. Policy Evaluation
• There are three major steps:
– Generate the usage logs
– Evaluate policies
– Write log to disk if everything is okay, else abort
• Our optimizations:
– Avoid generating the logs
– Prune the logs by removing data no longer needed
– Avoid evaluating all policies
– Try to evaluate cheaper, partial policies first
23
25. Data License Agreements Summary
• Data comes with terms of use
– Even free data often has terms of use
• Today, terms of use are written in natural language
– Compliance for buyers is tedious and error-prone
• Possible to automate the process: DataLawyer
– Enables more precise terms of use specification
– Enables efficient enforcement
• Open problems
– Malicious users
– Data leaving database system
Magdalena Balazinska - University of Washington 25
26. What are the Technical Challenges
• Challenge 1: Data License Agreements
– All data comes with terms of use
– Can we automate their enforcement?
• Challenge 2: Data Pricing
– Existing pricing methods are limited
– Can we support flexible pricing?
Magdalena Balazinska - University of Washington 26
27. Data Pricing Today: Fixed
Magdalena Balazinska - University of Washington 27
Not flexible!
28. Data Pricing Today: Subscriptions
Magdalena Balazinska - University of Washington 28
Not flexible!
29. Data Pricing Today: Private Price
Magdalena Balazinska - University of Washington 29
Not scalable!
30. Example Scenario
• Seller has a database of cities and business contact information
– Businesses in one province or state: $300
– One type of business: $150
– Cities with given climate: $10
• Buyer:
– Q1: “Businesses with more than 200 employees” (selection)
– Q2: “West-coast businesses in cities with high yearly
precipitation” (join)
• How to satisfy buyer?
Magdalena Balazinska - University of Washington 30
31. Current Pricing: Fixed Prices
• Fixed price for entire dataset
• Must create and price views specific to queries Q1 and Q2
• OR user must buy entire dataset if view not available
• AND user must perform joins by herself
• Certainly the case if datasets have different owners
31Magdalena Balazinska - University of Washington
32. Current Pricing: Subscriptions
• Subscriptions
– Fixed number of transactions per month
– Must create and price appropriate parameterized queries
– Today queries are dataset specific (i.e., no joins!)
– Can satisfy Q1: “Businesses with more than 200 employees”
– Cannot Q2: “West-coast businesses in cities with high yearly
precipitation”
32Magdalena Balazinska - University of Washington
33. Other Data Pricing Issues
• Today’s data pricing can also have bad properties
• Example: Weather Imagery on Azure DataMarket
– 1,000,000 transactions -> $2,400
– 100,000 -> $600
– 10,000 -> $120
– 2,500 -> $0
• Arbitrage opportunity:
– Emulate many users
– Get as much data as you want for free!
33Magdalena Balazinska - University of Washington
35. Query-Based Pricing
• Seller specifies a set of queries Q1, … Qn
• These queries form views on the data for sell
• Seller prices the views: price(Q1), …, price(Qn)
– D = all cities and businesses in North America
– V1 (businesses in one state) = $300
– V2 (businesses of one type) = $150
– V3 (cities with a given climate) = $10
35Magdalena Balazinska - University of Washington
36. Query-Based Pricing
• QueryMarket system computes other query prices
– Q2: “West-coast businesses in cities with high yearly
precipitation”
– Key idea: Compute least expensive set of views that can
be used to answer the query. The sum of the price of
these views is the price of the query
• System guarantees price properties
– Arbitrage-free prices
– Maximal prices (no unintended discounts)
36Magdalena Balazinska - University of Washington
37. Conclusion
• Data has value
• Data is bought and sold online
• Supporting modern data markets requires
– New tools for managing license agreements
– New methods for pricing data
• Much work remains to be done
http://cloud-data-pricing.cs.washington.edu
Magdalena Balazinska - University of Washington 37
39. Potential Techniques
Privacy
Mechanisms
(Reduces data utility)
Intrusion
Detection System
(Assumes the user is
malicious. Offline)
Access Control
(Want full access to
data)
Auditing Systems
(Offline)
Online Offline
Fuzzy
Semantics
Precise
Semantics
None of these work!
39
Notes de l'éditeur
Once hardware become commoditized, the value moved to the software…
Dun & Bradstreet: To be the most trusted source of commercial insight so our customers can decide with confidence. With over 225 million records on businesses from around the world, we have more data and insight about businesses than any other information company on the planet. Founded 1841.
How they sell their data: It always says: “To get started, contact us”.
Factual: Business info, service such as translating lat/long into a neighborhood, data integration APIs (such as mapping location on different services such as Yelp, Foursquare, Facebook, and Twitter.
Companies that own data, aggregate & curate data, sell data-related services such as cleaning, integration, and analysis.
Microsoft Azure Marketplace: online market for buying and selling finished Software as a Service (SaaS) applications and premium datasets.
131 free datasets. 95 paid datasets (26 have free trial). Prices are posted. Subscription based.
Factual: Focus is on supporting mobile apps. Business info, service such as translating lat/long into a neighborhood, data integration APIs (such as mapping location on different services such as Yelp, Foursquare, Facebook, and Twitter. Private pricing (just like Dun & Bradstreet).
AggData: Business contact information. Flexible pricing full datasets, subscriptions to get updates, or personalized data product.
Gnip: Provider of social data (real time and historical). For example, real-time “firehose” data. We get every activity from social networks like Twitter, Tumblr, and WordPress, add in our own enrichments and then pass the data on to you, all in the blink of an eye. Also sell subsets of data based on user-specified selection predicates. Private pricing.
Patientslikeme: See 25 million data points about disease. We take the information patients like you share about your experience with the disease and sell it to our partners.
Xignite: Sell financial market data. Private pricing appears to be subscription based (hits).
The MIMIC-II research database (Multiparameter Intelligent Monitoring in Intensive Care) is notable for three factors: it is publicly and freely available; it encompasses a diverse and very large population of ICU patients; and it contains high temporal resolution data including lab results, electronic documentation, and bedside monitor trends and waveforms. The database can support a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development.
What medical researchers actually do is this. You email them saying you want the data. They send you the link to this three hour long course. If you are like me, and use common sense, you flunk it and spend an additional hour retaiking it. Now, all the course does is to tell you what you can’t do with the data which is, you can’t share it with anyone else and you shouldn’t try to infer the identity of patients.
Here are more examples. The matter of interest to us is that these license agreements were over 8 pages of text and impose significant cognitive burden on the users of these datasets.
You can do live processing on twitter through Twitter Stream API. But stream api doesn't send you all tweets, you have to specify either keywords to filter tweets containing given keywords or userIds to follow tweets coming from given users or geo location to filter tweets coming from a particular place. Also there are limits on number of keywords, user ids to follow etc. And lastly twitter will give you upmost 1% percent of the tweets in the global traffic. So for example if you come up with a filter which tracks less than 1% of the global trafic you will get all the tweets (in theory) otherwise you will miss some tweet as twitter won't share them with you.
We performed a survey of 13 data providers (Foursquare [8], Yelp [20], Azure Marketplace [17], Twitter [14], In-
fochimps [9], Socrata [15], Xignite [19], Digital Folio [6], DataSift [5], World Bank [18], Navteq [12], data.gov.uk [13], and DataMarket [4].
I will give the overall idea. Please find me during the poster session to discuss the algorithms in depth.
To check policies given a user query, we first run the various feature extractors, populate them into a log, that we call the usage log, evaluate all the policies and check that all of them are empty. If so, we return the query answers, if not, we give an error message for the policy that failed.
We have a bunch of optimizations that try to evaluate policies without evaluating all of the features, since they are the primary overheads.
To check policies given a user query, we first run the various feature extractors, populate them into a log, that we call the usage log, evaluate all the policies and check that all of them are empty. If so, we return the query answers, if not, we give an error message for the policy that failed.
We have a bunch of optimizations that try to evaluate policies without evaluating all of the features, since they are the primary overheads.
We found that this abstraction is powerful enough to capture a vast number of usage terms we found in real life.
Our prototype extracts three features,
user information such as user-id and query time.
schema information about the input relations of the query and its output columns.
where-provenance of tuples.
Given just these three tables, it is possible to write very sophisticated policy checking queries.
Note that it is also easy to extend the system to newer domains, say query prices in data marketplaces.
No additional SQL constructs, and extensible.
Scaleup
The seller needs to carefully design the views to
sell so as to cater to as many buyers as she can, without
knowing the exact queries from the buyers. Similarly, the
buyer needs to carefully evaluate the various views that are
available for sale to determine which views suce to answer
her query and are also the cheapest way to answer that
query
Currently supports only selection queries on a single attribute of a relation
Currently, non-trivial subset of full conjunctive queries without self-joins
[8] called the Generalized Chain Queries" (Section 5), a set that includes all
path join queries, star join queries, and their combination.
an algorithm with polynomial time data complexity for computing
the price of any generalized chain query, by reducing the
problem to network flow, (2) a complete characterization of
the class of Conjunctive Queries without self-joins that can
be priced with PTIME data complexity (this class is slightly
larger than generalize chain queries), and (3) a proof that
pricing all other queries is NP-complete, thus establishing a
dichotomy on the complexity of the pricing problem when
all views are selection queries. For the queries that are not
in PTIME, we reduce the problem to an Integer Linear Pro-
gram and compute the prices.
IDS: The user is not adversarial, but a legal user.
Highlight about why each doesn’t work.
Privacy: We discussed.
Access control: Do not want it.
Auditing systems: After the fact.
Intrusion detection system: Very different aims. Aims to find malicious users whose actions are deliberate. Our users aren’t malicious and offenses.
Access control
Offline audits
Hippocratic databases
----- Meeting Notes (1/4/13 16:47) -----
Instrsion detection argument was bad.