This presentation was given by Christian Reimsbach-Kounatze of the OECD at the CERI Conference on Innovation, Governance and Reform in Education on 5 November 2014 during session 6.b: The Role of “Big Data”.
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
Data driven innovation for education
1. DATA-DRIVEN INNOVATION
FOR EDUCATION
5 November 2014
CERI CONFERENCE ON
INNOVATION, GOVERNANCE AND REFORM IN EDUCATION
Christian.Reimsbach-Kounatze@oecd.org
Directorate for Science, Technology and Innovation (DSTI)
2. 1. Why should we care about “big data” or
data-driven innovation (DDI)?
2. What is really new about it?
3. What are the key opportunities?
4. What are the key challenges?
2
Structure
4. A lot of “big data” buzz
• “Data is the new oil.” Andreas Weigend, Stanford (ex Amazon)
• “The future belongs to companies and people that turn
data into products”, Mike Loukides, O’Reilly Media
“Why big data
is a big deal”
InfoWorld – 9/1/11
“Keeping Afloat
in a Sea of 'Big
Data”
ITBusinessEdge – 9/6/11
“The challenge–
and opportunity–
of big data”
McKinsey Quarterly—5/11
“Getting a Handle
on Big Data with
Hadoop”
Businessweek-9/7/11
“Ten reasons why
Big Data will
change the travel
industry”
Tnooz -8/15/11
“The promise of
Big Data”
Intelligent Utility-8/28/11
4
Source: http://www.google.com/trends/explore#q=%22big%20data%22
5. What is “big data”? And why we should
rather refer to Data-Driven Innovation?
• Defining “big data” is challenging:
– Data for which the “size is beyond the ability of
typical database software tools to capture, store,
manage, and analyse” (McKinsey Global Institute,
2011)
– Data that is characterized by the 3Vs: volume,
velocity (real-time data) and variety (unstructured
data) (Gartner, 2011).
• DDI refers to the use of data and analytics to
improve or foster new products, processes,
organisational methods and markets.
5
6. 6
Data: unlimited source for growth
Health and Aging
Public Administration Retail
Transportation and
energy
Agriculture
Science and Education
8. Data has always been key to social
and economic activities
• “Business intelligence” and “data
warehousing” already emerged in the
1960s and became popular in the late
1980s (Luhn, 1958; Keen, 1978).
• “Formal education has always been a data-rich
activity, with many data collected by
teachers and schools about learning
outcomes, attendance, enrolments”
(see agenda)
8
9. 9
DDI is not only about data,
it is about the data value cycle
10. The exponential growth in data
generated and collected
Monthly global IP traffic, 2005-16
In exabytes (billions of gigabytes)
Average data storage cost, 1998-2012
In USD per gigabyte (log scale)
Source: Source: OECD based on Cisco (2012) OECD based on Pingdom (2011)
10
11. The democratisation of computation
and analytic capacities
Open source data
processing and analytics
Data requests in Netflix, 2010-11
Data centre capacity
Sources: Netflix.com
In billions
11
12. A new paradigm in decision making?
Machine learning is now mainstream
12
15. Personal data is increasingly used
for customization
15
Personalised services Collaborative filtering
16. Data and analytics are empowering
process automation
16
• Automatic adjustment of production (e.g. smart grids)
• Autonomous machines in retail warehousing or
self-driving cars
Growth in algorithmic trading as share of total trading
Source: The Economist (2012)
20. Loss of autonomy and freedom
20
• Discrimination may result in greater
efficiencies, but also limits an individual’s
ability to escape the impact of prejudices
• Filter bubbles: users become separated
from information that disagrees with their
viewpoints, effectively isolating them in
their own cultural or ideological bubbles.
21. Lack of data scientists across the
economy
21
United States, 2013 EU, 2013
Professional
and business
services, 43%
Others,
5%
Financial
Wholesale and
retail trade, 5%
Information,
6%
administration,
Manufacturing,
11%
Public
7%
Educational activities, 12%
and health
services, 11%
Professional,
scientific and
technical
activities, 43%
Public
administration,
defence, and
sociale
services, 15%
Wholesale and
retail trade, 6%
Information
and
communication
, 9%
Manufacturing
industry, 12%
Financial and
insurance
activities, 7%
Transportation
and storage,
2%
Others,
7%
* Based on preliminary working definition of “data scientists”; ICT services included in “Professional *”.
Source: OECD based on US CPS (March Supplement 2013) and EU LFS
22. • Data ownership?
• Data interoperability?
• Data portability?
Better data sharing platforms and common
standards could be needed;
Privacy as well as IPR concerns may better be
addressed in a more differentiated manner;
22
Getting data governance
frameworks right
23. Thank you for your attention!
23
• OECD project site: http://oe.cd/bigdata
• OECD (2013), “Exploring Data-Driven Innovation as a
New Source of Growth: Mapping the Policy Issues
Raised by ‘Big Data’”: http://oe.cd/bigdata1
• OECD (2015), Data-Driven Innovation for Growth and
Well-being
– Preliminary synthesis report on “Data-Driven Innovation for
Growth and Well-being”: http://oe.cd/bigdata2
• Contact: Christian.Reimsbach-Kounatze@oecd.org
Notes de l'éditeur
Good morning every one!
It is my pleasure to share with you today // the interim results of the work carried out // under the data pillar of KBC2.
These results have been provided to you via the first draft of the synthesis report // which has the cote DSTI/ICCP(2014)11 // as well as the overall report // including 10 draft chapters // provided as ANNEX document.
The synthesis report is the basis of this presentation.
Explain structure,
On 1.) Highlight that at the ende of this section you should also understand what we mean by data-driven innovation
On 3.) The policy opportunities discussed are not only relevant for the EU but for other oecd countries as well as some of its key partner economies.
Then give the disclaimer that your presentation reflects your expert opinion and does not necessarly reflect the position of the OECD SG or that of its member countries.
More data was created in 2013 than in all the preceding years of human history combined, and every minute the world generates enough data to fill more than 360,000 standard DVDs
This includes tweets, public Facebook posts, geotags that locate where photos were taken and news stories. It can also include de-identified records of mobile phone activity
MGI estimates suggest that:
Private sector retailers using big data can boosting productivity growth and increase their operating margin by over 60%
Public administration could generate EUR 100 billion in savings from operational efficiency improvements.
The use of geo-location data could generate almost USD 500 billion by 2020 in consumer surplus attributable to saved time and fuel.
We have to be cautious about these numbers. But what is more important here is that data is now increasingly used across economy, even in agriculture. Companies such as John Deere (US) or Lely (NL), are increasing innovating based on the data their collect.
Why is it important to highlight this, because data-driven innovation in the past was mainly about internet firms!
However, to have a more nuanced view on DDI, it is helpful to also consider the full data value cycle. This can help for example to identify specific issues that occur at the different phases of the data value cycle.
Decision makers do not necessarily need to understand a phenomenon, before they act on it. In other words: first comes the analytical fact, then the action, and last, if at all, the understanding.
For example, a company such as Wal-Mart Stores may change the product placement in its stores based on correlations without the need to know why the change will have a positive impact on its revenue.
As Anderson (2008) explains: “Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity.” And he conclude by challenging the usefulness of models in an age of massive datasets, arguing that with large enough data sets, machines can detect complex patterns and relationships that are invisible to researchers. And he concludes that the scientific method has in most cases become obsolete, because correlations are enough.
1.) The use of geo-location data will generate almost USD 500 billion by 2020 in consumer surplus attributable to saved time and fuel.
2.) NSOs exploring the use of “big data” for the supplement of official statistics: Analysts at the Chilean Central bank have used Google Insights to create a Google Trend Activity Index (GTAI) to sucessfully forecast the year-over-year (y-o-y) growth in the volume of car sales in Chile.
3.) Twitter not only for flue trends: Twitter as potential (unstructured) data source for analysing and even predicting the “emotional roller coaster” and its impact on the ups and downs of stock markets (Grossman, 2010; MIT Technology Review, 2010).
GNS Healthcare, Cambridge based companies uses different data sets (gene expression, SNPs, proteomics, metabolomics to, more recently, next-generation gene sequence data and Electronic Health Records and Health Information Exchanges data) to deliver personalized medical recommendations.
some have suggested that with big data, decision makers could base their actions only on analytical facts without the need to understand the phenomenon
This is because with big data correlations can often appear statistically significant even if there is no causal relationship
Changing data environment!
Data analytics, in particular when used for decision automation, can sometimes be easily “gamed” once the factors affecting the underlying algorithms have been understood, for example, through reverse engineering.
Marcus and Davis (2014) present for example the case where essay evaluation analytics that relied on measures like sentence length and word sophistication to determine typical scores given by human graders, were gamed by students who suddenly started “writing long sentences and using obscure words, rather than learning how to actually formulate and write clear, coherent text”.
Data analytics does not need to be intentionally gamed to lead to wrong results. Often they are just not robust enough to unexpected changes in the data environment.
The elephant is the room when we speak of big data really is privacy.
The key challenge to regulation is that the concept of personal data is becoming less and less operationable. Because what seems non-personal data will be able to convey personal information if linked to other data that seems non-personal.
Computers and devices are encoding a lot of information about what we are doing, when we are doing it, and where we are doing it from.
This comes with some key risks such as:
Discrimination: Customer segmentation can support dynamic pricing, raising issues related to equality. Predictive analytics can perpetuate existing stereotypes. Consumers may not realise that they are treated differently, and have little opportunity to contest such treatment. Could be extended to employment, insurance and credit.
Information asymmetry: Yes the web puts a wealth of information at a surfer’s finger types. Price comparisons, user reviews, etc have an importantly empowering impact. But businesses are likewise obtaining information about individual customers of greater and greater refinement. There is a general lack of transparency about these processes to consumers that may put them at a commercial disadvantage.
PRIVACY FWKS IN NEED OF ADJUSTMENT: OECD Privacy Gls revised. To be submitted to Council on 11 July for adoption. Further adjustments may be needed to specifically protect privacy in the context of big data (e.g. the [intact] basic principles may need to be adapted to better address the issue of secondary uses of personal data).
Let me give another example of a cross-cutting issue: data security breach.
The security dimension is clear: the compromise of IT systems is a been a long-standing problem.
Where the lost or stolen data is personal data, you have a privacy problem.
And then there are consumer risks: identity theft has for years been at or near the top of list of consumer complaints.
Security breaches are regrettably commonplace. This slide notes 3 breaches – a very partial list of breaches announced this month alone.
One is a breach affecting at least 70 million customers of the 3rd largest US retailer. Following the breach, Target reduced its 4th quarter earnings forecast by 25%.
Another involved 3 Korean credit card companies and affected 20 million individuals – 40 % of the population. Some 3 dozen executives lost jobs.
A 3rd breach involves data from several million users of an app to send secure private messages. Snapchat. How do you measure damage to a start-up whose business is trust?
Current debate about tackling the information asymmetry issue => increasing transparency about the use of personal data and increasing users’ (consumers’) control over their personal data by given them open access to these data sets. One example for the latter emerged last year in the UK and it is known as the “midata” initiative. It aims at giving consumers access to the data created through their household utility use, banking, internet transactions and high street loyalty cards. (see https://www.gov.uk/government/consultations/midata-2012-review-and-consultation) .
This leads to the issues related to open data >>
Data analytics make it increasingly easy to infer information about individuals, even if they never shared this information with anyone.
Privacy regimes are based on the concept of personal data. However, data analytics make it possible to infer personal information from non-personal data. In particular when data sets are linked!!!
The elephant is the room when we speak of big data really is privacy.
The key challenge to regulation is that the concept of personal data is becoming less and less operationable. Because what seems non-personal data will be able to convey personal information if linked to other data that seems non-personal.
Computers and devices are encoding a lot of information about what we are doing, when we are doing it, and where we are doing it from.
This comes with some key risks such as:
Discrimination: Customer segmentation can support dynamic pricing, raising issues related to equality. Predictive analytics can perpetuate existing stereotypes. Consumers may not realise that they are treated differently, and have little opportunity to contest such treatment. Could be extended to employment, insurance and credit.
Information asymmetry: Yes the web puts a wealth of information at a surfer’s finger types. Price comparisons, user reviews, etc have an importantly empowering impact. But businesses are likewise obtaining information about individual customers of greater and greater refinement. There is a general lack of transparency about these processes to consumers that may put them at a commercial disadvantage.
PRIVACY FWKS IN NEED OF ADJUSTMENT: OECD Privacy Gls revised. To be submitted to Council on 11 July for adoption. Further adjustments may be needed to specifically protect privacy in the context of big data (e.g. the [intact] basic principles may need to be adapted to better address the issue of secondary uses of personal data).
Let me give another example of a cross-cutting issue: data security breach.
The security dimension is clear: the compromise of IT systems is a been a long-standing problem.
Where the lost or stolen data is personal data, you have a privacy problem.
And then there are consumer risks: identity theft has for years been at or near the top of list of consumer complaints.
Security breaches are regrettably commonplace. This slide notes 3 breaches – a very partial list of breaches announced this month alone.
One is a breach affecting at least 70 million customers of the 3rd largest US retailer. Following the breach, Target reduced its 4th quarter earnings forecast by 25%.
Another involved 3 Korean credit card companies and affected 20 million individuals – 40 % of the population. Some 3 dozen executives lost jobs.
A 3rd breach involves data from several million users of an app to send secure private messages. Snapchat. How do you measure damage to a start-up whose business is trust?
Current debate about tackling the information asymmetry issue => increasing transparency about the use of personal data and increasing users’ (consumers’) control over their personal data by given them open access to these data sets. One example for the latter emerged last year in the UK and it is known as the “midata” initiative. It aims at giving consumers access to the data created through their household utility use, banking, internet transactions and high street loyalty cards. (see https://www.gov.uk/government/consultations/midata-2012-review-and-consultation) .
This leads to the issues related to open data >>
Data analytics make it increasingly easy to infer information about individuals, even if they never shared this information with anyone.
Privacy regimes are based on the concept of personal data. However, data analytics make it possible to infer personal information from non-personal data. In particular when data sets are linked!!!
While Hal is famous for promoting the sexy nature of being a statistician, processing and mining Big Data takes a special type of statistician, increasingly called a “Data Scientist”.
MGI (2011) estimates that demand for “deep analytical talent” in the US could be 50 to 60% greater than its projected supply by 2018.
This suggests that NSOs would be bidding against private firms for people who have these skills and could be forced to pay a premium to attract this talent.
Why work for ABS when you can work for Google?
PIAAC data across economies reveal that between 7% and 27% of adults have no experience in using computers or lack the most elementary computer skills, such as the ability to use a mouse.
Highlight that 35% within the 43% in Professional, scientific, and technical activities are in ICT services.
Michael made the point on looking at personal data a binary concept (O and I)
In the case of PSI:
Knowledge is a source of competitive advantage in the “information economy” and a major source of growth
Wide diffusion of data can be economically significant
Benefits from improving access to and facilitating reuse of data include:
Developing new products built directly on PSI
Developing complementary products, software and services
Reducing transaction costs in accessing and using information
Improving efficiency and productivity
Enabling efficiency gains in the public sector
Mixing public and private information in new goods and services
Almost all countries have Creative Commons (CC) or Creative Commons-like unrestricted licensing models to encourage use and innovation
Attribution is the main licence requirement
Most public pricing practices moved progressively from seeing public sector information and data as resources to be exploited
…..To
Seeing them as potential drivers of innovation, business creation and expansion
Making data free or available at marginal cost
Finally, here is the outline of the overall publication.
Finally I would like to thank your attention and highlighting that this work is based on a collaborative work across divisions and directorates.
And I may have missed to highlight some of the important elements done by my colleagues during the presentation.
The most successful high-tech internet companies such as Google and Amazon have built their business models on the collection and exploitation of big data.
These companies were able to scale without mass:
Talk about revenue per employee: Google 1 million USD per employee.
At Google, physical assets accounted for only about 13% of Google’s worth as of 31 December 2012
(calculated based annual balance sheet data as follow: (p – d) / a, where p: the total gross value for property, plant, and equipment; d: total accumulated depreciation; and a: total assets.)
In 2008, Google already processed over 20 petabytes of data per day (100 petabyte in 2012)
through 1 to 10 million servers operating every day
1 Petabyte = 1 milliong gigabytes = 0.5 billion HQ photos
20 Petabytes = Total production of hard-disk drives in 1995 = volume 1000 times the quantity of all printed material in the U.S. Library of Congress
The 3rd phase of the Internet will be the “Internet of Things” or M2M.
It will be less PC / personal device centric and more embedded devices;
– that open up huge new opportunities for controlling supply chains, tracking objects and monitoring the environment
-- But also poise some issues regarding security and privacy.
-- these devices will throw off huge amounts of data;
-- Ericsson estimates that already by 2020 that there will be 50 billion devices connected to the Internet
Available evidence confirm that DDI is a NEW SOURCE OF GROWTH.
[CLICK]
Looking first at the supply side for data and analytics // estimates suggest that the global market for data analytics is growing by 40% a year on average // and will reach 17 billion USD by 2015;
According our estimates // the OECD market for public sector data was worth 97 billion USD in 2008.
[CLICK]
What is more relevant from a policy maker perspective // however // is the impact of the use of data and analytics // that is // the impact of DDI // across the economy.
Empirical firm level studies confirm that the use of data analytics can boost firms’ productivity. Depending on the study, the impact ranges between 5% to up to 13%.
We believe that 5-10% is a reasonable conservative estimate // which is still an impressive figure // in particular if you consider that productivity growth in the OECD area was at 1.6% between 2009-12;
[CLICK]
At this point // it is very important to be aware that these figures DO NOT capture the full social benefits of data and analytics. // Such as the social benefits of better transparency of governments activities through open data // or the benefits of the personal use of data and analytics for health care or self-awareness raising // as promoted for example by the quantified-self movement.
These social benefits // that relate to consumer surplus // or to aspects of well-being // are still poorly captured by economic statistics // if at all.
It is important to recall this // also because in contrast to the economic benefits which are well captured quantitatively // potential social costs due to the inappropriate use of data and analytics are hard to measure // and may not appear on a radar screen which only capture quantitative figures.
Policy makers also need to understand the risks and challenges that come with DDI.
What are these risks and challenges?
[Click]
Looking at the supply side first again//
[Click]
Barriers to the free flow of data can be identified as one of the most critical challenges preventing possible spill-over effects.
These barriers are not only an issue across borders, // but also across sectors and organisations, // including between organisations and individuals // the latter is relevant when we talk about data portability.
It is important to note that there are some legitimate reasons for the limitation of the free flow of data // privacy is often cited as one, as well as security // or the protection of trade secrets.
[Click]
An other challenges are related to the limited applicability of the concept of ownership //
The concept of ownership entails the right of exclusion, as well as the right to fully dispose of the data including the right to delete the data at will.
However, when it comes to personal data in particular // there are some unrestrictable control rights granted to data subjects // that limit the control rights of the data controller // to such an extent that data controllers can hardly be seen as data owners in the traditional sense.
[Click]
At this point // it important to note that // the limited applicability of the concept of ownership is at the source of some of the incentives problems // related to data quality control or data curation that we see in science but also in health care, as well as some of the incentive issues related to data sharing.
[Click]
Looking now at the demand side //
[Click]
Lack of skills and competencies is an issue that emerged in all working streams of the project, be it // skills in the area of science, health care, or even public administration.
A number of empirical studies have also confirmed the lack of skills as an important barrier to DDI in businesses.
I have already talked about skills and
[Click]
Organisational change
And
[Click]
Entrepreneurship as important demand side issues.
[Click]
So please let me now highlight some of the societal challenges that are affecting not only the supply side or the demand side but society at large.
The first issue is related to the economic property of data discussed in the previous slide:
[Click]
As I highlighted, the increasing returns to scale and scope favour market concentration and dominance.
This can raise competition // as well as consumer protection issues // where such a market dominance is abused.
[Click]
Furthermore, the agglomeration of data can also lead to greater information asymmetry between the data controller and the data subjects.
This information asymmetry may lead to a shift in power away from the data subject and // may exacerbate existing inequalities; leading to a new type of digital divide : a digital divide 3.0 if you want.
[Click]
Last, but not least, trust deterioration in face of (i) the risks of loosing autonomy and freedom but also due to the (ii) increased cybersecurity risks needs to be considered by policy makers.
Finally, here is the outline of the overall publication.
I would like to take this opportunity to thank the Netherlands for their in-kind contribution through a module produced by TNO. The content of the module was very much appreciated and used for chapter 3 and chapter 10.
Many thanks to the NL.