SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
What Do We Do with
All This Big Data?
Fostering Insight and Trust in the Digital Age
A Market Definition Report
January 21, 2015
By Susan Etlinger
Edited by Rebecca Lieb
Every day, we hear new stories about data: how much there is, how fast it
moves, how it’s used for good or ill. Data ubiquity affects our businesses,
our educational and legal systems, our society, and increasingly, our
dinner-table conversation. I had the opportunity to speak at TED@IBM
in San Francisco on September 23, 2014, about the implications of a
data-rich world, and what we can do as businesspeople, citizens, and
consumers, to use it to our best advantage.1
That talk, as well as this document, examines two themes that underlie
many conversations about data and technology that correspond to fears
that George Orwell and Aldous Huxley chronicled in their novels 1984 and
Brave New World. As the culture critic Neil Postman put it in his 1985 book,
Amusing Ourselves to Death:
What Orwell feared were those who would ban books. What
Huxley feared was that there would be no reason to ban a book,
for there would be no one who wanted to read one. Orwell
feared those who would deprive us of information. Huxley
feared those who would give us so much that we would be
reduced to passivity and egotism. Orwell feared that the truth
would be concealed from us. Huxley feared the truth would
be drowned in a sea of irrelevance. Orwell feared we would
become a captive culture. Huxley feared we would become a
These two themes—irrelevance and narcissism on one hand (Huxley) and
surveillance and power on the other (Orwell)—anticipate modern fears
about the explosion of data in our personal and professional lives. As
individuals, we crave insight and convenience, yet we simultaneously fear
loss of control over our privacy and our digital identities.
Photo: Daniel K. Davis/TED
speaking at TED@IBM at SFJAZZ, San Francisco, California, September 23, 2014.
What’s So Hard About Big Data? .......................................................................................................................................
With Big Data, Size Isn’t Everything ...............................................................................................................................
Unstructured Data Demands New Analytical Approaches ........................................................................................
Traditional Methodologies Must Adapt ........................................................................................................................
From Data to Insight ..............................................................................................................................................................................
Big Data Requires Linguistic Expertise .........................................................................................................................
Big Data Requires Expertise in Data Science and Critical Thinking .........................................................................
Legal and Ethical Issues of Big Data .................................................................................................................................
Planning for Data Ubiquity .............................................................................................................................................................
Table of Contents
This document proposes an approach to better understand and address:
• How we extract insight from data
• How we use data in such a way as to earn and protect trust: the trust of customers,
constituents, patients, and partners
To be clear, these twin challenges of insight and trust will occupy data scientists, engineers,
analysts, ethicists, linguists, lawyers, social scientists, journalists, and, of course, the public for
many years to come. To derive insight from data while protecting and sustaining trust with
communities, organizations must think deeply about how they source and analyze it and clarify and
communicate their roles as stewards of increasingly revealing information. This is only a first step,
but it’s a critical one if we are to derive sustainable advantage from data, big and small.
WITH BIG DATA, SIZE ISN’T EVERYTHING
The idea of big data isn’t new; it was defined in the late ’90s by analysts at META Group (now
Gartner Group). According to META/Gartner, big data has three main attributes, known as
the Three Vs:
• Volume (the amount of data)
• Velocity (the speed at which the data moves)
• Variety (the many types of data)3
Now nearly two decades old, this construct has become increasingly pertinent. As IBM has famously said,
“90% of all the data in the world was created in the past two years.”4
To understand why this is, we need to
compare the business conditions that existed when big data was originally defined with today’s. In the early
2000s, technologists were grappling with a burgeoning variety of data types, spurred in large part by the rise of
electronic commerce. Today, social media is a major catalyst of data proliferation. Consider that:
• 100 hours of video are uploaded to YouTube every minute.5
• On WordPress alone, users produce about 64.8 million new blog posts and 60.4 million new comments
• 500 million tweets are sent per day.7
Much data is unstructured. It is, as Gartner defines it, “content that does not conform to a specific, pre-
defined data model. It tends to be the human-generated and people-oriented content that does not fit neatly
into database tables.”8
As a result, the primary challenge of what we think of as big data isn’t actually the size;
it’s the variety. For this reason, the term “big data” can sometimes be misleading.
If this seems counterintuitive, consider this example: the New York Stock Exchange (NYSE) recorded
approximately 9.3 billion shares traded on December 16, 2014, more than 18 times the average number of
tweets (approximately 500 million) created per day.9
Even though the number of trades is much larger than
the number of tweets (volume) and the speed of the market may change from hour to hour and day to day
(velocity), the basic attributes of a trade—price, trade time, change from previous trade, previous close, price/
earnings ratio, and so on—are the same every time. A trade is a trade. It is homogeneous and predictable from a
data perspective (variety).
In contrast, social data is far more complex and variable. While a tweet contains some structured data
(metadata about the time it was posted, the user who posted it, whether it includes hashtags or media, such as
photography, and other attributes), it can express anything that fits into 140 characters. It is a mix of structured
metadata and unstructured text and images that can be expressed with variable lengths, languages, meanings,
and formats. It can contain a news headline, a haiku, a sales message, or a random thought. For this reason, a
much smaller number of tweets can be far more complex to analyze from a data standpoint. Size isn’t everything.
The nature of human
rigorous and repeatable
processes to extract
meaning from it in
a transparent and
DATA DEMANDS NEW
The human-generated and people-oriented nature of
unstructured data is both an unprecedented asset and a
disruptive force. Data’s value lies in its ability to capture the
desires, hopes, dreams, preferences, buying habits, likes,
and dislikes of everyday people, whether individually or in
aggregate. The disruptive nature of this data stems from
• It’s raw material. It requires processing to translate it
into a format that machines, and therefore people, can
understand and act upon at scale.
• It offers a window into human behavior and attitudes.
When enriched with demographic and location
information, data can introduce an unprecedented
level of insight and, potentially, privacy concerns..
Unstructured data requires a number of processes and
• Identify the appropriate sources
• Crawl and extract it
• Detect and interpret the language being used
• Filter it for spam
• Categorize it for relevance (e.g., “Gap store” versus
• Analyze the content for context (sentiment, tone,
intensity, keywords, location, demographic information)
• Classify it so the business can act on it (a customer
service issue, a request for a product enhancement,
a question, etc.)
Each of these steps is rife with nuances that require both
sophisticated technologies and processes to address
(see Figure 1).
The above challenges add up to a host of risks: missed
signals, inaccurate conclusions, bad decisions, high total
cost of data and tool ownership, and an inability to scale,
among others. Even a small misstep, such as a missing
source, a disparity in filtering algorithms, or a lack of
language support, can have a significant detrimental effect
on the trustworthiness of the results.
A recent story in Foreign Policy magazine provides a timely
example. “Why Big Data Missed the Early Warning Signs of
Ebola” highlights the importance of an early media report
published by Xinhua’s French-language newswire covering
a press conference about an outbreak of an unidentified
hemorrhagic fever in the Macenta prefecture in Guinea.10
The Foreign Policy article debunks some of the hyperbole
about the role of big data in identifying Ebola, not because
the technology wasn’t available (it was) or because the
indications weren’t there (they were), but because, as
author Kalev Leetaru writes, “part of the problem is that
the majority of media in Guinea is not published in English,
while most monitoring systems today emphasize English-
Identify Data Sources
Crawl and Extract Data
Detect and Interpret Language
Filter for Spam
Categorize for Relevance
Analyze for Sentiment and
Classify for Action
Not all data sources provide reliable APIs
or consistent access.
Different tools use different crawlers, which
can return different samples.
Different spam ﬁltering algorithms can also
return different samples, accuracy levels.
Sentiment analysis is highly subjective and subject
to interpretation or error. Even with human coding
(which reduces scalability) and machine learning,
no tool is perfect.
Requires both organizational and technology
resource to tag data so that it is appropriately
classiﬁed and shared with the right people.
Inconsistent levels of accuracy and
Not all tools support multiple languages,
or support them equally well.
FIGURE 1 CHALLENGES OF UNSTRUCTURED DATA
Even in the unlikely event that all relevant data is in English
or another single language, there’s no guarantee that it
will be easy to interpret or that the path to doing so will
be clear. For this reason, researchers in both industry
and academia are grappling with the many challenges
that large, unstructured human data poses as a tool
for conducting scientific or business research. The
following provides an example of how one organization is
addressing these significant methodological issues.
Case Study: Health Media Collaboratory
Applying Methodological Rigor to Big Data
The Health Media Collaboratory (HMC) at the University
of Illinois at Chicago’s Institute for Health Research and
Policy is focused on understanding social data, most of
which is unstructured, to “positively impact the health
behavior of individuals and communities,” according
to its website. In the broadest sense, HMC’s mission is
to develop and propagate a new paradigm for health
media research using innovative strategies to apply
methodological rigor to the analysis of big data.11
The focus of a recent project was to look at how people
talk about quitting smoking on Twitter so that HMC and
the Centers for Disease Control and Prevention (CDC)
could learn how they might promote behavior change.
Recently, HMC turned to Twitter to explore two questions
about the impact, if any, of social data on smoking
cessation. The initial research questions were:
• How much electronic-cigarette promotion is there
• How much organic conversation about electronic
cigarettes exists on Twitter?
In another project, HMC also looked at whether Twitter
could be used as a tool to evaluate the efficacy of health-
oriented media campaigns. In particular, the CDC wanted
to assess the impact of several provocative and graphic
television commercials, one of which featured a woman
with a hole in her throat. The questions HMC sought to
• Did the commercials work?
• How can we prove it?
This type of research, as well as the data it presents, is
vastly different from fielding a conventional multiple-
choice survey in which the questions and answers are
predefined and results tabulate the percentage of answers
in each column. HMC instead had to determine, with an
appropriate level of confidence, how people talk about
smoking on Twitter and whether this data could serve as a
useful indicator of public opinion and even of likely behavior.
both industry and
grappling with the
data poses as a
tool for conducting
To do this, the team needed to understand how much
of the Twitter conversation about smoking was spam,
how much was off topic (“smoking marijuana,” “smoking
ribs,” “smoking hot women”), and how much was relevant
(“I’ve really got to quit smoking cigarettes”). For the first
project, it also meant understanding how people talk about
electronic cigarettes in particular. Figure 2 is a recreation
of the search string HMC used in its research, illustrating
why this effort isn’t as simple as it might seem.
The methodology that HMC used to collect, clean, and
analyze the Twitter conversation related to smoking
topics closely mirrors the big data challenges outlined
in Figure 1. While it adheres to scientific method, it’s
important to know that this was a methodology that
HMC itself devised to account for the nuances and
challenges of unstructured data.
1. Data collection. Determine the appropriate source
and sample size of the data to be collected.
2. Keyword selection. Generate the most comprehensive
possible list of keywords, encompassing nonstandard
English usages, slang terms, and misspellings.
3. Metadata. Collect metadata related to the
a. A tweet ID (a unique numerical identifier assigned
to each tweet)
b. The username and biographical profile of the
account used to post the tweet
c. Geolocation (if enabled by the user)
d. Number of followers of the posting account
e. The number of accounts the posting
f. The posting account’s Klout score
h. URL links
i. Media content attached to the tweet.
4. Filtering for engagement. Because engagement with
the campaign was the determining factor for relevance,
the team filtered tweets that described televised
commercials, later de-duplicating them to ensure that
tweets with multiple keywords would not be counted twice.
5. Human coding. Throughout the process, human
coders reviewed the data to assess relevance and code
Figure 2: How People Talk About E-Cigarettes
Key Words for E-Cigs
E cigarettes blue cigarette e cigarettes njoy cigarette e cigarettes blu cig e cigarettes njoy cig e cigarettes ecig e
cigarettes e cig e cigarettes @blucigs e cigarettes e-cigarette e cigarettes ecigarette ecigarettes from:blucigs e
cigarettes e-cigarette e cigarettes e-cigs e cigarettes ecigarettes e cigarettes e-cigarettes e cigarettes green
smoke e cigarettes south beach smoke e cigarettes cartomizer ecigarette (atomizer OR atomizers)-perfume e
cigarettes ehookah OR e-hookah e cigarettes ejuice OR ejuices OR e-juice OR e-juice ecigarettes eliquid OR
eliquids OR e-liquid OR e-liquids e cigarettes e-smoke OR e-smokes e cigarettes (esmoke OR esmokes)
sample:5 lang: en e cigarettes lavatube OR lavatubes e cigarettes logicecig OR logicecigs e cigarettes
smartsmoker e cigarettes smokestik OR Smokestiks e cigarettes v2 cig OR “v2 cigs” OR v2cig OR v2cigs vaper
or vapers OR vaping e cigarettes zerocig OR Zerocigs e cigarettes cartomizers e cigarettes e-cigarettes
FIGURE 2 HOW PEOPLE TALK ABOUT E-CIGARETTES
Source: University of Illinois at Chicago’s Institute for Health Research and Policy
6. Precision and relevance. The team used a
combination of human and machine coding to assess
relevance and eliminate false positives, using three
teams of trained coders and a process to assess
intercoder reliability using a Kappa score, a statistic
“used to assess inter-rater reliability when observing or
otherwise coding qualitative/categorical variables.”12
According to HMC, “the human-coded tweets were then
used to train a naïve Bayes classifier to automatically
classify the larger dataset of Tips engagement tweets
for relevance. Precision was calculated as the percent
of Tips-relevant tweets yielded by the keyword filters.”13
7. Recall. To assess whether the tweet sample was
representative of and could be generalized to all
potentially relevant Twitter content, the team compared
its sample to a larger sample of unretrieved tweets,
again using trained coders and a Kappa score to
assess how well the filtered tweet sample represented
the larger data set.14
8. Content coding. Finally, the team coded the content
to better understand “fear appeals,” that is, whether the
user accepted, rejected, or disregarded the message.
So, did the CDC’s graphic and disturbing anti-smoking ads
and the Twitter conversation surrounding them actually
lead people to quit? HMC didn’t overstate its data; rather, it
concluded that approximately 87% of the tweets about the
TV commercials expressed fear and that the ads had “the
desired result of jolting the audience into a thought process
that might have some impact on future behavior.”15
HMC’s case study illustrates that unstructured data
requires significant adaptations to analytics methodology
to extract meaning. Certainly it would have been a lot
simpler for the CDC to host a focus group or field a survey
to collect impressions about its anti-smoking campaign,
but that data, as comparatively simple as it would have
been to analyze, would lack the spontaneity and rich variety
of expression available on Twitter or other social networks,
had the teams extended the research to other sources.
The nature of human language demands rigorous and
repeatable processes to extract meaning in a transparent
and defensible way. As a result, analytics methodology is
undergoing an explosive period of change.
BIG DATA REQUIRES
As counterintuitive as it might seem, an influx of
unstructured data demands not only new and more
sophisticated technologies to process and store it but a
renewed emphasis on the humanistic disciplines as well.
This is because, as Gartner has said, big data “tends to be
the human-generated and people-oriented content” rather
than highly structured data that fits neatly into databases.
Naturally, “human-generated and people-oriented content”
includes language, which is rife with contractions,
sarcasm, slang, and metaphors expressed in multiple
written forms, in hundreds of languages, 24 hours a day,
seven days a week.
Furthermore, language changes constantly, a fact Oxford
Dictionaries marks each November by publishing a word
of the year that encapsulates that year’s zeitgeist. 2014’s
word was “vape,” salient in light of HMC’s research. Five
years ago, “vape” would have been impossible to interpret,
because it—and its cultural context—didn’t exist yet.
A recent article in MIT Technology Review illustrates just
how quickly language and meaning can evolve, both in
obvious and subtle ways.16
Vivek Kulkarni, a PhD student
in the Data Science Lab at Stony Brook University, along
with several of his colleagues, used linguistic mapping
to illustrate the speed at which word meanings change,
gathering inputs from sources such as Google Books,
Amazon, and Twitter.
“Mouse” acquired an entirely new meaning following the
introduction of the computer mouse in the early 1970s, and
“sandy” changed literally overnight with Hurricane Sandy in
2012. Today we see a constant stream of examples both
of redefined words and of new ones (“vaping,” “selfie”) that
require both technological and humanistic expertise to map,
place in context, and understand.
BIG DATA REQUIRES
EXPERTISE IN DATA
SCIENCE AND CRITICAL
The speed, size, and variety of data around us—and the
availability of platforms used to visualize and analyze
it—have democratized the function of analytics within
organizations. At the same time, fundamental analytics
education has lagged, creating a situation in which
organizations are at risk of misinterpreting data of all
kinds. Says Philip B. Stark, professor and chair of statistics
at the University of California, Berkeley, “the type of data
(structured, text, etc.) isn’t the point at all. The way of
Stark emphasizes that good data science requires having
subject matter expertise, access to the appropriate
computational tools, and most importantly, critical thinking
and statistics skills. Figure 3 lays out the consequences of
overlooking any of these three foundational elements.
FIGURE 3 FUNDAMENTALS OF DATA SCIENCE
1. Irrelevant conclusions. If tools and critical thinking
are present but subject matter expertise is absent, the
organization risks asking the wrong questions, which
can result in irrelevant conclusions and valueless
answers. In addition, the organization will lack the
context necessary to design experiments that will yield
the answers it needs. It will be unable to understand
the intrinsic limitations of the data, says Stark: noise,
sampling issues, response bias, measurement bias,
and so on. This creates a domino effect that can
squander resources and lead to ineffectual—or worse,
2. Inability to execute. If subject matter expertise and
critical thinking are present, but tools are absent, the
organization will be unable to extract insights at scale
and must resort to time-consuming manual methods.
As a result, the organization risks burning out and
eventually losing top analysts, who now must focus on
brute-force methods of processing and analyzing data,
rather than using their skills for more sophisticated and
3. Incorrect conclusions. If subject matter expertise
and tools are present, but critical thinking and a
knowledge of applied statistics are absent, the
organization risks drawing the wrong conclusions from
good data, making poor decisions that may ignore
other critical business signals. Like a lack of subject
matter expertise, this can have harmful consequences
to decision making and, therefore, business results.
Given the spread of data throughout organizations and
the impracticality of hiring legions of trained analysts to
keep pace with its growth, the next step is to evolve from
analytics that simply describe a situation to analytics that
predict what may happen next and then to analytics that
prescribe a course of action.18
But even assuming access to the most sophisticated
algorithms that incorporate the most detailed business
knowledge, widespread access to data necessitates
that more people, irrespective of role, grasp the basics of
logic and statistics to understand that data. This doesn’t
mandate universal PhDs in applied statistics, but it does
require an awareness of basic principles of logic.
The good news is that, while the big data industry is still
in its infancy, many of the most valuable tools for analysis
are widely available—and more than two thousand years
old to boot. As early as 350 BCE, Aristotle described 13
logical fallacies, which logicians and philosophers have
built upon during the last 2,400 years.19
fallacies leaves organizations vulnerable to a host of risks,
which can harm competitive position, financial success,
customer sentiment and trust, and other critical objectives.
One common example is mistaking correlation for
causation, in which organizations erroneously attribute
one outcome (for example, increased revenue) to a
corresponding data point (for example, reach of a
marketing campaign). The increasing use of technologies
that present complex data visually can exacerbate the
problem. Harvard law student Tyler Vigen succinctly (and
sometimes hilariously) presents this phenomenon on his
Spurious Correlations blog.20
The good news is
that, while the big
data industry is
still in its infancy,
many of the most
valuable tools for
analysis are widely
more than two
old to boot.
Divorce rate in Maine
Per capita consumption of margarine (US)
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
5.0 4.7 4.6 4.4 4.3 4.1 4.2 4.2 4.2 4.1
8.2 7.0 6.5 5.3 5.2 4.0 4.6 4.5 4.2 3.7
Divorces rate in Maine Per Capita Consumption of Margarine (US)
FIGURE 4 MISTAKING CORRELATION FOR CAUSATION
Source: Tyler Vigen
In Figure 4, Vigen’s calculations show that there is a 99%
correlation between the divorce rate in Maine and per-
capita margarine consumption. Does the Maine divorce
rate somehow cause US residents to eat margarine? Does
US margarine consumption somehow lead to divorce in
Maine? While these questions are absurd, charts such this
visually suggest a link.
The correlation/causation fallacy is just one of many
logical fallacies that have been documented and described
over the years, including formal fallacies (fallacies of
logic) and informal fallacies (fallacies of evidence or
As more tools become available to visualize
data sets quickly and easily, organizations must invest
as much in critical thinking and data science expertise
as they do in tools to visualize data. Otherwise, they risk
succumbing to logical fallacies.
BIG DATA RAISES
MULTIPLE LEGAL AND
The good news—and the bad news—about big data is that
it can provide unprecedented insight into people, both as
individuals and in aggregate. While surveys can, arguably,
reveal human attitudes, Christian Rudder, CEO of dating
site OKCupid, points out in his 2014 book, Dataclysm:
Who We Are When We Think No One’s Looking, that “we can
pinpoint the speaker, the words, the moment, even the
latitude and longitude of human communication.”22
Many people know the story of how Target discovered
that a young girl was pregnant before her father did; such
stories have become mainstream.23
But much of the
challenge with recent discussions on ethics and privacy
stems from the extremely broad nature of these terms,
the spectrum of personal preferences, and the beliefs of
individuals about the media environment we live in today.
Consider these recent examples:
• Seeking to prevent suicides, Samaritans Radar raises
privacy concerns. In October 2014, the BBC reported
that the Samaritans had launched an app that would
monitor words and phrases such as “hate myself” and
“depressed” on Twitter and would notify users if any
of the people they follow appear to be suicidal.24
the app was developed to help people reach out to
those in need, privacy advocates expressed concern
that the information could be used to target and profile
individuals without their consent. According to a
petition filed on Change.org, the Samaritans app was
monitoring approximately 900,000 Twitter accounts
as of late October.25
By November 7, the app was
suspended based on public feedback.26
• Facebook’s “Emotional Contagion” experiment
provokes outrage about its methodology. In June
2014, Facebook’s Adam Kramer published a study in
The Proceedings of the National Academy of Science,
revealing that “emotional states can be transferred
to others via emotional contagion, leading people
to experience the same emotions without their
In other words, seeing negative stories
on Facebook can make you sad. The experiment
provoked outrage about the perceived lack of informed
consent, the ethical repercussions of such a study,
the concern over appropriate peer review, the privacy
implications, and the precedent such a study might
set for research using digital data.
• Uber knows when and where (and possibly with
whom) you’ve spent the night. In March 2012, Uber
posted, and later deleted, a blog post entitled “Rides of
Glory,” which revealed patterns, by city, of Uber rides
after “brief overnight weekend stays,” also known as
the passenger version of the walk of shame.28
was later criticized for allegedly revealing its “God
View” at an industry event, showing attendees the
precise location of a particular journalist without his
knowledge, while a December 1, 2014, post on Talking
Points Memo disclosed the story of a job applicant who
was allegedly shown individuals’ live travel information
during an interview.29, 30
Much of the
on ethics and
from the extremely
broad nature of
• A teenager becomes an Internet celebrity—and a
target—in one day. Alex Lee, a 16-year-old Target clerk,
became a meme (#AlexFromTarget) and a celebrity
within hours, based on a photo taken of him unawares
at work. He was invited to appear on The Ellen Show
and was reported to have received death threats on
These stories illustrate several attributes of the data
environment we live in now and the attendant ethical
issues they represent:
• Data collection. The Samaritans example illustrates
the law of unintended consequences: what may
happen when an app collects data that may, albeit
unintentionally, compromise privacy or put people in
• Methodology and usage. The Facebook example
demonstrates what happens when a company uses
its vast reservoir of data to run technically legal but
ethically ambiguous experiments on its users, raising
questions about the nature of informed consent and
ethical data use in the digital age.
• Aggregation, storage, and stewardship. The Uber posts
illustrate, albeit with aggregated data, the intensely
intimate nature of the data users entrust to companies,
raising questions of stewardship, ethics (is aggregating
such data ethical?), and privacy (what happens if data is
intentionally or accidentally disclosed?).
• Communication. All of the above examples illustrate
the gray areas between law and ethics, or, from
an organizational point of view, risk management
and customer experience. As data becomes even
more valuable and ubiquitous, the way in which
organizations communicate—about collection,
analysis, intent and usage—will affect not only their
legal risk profile, but their ability to attract and retain
the trust and loyalty of their communities.
Finally, there is, as former secretary of defense Donald
Rumsfeld so famously called it, “the unknown unknown.”
The #AlexFromTarget story demonstrates not only an
example of how an everyday 16 year old (by definition, a
minor) can become an instant Internet celebrity but also
how a company can unwittingly and suddenly find itself
at the center of a crisis not of its own creation, one that
raises issues (compounded because of Lee’s age) of
employee privacy and even safety.
Figure 5 lays out these issues at a high level.
In the past, many of these ethical issues related to data
were cloaked behind proprietary systems and siloed data
stores. As data becomes ubiquitous, more integrated,
and more portable, however, the number and type of
ethical gray areas will multiply, along with a need to
distinguish the organization’s legal responsibilities, such
as what it discloses in a terms of service, from its ethical
ones—the actions it takes that promote or erode the trust
of its community.
How the data may have been ﬁltered,
enriched or otherwise modiﬁed with:
Human or algorithmic coding
Process for assessing precision, relevance, recall
How the organization may change the
experience based on data
Whether the organization plans to sell the data
in any form to a third party
How data is combined and its impact on
personally identiﬁable information (PII) or user
experience in general
What data is collected
How and for how long data is stored
Who owns the data
Who has the right to delete data (posts or entire proﬁles)
Process for deleting data (posts or entire proﬁles)
Who has the right to view/modify/share data (administration)
Whether and how the data can be extracted
The extent to which the
ers about what and
how it collects, analyz-
es, stores, aggregates,
and uses their data
FIGURE 5 ETHICAL ISSUES RELATED TO DATA
Define data strategy and operating model
If data is to be considered a business-critical asset, it must be treated as such by leaders who drive and
instill strategy across the organization. In 2015, leaders must define what critical data streams are needed
to drive business goals, how they will source them, and what operating model is needed to process,
interpret, and act on them at the right time.
The challenge is that an organization’s departments (and therefore the data) tend to be siloed, which can
result in blind spots, organizational politics, and spiraling costs. Organizations must balance their need
for insight and competitive advantage on the one hand and privacy and rational cost of ownership on the
other. All too frequently, these dual imperatives are in conflict, sometimes unnecessarily so, because the
organization does not have a clear strategy for what data will be used and stored, what data will be used
but not stored, and what data is simply unnecessary.
Update analytics methodology to reflect new data realities
Analyzing unstructured data will never yield the same confidence levels as a simple binary choice; it will
always require interpretation. The key is to make that interpretation transparent, rigorous, and repeatable so
that others can reliably repeat analyses and yield the same or substantially similar results.
This is one area in which there is a tremendous difference between private and public institutions. In
private institutions, work process, product, and data tend to be proprietary. In public institutions, such as
universities, research is subject to the highest levels of scrutiny among academic publications and journals.
It’s also important to engineer the method of measurement into initiatives to reduce ambiguity and provide
a greater ability to trace impact. The broader the topic, the more hashtags can help confirm the provenance
and relevance of social conversation. Tracking codes and multivariate testing are also a useful if not
Seek out critical thinking and diverse skill sets
Unquestionably, engineering and analytical skills, not to mention skills in applied statistics and data science,
will continue to gain value as organizations become ever more dependent on multiple data types. At the
same time, the demands of analyzing unstructured data also require skill in interpreting context related
to language and behavior, a challenge humans have had since we developed language. After all, even the
cleanest, most reliable data can be misinterpreted, whether intentionally or unintentionally.
To minimize misinterpretation means valuing not only math and engineering but also social sciences and
humanities. These disciplines—sociology, psychology, anthropology, linguistics, ethics, philosophy, and
rhetoric—provide context and help us become better critical thinkers. Without a balance of critical thinking,
business knowledge, and smart analytics tools, we’re in danger of making the wrong decision much more
efficiently, quickly, and with far greater impact than we have in the past.
If we—individually and collectively—are to make the best use of data and extract relevant
insight from it in a trustworthy manner, we must approach data strategy thoughtfully.
Following are some basic tenets of a strategic data plan.
The hype over “big data” has partially obscured the fact that our ability to collect, analyze,
and act on data—and to some extent predict outcomes based upon it—is a potentially
transformative force for business and humanity alike. While Aldous Huxley couldn’t have
anticipated the impact of a Kim Kardashian magazine cover or the challenges inherent
in understanding how people talk about smoking, he was prescient to call out the ever-
increasing difficulty of identifying relevance in a “sea of irrelevance.”32
It seems likely that the privacy and ethical implications of data ubiquity, not to mention
recent disclosures about government access to and use of personal data, would have
confirmed many of Orwell’s worst fears. At the same time, however, we do not need to
blindly accept the dystopian nightmare he envisioned as our only future. We have an
opportunity--and an obligation--to examine not only the legal, but the ethical implications
of ubiquitous data, and use this understanding to decide how we will use it, sustainably
and responsibly, for years to come.
Insist on ethical data use and transparent disclosure
Earl Warren, former chief justice of the United States, once said, “In civilized life, law floats in a sea of
This is especially true of the digital age, in which few of the implications of digital transformation
have found their way into case law and, as a result, organizational policy. As organizations become
more data centric, for their own benefit as well as their customers’, they must also look closely at
the affirmative and passive decisions they make about where they get their data; their analytics
methodology; how they store, steward, aggregate, and use the data; and how transparently they disclose
Reward and reinforce humility and learning
It is nearly impossible to calculate the impact that data will have in our lives in the next decade.
Technologies such as IBM’s Watson, Ayasdi, and others are illustrating the many applications for big
data, whether in healthcare, consumer products, financial services, energy, or elsewhere. Meanwhile, the
Internet of Things introduces data feeds from sensors, which can be combined with other data streams
to deliver specific, relevant, and even predictive insights that will only compound volume, velocity, and
Yet the world is just starting to come to terms with the impact of data ubiquity from the technology,
business, research, cultural, and ethical perspectives. The most important and perhaps most difficult
impact of data ubiquity is the fact that it radically undermines traditional methods of analysis and laughs
at our desire for certainty. The only strategy to combat the fear of uncertainty is to accept and work with
the limits of the data and approach the science of challenging data sets with an appetite for continuous
learning, whether the goal is to sell a pair of shoes or to help prevent cancer
You can view the talk at http://www.ted.com/talks/susan_
Neil Postman, Amusing Ourselves to Death: Public Discourse in the
Age of Show Business (New York: Penguin Books,1985), vii.
For a more detailed view, a good starting point is ““3D Data
Management: Controlling Data Volume, Velocity and Variety,”“
published by META Group on February 6, 2001, http://blogs.
“What Is Big Data?” IBM, accessed January 6, 2015, http://www-
“Statistics,” YouTube, accessed January 6, 2015, https://www.
“Stats,” WordPress, cached on November 2, 2014, http://
“About,” Twitter, accessed January 6, 2015, https://about.twitter.
Darin Stewart, ““Big Content: The Unstructured Side of Big Data,”“
Gartner Group, May 1, 2013, http://blogs.gartner.com/darin-
Zacks Equity Research, “Stock Market News for
December 17, 2014 - Market News,” Yahoo! Finance,
December 17, 2014, http://finance.yahoo.com/news/
Kalev Leetaru, “Why Big Data Missed the Early Warning Signs of
Ebola,” Foreign Policy, September 26, 2014, http://foreignpolicy.
See also: Sherry L. Emery, Glen Szczypka, Eulàlia P. Abril,
Yoonsang Kim, and Lisa Vera, “Are You Scared Yet? Evaluating
Fear Appeal Messages in Tweets About the Tips Campaign,”
Journal of Communication, 64 (2014): 278–295, doi: 10.1111/
“Cohen’s Kappa, “University of Nebraska–Lincoln, accessed
January 6, 2015, http://psych.unl.edu/psycrs/handcomp/
Sherry L. Emery, “Are You Scared Yet?.”’’
“Linguistic Mapping Reveals How Word Meanings Sometimes
Change Overnight,” MIT Technology Review, November 23, 2014,
Philip Stark, Twitter comment, November 24, 2014, https://
For a quick primer on descriptive, predictive, and prescriptive
analytics, see this interview with data scientist Michael Wu
of Lithium by Jeff Bertolucci in InformationWeek: http://www.
To download the text, go to http://classics.mit.edu/Aristotle/
Vigen maintains a running list of spurious correlations at his
blog, Spurious Correlations (http://tylervigen.com/).
For an excellent tutorial on logical fallacies, see chapter 2 of
“SticiGui,” an online statistics textbook by Philip B. Stark, professor
and chair of the department of statistics, University of California,
Rudder, Dataclysm, 146.
Kashmir Hill,, “How Target Figured Out a Teen Girl Was Pregnant
Before Her Father Did,” Forbes, February 16, 2012, http://www.
Zoe Kleinman, “Samaritans App Monitors Twitter Feeds for
Suicide Warnings,” BBC News, October 28, 2014, http://www.bbc.
Adrian Short, “Shut Down Samaritans Radar,” Change.org,
accessed January 6, 2015, https://www.change.org/p/twitter-
“Samaritans Radar announcement - Friday 7 November,”
Samaritans, November 7, 2014, http://www.samaritans.org/
Adam D.I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock,
“Experimental Evidence of Massive-Scale Emotional Contagion
Through Social Networks,” Proceedings of the National Academy
of Sciences of the United States of America, vol. 111 (24), DOI:
Voytek, “Rides of Glory,” Uber, cached March 26, 2012, https://
Kashmir Hill, “‘God View’: Uber Allegedly Stalked Users for
Party-Goers’ Viewing Pleasure (Updated),” Forbes, October 3,
pleasure/. Talking Points Mem: Uber Let Job Applicant Access
Controversial December 1, 2014: http://talkingpointsmemo.
Caitlin MacNeal, “Report: Uber Let Job Applicant Access
Controversial “‘God View’ Mode,” Talking Points Memo, December
1, 2014, http://talkingpointsmemo.com/livewire/uber-job-
Nick Bilton, “Alex from Target: The Other Side of Fame”, The
New York Times, November 12, 2014, http://www.nytimes.
Aldous Huxley, Brave New World Revisited (New York:
HarperCollins Publishers, 1958), 36.
Earl Warren, speech at the Louis Marshall Award Dinner of the
Jewish Theological Seminary (Americana Hotel, New York City,
November 11, 1962).
SOURCES AND ACKNOWLEDGMENTS
This document was developed as a companion piece to a talk
given at TED@IBM in San Francisco, California, on September 23,
2014. As such, it was built on online and in-person conversations
with market influencers, technology vendors, brands, academics,
and others on the effective and ethical use of big data, as well as
secondary research, including relevant and timely books, articles,
and news stories. My deepest gratitude to the following:
• The team at the Health Media Collaboratory at the University
of Illinois at Chicago, specifically Sherry Emery, Eman Aly, and
Glen Szcypka for sharing their research and methodology and
educating me about the nuances of interpreting big data for
• My fellow board members at the Big Boulder Initiative for
their insights and perspective on the effective and ethical
use of social data: Pernille Bruun-Jensen, CMO, NetBase;
Damon Cortesi, Founder and CTO, Simply Measured; Jason
Gowans, Director, Data Lab, Nordstrom; Will McInnes, CMO,
Brandwatch; Chris Moody, Vice President, Data Strategy,
Twitter (Chair); Stuart Shulman, Founder and CEO, Texifier;
Carmen Sutter, Product Manager, Social, Adobe; and Tom
Watson, Head of Sales, Hanweck Associates, LLC.
• The team at TED who helped me hone and focus the talk and
provided invaluable feedback throughout: Juliet Blake
and Anna Bechtol.
• The team at IBM Social Business for planning, executing and
marketing a superb event: Michela Stribling, Beth McElroy,
Jacqueline Saenz and Michelle Killebrew.
• My fellow TED@IBM speakers: Gianluca Ambrosetti, Kare
Anderson, Brad Bird, Monika Blaumueller, Erick Brethenoux,
Lisa Seacat DeLuca, Jon Iwata, Bryan Kramer, Tan Le, Charlene
Li, Florian Pinel, Inhi Cho Suh, Marie Wallace,
and Kareem Yusuf.
• Philip Stark, professor and chair of Statistics, University of
California, Berkeley, for an extremely insightful perspective on
the methodological and organizational requirements of big
data, as well as access to his superb course materials.
• The organizers and speakers at the International Symposium
on Digital Ethics at Loyola University in November 2014, with
whom I had some incredibly insightful conversations: Don
Heider, dean, School of Communication, Loyola University
Chicago; Thorsten Busch, senior research fellow, Institute for
Business Ethics, University of St. Gallen; Michael Koliska, PhD
candidate at University of Maryland; and Caitlin Ring, assistant
professor of strategic communication at Seattle University.
• Farida Vis, research fellow in the Social Sciences in the
Information School at the University of Sheffield.
• The teams at DataSift (Nick Halstead, Tim Barker, Jason Rose,
Seth Catalli); Lithium Technologies (Katy Keim and Nicol
Addison); and Oracle (Tara Roberts and Christine Wan) for
valuable insights along the way.
• Tyler Vigen for his Spurious Correlations blog, which makes a
complex topic simple and fun to explain; Gary Schroeder for
his wonderful visual storytelling of my TED talk; Daniel K. Davis
for his superb photography at TED@IBM; Vladimir Mirkovic for
graphic design; and Erin Brenner for copyediting.
• My talented teammates at Altimeter Group: Rebecca Lieb, who
edited this report, Cheryl Graves, Jessica Groopman, Jaimy
Szymanski, Christine Tran, and, of course, Charlene Li.
Input into this document does not represent a complete
endorsement of the report by the individuals or organizations
listed above. Finally, any errors are mine alone.
This independent research report was 100% funded by Altimeter
Group. This report is published under the principle of Open
Research and is intended to advance the industry at no cost. This
report is intended for you to read, utilize, and share with others; if
you do so, please provide attribution to Altimeter Group.
The Creative Commons License is Attribution-Noncommercial-
ShareAlike 3.0 United States, which can be found at https://
ALTHOUGH THE INFORMATION AND DATA USED IN THIS REPORT HAVE BEEN PRODUCED
AND PROCESSED FROM SOURCES BELIEVED TO BE RELIABLE, NO WARRANTY EXPRESSED
OR IMPLIED IS MADE REGARDING THE COMPLETENESS, ACCURACY, ADEQUACY, OR USE
OF THE INFORMATION. THE AUTHORS AND CONTRIBUTORS OF THE INFORMATION AND
DATA SHALL HAVE NO LIABILITY FOR ERRORS OR OMISSIONS CONTAINED HEREIN OR FOR
INTERPRETATIONS THEREOF. REFERENCE HEREIN TO ANY SPECIFIC PRODUCT OR VENDOR
BY TRADE NAME, TRADEMARK, OR OTHERWISE DOES NOT CONSTITUTE OR IMPLY ITS
ENDORSEMENT, RECOMMENDATION, OR FAVORING BY THE AUTHORS OR CONTRIBUTORS
AND SHALL NOT BE USED FOR ADVERTISING OR PRODUCT ENDORSEMENT PURPOSES.
THEOPINIONS EXPRESSED HEREIN ARE SUBJECT TO CHANGE WITHOUT NOTICE.
How to Work with Us
Altimeter Group research is applied and brought to life in our client engagements. We help organizations understand and
take advantage of digital disruption. There are several ways Altimeter can help you with your business initiatives:
• Strategy Consulting. Altimeter creates strategies and plans to help companies act on disruptive business and
technology trends. Our team of analysts and consultants works with senior executives, strategists .and marketers on
needs assessment, strategy roadmaps, and pragmatic recommendations across disruptive trends.
• Education and Workshops. Engage an Altimeter speaker to help make the business case to executives or arm
practitioners with new knowledge and skills.
• Advisory. Retain Altimeter for ongoing research-based advisory: conduct an ad-hoc session to address an immediate
challenge; or gain deeper access to research and strategy counsel.
To learn more about Altimeter’s offerings, contact email@example.com.
Altimeter is a research and
consulting firm that helps
companies understand and
act on technology disruption.
We give business leaders the
insight and confidence to help
their companies thrive in the
face of disruption. In addition to
publishing research, Altimeter
Group analysts speak and
provide strategy consulting
on trends in leadership, digital
transformation, social business,
data disruption and content
1875 S Grant St #680
San Mateo, CA 94402
Susan Etlinger, Industry Analyst
Susan Etlinger is an industry analyst at Altimeter Group,
where she works with global organizations to develop
data and analytics strategies that support their business
objectives. Susan has a diverse background in marketing
and strategic planning within both corporations and
agencies. She’s a frequent speaker on social data and
analytics and has been extensively quoted in outlets,
including Fast Company, BBC, The New York Times, and The
Wall Street Journal. Find her on Twitter at @setlinger and at
her blog, Thought Experiments, at susanetlinger.com.
Rebecca Lieb, Industry Analyst
Rebecca Lieb (@lieblink) covers digital advertising and
media, encompassing brands, publishers, agencies and
technology vendors. In addition to her background as a
marketing executive, she was VP and editor-in-chief of the
ClickZ Network for over seven years. She’s written two
books on digital marketing: The Truth About Search Engine
Optimization (2009) and Content Marketing (2011). Rebecca
blogs at www.rebeccalieb.com/blog.