SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
www.it-ebooks.info
Foster Provost and Tom Fawcett
Data Science for Business
www.it-ebooks.info
Data Science for Business
by Foster Provost and Tom Fawcett
Copyright © 2013 Foster Provost and Tom Fawcett. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
alsoavailableformosttitles(http://my.safaribooksonline.com).Formoreinformation,contactourcorporate/
institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Mike Loukides and Meghan Blanchette
Production Editor: Christopher Hearse
Proofreader: Kiel Van Horn
Indexer: WordCo Indexing Services, Inc.
Cover Designer: Mark Paglietti
Interior Designer: David Futato
Illustrator: Rebecca Demarest
July 2013: First Edition
Revision History for the First Edition:
2013-07-25: First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449361327 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Many of the designations used by man‐
ufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations
appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been
printed in caps or initial caps. Data Science for Business is a trademark of Foster Provost and Tom Fawcett.
While every precaution has been taken in the preparation of this book, the publisher and authors assume
no responsibility for errors or omissions, or for damages resulting from the use of the information contained
herein.
ISBN: 978-1-449-36132-7
[LSI]
www.it-ebooks.info
Dream no small dreams for they have no power to
move the hearts of men.
—Johann Wolfgang von Goethe
CHAPTER 1
Introduction: Data-Analytic Thinking
The past fifteen years have seen extensive investments in business infrastructure, which
have improved the ability to collect data throughout the enterprise. Virtually every as‐
pect of business is now open to data collection and often even instrumented for data
collection: operations, manufacturing, supply-chain management, customer behavior,
marketing campaign performance, workflow procedures, and so on. At the same time,
information is now widely available on external events such as market trends, industry
news, and competitors’ movements. This broad availability of data has led to increasing
interest in methods for extracting useful information and knowledge from data—the
realm of data science.
The Ubiquity of Data Opportunities
With vast amounts of data now available, companies in almost every industry are fo‐
cused on exploiting data for competitive advantage. In the past, firms could employ
teams of statisticians, modelers, and analysts to explore datasets manually, but the vol‐
ume and variety of data have far outstripped the capacity of manual analysis. At the
same time, computers have become far more powerful, networking has become ubiq‐
uitous, and algorithms have been developed that can connect datasets to enable broader
and deeper analyses than previously possible. The convergence of these phenomena has
given rise to the increasingly widespread business application of data science principles
and data-mining techniques.
Probably the widest applications of data-mining techniques are in marketing for tasks
such as targeted marketing, online advertising, and recommendations for cross-selling.
1
www.it-ebooks.info
Dataminingisusedforgeneralcustomerrelationshipmanagementtoanalyzecustomer
behavior in order to manage attrition and maximize expected customer value. The
finance industry uses data mining for credit scoring and trading, and in operations via
fraud detection and workforce management. Major retailers from Walmart to Amazon
apply data mining throughout their businesses, from marketing to supply-chain man‐
agement. Many firms have differentiated themselves strategically with data science,
sometimes to the point of evolving into data mining companies.
The primary goals of this book are to help you view business problems from a data
perspective and understand principles of extracting useful knowledge from data. There
is a fundamental structure to data-analytic thinking, and basic principles that should
be understood. There are also particular areas where intuition, creativity, common
sense, and domain knowledge must be brought to bear. A data perspective will provide
you with structure and principles, and this will give you a framework to systematically
analyze such problems. As you get better at data-analytic thinking you will develop
intuition as to how and where to apply creativity and domain knowledge.
Throughout the first two chapters of this book, we will discuss in detail various topics
and techniques related to data science and data mining. The terms “data science” and
“data mining” often are used interchangeably, and the former has taken a life of its own
as various individuals and organizations try to capitalize on the current hype surround‐
ing it. At a high level, data science is a set of fundamental principles that guide the
extraction of knowledge from data. Data mining is the extraction of knowledge from
data, via technologies that incorporate these principles. As a term, “data science” often
is applied more broadly than the traditional use of “data mining,” but data mining tech‐
niques provide some of the clearest illustrations of the principles of data science.
It is important to understand data science even if you never intend to
apply it yourself. Data-analytic thinking enables you to evaluate pro‐
posals for data mining projects. For example, if an employee, a con‐
sultant, or a potential investment target proposes to improve a partic‐
ular business application by extracting knowledge from data, you
should be able to assess the proposal systematically and decide wheth‐
er it is sound or flawed. This does not mean that you will be able to
tell whether it will actually succeed—for data mining projects, that
often requires trying—but you should be able to spot obvious flaws,
unrealistic assumptions, and missing pieces.
Throughout the book we will describe a number of fundamental data science principles,
and will illustrate each with at least one data mining technique that embodies the prin‐
ciple. For each principle there are usually many specific techniques that embody it, so
in this book we have chosen to emphasize the basic principles in preference to specific
techniques. That said, we will not make a big deal about the difference between data
2 | Chapter 1: Introduction: Data-Analytic Thinking
www.it-ebooks.info
1. Of course! What goes better with strawberry Pop-Tarts than a nice cold beer?
science and data mining, except where it will have a substantial effect on understanding
the actual concepts.
Let’s examine two brief case studies of analyzing data to extract predictive patterns.
Example: Hurricane Frances
Consider an example from a New York Times story from 2004:
Hurricane Frances was on its way, barreling across the Caribbean, threatening a direct
hit on Florida’s Atlantic coast. Residents made for higher ground, but far away, in Ben‐
tonville, Ark., executives at Wal-Mart Stores decided that the situation offered a great
opportunity for one of their newest data-driven weapons … predictive technology.
A week ahead of the storm’s landfall, Linda M. Dillman, Wal-Mart’s chief information
officer, pressed her staff to come up with forecasts based on what had happened when
Hurricane Charley struck several weeks earlier. Backed by the trillions of bytes’ worth of
shopper history that is stored in Wal-Mart’s data warehouse, she felt that the company
could ‘start predicting what’s going to happen, instead of waiting for it to happen,’ as she
put it. (Hays, 2004)
Consider why data-driven prediction might be useful in this scenario. It might be useful
to predict that people in the path of the hurricane would buy more bottled water. Maybe,
but this point seems a bit obvious, and why would we need data science to discover it?
It might be useful to project the amount of increase in sales due to the hurricane, to
ensure that local Wal-Marts are properly stocked. Perhaps mining the data could reveal
that a particular DVD sold out in the hurricane’s path—but maybe it sold out that week
at Wal-Marts across the country, not just where the hurricane landing was imminent.
The prediction could be somewhat useful, but is probably more general than Ms. Dill‐
man was intending.
It would be more valuable to discover patterns due to the hurricane that were not ob‐
vious. To do this, analysts might examine the huge volume of Wal-Mart data from prior,
similar situations (such as Hurricane Charley) to identify unusual local demand for
products. From such patterns, the company might be able to anticipate unusual demand
for products and rush stock to the stores ahead of the hurricane’s landfall.
Indeed, that is what happened. The New York Times (Hays, 2004) reported that: “… the
experts mined the data and found that the stores would indeed need certain products
—and not just the usual flashlights. ‘We didn’t know in the past that strawberry Pop-
Tarts increase in sales, like seven times their normal sales rate, ahead of a hurricane,’
Ms. Dillman said in a recent interview. ‘And the pre-hurricane top-selling item was
beer.’”1
Example: Hurricane Frances | 3
www.it-ebooks.info
Example: Predicting Customer Churn
How are such data analyses performed? Consider a second, more typical business sce‐
nario and how it might be treated from a data perspective. This problem will serve as a
running example that will illuminate many of the issues raised in this book and provide
a common frame of reference.
Assume you just landed a great analytical job with MegaTelCo, one of the largest tele‐
communication firms in the United States. They are having a major problem with cus‐
tomer retention in their wireless business. In the mid-Atlantic region, 20% of cell phone
customers leave when their contracts expire, and it is getting increasingly difficult to
acquire new customers. Since the cell phone market is now saturated, the huge growth
in the wireless market has tapered off. Communications companies are now engaged
in battles to attract each other’s customers while retaining their own. Customers switch‐
ing from one company to another is called churn, and it is expensive all around: one
company must spend on incentives to attract a customer while another company loses
revenue when the customer departs.
You have been called in to help understand the problem and to devise a solution. At‐
tracting new customers is much more expensive than retaining existing ones, so a good
deal of marketing budget is allocated to prevent churn. Marketing has already designed
a special retention offer. Your task is to devise a precise, step-by-step plan for how the
datascienceteamshoulduseMegaTelCo’svastdataresourcestodecidewhichcustomers
should be offered the special retention deal prior to the expiration of their contracts.
Think carefully about what data you might use and how they would be used. Specifically,
how should MegaTelCo choose a set of customers to receive their offer in order to best
reduce churn for a particular incentive budget? Answering this question is much more
complicatedthanitmayseeminitially.Wewillreturntothisproblemrepeatedlythrough
the book, adding sophistication to our solution as we develop an understanding of the
fundamental data science concepts.
In reality, customer retention has been a major use of data mining
technologies—especially in telecommunications and finance business‐
es. These more generally were some of the earliest and widest adopt‐
ers of data mining technologies, for reasons discussed later.
Data Science, Engineering, and Data-Driven Decision
Making
Data science involves principles, processes, and techniques for understanding phe‐
nomena via the (automated) analysis of data. In this book, we will view the ultimate goal
4 | Chapter 1: Introduction: Data-Analytic Thinking
www.it-ebooks.info
Figure 1-1. Data science in the context of various data-related processes in the
organization.
of data science as improving decision making, as this generally is of direct interest to
business.
Figure 1-1 places data science in the context of various other closely related and data-
related processes in the organization. It distinguishes data science from other aspects
of data processing that are gaining increasing attention in business. Let’s start at the top.
Data-driven decision-making (DDD) refers to the practice of basing decisions on the
analysis of data, rather than purely on intuition. For example, a marketer could select
advertisements based purely on her long experience in the field and her eye for what
will work. Or, she could base her selection on the analysis of data regarding how con‐
sumers react to different ads. She could also use a combination of these approaches.
DDD is not an all-or-nothing practice, and different firms engage in DDD to greater or
lesser degrees.
The benefits of data-driven decision-making have been demonstrated conclusively.
Economist Erik Brynjolfsson and his colleagues from MIT and Penn’s Wharton School
conducted a study of how DDD affects firm performance (Brynjolfsson, Hitt, & Kim,
2011). They developed a measure of DDD that rates firms as to how strongly they use
Data Science, Engineering, and Data-Driven Decision Making | 5
www.it-ebooks.info
Ifyoucan’texplainitsimply,youdon’tunderstand
it well enough.
—Albert Einstein
CHAPTER 14
Conclusion
The practice of data science can best be described as a combination of analytical engi‐
neering and exploration. The business presents a problem we would like to solve. Rarely
is the business problem directly one of our basic data mining tasks. We decompose the
problem into subtasks that we think we can solve, usually starting with existing tools.
For some of these tasks we may not know how well we can solve them, so we have to
mine the data and conduct evaluation to see. If that does not succeed, we may need to
try something completely different. In the process we may discover knowledge that will
help us to solve the problem we had set out to solve, or we may discover something
unexpected that leads us to other important successes.
Neither the analytical engineering nor the exploration should be omitted when con‐
sidering the application of data science methods to solve a business problem. Omitting
the engineering aspect usually makes it much less likely that the results of mining data
will actually solve the business problem. Omitting the understanding of process as one
of exploration and discovery often keeps an organization from putting the right man‐
agement, incentives, and investments in place for the project to succeed.
The Fundamental Concepts of Data Science
Both the analytical engineering and the exploration and discovery are made more sys‐
tematic and thereby more likely to succeed by the understanding and embracing of the
fundamental concepts of data science. In this book we have introduced a collection of
the most important fundamental concepts. Some of these concepts we made into head‐
liners for the chapters, and others were introduced more naturally through the discus‐
331
www.it-ebooks.info
sions (and not necessarily labeled as fundamental concepts). These concepts span the
process from envisioning how data science can improve business decisions, to applying
data science techniques, to deploying the results to improve decision-making. The con‐
cepts also undergird a large array of business analytics.
We can group our fundamental concepts roughly into three types:
1. General concepts about how data science fits in the organization and the compet‐
itive landscape, including ways to attract, structure, and nurture data science teams,
ways for thinking about how data science leads to competitive advantage, ways that
competitive advantage can be sustained, and tactical principles for doing well with
data science projects.
2. General ways of thinking data-analytically, which help us to gather appropriate data
and consider appropriate methods. The concepts include the data mining process,
the collection of different high-level data science tasks, as well as principles such as
the following.
• Data should be considered an asset, and therefore we should think carefully about
what investments we should make to get the best leverage from our asset
• The expected value framework can help us to structure business problems so we
can see the component data mining problems as well as the connective tissue of
costs, benefits, and constraints imposed by the business environment
• Generalizationandoverfitting:ifwelooktoohardatthedata,wewillfindpatterns;
we want patterns that generalize to data we have not yet seen
• Applying data science to a well-structured problem versus exploratory data mining
require different levels of effort in different stages of the data mining process
3. General concepts for actually extracting knowledge from data, which undergird the
vast array of data science techniques. These include concepts such as the following.
• Identifying informative attributes—those that correlate with or give us informa‐
tion about an unknown quantity of interest
• Fitting a numeric function model to data by choosing an objective and finding a
set of parameters based on that objective
• Controlling complexity is necessary to find a good trade-off between generalization
and overfitting
• Calculating similarity between objects described by data
332 | Chapter 14: Conclusion
www.it-ebooks.info
Once we think about data science in terms of its fundamental concepts, we see the same
concepts underlying many different data science strategies, tasks, algorithms, and pro‐
cesses. As we have illustrated throughout the book, these principles not only allow us
to understand the theory and practice of data science much more deeply, they also allow
ustounderstandthemethodsandtechniquesofdatascienceverybroadly,becausethese
methods and techniques are quite often simply particular instantiations of one or more
of the fundamental principles.
At a high level we saw how structuring business problems using the expected value
framework allows us to decompose problems into data science tasks that we understand
better how to solve, and this applies across many different sorts of business problems.
For extracting knowledge from data, we saw that our fundamental concept of deter‐
mining the similarity of two objects described by data is used directly, for example to
find customers similar to our best customers. It is used for classification and for re‐
gression, via nearest-neighbor methods. It is the basis for clustering, the unsupervised
grouping of data objects. It is the basis for finding documents most related to a search
query. And it is the basis for more than one common method for making recommen‐
dations, for example by casting both customers and movies into the same “taste space,”
and then finding movies most similar to a particular customer.
When it comes to measurement, we see the notion of lift—determining how much more
likely a pattern is than would be expected by chance—appearing broadly across data
science, when evaluating very different sorts of patterns. One evaluates algorithms for
targeting advertisements by computing the lift one gets for the targeted population. One
calculates lift for judging the weight of evidence for or against a conclusion. One cal‐
culates lift to help judge whether a repeated co-occurrence is interesting, as opposed to
simply being a natural consequence of popularity.
Understanding the fundamental concepts also facilitates communication between busi‐
ness stakeholders and data scientists, not only because of the shared vocabulary, but
because both sides actually understand better. Instead of missing important aspects of
a discussion completely, we can dig in and ask questions that will reveal critical aspects
that otherwise would not have been uncovered.
For example, let’s say your venture firm is considering investing in a data science-based
company producing a personalized online news service. You ask how exactly they are
personalizing the news. They say they use support vector machines. Let’s even pretend
that we had not talked about support vector machines in this book. You should feel
confident enough in your knowledge of data science now that you should not simply
say “Oh, OK.” You should be able to confidently ask: “What’s that exactly?” If they really
doknowwhattheyaretalkingabout,theyshouldgiveyousomeexplanationbasedupon
our fundamental principles (as we did in Chapter 4). You also are now prepared to ask,
“What exactly are the training data you intend to use?” Not only might that impress
data scientists on their team, but it actually is an important question to be asked to see
The Fundamental Concepts of Data Science | 333
www.it-ebooks.info
whethertheyaredoingsomethingcredible,orjustusing“datascience”asasmokescreen
to hide behind. You can go on to think about whether you really believe building any
predictive model from these data—regardless of what sort of model it is—would be
likelytosolvethebusinessproblemthey’reattacking.Youshouldbereadytoaskwhether
you really think they will have reliable training labels for such a task. And so on.
Applying Our Fundamental Concepts to a New Problem: Mining
Mobile Device Data
As we’ve emphasized repeatedly, once we think about data science as a collection of
concepts, principles, and general methods, we will have much more success both un‐
derstanding data science activities broadly, and also applying data science to new busi‐
ness problems. Let’s consider a fresh example.
Recently (as of this writing), there has been a marked shift in consumer online activity
from traditional computers to a wide variety of mobile devices. Companies, many still
working to understand how to reach consumers on their desktop computers, now are
scrambling to understand how to reach consumers on their mobile devices: smart
phones,tablets,andevenincreasinglymobilelaptopcomputers,asWiFiaccessbecomes
ubiquitous. We won’t talk about most of the complexity of that problem, but from our
perspective, the data-analytic thinker might notice that mobile devices provide a new
sortofdatafromwhichlittleleveragehasyetbeenobtained.Inparticular,mobiledevices
are associated with data on their location.
For example, in the mobile advertising ecosystem, depending on my privacy settings,
my mobile device may broadcast my exact GPS location to those entities who would
like to target me with advertisements, daily deals, and other offers. Figure 14-1 shows
a scatterplot of a small sample of locations that a potential advertiser might see, sampled
from the mobile advertising ecosystem. Even if I do not broadcast my GPS location, my
devicebroadcaststheIPaddressofthenetworkitcurrentlyisusing,whichoftenconveys
location information.
334 | Chapter 14: Conclusion
www.it-ebooks.info
Figure 14-1. A scatterplot of a sample of GPS locations captured from mobile devices.
As an interesting side point, this is just a scatterplot of the latitude and
longitudes broadcast by mobile devices; there is no map! It gives a
striking picture of population density across the world. And it makes
us wonder what’s going on with mobile devices in Antarctica.
How might we use such data? Let’s apply our fundamental concepts. If we want to get
beyond exploratory data analysis (as we’ve started with the visualization in
Figure 14-1), we need to think in terms of some concrete business problem. A particular
firm might have certain problems to solve, and be focused on one or two. An entrepre‐
neur or investor might scan across different possible problems she sees that businesses
or consumers currently have. Let’s pick one related to these data.
Advertisers face the problem that in this new world, we see a variety of different devices
and a particular consumer’s behavior may be fragmented across several. In the desktop
world, once the advertisers identify a good prospect, perhaps via a cookie in a particular
consumer’s browser or a device ID, they can then begin to take action accordingly; for
example, by presenting targeted ads. In the mobile ecosystem, this consumer’s activity
The Fundamental Concepts of Data Science | 335
www.it-ebooks.info
Data Science for Business PDF Free Ebook Textbook

Contenu connexe

En vedette

組織創新管理- 序言(科特勒談創新型組織)
組織創新管理- 序言(科特勒談創新型組織)組織創新管理- 序言(科特勒談創新型組織)
組織創新管理- 序言(科特勒談創新型組織)Turning Point Studio
 
組織創新管理 - 啟動者(科特勒談創新型組織)
組織創新管理 - 啟動者(科特勒談創新型組織)組織創新管理 - 啟動者(科特勒談創新型組織)
組織創新管理 - 啟動者(科特勒談創新型組織)Turning Point Studio
 
組織創新管理 - 促進者(科特勒談創新型組織)
組織創新管理 - 促進者(科特勒談創新型組織)組織創新管理 - 促進者(科特勒談創新型組織)
組織創新管理 - 促進者(科特勒談創新型組織)Turning Point Studio
 
組織創新管理 - 執行者(科特勒談創新型組織)
組織創新管理 - 執行者(科特勒談創新型組織)組織創新管理 - 執行者(科特勒談創新型組織)
組織創新管理 - 執行者(科特勒談創新型組織)Turning Point Studio
 
Applying Data Science to Your Business Problem
Applying Data Science to Your Business ProblemApplying Data Science to Your Business Problem
Applying Data Science to Your Business ProblemCA Technologies
 
Honey's Data Dinner#8 資料科學實用技術、工具與實例分享
Honey's Data Dinner#8 資料科學實用技術、工具與實例分享Honey's Data Dinner#8 資料科學實用技術、工具與實例分享
Honey's Data Dinner#8 資料科學實用技術、工具與實例分享beehivedata
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Welcome - Keynote - AWSome Day Helsinki 2017
Welcome - Keynote - AWSome Day Helsinki 2017Welcome - Keynote - AWSome Day Helsinki 2017
Welcome - Keynote - AWSome Day Helsinki 2017Amazon Web Services
 
Building Data-Centric Businesses
Building Data-Centric BusinessesBuilding Data-Centric Businesses
Building Data-Centric BusinessesThoughtworks
 
Honey's Data Dinner#1 word2vec 2016總統大選新聞
Honey's Data Dinner#1 word2vec 2016總統大選新聞Honey's Data Dinner#1 word2vec 2016總統大選新聞
Honey's Data Dinner#1 word2vec 2016總統大選新聞beehivedata
 
The Human Side of Data By Colin Strong
The Human Side of Data By Colin StrongThe Human Side of Data By Colin Strong
The Human Side of Data By Colin StrongMarTech Conference
 
數位媒體雲端儲存案例和技術分享 (AWS Storage Options for Media Industry)
數位媒體雲端儲存案例和技術分享 (AWS Storage Options for Media Industry)數位媒體雲端儲存案例和技術分享 (AWS Storage Options for Media Industry)
數位媒體雲端儲存案例和技術分享 (AWS Storage Options for Media Industry)Amazon Web Services
 
Honey's Data Dinner#11 以資料科學切入的食品安全風險評估方法
Honey's Data Dinner#11 以資料科學切入的食品安全風險評估方法Honey's Data Dinner#11 以資料科學切入的食品安全風險評估方法
Honey's Data Dinner#11 以資料科學切入的食品安全風險評估方法beehivedata
 
5 Mistakes During Data Migration
5 Mistakes During Data Migration5 Mistakes During Data Migration
5 Mistakes During Data MigrationHappy Marketer
 
Big Data Analytics for Non Programmers
Big Data Analytics for Non ProgrammersBig Data Analytics for Non Programmers
Big Data Analytics for Non ProgrammersEdureka!
 
大數據的獲利模式
大數據的獲利模式大數據的獲利模式
大數據的獲利模式Chang Chiao Hui
 
PROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALPROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALSolarWinds MSP
 
Tech Trends 2015: The fusion of business and IT
Tech Trends 2015: The fusion of business and ITTech Trends 2015: The fusion of business and IT
Tech Trends 2015: The fusion of business and ITDeloitte United States
 

En vedette (20)

組織創新管理- 序言(科特勒談創新型組織)
組織創新管理- 序言(科特勒談創新型組織)組織創新管理- 序言(科特勒談創新型組織)
組織創新管理- 序言(科特勒談創新型組織)
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 
組織創新管理 - 啟動者(科特勒談創新型組織)
組織創新管理 - 啟動者(科特勒談創新型組織)組織創新管理 - 啟動者(科特勒談創新型組織)
組織創新管理 - 啟動者(科特勒談創新型組織)
 
組織創新管理 - 促進者(科特勒談創新型組織)
組織創新管理 - 促進者(科特勒談創新型組織)組織創新管理 - 促進者(科特勒談創新型組織)
組織創新管理 - 促進者(科特勒談創新型組織)
 
組織創新管理 - 執行者(科特勒談創新型組織)
組織創新管理 - 執行者(科特勒談創新型組織)組織創新管理 - 執行者(科特勒談創新型組織)
組織創新管理 - 執行者(科特勒談創新型組織)
 
Applying Data Science to Your Business Problem
Applying Data Science to Your Business ProblemApplying Data Science to Your Business Problem
Applying Data Science to Your Business Problem
 
Honey's Data Dinner#8 資料科學實用技術、工具與實例分享
Honey's Data Dinner#8 資料科學實用技術、工具與實例分享Honey's Data Dinner#8 資料科學實用技術、工具與實例分享
Honey's Data Dinner#8 資料科學實用技術、工具與實例分享
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Welcome - Keynote - AWSome Day Helsinki 2017
Welcome - Keynote - AWSome Day Helsinki 2017Welcome - Keynote - AWSome Day Helsinki 2017
Welcome - Keynote - AWSome Day Helsinki 2017
 
Building Data-Centric Businesses
Building Data-Centric BusinessesBuilding Data-Centric Businesses
Building Data-Centric Businesses
 
Honey's Data Dinner#1 word2vec 2016總統大選新聞
Honey's Data Dinner#1 word2vec 2016總統大選新聞Honey's Data Dinner#1 word2vec 2016總統大選新聞
Honey's Data Dinner#1 word2vec 2016總統大選新聞
 
The Human Side of Data By Colin Strong
The Human Side of Data By Colin StrongThe Human Side of Data By Colin Strong
The Human Side of Data By Colin Strong
 
數位媒體雲端儲存案例和技術分享 (AWS Storage Options for Media Industry)
數位媒體雲端儲存案例和技術分享 (AWS Storage Options for Media Industry)數位媒體雲端儲存案例和技術分享 (AWS Storage Options for Media Industry)
數位媒體雲端儲存案例和技術分享 (AWS Storage Options for Media Industry)
 
Honey's Data Dinner#11 以資料科學切入的食品安全風險評估方法
Honey's Data Dinner#11 以資料科學切入的食品安全風險評估方法Honey's Data Dinner#11 以資料科學切入的食品安全風險評估方法
Honey's Data Dinner#11 以資料科學切入的食品安全風險評估方法
 
5 Mistakes During Data Migration
5 Mistakes During Data Migration5 Mistakes During Data Migration
5 Mistakes During Data Migration
 
Big Data Analytics for Non Programmers
Big Data Analytics for Non ProgrammersBig Data Analytics for Non Programmers
Big Data Analytics for Non Programmers
 
大數據的獲利模式
大數據的獲利模式大數據的獲利模式
大數據的獲利模式
 
Technical Track
Technical TrackTechnical Track
Technical Track
 
PROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINALPROG_UntoldStory ISV eBook_0706c FINAL
PROG_UntoldStory ISV eBook_0706c FINAL
 
Tech Trends 2015: The fusion of business and IT
Tech Trends 2015: The fusion of business and ITTech Trends 2015: The fusion of business and IT
Tech Trends 2015: The fusion of business and IT
 

Dernier

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 

Dernier (20)

YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 

Data Science for Business PDF Free Ebook Textbook

  • 2. Foster Provost and Tom Fawcett Data Science for Business www.it-ebooks.info
  • 3. Data Science for Business by Foster Provost and Tom Fawcett Copyright © 2013 Foster Provost and Tom Fawcett. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are alsoavailableformosttitles(http://my.safaribooksonline.com).Formoreinformation,contactourcorporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Mike Loukides and Meghan Blanchette Production Editor: Christopher Hearse Proofreader: Kiel Van Horn Indexer: WordCo Indexing Services, Inc. Cover Designer: Mark Paglietti Interior Designer: David Futato Illustrator: Rebecca Demarest July 2013: First Edition Revision History for the First Edition: 2013-07-25: First release See http://oreilly.com/catalog/errata.csp?isbn=9781449361327 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Many of the designations used by man‐ ufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. Data Science for Business is a trademark of Foster Provost and Tom Fawcett. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-36132-7 [LSI] www.it-ebooks.info
  • 4.
  • 5. Dream no small dreams for they have no power to move the hearts of men. —Johann Wolfgang von Goethe CHAPTER 1 Introduction: Data-Analytic Thinking The past fifteen years have seen extensive investments in business infrastructure, which have improved the ability to collect data throughout the enterprise. Virtually every as‐ pect of business is now open to data collection and often even instrumented for data collection: operations, manufacturing, supply-chain management, customer behavior, marketing campaign performance, workflow procedures, and so on. At the same time, information is now widely available on external events such as market trends, industry news, and competitors’ movements. This broad availability of data has led to increasing interest in methods for extracting useful information and knowledge from data—the realm of data science. The Ubiquity of Data Opportunities With vast amounts of data now available, companies in almost every industry are fo‐ cused on exploiting data for competitive advantage. In the past, firms could employ teams of statisticians, modelers, and analysts to explore datasets manually, but the vol‐ ume and variety of data have far outstripped the capacity of manual analysis. At the same time, computers have become far more powerful, networking has become ubiq‐ uitous, and algorithms have been developed that can connect datasets to enable broader and deeper analyses than previously possible. The convergence of these phenomena has given rise to the increasingly widespread business application of data science principles and data-mining techniques. Probably the widest applications of data-mining techniques are in marketing for tasks such as targeted marketing, online advertising, and recommendations for cross-selling. 1 www.it-ebooks.info
  • 6. Dataminingisusedforgeneralcustomerrelationshipmanagementtoanalyzecustomer behavior in order to manage attrition and maximize expected customer value. The finance industry uses data mining for credit scoring and trading, and in operations via fraud detection and workforce management. Major retailers from Walmart to Amazon apply data mining throughout their businesses, from marketing to supply-chain man‐ agement. Many firms have differentiated themselves strategically with data science, sometimes to the point of evolving into data mining companies. The primary goals of this book are to help you view business problems from a data perspective and understand principles of extracting useful knowledge from data. There is a fundamental structure to data-analytic thinking, and basic principles that should be understood. There are also particular areas where intuition, creativity, common sense, and domain knowledge must be brought to bear. A data perspective will provide you with structure and principles, and this will give you a framework to systematically analyze such problems. As you get better at data-analytic thinking you will develop intuition as to how and where to apply creativity and domain knowledge. Throughout the first two chapters of this book, we will discuss in detail various topics and techniques related to data science and data mining. The terms “data science” and “data mining” often are used interchangeably, and the former has taken a life of its own as various individuals and organizations try to capitalize on the current hype surround‐ ing it. At a high level, data science is a set of fundamental principles that guide the extraction of knowledge from data. Data mining is the extraction of knowledge from data, via technologies that incorporate these principles. As a term, “data science” often is applied more broadly than the traditional use of “data mining,” but data mining tech‐ niques provide some of the clearest illustrations of the principles of data science. It is important to understand data science even if you never intend to apply it yourself. Data-analytic thinking enables you to evaluate pro‐ posals for data mining projects. For example, if an employee, a con‐ sultant, or a potential investment target proposes to improve a partic‐ ular business application by extracting knowledge from data, you should be able to assess the proposal systematically and decide wheth‐ er it is sound or flawed. This does not mean that you will be able to tell whether it will actually succeed—for data mining projects, that often requires trying—but you should be able to spot obvious flaws, unrealistic assumptions, and missing pieces. Throughout the book we will describe a number of fundamental data science principles, and will illustrate each with at least one data mining technique that embodies the prin‐ ciple. For each principle there are usually many specific techniques that embody it, so in this book we have chosen to emphasize the basic principles in preference to specific techniques. That said, we will not make a big deal about the difference between data 2 | Chapter 1: Introduction: Data-Analytic Thinking www.it-ebooks.info
  • 7. 1. Of course! What goes better with strawberry Pop-Tarts than a nice cold beer? science and data mining, except where it will have a substantial effect on understanding the actual concepts. Let’s examine two brief case studies of analyzing data to extract predictive patterns. Example: Hurricane Frances Consider an example from a New York Times story from 2004: Hurricane Frances was on its way, barreling across the Caribbean, threatening a direct hit on Florida’s Atlantic coast. Residents made for higher ground, but far away, in Ben‐ tonville, Ark., executives at Wal-Mart Stores decided that the situation offered a great opportunity for one of their newest data-driven weapons … predictive technology. A week ahead of the storm’s landfall, Linda M. Dillman, Wal-Mart’s chief information officer, pressed her staff to come up with forecasts based on what had happened when Hurricane Charley struck several weeks earlier. Backed by the trillions of bytes’ worth of shopper history that is stored in Wal-Mart’s data warehouse, she felt that the company could ‘start predicting what’s going to happen, instead of waiting for it to happen,’ as she put it. (Hays, 2004) Consider why data-driven prediction might be useful in this scenario. It might be useful to predict that people in the path of the hurricane would buy more bottled water. Maybe, but this point seems a bit obvious, and why would we need data science to discover it? It might be useful to project the amount of increase in sales due to the hurricane, to ensure that local Wal-Marts are properly stocked. Perhaps mining the data could reveal that a particular DVD sold out in the hurricane’s path—but maybe it sold out that week at Wal-Marts across the country, not just where the hurricane landing was imminent. The prediction could be somewhat useful, but is probably more general than Ms. Dill‐ man was intending. It would be more valuable to discover patterns due to the hurricane that were not ob‐ vious. To do this, analysts might examine the huge volume of Wal-Mart data from prior, similar situations (such as Hurricane Charley) to identify unusual local demand for products. From such patterns, the company might be able to anticipate unusual demand for products and rush stock to the stores ahead of the hurricane’s landfall. Indeed, that is what happened. The New York Times (Hays, 2004) reported that: “… the experts mined the data and found that the stores would indeed need certain products —and not just the usual flashlights. ‘We didn’t know in the past that strawberry Pop- Tarts increase in sales, like seven times their normal sales rate, ahead of a hurricane,’ Ms. Dillman said in a recent interview. ‘And the pre-hurricane top-selling item was beer.’”1 Example: Hurricane Frances | 3 www.it-ebooks.info
  • 8. Example: Predicting Customer Churn How are such data analyses performed? Consider a second, more typical business sce‐ nario and how it might be treated from a data perspective. This problem will serve as a running example that will illuminate many of the issues raised in this book and provide a common frame of reference. Assume you just landed a great analytical job with MegaTelCo, one of the largest tele‐ communication firms in the United States. They are having a major problem with cus‐ tomer retention in their wireless business. In the mid-Atlantic region, 20% of cell phone customers leave when their contracts expire, and it is getting increasingly difficult to acquire new customers. Since the cell phone market is now saturated, the huge growth in the wireless market has tapered off. Communications companies are now engaged in battles to attract each other’s customers while retaining their own. Customers switch‐ ing from one company to another is called churn, and it is expensive all around: one company must spend on incentives to attract a customer while another company loses revenue when the customer departs. You have been called in to help understand the problem and to devise a solution. At‐ tracting new customers is much more expensive than retaining existing ones, so a good deal of marketing budget is allocated to prevent churn. Marketing has already designed a special retention offer. Your task is to devise a precise, step-by-step plan for how the datascienceteamshoulduseMegaTelCo’svastdataresourcestodecidewhichcustomers should be offered the special retention deal prior to the expiration of their contracts. Think carefully about what data you might use and how they would be used. Specifically, how should MegaTelCo choose a set of customers to receive their offer in order to best reduce churn for a particular incentive budget? Answering this question is much more complicatedthanitmayseeminitially.Wewillreturntothisproblemrepeatedlythrough the book, adding sophistication to our solution as we develop an understanding of the fundamental data science concepts. In reality, customer retention has been a major use of data mining technologies—especially in telecommunications and finance business‐ es. These more generally were some of the earliest and widest adopt‐ ers of data mining technologies, for reasons discussed later. Data Science, Engineering, and Data-Driven Decision Making Data science involves principles, processes, and techniques for understanding phe‐ nomena via the (automated) analysis of data. In this book, we will view the ultimate goal 4 | Chapter 1: Introduction: Data-Analytic Thinking www.it-ebooks.info
  • 9. Figure 1-1. Data science in the context of various data-related processes in the organization. of data science as improving decision making, as this generally is of direct interest to business. Figure 1-1 places data science in the context of various other closely related and data- related processes in the organization. It distinguishes data science from other aspects of data processing that are gaining increasing attention in business. Let’s start at the top. Data-driven decision-making (DDD) refers to the practice of basing decisions on the analysis of data, rather than purely on intuition. For example, a marketer could select advertisements based purely on her long experience in the field and her eye for what will work. Or, she could base her selection on the analysis of data regarding how con‐ sumers react to different ads. She could also use a combination of these approaches. DDD is not an all-or-nothing practice, and different firms engage in DDD to greater or lesser degrees. The benefits of data-driven decision-making have been demonstrated conclusively. Economist Erik Brynjolfsson and his colleagues from MIT and Penn’s Wharton School conducted a study of how DDD affects firm performance (Brynjolfsson, Hitt, & Kim, 2011). They developed a measure of DDD that rates firms as to how strongly they use Data Science, Engineering, and Data-Driven Decision Making | 5 www.it-ebooks.info
  • 10. Ifyoucan’texplainitsimply,youdon’tunderstand it well enough. —Albert Einstein CHAPTER 14 Conclusion The practice of data science can best be described as a combination of analytical engi‐ neering and exploration. The business presents a problem we would like to solve. Rarely is the business problem directly one of our basic data mining tasks. We decompose the problem into subtasks that we think we can solve, usually starting with existing tools. For some of these tasks we may not know how well we can solve them, so we have to mine the data and conduct evaluation to see. If that does not succeed, we may need to try something completely different. In the process we may discover knowledge that will help us to solve the problem we had set out to solve, or we may discover something unexpected that leads us to other important successes. Neither the analytical engineering nor the exploration should be omitted when con‐ sidering the application of data science methods to solve a business problem. Omitting the engineering aspect usually makes it much less likely that the results of mining data will actually solve the business problem. Omitting the understanding of process as one of exploration and discovery often keeps an organization from putting the right man‐ agement, incentives, and investments in place for the project to succeed. The Fundamental Concepts of Data Science Both the analytical engineering and the exploration and discovery are made more sys‐ tematic and thereby more likely to succeed by the understanding and embracing of the fundamental concepts of data science. In this book we have introduced a collection of the most important fundamental concepts. Some of these concepts we made into head‐ liners for the chapters, and others were introduced more naturally through the discus‐ 331 www.it-ebooks.info
  • 11. sions (and not necessarily labeled as fundamental concepts). These concepts span the process from envisioning how data science can improve business decisions, to applying data science techniques, to deploying the results to improve decision-making. The con‐ cepts also undergird a large array of business analytics. We can group our fundamental concepts roughly into three types: 1. General concepts about how data science fits in the organization and the compet‐ itive landscape, including ways to attract, structure, and nurture data science teams, ways for thinking about how data science leads to competitive advantage, ways that competitive advantage can be sustained, and tactical principles for doing well with data science projects. 2. General ways of thinking data-analytically, which help us to gather appropriate data and consider appropriate methods. The concepts include the data mining process, the collection of different high-level data science tasks, as well as principles such as the following. • Data should be considered an asset, and therefore we should think carefully about what investments we should make to get the best leverage from our asset • The expected value framework can help us to structure business problems so we can see the component data mining problems as well as the connective tissue of costs, benefits, and constraints imposed by the business environment • Generalizationandoverfitting:ifwelooktoohardatthedata,wewillfindpatterns; we want patterns that generalize to data we have not yet seen • Applying data science to a well-structured problem versus exploratory data mining require different levels of effort in different stages of the data mining process 3. General concepts for actually extracting knowledge from data, which undergird the vast array of data science techniques. These include concepts such as the following. • Identifying informative attributes—those that correlate with or give us informa‐ tion about an unknown quantity of interest • Fitting a numeric function model to data by choosing an objective and finding a set of parameters based on that objective • Controlling complexity is necessary to find a good trade-off between generalization and overfitting • Calculating similarity between objects described by data 332 | Chapter 14: Conclusion www.it-ebooks.info
  • 12. Once we think about data science in terms of its fundamental concepts, we see the same concepts underlying many different data science strategies, tasks, algorithms, and pro‐ cesses. As we have illustrated throughout the book, these principles not only allow us to understand the theory and practice of data science much more deeply, they also allow ustounderstandthemethodsandtechniquesofdatascienceverybroadly,becausethese methods and techniques are quite often simply particular instantiations of one or more of the fundamental principles. At a high level we saw how structuring business problems using the expected value framework allows us to decompose problems into data science tasks that we understand better how to solve, and this applies across many different sorts of business problems. For extracting knowledge from data, we saw that our fundamental concept of deter‐ mining the similarity of two objects described by data is used directly, for example to find customers similar to our best customers. It is used for classification and for re‐ gression, via nearest-neighbor methods. It is the basis for clustering, the unsupervised grouping of data objects. It is the basis for finding documents most related to a search query. And it is the basis for more than one common method for making recommen‐ dations, for example by casting both customers and movies into the same “taste space,” and then finding movies most similar to a particular customer. When it comes to measurement, we see the notion of lift—determining how much more likely a pattern is than would be expected by chance—appearing broadly across data science, when evaluating very different sorts of patterns. One evaluates algorithms for targeting advertisements by computing the lift one gets for the targeted population. One calculates lift for judging the weight of evidence for or against a conclusion. One cal‐ culates lift to help judge whether a repeated co-occurrence is interesting, as opposed to simply being a natural consequence of popularity. Understanding the fundamental concepts also facilitates communication between busi‐ ness stakeholders and data scientists, not only because of the shared vocabulary, but because both sides actually understand better. Instead of missing important aspects of a discussion completely, we can dig in and ask questions that will reveal critical aspects that otherwise would not have been uncovered. For example, let’s say your venture firm is considering investing in a data science-based company producing a personalized online news service. You ask how exactly they are personalizing the news. They say they use support vector machines. Let’s even pretend that we had not talked about support vector machines in this book. You should feel confident enough in your knowledge of data science now that you should not simply say “Oh, OK.” You should be able to confidently ask: “What’s that exactly?” If they really doknowwhattheyaretalkingabout,theyshouldgiveyousomeexplanationbasedupon our fundamental principles (as we did in Chapter 4). You also are now prepared to ask, “What exactly are the training data you intend to use?” Not only might that impress data scientists on their team, but it actually is an important question to be asked to see The Fundamental Concepts of Data Science | 333 www.it-ebooks.info
  • 13. whethertheyaredoingsomethingcredible,orjustusing“datascience”asasmokescreen to hide behind. You can go on to think about whether you really believe building any predictive model from these data—regardless of what sort of model it is—would be likelytosolvethebusinessproblemthey’reattacking.Youshouldbereadytoaskwhether you really think they will have reliable training labels for such a task. And so on. Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data As we’ve emphasized repeatedly, once we think about data science as a collection of concepts, principles, and general methods, we will have much more success both un‐ derstanding data science activities broadly, and also applying data science to new busi‐ ness problems. Let’s consider a fresh example. Recently (as of this writing), there has been a marked shift in consumer online activity from traditional computers to a wide variety of mobile devices. Companies, many still working to understand how to reach consumers on their desktop computers, now are scrambling to understand how to reach consumers on their mobile devices: smart phones,tablets,andevenincreasinglymobilelaptopcomputers,asWiFiaccessbecomes ubiquitous. We won’t talk about most of the complexity of that problem, but from our perspective, the data-analytic thinker might notice that mobile devices provide a new sortofdatafromwhichlittleleveragehasyetbeenobtained.Inparticular,mobiledevices are associated with data on their location. For example, in the mobile advertising ecosystem, depending on my privacy settings, my mobile device may broadcast my exact GPS location to those entities who would like to target me with advertisements, daily deals, and other offers. Figure 14-1 shows a scatterplot of a small sample of locations that a potential advertiser might see, sampled from the mobile advertising ecosystem. Even if I do not broadcast my GPS location, my devicebroadcaststheIPaddressofthenetworkitcurrentlyisusing,whichoftenconveys location information. 334 | Chapter 14: Conclusion www.it-ebooks.info
  • 14. Figure 14-1. A scatterplot of a sample of GPS locations captured from mobile devices. As an interesting side point, this is just a scatterplot of the latitude and longitudes broadcast by mobile devices; there is no map! It gives a striking picture of population density across the world. And it makes us wonder what’s going on with mobile devices in Antarctica. How might we use such data? Let’s apply our fundamental concepts. If we want to get beyond exploratory data analysis (as we’ve started with the visualization in Figure 14-1), we need to think in terms of some concrete business problem. A particular firm might have certain problems to solve, and be focused on one or two. An entrepre‐ neur or investor might scan across different possible problems she sees that businesses or consumers currently have. Let’s pick one related to these data. Advertisers face the problem that in this new world, we see a variety of different devices and a particular consumer’s behavior may be fragmented across several. In the desktop world, once the advertisers identify a good prospect, perhaps via a cookie in a particular consumer’s browser or a device ID, they can then begin to take action accordingly; for example, by presenting targeted ads. In the mobile ecosystem, this consumer’s activity The Fundamental Concepts of Data Science | 335 www.it-ebooks.info