This document discusses how statistical engineering principles can help address challenges with "Big Data" projects. It argues that simply having powerful algorithms and large datasets does not guarantee good models or results. The leadership challenge for statisticians is to ensure Big Data projects are built on sound modeling foundations rather than hype. Statistical engineering principles like understanding data quality, using sequential approaches, and integrating subject matter knowledge can help improve the success of Big Data analyses and provide the statistical profession an opportunity for leadership in this area. Statistical engineering provides a framework to structure Big Data projects and incorporate fundamentals of good science that are sometimes overlooked.
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Roger hoerl say award presentation 2013
1. 1Statistical Engineering and BIG DATA
'Big Data’ - A Challenge for Statistical
Leadership
Chicago Chapter ASA
SAY Award Luncheon
Roger W. Hoerl
Union College
Schenectady, NY
With significant input from Ron Snee
2. 2Statistical Engineering and BIG DATA
Abstract
The Wall Street Journal, New York Times and other respected publications have had major features
recently on Big Data - the massive data sets which are becoming commonplace, and on the new,
"sexy" data mining methods developed to analyze them. These articles, as well as much of the
professional data mining and Big Data literature, may give casual users the impression that if one
has a powerful enough algorithm and a lot of data, good models and good results are guaranteed at
the push of a button. Obviously, this is not the case. The leadership challenge to the statistical
profession is to insure that Big Data projects are built upon a sound foundation of good modeling,
and not upon the sandy foundation of hype and unstated assumptions. Further, we need to
accomplish this without giving the impression that we are "against" Big Data or newer methods. I feel
that the principles of statistical engineering (see Anderson-Cook and Lu 2012) can provide a path to
do just this. Three statistical engineering principles that are often overlooked or underemphasized by
Big Data enthusiasts are the importance of data quality - knowing the "pedigree" of the data; the
need to view statistical studies as part of the sequential process of scientific discovery - versus the
"one-shot study" so common in textbooks; and the criticality of using subject-matter knowledge when
developing models. I will present examples of the severe problems that can arise in Big Data studies
when these principles are not understood or ignored. In summary, I argue that the development of
Big Data analytics provides significant opportunities to the profession, but at the same time requires
a more proactive role from us, if we are to provide true leadership in the Big Data phenomenon.
3. 3Statistical Engineering and BIG DATA
Outline
Statistical Leadership (Advocacy)
The ―Big Data‖ Phenomenon
What Could Possibly Go Wrong?
Statistical Engineering, and How It Can Help
Leading the Way – Doing Big Data the Right Way
Summary
4. 4Statistical Engineering and BIG DATA
Statistical Leadership
Leadership: taking people from one paradigm to another.
Enabling people to think statistically, and apply statistical methods, requires leadership.
Opinion: too many statisticians are satisfied being experts in the tools themselves,
without worrying much about the overall impact our profession is having on society.
Can’t see the forest for the trees.
As a result, society too often compartmentalizes statisticians as narrow specialists, and
does not view us as thought leaders; they look elsewhere for leadership.
Passive consultants versus proactive leaders.
As a case in point, most professionals view the ―Big Data‖ phenomenon as being led by
computer scientists, engineers, or data scientists (whatever that means), rather than by
statisticians.
Ron Snee, Gerry Hahn, and other leaders have been noting for years that statisticians
need to be more proactive, and guide society as to what needs to be done.
We shouldn’t be satisfied being the ―tools guys‖.
“Everything Rises and Falls on Leadership.” John Maxwell
5. 5Statistical Engineering and BIG DATA
Data Mining and Big Data
The technology for acquiring, storing, and processing data have been increasing
exponentially (―Big Data‖), providing new opportunities to ―mine‖ the data.
According to IBM, there are now 1.6 zetabytes (1021 bytes) of digital data available.
To use 1.6 zetabytes of bandwidth, you would need to watch HD TV for 47,000 years.
―I keep saying that the sexy job in the next 10 years will be statisticians,‖ Hal Varian,
chief economist at Google. ―And I’m not kidding.‖
March 2012: The White House announced a national "Big Data Initiative" that
consisted of six Federal departments and agencies committing more than $200
million to Big Data research projects.
As noted by Ron Snee, data mining has been around for decades:
1950s: Stepwise regression first developed at Esso (now Exxon) by Efroymson
to analyze refinery data
1960s: Graphical methods developed by Tukey, Wilk, Gnanadesikan and others
at Bell Labs to gain insight from large data sets
1970s: DuPont uses data compression algorithms in process monitoring using
on-line systems
Big Data and Data Mining are Growing Rapidly, but Are Not New.
6. 6Statistical Engineering and BIG DATA
What’s New?
Sheer size of data – often requires compression, parallel processing, and sampling,
to store and analyze.
Some traditional methods are no longer relevant, e.g., hypothesis testing.
Insight from graphical methods must be rethought – difficult to see find outliers in
zetabytes of data.
The sample sizes coupled with faster computing enables much more complex
models, relative to data sets of 30.
Due to the above, newer techniques have become popular:
CART and other tree-based methods; recursive splits on the data.
Neural networks; non-linear models involving combinations of variables – very flexible.
Methods based on bootstrapping – resampling and combining models; random forests,
―bagging‖, etc.
Clustering and classification methods designed for massive data sets; K-means
clustering, support vector machines, etc.
Good News: We Have More Data and Powerful Analysis Methods.
8. 8Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
Duke Genomics Center published several groundbreaking articles
conclusively identifying cancer biomarkers in the 2005-2010 timeframe.
Unfortunately, clinical trials based on this research did not pan out.
Women died unexpectedly.
Two statisticians, Keith Baggerly and Kevin Coombes, dug into the
research.
New York Times, July 8, 2011:
Dr. Baggerly and Dr. Coombes found errors almost immediately. Some seemed
careless – moving a row or column over by one in a giant spreadsheet – while others
seemed inexplicable. The Duke team shrugged them off as ―clerical errors‖...In the end,
four gene signature papers were retracted. Duke shut down three trials using the results.
(Lead investigator) Dr. Potti resigned from Duke...His collaborator and mentor, Dr.
Nevins, no longer directs one of Duke’s genomics centers. The cancer world is reeling.
Large Amounts of Data Plus Sophisticated Algorithms Do Not Guarantee Success.
9. 9Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
Financial giant Lehman Brothers declared bankruptcy on September 15th,
2008.
This was the largest bankruptcy filing in US history, with Lehman Brothers
holding roughly $600 billion in assets.
The Dow Jones Industrial Average dropped over 500 points that day,
several other financial institutions followed Lehman Brothers into
bankruptcy.....and the rest is history.
A few years earlier, I had visited Lehman Brothers headquarters in NY with
representatives of GE Capital:
Lehman was selling models to predict corporate defaults.
Their models were quite sophisticated, and based on large amounts of historical
financial data.
Virtually all financial institutions impacted by the crisis had models.
“Historical Results Do Not Guarantee Future Performance.”
10. 10Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
On April 18th, 2011 the book ―The Making of a Fly‖ goes on sale on
Amazon.com.
Amazon’s automated algorithm places a price of $1,730,045 on the
book.
Later in the day, the Amazon price goes up to $23,698,656.
Plus $3.55 for shipping and handling.
No one buys the book that day.
Days later, the Amazon price was $106.
People started to buy the book.
“We are Writing Things That No One Can Read.” Kevin Slavin (2011 TED Conference)
11. 11Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
Our quandary:
All other things being equal, ―Big Data‖ is better than ―little
data‖.
The newer data mining tools are powerful and work quite
well in numerous cases.
Yet, modeling disasters continue to occur; why?
Clearly, we are missing something in the equation.
Could It Be That the Fundamentals Are Still Important?
12. 12Statistical Engineering and BIG DATA
Can Statistical Engineering Principles Help?
Some Background, and a Definition
13. 13Statistical Engineering and BIG DATA
Interesting Course Taught at Harvard
Stat 399: Problem Solving in Statistics
“…emphasizes deep, broad, and creative statistical
thinking instead of technical problems that correspond
to a recognizable textbook chapter.”*
*Xiao-Li Meng, American Statistician, August 2009
Do the Important Problems We Face “Correspond to a
Recognizable Textbook Chapter?”
14. 14Statistical Engineering and BIG DATA
Susan Hockfield – MIT President
Around the dawn of the 20th century, physicists discovered the
basic building blocks of the universe; a ―parts list‖, if you
will. Engineers said ―we can build something from this list,‖
and produced the electronics revolution, and subsequently
the computer revolution.
More recently, biologists have discovered and mapped the
basic ―parts list‖ of life – the human genome. Engineers
have said ―we can build something from this list,‖ and are
producing a revolution in personalized medicine.*
Who is Building Something Meaningful From the Statistical Science Parts List of Tools?
*Loosely quoted from January, 2010 seminar at GE Global Research
15. 15Statistical Engineering and BIG DATA
Statistical Engineering Definition
Statistical engineering:
The study of how to best utilize statistical concepts, methods, and tools
and integrate them with information technology and other relevant
sciences to generate improved results (Hoerl and Snee 2010a).
In other words, trying to build something meaningful from the statistical
science tools list.
Enables us to attack the large, complex, unstructured problems “that do
not correspond to a recognizable textbook chapter.”
Notes
This is a different definition than that used by Eisenhart, who we believe was
the first to use this term in 1950.
Good statisticians have always done this, but little practical guidance has
been documented in the literature.
This Definition is Consistent with Dictionary Definitions of Engineering.
16. 16Statistical Engineering and BIG DATA
Typical Phases of Statistical Engineering Projects
1. Identify problems: find the high-impact issues inhibiting
achievement of the organization’s strategic goals.
2. Create structure: carefully define the problem, objectives,
constraints, metrics for success, and so on.
3. Understand the context: identify important stakeholders (e.g.,
customers, organizations, individuals, management), research the
history of the issue, identify unstated complications and cultural
issues, locate relevant data sources.
4. Develop a strategy: create an overall, high level approach to
attacking the problem, based on phases 2 and 3.
5. Establish tactics: develop and implement diverse initiatives or
projects that collectively will accomplish the strategy.
There Are No “Seven Easy Steps” to Statistical Engineering Projects.
17. 17Statistical Engineering and BIG DATA
Statistical Engineering – Critical Considerations for BIG DATA
Data Quality
Free of omissions, errors, missing values, etc.
Missing variables
High measurement variation
Biases – human, equipment,
Subject Matter Knowledge – Used in Many different ways
Variables selection and appropriate scales (e.g., log, inverse, square. …)
Selection of model form; linear, curvilinear, multiplicative
Interpretation of results
Ability to extrapolate findings
Use of Sequential Approaches
Big problems are not solved with one analysis or even one data set
Strategy must move beyond the one shot study mindset
Three Macro Issues That Seem to Be Overlooked in the Big Data literature.
18. 18Statistical Engineering and BIG DATA
Understanding the “Data Pedigree”
Trust but Verify - Data pedigree must be assessed when
analyzing Big Data. Data quality is an issue with all sources of
data.
Careful thought must be given to the model form needed to
answer the question, and whether the current data is sufficient
for that purpose.
Multiple sources of data require careful thought as to data
pedigree and how to fit the data bases together to produce
useful results.
Different data sources are typically associated with political
issues, different agendas, different objectives, etc.
Good Principle: Data Are Guilty Until Proven Innocent.
19. 19Statistical Engineering and BIG DATA
The Advantages of a Sequential Approach
Much of our professional literature, and virtually all of our textbooks,
assume that statistical problems are, by their nature, ―one shot
studies‖:
We are handed a fixed data set, and must develop the ―best‖
model to fit the data.
Articles are frequently published challenging previously published
analyses, and proposing a better model for the same data.
This is the clearly the tone of many high-profile data analysis
competitions, beginning with the Netflix Challenge, and continuing
today with Kaggle.com.
Are Most Statistical Problems One-Shot Studies?
20. 20Statistical Engineering and BIG DATA
The Advantages of a Sequential Approach
In 30 years working as a statistician in the private sector, I almost
always needed a sequential approach, involving more than one
statistical tool, to solve the important problems I faced.
If one is in the midst of an sequential process, he or she approaches
data analysis from a very different viewpoint versus one-shot studies.
A key goal in the process is to direct the next round of data gathering
and analysis, as opposed to finding the ―optimal‖ model.
Sequential approaches, as proposed by Box, Hunter, and Hunter
(2005) also offer the opportunity for using hindsight to our advantage.
―The best time to design an experiment is after examining the
results.‖
Are Netflix and Kaggle.com Missing Something?
21. 21Statistical Engineering and BIG DATA
The Importance of Subject Matter Knowledge
―Data have no meaning in themselves; they are meaningful only in relation to a
conceptual model of the phenomenon being studied.‖ Box, Hunter, and Hunter.
Implied message of the data mining, machine learning, and Big Data literature; ―Data
have complete meaning in themselves; no theory is required‖.
For example, only subject matter theory, NOT statistics, allows us to extrapolate
the results of a study, say a clinical trial, to a broader population.
Subject matter theory guides the statistical process, including data collection,
analysis, and interpretation.
This is a ―scientific method‖ approach to statistics, as opposed to a ―test‖ approach to
statistics.
Such an approach allows statistics and statisticians an active role in developing new
theories, as opposed to simply providing yes/no answers to existing theories
(proactive leadership vs. passive consulting paradigm).
New subject matter insights lead naturally to new questions, and new data,
directly linking this principle to the sequential approach principle.
Data and Understanding Are Not Synonyms
22. 22Statistical Engineering and BIG DATA
Data
Subject Matter Theory
Process Knowledge Increases
Business Process
Customer
Data
Integration of Subject Matter Knowledge
From Hoerl & Snee, Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley, 2012
23. 23Statistical Engineering and BIG DATA
Putting It All Together
- Providing Leadership to Ensure We Do Big Data the
Right Way
24. 24Statistical Engineering and BIG DATA
Statistical Engineering Approach to Big Data
Leadership is needed to avoid the pitfalls of ―Big Data + powerful algorithms = success‖
fallacy; if we don’t lead the way, it probably won’t happen.
The fundamentals still apply – in fact they are even more critical.
The phases of Statistical Engineering provide a framework with which to attack Big
Data projects more scientifically
1. Identify problems: find the high-impact Big Data problems – don’t wait for them to
come to you
2. Create structure: carefully define the real (versus stated) problem, objectives,
constraints, metrics for success, and so on.
3. Understand the context: obtain as much subject-matter knowledge as possible,
research the history of the issue, locate relevant data sources, and so on.
4. Develop a strategy: create an overall, high level approach to attacking the problem,
based on phases 2 and 3; incorporate a sequential approach – applying what we
learn in the initial analysis.
5. Establish tactics: develop and implement individual steps in the strategy – stay
flexible, but start with a defined plan.
Big Data Constitutes One of Our Profession’s Best Leadership Opportunities in Our History.
25. 25Statistical Engineering and BIG DATA
Summary
The glass is half-full: Big Data and associated tools offer a unique opportunity to
solve important problems that were previously intractable.
Fundamentals of good science, analytical modeling and interpretation still apply.
Ignoring these fundamentals increases the probability that invalid
conclusions are reached and inappropriate actions taken.
Statistical Engineering provides a useful approach for using Big Data to solve
important problems.
A five-phase framework is suggested to guide the work associated with Big
Data problems that are typically large, complex and unstructured.
Probability of success is significantly increased when the following aspects of
Statistical Engineering are incorporated in the approach:
Understanding of data pedigree
Utilization of sequential approaches
Integration of subject matter knowledge
Statistical Engineering Can Help Big Data Projects Be Successful
26. 26Statistical Engineering and BIG DATA
References
Davenport, T. H and J. G. Harris (2007) Competing on Analytics, Harvard Business School Press, Boston,
MA
DeVeaux, R. D. and D. J. Hand (2005) ―How to Lie with Bad Data‖, Statistical Science, Vol. 20, No.3, 231-
238
Hoerl, R. W. and R. D. Snee (2012) Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley,
2012
Pierrard, J. M. (1974) ―Relating Automotive Emissions and Urban Air Quality‖, DuPont Innovation, Vol. 5.
No. 2, pp 6-9.
Pierrard, J. M., R. D. Snee and J. Zelson (1973) ―A New Approach to Setting Vehicle Emission Standards‖,
Presented at Air Pollution Control Association Annual Meeting, June 24-28, 1973
Pierrard, J. M., R. D. Snee and J. Zelson (1974) ―A New Approach to Setting Vehicle Emission Standards‖,
Air Pollution Control Association Journal, Vol. 24, No. 9, pp 841-848.
Snee, R. D. and R. W. Hoerl (2003) Leading Six Sigma – A Step by Step Guide Based on Experience With
General Electric and Other Six Sigma Companies, FT Prentice Hall, New York, NY.
Snee, R. D. and R. W. Hoerl (2012) ―Inquiry on Pedigree – Do You Know the Quality and Origin of Your
Data?‖ Quality Progress, December 2012, 66-68.
Snee, R. D. and J. M. Pierrard (1977) ―The Annual Average: An Alternative to the Second Highest Value as
a Measure of Air Quality‖, Air Pollution Control Association Journal, Vol. 27, No. 2, pp 131-133.
27. 27Statistical Engineering and BIG DATA
Articles on Statistical Engineering by Hoerl and Snee
Roger W. Hoerl and Ronald D. Snee, (2009) ―Post Financial Meltdown: What Do Services Industries Need
From Us Now?‖ Applied Stochastic Models in Business and Industry, December 2009, pp. 509-521.
Roger W. Hoerl and Ronald D. Snee, (2010) ―Moving the Statistics Profession Forward to the Next Level,‖ The
American Statistician, February 2010, pp. 10-14.
Roger W. Hoerl and R. D. Snee, (2010) ―Closing the Gap: Statistical Engineering Can Bridge Statistical
Thinking with Methods and Tools,‖ Quality Progress, May 2010, pp. 52-53.
Roger W. Hoerl and R. D. Snee, (2010) ―Tried and True—Organizations Put Statistical Engineering to the Test
and See Real Results,‖ Quality Progress, June 2010, pp. 58-60.
Roger W. Hoerl and Ronald D. Snee, (2010) ―Statistical Thinking and Methods in Quality Improvement: A Look
to the Future,‖ Quality Engineering, 22, 3, pp. 119-139.
Roger W. Hoerl and Ronald D. Snee, (2011) ―Statistical Engineering: Is This Just Another Term for Applied
Statistics?‖ Joint Newsletter of the ASA Section on Physical and Engineering Sciences and Quality and
Productivity , March 2011, 4-6.
Ronald D. Snee and Roger W. Hoerl, (2010) ―Further Explanation; Clarifying Points About Statistical
Engineering,‖ Quality Progress, December 2010, pp. 68-72
Ronald D. Snee and Roger W. Hoerl (2011) ―Engineering an Advantage‖, Six Sigma Forum Magazine, Guest
Editorial, February 2011, 6-7.
Ronald D. Snee and Roger W. Hoerl, (2011) ―Proper Blending: Finding the Right Mix of Statistical Engineering
and Traditional Applied Statistics,‖ Quality Progress, June 2011.