SlideShare une entreprise Scribd logo
1  sur  27
1Statistical Engineering and BIG DATA
'Big Data’ - A Challenge for Statistical
Leadership
Chicago Chapter ASA
SAY Award Luncheon
Roger W. Hoerl
Union College
Schenectady, NY
With significant input from Ron Snee
2Statistical Engineering and BIG DATA
Abstract
The Wall Street Journal, New York Times and other respected publications have had major features
recently on Big Data - the massive data sets which are becoming commonplace, and on the new,
"sexy" data mining methods developed to analyze them. These articles, as well as much of the
professional data mining and Big Data literature, may give casual users the impression that if one
has a powerful enough algorithm and a lot of data, good models and good results are guaranteed at
the push of a button. Obviously, this is not the case. The leadership challenge to the statistical
profession is to insure that Big Data projects are built upon a sound foundation of good modeling,
and not upon the sandy foundation of hype and unstated assumptions. Further, we need to
accomplish this without giving the impression that we are "against" Big Data or newer methods. I feel
that the principles of statistical engineering (see Anderson-Cook and Lu 2012) can provide a path to
do just this. Three statistical engineering principles that are often overlooked or underemphasized by
Big Data enthusiasts are the importance of data quality - knowing the "pedigree" of the data; the
need to view statistical studies as part of the sequential process of scientific discovery - versus the
"one-shot study" so common in textbooks; and the criticality of using subject-matter knowledge when
developing models. I will present examples of the severe problems that can arise in Big Data studies
when these principles are not understood or ignored. In summary, I argue that the development of
Big Data analytics provides significant opportunities to the profession, but at the same time requires
a more proactive role from us, if we are to provide true leadership in the Big Data phenomenon.
3Statistical Engineering and BIG DATA
Outline
Statistical Leadership (Advocacy)
The ―Big Data‖ Phenomenon
What Could Possibly Go Wrong?
Statistical Engineering, and How It Can Help
Leading the Way – Doing Big Data the Right Way
Summary
4Statistical Engineering and BIG DATA
Statistical Leadership
Leadership: taking people from one paradigm to another.
Enabling people to think statistically, and apply statistical methods, requires leadership.
Opinion: too many statisticians are satisfied being experts in the tools themselves,
without worrying much about the overall impact our profession is having on society.
Can’t see the forest for the trees.
As a result, society too often compartmentalizes statisticians as narrow specialists, and
does not view us as thought leaders; they look elsewhere for leadership.
Passive consultants versus proactive leaders.
As a case in point, most professionals view the ―Big Data‖ phenomenon as being led by
computer scientists, engineers, or data scientists (whatever that means), rather than by
statisticians.
Ron Snee, Gerry Hahn, and other leaders have been noting for years that statisticians
need to be more proactive, and guide society as to what needs to be done.
We shouldn’t be satisfied being the ―tools guys‖.
“Everything Rises and Falls on Leadership.” John Maxwell
5Statistical Engineering and BIG DATA
Data Mining and Big Data
The technology for acquiring, storing, and processing data have been increasing
exponentially (―Big Data‖), providing new opportunities to ―mine‖ the data.
According to IBM, there are now 1.6 zetabytes (1021 bytes) of digital data available.
To use 1.6 zetabytes of bandwidth, you would need to watch HD TV for 47,000 years.
―I keep saying that the sexy job in the next 10 years will be statisticians,‖ Hal Varian,
chief economist at Google. ―And I’m not kidding.‖
March 2012: The White House announced a national "Big Data Initiative" that
consisted of six Federal departments and agencies committing more than $200
million to Big Data research projects.
As noted by Ron Snee, data mining has been around for decades:
1950s: Stepwise regression first developed at Esso (now Exxon) by Efroymson
to analyze refinery data
1960s: Graphical methods developed by Tukey, Wilk, Gnanadesikan and others
at Bell Labs to gain insight from large data sets
1970s: DuPont uses data compression algorithms in process monitoring using
on-line systems
Big Data and Data Mining are Growing Rapidly, but Are Not New.
6Statistical Engineering and BIG DATA
What’s New?
Sheer size of data – often requires compression, parallel processing, and sampling,
to store and analyze.
Some traditional methods are no longer relevant, e.g., hypothesis testing.
Insight from graphical methods must be rethought – difficult to see find outliers in
zetabytes of data.
The sample sizes coupled with faster computing enables much more complex
models, relative to data sets of 30.
Due to the above, newer techniques have become popular:
CART and other tree-based methods; recursive splits on the data.
Neural networks; non-linear models involving combinations of variables – very flexible.
Methods based on bootstrapping – resampling and combining models; random forests,
―bagging‖, etc.
Clustering and classification methods designed for massive data sets; K-means
clustering, support vector machines, etc.
Good News: We Have More Data and Powerful Analysis Methods.
7Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
8Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
Duke Genomics Center published several groundbreaking articles
conclusively identifying cancer biomarkers in the 2005-2010 timeframe.
Unfortunately, clinical trials based on this research did not pan out.
Women died unexpectedly.
Two statisticians, Keith Baggerly and Kevin Coombes, dug into the
research.
New York Times, July 8, 2011:
Dr. Baggerly and Dr. Coombes found errors almost immediately. Some seemed
careless – moving a row or column over by one in a giant spreadsheet – while others
seemed inexplicable. The Duke team shrugged them off as ―clerical errors‖...In the end,
four gene signature papers were retracted. Duke shut down three trials using the results.
(Lead investigator) Dr. Potti resigned from Duke...His collaborator and mentor, Dr.
Nevins, no longer directs one of Duke’s genomics centers. The cancer world is reeling.
Large Amounts of Data Plus Sophisticated Algorithms Do Not Guarantee Success.
9Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
Financial giant Lehman Brothers declared bankruptcy on September 15th,
2008.
This was the largest bankruptcy filing in US history, with Lehman Brothers
holding roughly $600 billion in assets.
The Dow Jones Industrial Average dropped over 500 points that day,
several other financial institutions followed Lehman Brothers into
bankruptcy.....and the rest is history.
A few years earlier, I had visited Lehman Brothers headquarters in NY with
representatives of GE Capital:
Lehman was selling models to predict corporate defaults.
Their models were quite sophisticated, and based on large amounts of historical
financial data.
Virtually all financial institutions impacted by the crisis had models.
“Historical Results Do Not Guarantee Future Performance.”
10Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
On April 18th, 2011 the book ―The Making of a Fly‖ goes on sale on
Amazon.com.
Amazon’s automated algorithm places a price of $1,730,045 on the
book.
Later in the day, the Amazon price goes up to $23,698,656.
Plus $3.55 for shipping and handling.
No one buys the book that day.
Days later, the Amazon price was $106.
People started to buy the book.
“We are Writing Things That No One Can Read.” Kevin Slavin (2011 TED Conference)
11Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
Our quandary:
All other things being equal, ―Big Data‖ is better than ―little
data‖.
The newer data mining tools are powerful and work quite
well in numerous cases.
Yet, modeling disasters continue to occur; why?
Clearly, we are missing something in the equation.
Could It Be That the Fundamentals Are Still Important?
12Statistical Engineering and BIG DATA
Can Statistical Engineering Principles Help?
Some Background, and a Definition
13Statistical Engineering and BIG DATA
Interesting Course Taught at Harvard
Stat 399: Problem Solving in Statistics
“…emphasizes deep, broad, and creative statistical
thinking instead of technical problems that correspond
to a recognizable textbook chapter.”*
*Xiao-Li Meng, American Statistician, August 2009
Do the Important Problems We Face “Correspond to a
Recognizable Textbook Chapter?”
14Statistical Engineering and BIG DATA
Susan Hockfield – MIT President
Around the dawn of the 20th century, physicists discovered the
basic building blocks of the universe; a ―parts list‖, if you
will. Engineers said ―we can build something from this list,‖
and produced the electronics revolution, and subsequently
the computer revolution.
More recently, biologists have discovered and mapped the
basic ―parts list‖ of life – the human genome. Engineers
have said ―we can build something from this list,‖ and are
producing a revolution in personalized medicine.*
Who is Building Something Meaningful From the Statistical Science Parts List of Tools?
*Loosely quoted from January, 2010 seminar at GE Global Research
15Statistical Engineering and BIG DATA
Statistical Engineering Definition
Statistical engineering:
The study of how to best utilize statistical concepts, methods, and tools
and integrate them with information technology and other relevant
sciences to generate improved results (Hoerl and Snee 2010a).
In other words, trying to build something meaningful from the statistical
science tools list.
Enables us to attack the large, complex, unstructured problems “that do
not correspond to a recognizable textbook chapter.”
Notes
This is a different definition than that used by Eisenhart, who we believe was
the first to use this term in 1950.
Good statisticians have always done this, but little practical guidance has
been documented in the literature.
This Definition is Consistent with Dictionary Definitions of Engineering.
16Statistical Engineering and BIG DATA
Typical Phases of Statistical Engineering Projects
1. Identify problems: find the high-impact issues inhibiting
achievement of the organization’s strategic goals.
2. Create structure: carefully define the problem, objectives,
constraints, metrics for success, and so on.
3. Understand the context: identify important stakeholders (e.g.,
customers, organizations, individuals, management), research the
history of the issue, identify unstated complications and cultural
issues, locate relevant data sources.
4. Develop a strategy: create an overall, high level approach to
attacking the problem, based on phases 2 and 3.
5. Establish tactics: develop and implement diverse initiatives or
projects that collectively will accomplish the strategy.
There Are No “Seven Easy Steps” to Statistical Engineering Projects.
17Statistical Engineering and BIG DATA
Statistical Engineering – Critical Considerations for BIG DATA
Data Quality
Free of omissions, errors, missing values, etc.
Missing variables
High measurement variation
Biases – human, equipment,
Subject Matter Knowledge – Used in Many different ways
Variables selection and appropriate scales (e.g., log, inverse, square. …)
Selection of model form; linear, curvilinear, multiplicative
Interpretation of results
Ability to extrapolate findings
Use of Sequential Approaches
Big problems are not solved with one analysis or even one data set
Strategy must move beyond the one shot study mindset
Three Macro Issues That Seem to Be Overlooked in the Big Data literature.
18Statistical Engineering and BIG DATA
Understanding the “Data Pedigree”
Trust but Verify - Data pedigree must be assessed when
analyzing Big Data. Data quality is an issue with all sources of
data.
Careful thought must be given to the model form needed to
answer the question, and whether the current data is sufficient
for that purpose.
Multiple sources of data require careful thought as to data
pedigree and how to fit the data bases together to produce
useful results.
Different data sources are typically associated with political
issues, different agendas, different objectives, etc.
Good Principle: Data Are Guilty Until Proven Innocent.
19Statistical Engineering and BIG DATA
The Advantages of a Sequential Approach
Much of our professional literature, and virtually all of our textbooks,
assume that statistical problems are, by their nature, ―one shot
studies‖:
We are handed a fixed data set, and must develop the ―best‖
model to fit the data.
Articles are frequently published challenging previously published
analyses, and proposing a better model for the same data.
This is the clearly the tone of many high-profile data analysis
competitions, beginning with the Netflix Challenge, and continuing
today with Kaggle.com.
Are Most Statistical Problems One-Shot Studies?
20Statistical Engineering and BIG DATA
The Advantages of a Sequential Approach
In 30 years working as a statistician in the private sector, I almost
always needed a sequential approach, involving more than one
statistical tool, to solve the important problems I faced.
If one is in the midst of an sequential process, he or she approaches
data analysis from a very different viewpoint versus one-shot studies.
A key goal in the process is to direct the next round of data gathering
and analysis, as opposed to finding the ―optimal‖ model.
Sequential approaches, as proposed by Box, Hunter, and Hunter
(2005) also offer the opportunity for using hindsight to our advantage.
―The best time to design an experiment is after examining the
results.‖
Are Netflix and Kaggle.com Missing Something?
21Statistical Engineering and BIG DATA
The Importance of Subject Matter Knowledge
―Data have no meaning in themselves; they are meaningful only in relation to a
conceptual model of the phenomenon being studied.‖ Box, Hunter, and Hunter.
Implied message of the data mining, machine learning, and Big Data literature; ―Data
have complete meaning in themselves; no theory is required‖.
For example, only subject matter theory, NOT statistics, allows us to extrapolate
the results of a study, say a clinical trial, to a broader population.
Subject matter theory guides the statistical process, including data collection,
analysis, and interpretation.
This is a ―scientific method‖ approach to statistics, as opposed to a ―test‖ approach to
statistics.
Such an approach allows statistics and statisticians an active role in developing new
theories, as opposed to simply providing yes/no answers to existing theories
(proactive leadership vs. passive consulting paradigm).
New subject matter insights lead naturally to new questions, and new data,
directly linking this principle to the sequential approach principle.
Data and Understanding Are Not Synonyms
22Statistical Engineering and BIG DATA
Data
Subject Matter Theory
Process Knowledge Increases
Business Process
Customer
Data
Integration of Subject Matter Knowledge
From Hoerl & Snee, Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley, 2012
23Statistical Engineering and BIG DATA
Putting It All Together
- Providing Leadership to Ensure We Do Big Data the
Right Way
24Statistical Engineering and BIG DATA
Statistical Engineering Approach to Big Data
Leadership is needed to avoid the pitfalls of ―Big Data + powerful algorithms = success‖
fallacy; if we don’t lead the way, it probably won’t happen.
The fundamentals still apply – in fact they are even more critical.
The phases of Statistical Engineering provide a framework with which to attack Big
Data projects more scientifically
1. Identify problems: find the high-impact Big Data problems – don’t wait for them to
come to you
2. Create structure: carefully define the real (versus stated) problem, objectives,
constraints, metrics for success, and so on.
3. Understand the context: obtain as much subject-matter knowledge as possible,
research the history of the issue, locate relevant data sources, and so on.
4. Develop a strategy: create an overall, high level approach to attacking the problem,
based on phases 2 and 3; incorporate a sequential approach – applying what we
learn in the initial analysis.
5. Establish tactics: develop and implement individual steps in the strategy – stay
flexible, but start with a defined plan.
Big Data Constitutes One of Our Profession’s Best Leadership Opportunities in Our History.
25Statistical Engineering and BIG DATA
Summary
The glass is half-full: Big Data and associated tools offer a unique opportunity to
solve important problems that were previously intractable.
Fundamentals of good science, analytical modeling and interpretation still apply.
Ignoring these fundamentals increases the probability that invalid
conclusions are reached and inappropriate actions taken.
Statistical Engineering provides a useful approach for using Big Data to solve
important problems.
A five-phase framework is suggested to guide the work associated with Big
Data problems that are typically large, complex and unstructured.
Probability of success is significantly increased when the following aspects of
Statistical Engineering are incorporated in the approach:
Understanding of data pedigree
Utilization of sequential approaches
Integration of subject matter knowledge
Statistical Engineering Can Help Big Data Projects Be Successful
26Statistical Engineering and BIG DATA
References
Davenport, T. H and J. G. Harris (2007) Competing on Analytics, Harvard Business School Press, Boston,
MA
DeVeaux, R. D. and D. J. Hand (2005) ―How to Lie with Bad Data‖, Statistical Science, Vol. 20, No.3, 231-
238
Hoerl, R. W. and R. D. Snee (2012) Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley,
2012
Pierrard, J. M. (1974) ―Relating Automotive Emissions and Urban Air Quality‖, DuPont Innovation, Vol. 5.
No. 2, pp 6-9.
Pierrard, J. M., R. D. Snee and J. Zelson (1973) ―A New Approach to Setting Vehicle Emission Standards‖,
Presented at Air Pollution Control Association Annual Meeting, June 24-28, 1973
Pierrard, J. M., R. D. Snee and J. Zelson (1974) ―A New Approach to Setting Vehicle Emission Standards‖,
Air Pollution Control Association Journal, Vol. 24, No. 9, pp 841-848.
Snee, R. D. and R. W. Hoerl (2003) Leading Six Sigma – A Step by Step Guide Based on Experience With
General Electric and Other Six Sigma Companies, FT Prentice Hall, New York, NY.
Snee, R. D. and R. W. Hoerl (2012) ―Inquiry on Pedigree – Do You Know the Quality and Origin of Your
Data?‖ Quality Progress, December 2012, 66-68.
Snee, R. D. and J. M. Pierrard (1977) ―The Annual Average: An Alternative to the Second Highest Value as
a Measure of Air Quality‖, Air Pollution Control Association Journal, Vol. 27, No. 2, pp 131-133.
27Statistical Engineering and BIG DATA
Articles on Statistical Engineering by Hoerl and Snee
Roger W. Hoerl and Ronald D. Snee, (2009) ―Post Financial Meltdown: What Do Services Industries Need
From Us Now?‖ Applied Stochastic Models in Business and Industry, December 2009, pp. 509-521.
Roger W. Hoerl and Ronald D. Snee, (2010) ―Moving the Statistics Profession Forward to the Next Level,‖ The
American Statistician, February 2010, pp. 10-14.
Roger W. Hoerl and R. D. Snee, (2010) ―Closing the Gap: Statistical Engineering Can Bridge Statistical
Thinking with Methods and Tools,‖ Quality Progress, May 2010, pp. 52-53.
Roger W. Hoerl and R. D. Snee, (2010) ―Tried and True—Organizations Put Statistical Engineering to the Test
and See Real Results,‖ Quality Progress, June 2010, pp. 58-60.
Roger W. Hoerl and Ronald D. Snee, (2010) ―Statistical Thinking and Methods in Quality Improvement: A Look
to the Future,‖ Quality Engineering, 22, 3, pp. 119-139.
Roger W. Hoerl and Ronald D. Snee, (2011) ―Statistical Engineering: Is This Just Another Term for Applied
Statistics?‖ Joint Newsletter of the ASA Section on Physical and Engineering Sciences and Quality and
Productivity , March 2011, 4-6.
Ronald D. Snee and Roger W. Hoerl, (2010) ―Further Explanation; Clarifying Points About Statistical
Engineering,‖ Quality Progress, December 2010, pp. 68-72
Ronald D. Snee and Roger W. Hoerl (2011) ―Engineering an Advantage‖, Six Sigma Forum Magazine, Guest
Editorial, February 2011, 6-7.
Ronald D. Snee and Roger W. Hoerl, (2011) ―Proper Blending: Finding the Right Mix of Statistical Engineering
and Traditional Applied Statistics,‖ Quality Progress, June 2011.

Contenu connexe

Tendances

AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd Matthew Lease
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
DSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanDSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanPaco Nathan
 
WUD2008 - The Numbers Revolution and its Effect on the Web
WUD2008 - The Numbers Revolution and its Effect on the WebWUD2008 - The Numbers Revolution and its Effect on the Web
WUD2008 - The Numbers Revolution and its Effect on the WebRich Miller
 
Bi(G) data: opportunities for BI Professionals
Bi(G) data: opportunities for BI ProfessionalsBi(G) data: opportunities for BI Professionals
Bi(G) data: opportunities for BI ProfessionalsAlbert Besselse
 
Data and Algorithmic Bias in the Web
Data and Algorithmic Bias in the WebData and Algorithmic Bias in the Web
Data and Algorithmic Bias in the WebWebVisions
 
Big data march2016 ipsos mori
Big data march2016 ipsos moriBig data march2016 ipsos mori
Big data march2016 ipsos moriChris Guthrie
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information RetrievalMatthew Lease
 
Big data v4.0
Big data v4.0Big data v4.0
Big data v4.0Ian Brown
 
Challenges and outlook with Big Data
Challenges and outlook with Big Data Challenges and outlook with Big Data
Challenges and outlook with Big Data IJCERT JOURNAL
 
Hector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsHector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsErika Marr
 
data science: past present & future [American Statistical Association (ASA) C...
data science: past present & future [American Statistical Association (ASA) C...data science: past present & future [American Statistical Association (ASA) C...
data science: past present & future [American Statistical Association (ASA) C...chris wiggins
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextMurad Daryousse
 
Top 10 data science takeaways for executives
Top 10 data science takeaways for executivesTop 10 data science takeaways for executives
Top 10 data science takeaways for executivesDylan Erens
 
data science history / data science @ NYT
data science history / data science @ NYTdata science history / data science @ NYT
data science history / data science @ NYTchris wiggins
 
GSAMPerspectives7-BigData-Edition
GSAMPerspectives7-BigData-EditionGSAMPerspectives7-BigData-Edition
GSAMPerspectives7-BigData-EditionGang Li
 

Tendances (20)

AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
DSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanDSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco Nathan
 
WUD2008 - The Numbers Revolution and its Effect on the Web
WUD2008 - The Numbers Revolution and its Effect on the WebWUD2008 - The Numbers Revolution and its Effect on the Web
WUD2008 - The Numbers Revolution and its Effect on the Web
 
Bi(G) data: opportunities for BI Professionals
Bi(G) data: opportunities for BI ProfessionalsBi(G) data: opportunities for BI Professionals
Bi(G) data: opportunities for BI Professionals
 
Data and Algorithmic Bias in the Web
Data and Algorithmic Bias in the WebData and Algorithmic Bias in the Web
Data and Algorithmic Bias in the Web
 
Big data march2016 ipsos mori
Big data march2016 ipsos moriBig data march2016 ipsos mori
Big data march2016 ipsos mori
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
What is Data Science
What is Data ScienceWhat is Data Science
What is Data Science
 
Big data v4.0
Big data v4.0Big data v4.0
Big data v4.0
 
Challenges and outlook with Big Data
Challenges and outlook with Big Data Challenges and outlook with Big Data
Challenges and outlook with Big Data
 
Hector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsHector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business Analytics
 
data science: past present & future [American Statistical Association (ASA) C...
data science: past present & future [American Statistical Association (ASA) C...data science: past present & future [American Statistical Association (ASA) C...
data science: past present & future [American Statistical Association (ASA) C...
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data Context
 
Top 10 data science takeaways for executives
Top 10 data science takeaways for executivesTop 10 data science takeaways for executives
Top 10 data science takeaways for executives
 
Big data survey
Big data surveyBig data survey
Big data survey
 
Lecture #03
Lecture #03Lecture #03
Lecture #03
 
Big Data
Big DataBig Data
Big Data
 
data science history / data science @ NYT
data science history / data science @ NYTdata science history / data science @ NYT
data science history / data science @ NYT
 
GSAMPerspectives7-BigData-Edition
GSAMPerspectives7-BigData-EditionGSAMPerspectives7-BigData-Edition
GSAMPerspectives7-BigData-Edition
 

En vedette

American Lung Association Challenges Americans to Stop Smoking
American Lung Association Challenges Americans to Stop Smoking American Lung Association Challenges Americans to Stop Smoking
American Lung Association Challenges Americans to Stop Smoking Stan Marrder
 
Vietas pievilciba izaicinajumi_zimolvedibai_a_klepers_vi_a_21032013
Vietas pievilciba izaicinajumi_zimolvedibai_a_klepers_vi_a_21032013Vietas pievilciba izaicinajumi_zimolvedibai_a_klepers_vi_a_21032013
Vietas pievilciba izaicinajumi_zimolvedibai_a_klepers_vi_a_21032013Andris Klepers
 
Roger hoerl icqi keynote address 2013
Roger hoerl icqi keynote address 2013Roger hoerl icqi keynote address 2013
Roger hoerl icqi keynote address 2013Roger Hoerl
 
Capers Set Yourself Up For Success 1st Session
Capers Set Yourself Up For Success   1st SessionCapers Set Yourself Up For Success   1st Session
Capers Set Yourself Up For Success 1st SessionKathryn Orford
 
Tutorial web Hendra Hindarsah
Tutorial web Hendra HindarsahTutorial web Hendra Hindarsah
Tutorial web Hendra Hindarsahhendrahindarsah07
 
El cancer (katherine)
El cancer (katherine)El cancer (katherine)
El cancer (katherine)katito23
 
Honduras Global Foundation Services and Programs
Honduras Global Foundation Services and ProgramsHonduras Global Foundation Services and Programs
Honduras Global Foundation Services and ProgramsStan Marrder
 
Cake Cutting of CPU Resources among multiple HPC agents on a Cloud
Cake Cutting of CPU Resources among multiple HPC agents on a CloudCake Cutting of CPU Resources among multiple HPC agents on a Cloud
Cake Cutting of CPU Resources among multiple HPC agents on a CloudKausal Malladi
 
2nd march smt presentation
2nd march smt presentation2nd march smt presentation
2nd march smt presentationMohamad Neezam
 
Tema 1. la familia
Tema 1. la familiaTema 1. la familia
Tema 1. la familiapatry46
 

En vedette (13)

American Lung Association Challenges Americans to Stop Smoking
American Lung Association Challenges Americans to Stop Smoking American Lung Association Challenges Americans to Stop Smoking
American Lung Association Challenges Americans to Stop Smoking
 
Vietas pievilciba izaicinajumi_zimolvedibai_a_klepers_vi_a_21032013
Vietas pievilciba izaicinajumi_zimolvedibai_a_klepers_vi_a_21032013Vietas pievilciba izaicinajumi_zimolvedibai_a_klepers_vi_a_21032013
Vietas pievilciba izaicinajumi_zimolvedibai_a_klepers_vi_a_21032013
 
Roger hoerl icqi keynote address 2013
Roger hoerl icqi keynote address 2013Roger hoerl icqi keynote address 2013
Roger hoerl icqi keynote address 2013
 
Capers Set Yourself Up For Success 1st Session
Capers Set Yourself Up For Success   1st SessionCapers Set Yourself Up For Success   1st Session
Capers Set Yourself Up For Success 1st Session
 
Tutorial web Hendra Hindarsah
Tutorial web Hendra HindarsahTutorial web Hendra Hindarsah
Tutorial web Hendra Hindarsah
 
El cancer (katherine)
El cancer (katherine)El cancer (katherine)
El cancer (katherine)
 
Uan kimia sma 2004
Uan kimia sma 2004Uan kimia sma 2004
Uan kimia sma 2004
 
Honduras Global Foundation Services and Programs
Honduras Global Foundation Services and ProgramsHonduras Global Foundation Services and Programs
Honduras Global Foundation Services and Programs
 
Cake Cutting of CPU Resources among multiple HPC agents on a Cloud
Cake Cutting of CPU Resources among multiple HPC agents on a CloudCake Cutting of CPU Resources among multiple HPC agents on a Cloud
Cake Cutting of CPU Resources among multiple HPC agents on a Cloud
 
2nd march smt presentation
2nd march smt presentation2nd march smt presentation
2nd march smt presentation
 
BDD with JBehave
BDD with JBehaveBDD with JBehave
BDD with JBehave
 
Tema 1. la familia
Tema 1. la familiaTema 1. la familia
Tema 1. la familia
 
Tic 4
Tic 4Tic 4
Tic 4
 

Similaire à Roger hoerl say award presentation 2013

The Future of Big Data
The Future of Big Data The Future of Big Data
The Future of Big Data EMC
 
Data Scientist - Good Rebels -
Data Scientist - Good Rebels -Data Scientist - Good Rebels -
Data Scientist - Good Rebels -Good Rebels
 
What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin eraser Juan José Calderón
 
Data Science definition
Data Science definitionData Science definition
Data Science definitionCarloLauro1
 
Let's talk about Data Science
Let's talk about Data ScienceLet's talk about Data Science
Let's talk about Data ScienceCarlo Lauro
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data, Republicans and 2016
Big Data, Republicans and 2016Big Data, Republicans and 2016
Big Data, Republicans and 2016steveparkhurst
 
Baban Hasnat is a professor of international business and ec.docx
Baban Hasnat is a professor of international business and ec.docxBaban Hasnat is a professor of international business and ec.docx
Baban Hasnat is a professor of international business and ec.docxwilcockiris
 
Big data 4 4 the art of the possible 4-en-web
Big data 4 4 the art of the possible 4-en-webBig data 4 4 the art of the possible 4-en-web
Big data 4 4 the art of the possible 4-en-webRick Bouter
 
Australia bureau of statistics some initiatives on big data - 23 july 2014
Australia bureau of statistics   some initiatives on big data - 23 july 2014Australia bureau of statistics   some initiatives on big data - 23 july 2014
Australia bureau of statistics some initiatives on big data - 23 july 2014noviari sugianto
 
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...Katie Whipkey
 
How is Data Made? From Dataset Literacy to Data Infrastructure Literacy
How is Data Made? From Dataset Literacy to Data Infrastructure LiteracyHow is Data Made? From Dataset Literacy to Data Infrastructure Literacy
How is Data Made? From Dataset Literacy to Data Infrastructure LiteracyJonathan Gray
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyeroiisdp
 
Insight white paper_2014
Insight white paper_2014Insight white paper_2014
Insight white paper_2014Lin Todd
 
Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science  Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science suresh sood
 

Similaire à Roger hoerl say award presentation 2013 (20)

The Future of Big Data
The Future of Big Data The Future of Big Data
The Future of Big Data
 
Data Scientist - Good Rebels -
Data Scientist - Good Rebels -Data Scientist - Good Rebels -
Data Scientist - Good Rebels -
 
What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin What Data Can Do: A Typology of Mechanisms . Angèle Christin
What Data Can Do: A Typology of Mechanisms . Angèle Christin
 
Big Data Research Trend and Forecast (2005-2015): An Informetrics Perspective
Big Data Research Trend and Forecast (2005-2015): An Informetrics PerspectiveBig Data Research Trend and Forecast (2005-2015): An Informetrics Perspective
Big Data Research Trend and Forecast (2005-2015): An Informetrics Perspective
 
Ayasdi Case Study
Ayasdi Case StudyAyasdi Case Study
Ayasdi Case Study
 
Ayasdi: Demystifying the Unknown
Ayasdi: Demystifying the UnknownAyasdi: Demystifying the Unknown
Ayasdi: Demystifying the Unknown
 
Data Science definition
Data Science definitionData Science definition
Data Science definition
 
Let's talk about Data Science
Let's talk about Data ScienceLet's talk about Data Science
Let's talk about Data Science
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data, Republicans and 2016
Big Data, Republicans and 2016Big Data, Republicans and 2016
Big Data, Republicans and 2016
 
Baban Hasnat is a professor of international business and ec.docx
Baban Hasnat is a professor of international business and ec.docxBaban Hasnat is a professor of international business and ec.docx
Baban Hasnat is a professor of international business and ec.docx
 
Big data 4 4 the art of the possible 4-en-web
Big data 4 4 the art of the possible 4-en-webBig data 4 4 the art of the possible 4-en-web
Big data 4 4 the art of the possible 4-en-web
 
Australia bureau of statistics some initiatives on big data - 23 july 2014
Australia bureau of statistics   some initiatives on big data - 23 july 2014Australia bureau of statistics   some initiatives on big data - 23 july 2014
Australia bureau of statistics some initiatives on big data - 23 july 2014
 
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
Guidance for Incorporating Big Data into Humanitarian Operations - 2015 - web...
 
How is Data Made? From Dataset Literacy to Data Infrastructure Literacy
How is Data Made? From Dataset Literacy to Data Infrastructure LiteracyHow is Data Made? From Dataset Literacy to Data Infrastructure Literacy
How is Data Made? From Dataset Literacy to Data Infrastructure Literacy
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
 
Insight white paper_2014
Insight white paper_2014Insight white paper_2014
Insight white paper_2014
 
Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science  Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science
 

Dernier

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Dernier (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Roger hoerl say award presentation 2013

  • 1. 1Statistical Engineering and BIG DATA 'Big Data’ - A Challenge for Statistical Leadership Chicago Chapter ASA SAY Award Luncheon Roger W. Hoerl Union College Schenectady, NY With significant input from Ron Snee
  • 2. 2Statistical Engineering and BIG DATA Abstract The Wall Street Journal, New York Times and other respected publications have had major features recently on Big Data - the massive data sets which are becoming commonplace, and on the new, "sexy" data mining methods developed to analyze them. These articles, as well as much of the professional data mining and Big Data literature, may give casual users the impression that if one has a powerful enough algorithm and a lot of data, good models and good results are guaranteed at the push of a button. Obviously, this is not the case. The leadership challenge to the statistical profession is to insure that Big Data projects are built upon a sound foundation of good modeling, and not upon the sandy foundation of hype and unstated assumptions. Further, we need to accomplish this without giving the impression that we are "against" Big Data or newer methods. I feel that the principles of statistical engineering (see Anderson-Cook and Lu 2012) can provide a path to do just this. Three statistical engineering principles that are often overlooked or underemphasized by Big Data enthusiasts are the importance of data quality - knowing the "pedigree" of the data; the need to view statistical studies as part of the sequential process of scientific discovery - versus the "one-shot study" so common in textbooks; and the criticality of using subject-matter knowledge when developing models. I will present examples of the severe problems that can arise in Big Data studies when these principles are not understood or ignored. In summary, I argue that the development of Big Data analytics provides significant opportunities to the profession, but at the same time requires a more proactive role from us, if we are to provide true leadership in the Big Data phenomenon.
  • 3. 3Statistical Engineering and BIG DATA Outline Statistical Leadership (Advocacy) The ―Big Data‖ Phenomenon What Could Possibly Go Wrong? Statistical Engineering, and How It Can Help Leading the Way – Doing Big Data the Right Way Summary
  • 4. 4Statistical Engineering and BIG DATA Statistical Leadership Leadership: taking people from one paradigm to another. Enabling people to think statistically, and apply statistical methods, requires leadership. Opinion: too many statisticians are satisfied being experts in the tools themselves, without worrying much about the overall impact our profession is having on society. Can’t see the forest for the trees. As a result, society too often compartmentalizes statisticians as narrow specialists, and does not view us as thought leaders; they look elsewhere for leadership. Passive consultants versus proactive leaders. As a case in point, most professionals view the ―Big Data‖ phenomenon as being led by computer scientists, engineers, or data scientists (whatever that means), rather than by statisticians. Ron Snee, Gerry Hahn, and other leaders have been noting for years that statisticians need to be more proactive, and guide society as to what needs to be done. We shouldn’t be satisfied being the ―tools guys‖. “Everything Rises and Falls on Leadership.” John Maxwell
  • 5. 5Statistical Engineering and BIG DATA Data Mining and Big Data The technology for acquiring, storing, and processing data have been increasing exponentially (―Big Data‖), providing new opportunities to ―mine‖ the data. According to IBM, there are now 1.6 zetabytes (1021 bytes) of digital data available. To use 1.6 zetabytes of bandwidth, you would need to watch HD TV for 47,000 years. ―I keep saying that the sexy job in the next 10 years will be statisticians,‖ Hal Varian, chief economist at Google. ―And I’m not kidding.‖ March 2012: The White House announced a national "Big Data Initiative" that consisted of six Federal departments and agencies committing more than $200 million to Big Data research projects. As noted by Ron Snee, data mining has been around for decades: 1950s: Stepwise regression first developed at Esso (now Exxon) by Efroymson to analyze refinery data 1960s: Graphical methods developed by Tukey, Wilk, Gnanadesikan and others at Bell Labs to gain insight from large data sets 1970s: DuPont uses data compression algorithms in process monitoring using on-line systems Big Data and Data Mining are Growing Rapidly, but Are Not New.
  • 6. 6Statistical Engineering and BIG DATA What’s New? Sheer size of data – often requires compression, parallel processing, and sampling, to store and analyze. Some traditional methods are no longer relevant, e.g., hypothesis testing. Insight from graphical methods must be rethought – difficult to see find outliers in zetabytes of data. The sample sizes coupled with faster computing enables much more complex models, relative to data sets of 30. Due to the above, newer techniques have become popular: CART and other tree-based methods; recursive splits on the data. Neural networks; non-linear models involving combinations of variables – very flexible. Methods based on bootstrapping – resampling and combining models; random forests, ―bagging‖, etc. Clustering and classification methods designed for massive data sets; K-means clustering, support vector machines, etc. Good News: We Have More Data and Powerful Analysis Methods.
  • 7. 7Statistical Engineering and BIG DATA What Could Possibly Go Wrong?
  • 8. 8Statistical Engineering and BIG DATA What Could Possibly Go Wrong? Duke Genomics Center published several groundbreaking articles conclusively identifying cancer biomarkers in the 2005-2010 timeframe. Unfortunately, clinical trials based on this research did not pan out. Women died unexpectedly. Two statisticians, Keith Baggerly and Kevin Coombes, dug into the research. New York Times, July 8, 2011: Dr. Baggerly and Dr. Coombes found errors almost immediately. Some seemed careless – moving a row or column over by one in a giant spreadsheet – while others seemed inexplicable. The Duke team shrugged them off as ―clerical errors‖...In the end, four gene signature papers were retracted. Duke shut down three trials using the results. (Lead investigator) Dr. Potti resigned from Duke...His collaborator and mentor, Dr. Nevins, no longer directs one of Duke’s genomics centers. The cancer world is reeling. Large Amounts of Data Plus Sophisticated Algorithms Do Not Guarantee Success.
  • 9. 9Statistical Engineering and BIG DATA What Could Possibly Go Wrong? Financial giant Lehman Brothers declared bankruptcy on September 15th, 2008. This was the largest bankruptcy filing in US history, with Lehman Brothers holding roughly $600 billion in assets. The Dow Jones Industrial Average dropped over 500 points that day, several other financial institutions followed Lehman Brothers into bankruptcy.....and the rest is history. A few years earlier, I had visited Lehman Brothers headquarters in NY with representatives of GE Capital: Lehman was selling models to predict corporate defaults. Their models were quite sophisticated, and based on large amounts of historical financial data. Virtually all financial institutions impacted by the crisis had models. “Historical Results Do Not Guarantee Future Performance.”
  • 10. 10Statistical Engineering and BIG DATA What Could Possibly Go Wrong? On April 18th, 2011 the book ―The Making of a Fly‖ goes on sale on Amazon.com. Amazon’s automated algorithm places a price of $1,730,045 on the book. Later in the day, the Amazon price goes up to $23,698,656. Plus $3.55 for shipping and handling. No one buys the book that day. Days later, the Amazon price was $106. People started to buy the book. “We are Writing Things That No One Can Read.” Kevin Slavin (2011 TED Conference)
  • 11. 11Statistical Engineering and BIG DATA What Could Possibly Go Wrong? Our quandary: All other things being equal, ―Big Data‖ is better than ―little data‖. The newer data mining tools are powerful and work quite well in numerous cases. Yet, modeling disasters continue to occur; why? Clearly, we are missing something in the equation. Could It Be That the Fundamentals Are Still Important?
  • 12. 12Statistical Engineering and BIG DATA Can Statistical Engineering Principles Help? Some Background, and a Definition
  • 13. 13Statistical Engineering and BIG DATA Interesting Course Taught at Harvard Stat 399: Problem Solving in Statistics “…emphasizes deep, broad, and creative statistical thinking instead of technical problems that correspond to a recognizable textbook chapter.”* *Xiao-Li Meng, American Statistician, August 2009 Do the Important Problems We Face “Correspond to a Recognizable Textbook Chapter?”
  • 14. 14Statistical Engineering and BIG DATA Susan Hockfield – MIT President Around the dawn of the 20th century, physicists discovered the basic building blocks of the universe; a ―parts list‖, if you will. Engineers said ―we can build something from this list,‖ and produced the electronics revolution, and subsequently the computer revolution. More recently, biologists have discovered and mapped the basic ―parts list‖ of life – the human genome. Engineers have said ―we can build something from this list,‖ and are producing a revolution in personalized medicine.* Who is Building Something Meaningful From the Statistical Science Parts List of Tools? *Loosely quoted from January, 2010 seminar at GE Global Research
  • 15. 15Statistical Engineering and BIG DATA Statistical Engineering Definition Statistical engineering: The study of how to best utilize statistical concepts, methods, and tools and integrate them with information technology and other relevant sciences to generate improved results (Hoerl and Snee 2010a). In other words, trying to build something meaningful from the statistical science tools list. Enables us to attack the large, complex, unstructured problems “that do not correspond to a recognizable textbook chapter.” Notes This is a different definition than that used by Eisenhart, who we believe was the first to use this term in 1950. Good statisticians have always done this, but little practical guidance has been documented in the literature. This Definition is Consistent with Dictionary Definitions of Engineering.
  • 16. 16Statistical Engineering and BIG DATA Typical Phases of Statistical Engineering Projects 1. Identify problems: find the high-impact issues inhibiting achievement of the organization’s strategic goals. 2. Create structure: carefully define the problem, objectives, constraints, metrics for success, and so on. 3. Understand the context: identify important stakeholders (e.g., customers, organizations, individuals, management), research the history of the issue, identify unstated complications and cultural issues, locate relevant data sources. 4. Develop a strategy: create an overall, high level approach to attacking the problem, based on phases 2 and 3. 5. Establish tactics: develop and implement diverse initiatives or projects that collectively will accomplish the strategy. There Are No “Seven Easy Steps” to Statistical Engineering Projects.
  • 17. 17Statistical Engineering and BIG DATA Statistical Engineering – Critical Considerations for BIG DATA Data Quality Free of omissions, errors, missing values, etc. Missing variables High measurement variation Biases – human, equipment, Subject Matter Knowledge – Used in Many different ways Variables selection and appropriate scales (e.g., log, inverse, square. …) Selection of model form; linear, curvilinear, multiplicative Interpretation of results Ability to extrapolate findings Use of Sequential Approaches Big problems are not solved with one analysis or even one data set Strategy must move beyond the one shot study mindset Three Macro Issues That Seem to Be Overlooked in the Big Data literature.
  • 18. 18Statistical Engineering and BIG DATA Understanding the “Data Pedigree” Trust but Verify - Data pedigree must be assessed when analyzing Big Data. Data quality is an issue with all sources of data. Careful thought must be given to the model form needed to answer the question, and whether the current data is sufficient for that purpose. Multiple sources of data require careful thought as to data pedigree and how to fit the data bases together to produce useful results. Different data sources are typically associated with political issues, different agendas, different objectives, etc. Good Principle: Data Are Guilty Until Proven Innocent.
  • 19. 19Statistical Engineering and BIG DATA The Advantages of a Sequential Approach Much of our professional literature, and virtually all of our textbooks, assume that statistical problems are, by their nature, ―one shot studies‖: We are handed a fixed data set, and must develop the ―best‖ model to fit the data. Articles are frequently published challenging previously published analyses, and proposing a better model for the same data. This is the clearly the tone of many high-profile data analysis competitions, beginning with the Netflix Challenge, and continuing today with Kaggle.com. Are Most Statistical Problems One-Shot Studies?
  • 20. 20Statistical Engineering and BIG DATA The Advantages of a Sequential Approach In 30 years working as a statistician in the private sector, I almost always needed a sequential approach, involving more than one statistical tool, to solve the important problems I faced. If one is in the midst of an sequential process, he or she approaches data analysis from a very different viewpoint versus one-shot studies. A key goal in the process is to direct the next round of data gathering and analysis, as opposed to finding the ―optimal‖ model. Sequential approaches, as proposed by Box, Hunter, and Hunter (2005) also offer the opportunity for using hindsight to our advantage. ―The best time to design an experiment is after examining the results.‖ Are Netflix and Kaggle.com Missing Something?
  • 21. 21Statistical Engineering and BIG DATA The Importance of Subject Matter Knowledge ―Data have no meaning in themselves; they are meaningful only in relation to a conceptual model of the phenomenon being studied.‖ Box, Hunter, and Hunter. Implied message of the data mining, machine learning, and Big Data literature; ―Data have complete meaning in themselves; no theory is required‖. For example, only subject matter theory, NOT statistics, allows us to extrapolate the results of a study, say a clinical trial, to a broader population. Subject matter theory guides the statistical process, including data collection, analysis, and interpretation. This is a ―scientific method‖ approach to statistics, as opposed to a ―test‖ approach to statistics. Such an approach allows statistics and statisticians an active role in developing new theories, as opposed to simply providing yes/no answers to existing theories (proactive leadership vs. passive consulting paradigm). New subject matter insights lead naturally to new questions, and new data, directly linking this principle to the sequential approach principle. Data and Understanding Are Not Synonyms
  • 22. 22Statistical Engineering and BIG DATA Data Subject Matter Theory Process Knowledge Increases Business Process Customer Data Integration of Subject Matter Knowledge From Hoerl & Snee, Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley, 2012
  • 23. 23Statistical Engineering and BIG DATA Putting It All Together - Providing Leadership to Ensure We Do Big Data the Right Way
  • 24. 24Statistical Engineering and BIG DATA Statistical Engineering Approach to Big Data Leadership is needed to avoid the pitfalls of ―Big Data + powerful algorithms = success‖ fallacy; if we don’t lead the way, it probably won’t happen. The fundamentals still apply – in fact they are even more critical. The phases of Statistical Engineering provide a framework with which to attack Big Data projects more scientifically 1. Identify problems: find the high-impact Big Data problems – don’t wait for them to come to you 2. Create structure: carefully define the real (versus stated) problem, objectives, constraints, metrics for success, and so on. 3. Understand the context: obtain as much subject-matter knowledge as possible, research the history of the issue, locate relevant data sources, and so on. 4. Develop a strategy: create an overall, high level approach to attacking the problem, based on phases 2 and 3; incorporate a sequential approach – applying what we learn in the initial analysis. 5. Establish tactics: develop and implement individual steps in the strategy – stay flexible, but start with a defined plan. Big Data Constitutes One of Our Profession’s Best Leadership Opportunities in Our History.
  • 25. 25Statistical Engineering and BIG DATA Summary The glass is half-full: Big Data and associated tools offer a unique opportunity to solve important problems that were previously intractable. Fundamentals of good science, analytical modeling and interpretation still apply. Ignoring these fundamentals increases the probability that invalid conclusions are reached and inappropriate actions taken. Statistical Engineering provides a useful approach for using Big Data to solve important problems. A five-phase framework is suggested to guide the work associated with Big Data problems that are typically large, complex and unstructured. Probability of success is significantly increased when the following aspects of Statistical Engineering are incorporated in the approach: Understanding of data pedigree Utilization of sequential approaches Integration of subject matter knowledge Statistical Engineering Can Help Big Data Projects Be Successful
  • 26. 26Statistical Engineering and BIG DATA References Davenport, T. H and J. G. Harris (2007) Competing on Analytics, Harvard Business School Press, Boston, MA DeVeaux, R. D. and D. J. Hand (2005) ―How to Lie with Bad Data‖, Statistical Science, Vol. 20, No.3, 231- 238 Hoerl, R. W. and R. D. Snee (2012) Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley, 2012 Pierrard, J. M. (1974) ―Relating Automotive Emissions and Urban Air Quality‖, DuPont Innovation, Vol. 5. No. 2, pp 6-9. Pierrard, J. M., R. D. Snee and J. Zelson (1973) ―A New Approach to Setting Vehicle Emission Standards‖, Presented at Air Pollution Control Association Annual Meeting, June 24-28, 1973 Pierrard, J. M., R. D. Snee and J. Zelson (1974) ―A New Approach to Setting Vehicle Emission Standards‖, Air Pollution Control Association Journal, Vol. 24, No. 9, pp 841-848. Snee, R. D. and R. W. Hoerl (2003) Leading Six Sigma – A Step by Step Guide Based on Experience With General Electric and Other Six Sigma Companies, FT Prentice Hall, New York, NY. Snee, R. D. and R. W. Hoerl (2012) ―Inquiry on Pedigree – Do You Know the Quality and Origin of Your Data?‖ Quality Progress, December 2012, 66-68. Snee, R. D. and J. M. Pierrard (1977) ―The Annual Average: An Alternative to the Second Highest Value as a Measure of Air Quality‖, Air Pollution Control Association Journal, Vol. 27, No. 2, pp 131-133.
  • 27. 27Statistical Engineering and BIG DATA Articles on Statistical Engineering by Hoerl and Snee Roger W. Hoerl and Ronald D. Snee, (2009) ―Post Financial Meltdown: What Do Services Industries Need From Us Now?‖ Applied Stochastic Models in Business and Industry, December 2009, pp. 509-521. Roger W. Hoerl and Ronald D. Snee, (2010) ―Moving the Statistics Profession Forward to the Next Level,‖ The American Statistician, February 2010, pp. 10-14. Roger W. Hoerl and R. D. Snee, (2010) ―Closing the Gap: Statistical Engineering Can Bridge Statistical Thinking with Methods and Tools,‖ Quality Progress, May 2010, pp. 52-53. Roger W. Hoerl and R. D. Snee, (2010) ―Tried and True—Organizations Put Statistical Engineering to the Test and See Real Results,‖ Quality Progress, June 2010, pp. 58-60. Roger W. Hoerl and Ronald D. Snee, (2010) ―Statistical Thinking and Methods in Quality Improvement: A Look to the Future,‖ Quality Engineering, 22, 3, pp. 119-139. Roger W. Hoerl and Ronald D. Snee, (2011) ―Statistical Engineering: Is This Just Another Term for Applied Statistics?‖ Joint Newsletter of the ASA Section on Physical and Engineering Sciences and Quality and Productivity , March 2011, 4-6. Ronald D. Snee and Roger W. Hoerl, (2010) ―Further Explanation; Clarifying Points About Statistical Engineering,‖ Quality Progress, December 2010, pp. 68-72 Ronald D. Snee and Roger W. Hoerl (2011) ―Engineering an Advantage‖, Six Sigma Forum Magazine, Guest Editorial, February 2011, 6-7. Ronald D. Snee and Roger W. Hoerl, (2011) ―Proper Blending: Finding the Right Mix of Statistical Engineering and Traditional Applied Statistics,‖ Quality Progress, June 2011.