Denise Esserman MedicReS World Congress 2015

The Future of Frequentist
Hypothesis Testing
Denise Esserman, PhD
Yale Center for Analytical Sciences (YCAS)
Yale School of Public Health
October 19-25 | 2015 New York
www.medicres.org

Outline
• Definition of Big Data
• Expanding into Medicine
• Statistical Concerns
• The Future
www.medicres.org

“Big data is like teenage sex: everyone talks
about it, nobody really knows how to do it,
everyone thinks that everyone else is doing it,
so everyone claims they are doing it…”
Dan Ariely 2013
[Terry Speed talk, 2014]
www.medicres.org

What are/is “Big Data”?
www.medicres.org
• “Buzzword”
• Multiple Definitions
– The V’s: volume, velocity, variability, variety,
veracity, value (and complexity)
• Variability in the quantity and quality of the
data

Origin
• Believed to have originated with Web search
companies
– Querying very large distributed aggregations of
loosely-structured data1
• Does not always reference the volume of
data
www.medicres.org

Big Data Trends
• Google trends
www.medicres.org

“The hopeful vision of big data is that
organizations will be able to harvest and
harness every byte of relevant data and use it
to make the best decisions. Big data
technologies not only support the ability to
collect large amounts, but more importantly,
the ability to understand and take advantage
of its full value.”
Mark Troester14
www.medicres.org

Big Data in Medicine
• “Potential lies in innovative ways it can be
linked, related, and integrated to provide
more detailed and personalized
information than is possible with data from
a single source”7
• For health care providers to offer
personalized medicine
www.medicres.org

NIH – Big Data to Knowledge6
“The ability to harvest the wealth of information
contained in biomedical Big Data will advance our
understanding of human health and disease;
however, lack of appropriate tools, poor data
accessibility, and insufficient training, are major
impediments to rapid translational impact. To meet
this challenge, the National Institutes of Health
(NIH) launched the Big Data to Knowledge (BD2K)
initiative in 2012.”
www.medicres.org

BD2K Mission Statement
“BD2K is a trans-NIH initiative established
to enable biomedical research as a digital
research enterprise, to facilitate discovery
and support new knowledge, and to
maximize community engagement.”
www.medicres.org

BD2K Four Major Aims
• To facilitate broad use of biomedical digital assets by making them
discoverable, accessible, and citable.
• To conduct research and develop the methods, software, and tools
needed to analyze biomedical Big Data.
• To enhance training in the development and use of methods and
tools necessary for biomedical Big Data science.
• To support a data ecosystem that accelerates discovery as part of a
digital enterprise.
www.medicres.org

Biomedical Big Data6
• More than just very large data or a large number of data
sources.
– Complexity, challenges, and new opportunities
presented by the combined analysis of data.
• Diverse and complex.
– Imaging, phenotypic, molecular, exposure, health,
behavioral, and many other types of data.
www.medicres.org

• Faces many challenges.
– Unwieldy amount of information
– Lack of organization and access to data and tools
– Insufficient training in data science methods
• Spectacular opportunities.
– Maximize the potential of existing data and enable new
directions for research.
www.medicres.org

Quantity of data does not mean one can
ignore foundational issues of measurement
and construct validity and reliability and
dependencies among data5
www.medicres.org

Google Flu Trends2,3
• Machine Learning Algorithm
– Predict number of flu cases based on Google
Search Terms
– Theory Free
– Misunderstanding about uncertainties in data
collection and modeling process
• Inaccurate results over time
– Lack of statistics: did not know what linked
search terms to spread of flu
www.medicres.org

Google Flu Trends (cont)
• Correlation rather than causation
– Cheaper and easier
• Theory-free analysis is fragile
• Intended as a “complementary signal”
rather than stand alone forecasting tool4
www.medicres.org

“There are a lot of small data problems that
occur in big data. They don’t disappear
because you’ve got lots of the stuff. They get
worse.”
David Spielgelhalter
“Big data is like a big trash dump. You have to
know how to find the nuggets so it’s
profitable.”
Vin Gupta
www.medicres.org

Indexing vs. Analyzing Big Data
• Search companies index it
– Make relevant data easy to use
• Statisticians analyze it
– Find structure within the data
www.medicres.org

Data Scientists7
• People who draw insights from large quantities of
data
– Innovative problem solvers
– Expertise in statistical modeling and machine learning
– Specialized programming skills
– Solid grasp of problem domain
• Data science is blend of statistical, mathematical,
and computational sciences
www.medicres.org

Statistical Disconnect
• Statisticians should be leaders of Big Data
and data science movement8
– Scope goes beyond traditional activities
• Statisticians need to be more engaged
– Need to develop the skills to handle the sheer
volume of data
• Data scientists need to engage more
statisticians (or more statisticians need to
become data scientists)
www.medicres.org

“Better data matters because simply having
Big Data does not guarantee reliable answers
for Big Questions.”
Robert Rodriguez
www.medicres.org

Some of the Statistical Concerns
• Sampling populations
– Sampling error – not representative
• Confounders
• Multiple Testing
• Bias
– Sampling bias – not randomly chosen
• Overfitting
www.medicres.org

Big n problem
• Myth that problem is only computational in
nature, not statistical because of large n
• Standard errors can be large even with Big
Data
– Issue with large p
www.medicres.org

• Scale of data requires spreading across
cluster or grid of computers
• Computational work to be distributed with
the data
– Google MapReduce Model for parallel
programming (Apache Hadoop)
www.medicres.org

Software Alchemy (SA)
• Simple, powerful method to reduce computation
• Partition the data in r groups and calculate
average of estimator across groups
– Requires partitions to be distributed similarly
• May need initial “shuffle” step
• Works well for any asymptotically normal
estimator
– Also works for p growing
www.medicres.org

Big p problem
• High-dimensional data
• Dimension reduction
– e.g. Principal components analysis, variable
selection
• Issues:
– Multiple comparisons and simultaneous
inference
– Sparsity of data
www.medicres.org

In the future…
• Can we find methods that allow larger
values of p than “safe” o(√n)?
– Dimension reduction may distort results
• Can we more easily verify technical
assumptions?
– e.g. lack of potential consistency of the LASSO
• Can we find more general methods?
www.medicres.org

Theoretical Null Distributions
• Null distribution is most often not estimated, but
hypothesized in classic hypothesis testing
• Incorrect null might lead to false inference
• Influences
– Correlation
– Incorrect assumptions
– Unobserved covariates
www.medicres.org

Empirical Null Distribution
• Empirical null estimated from study’s data
• Does not assume “nice” normal with variance
going to 0
• Need independence assumption, but do not need
identical assumption
– Can have heterogeneous groups
www.medicres.org

Big-data Clinical Trials (BCT)10
• Neglected problem in RCT – analysis
typically based on different effectiveness of
different interventions provided at
baseline
• Want to be able to analyze association of
baseline treatment and its subsequent
dynamic processes
www.medicres.org

Example: Blood Pressure
• RCT:
– Effectiveness of Antihypertensive on BP control
– BP measured at specified outcomes
– Long term outcomes (e.g. stroke)
• BCT:
– Maintenance stable BP every day, hour, minute
– “Dosing” equipment
www.medicres.org

Epidemiologic Perspective
• Big data is useful to detect rare drug-related
side effects, not likely to be observed in RCT
• Chance to look at rare diseases
• Can contribute to an understanding of the
strengths and limitations of new population
sources12
www.medicres.org

Future of BCT10
• How will this be defined?
• What is the “right” data to collect?
• Who is in a position to design this trial?
• How do we handle threats of big data?
• How do we incorporate the non-static
populations?
www.medicres.org

Moving Beyond Rectangular Data
• Structure of the data may be irregular, non-
structured
– Knowledge will change over time11
• Varying types of data
– Pictures, videos, images
– Unstructured and Sei-structured from Social
Media
www.medicres.org

Machine Learning
“…a subfield of computer science that evolved from
the study of pattern recognition and computational
learning theory in artificial intelligence. Machine
learning explores the study and construction of
algorithms that can learn from and make predictions
on data. Such algorithms operate by building a
model from example inputs in order to make data-
driven predictions or decisions, rather than
following strictly statistic program instructions.”9
www.medicres.org

Machine Learning Tasks
• Supervised learning
– Example inputs and output – learn a rule that maps inputs to
outputs
• Semi-supervised learning
– Incomplete training signal (i.e. target outputs missing)
• Unsupervised learning
– Leave on own to find structure in input
• Reinforcement learning
– Interaction with dynamic environment
www.medicres.org

Example: Support Vector Machines
• Supervised learning method
• Used for classification and regression
analysis
• Can perform non-linear classification
www.medicres.org

Challenges with Machine Learning
• Need to choose the appropriate algorithm
• Need to be able to define hyper-parameters
– One or model parameters
• Data needs to be in appropriate format
– This is not trivial
• Execution time grows with number of attributes
and data instances
www.medicres.org

Future of Machine Learning
• Automatic searches for optimal algorithm
and hyper-parameters
– Very time consuming, limited usefulness at
present
• More user friendly approaches
– e.g. allow healthcare researcher to efficiently
and independently build predictive model13
www.medicres.org

Example: Machine Learning for Big
Clinical Data (MLBCD)13
• Supports whole process of iterative machine learning on
big clinical data
– Clinical parameter extraction
– Feature construction
– Algorithm and hyper-parameter selection
– Model building
– Model evaluation
• Can use after once (1) defined study population and
research question, (2) obtained clinical data set, and (3)
prepped data, including cleaning and filling in missing
data
www.medicres.org

Data Set Preparation
• Tremendous amount of work goes into
getting a data set together
– Acquire, normalize, clean
• e.g. Pivoting entity-attribute value format EMR to
relational table formats13
www.medicres.org
ID (entity Test # (Attribute) Pulse (Value)
100100 Test 1 98
989021 Test 2 101
100100 Test 2 75
989021 Test 3 99
989021 Test 4 88
ID Test_1 Test_2 Test_3
100100 98 75 null
989021 null 101 99

• Even bigger challenge with large data sets
– Different formats
– Different locations
– Data quality and governance
– Security, privacy and regulatory challenges
• Surprisingly little work done here – and it
should be a priority!
www.medicres.org

Example: Fitbit
• Study of Cardiac Risk
• Use Fitbit to measure daily step counts for 3
years (4000 participants)
• Participants need to upload to
laptop/phone and then sync to Fitbit server
• Devise holds one month of data
• Need to then connect with other study data
www.medicres.org

Where should the field head?
• Need new techniques for data management
• New tools for data analysis
• New tools for data visualization
• Ways to acquire and analyze unstructured
text data
www.medicres.org

“Big data is not about the technologies to
store massive amounts of data, it is about
creating a flexible infrastructure with high-
performance computing, high-performance
analytics and governance – in a deployment
model that makes sense for the
organization.”14
Mark Troester
www.medicres.org

References
1. http://www.webopedia.com/TERM/B/big_data.html (accessed October 15, 2015)
2. http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/ (accessed October
15,2015)
3. http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html (accessed October 15, 2015)
4. http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/?_r=0 (accessed October 16, 2015)
5. Lazer D, Kennedy R, King G, Vespignani. The Parable of Google Flu: Traps in Big Data Analysis. Science, 343: 1203-05, (2014).
6. https://datascience.nih.gov/bd2k/about/what (accessed October 15, 2015)
7. http://magazine.amstat.org/blog/2012/06/01/prescorner/ (accessed October 17, 2015)
8. http://magazine.amstat.org/blog/2013/06/01/the-asa-and-big-data/ (accessed October 18, 2015)
9. https://en.wikipedia.org/wiki/Machine_learning (accessed October 18, 2015)
10. Wang SD. Opportunities and challenges of clinical research in the big-data era: from RCT to BCT. J Thorac Dis, 5(6): 721-
723, (2013)
11. Wang SD, Shen Y. Redefining big-data clinical trial (BCT). Annals of Translational Medicine. 2(10): 96, (2014)
12. Gange SJ, Golub ET. From smallpox to big data: The Next 100 years of epidemiologic methods. American Journal of
Epidemiology. DOI: 10.1093/aje/kwv150
13. Luo G. MLBCD: a machine learning tool for big clinical data. Inf Sci Syst. 3:3, 2015.
14. Big Data Meets Big Data Analytics. SAS White Paper
www.medicres.org

Denise Esserman MedicReS World Congress 2015

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Denise Esserman MedicReS World Congress 2015

Similaire à Denise Esserman MedicReS World Congress 2015 (20)

Plus de MedicReS

Plus de MedicReS (20)

Dernier

Dernier (20)

Denise Esserman MedicReS World Congress 2015