Contenu connexe Similaire à Nicholas Jewell MedicReS World Congress 2014 (20) Nicholas Jewell MedicReS World Congress 20141. Teaching Medical Research Methodology
Nicholas P. Jewell
Departments of Statistics &
School of Public Health (Biostatistics)
University of California, Berkeley
October 16, 2014
2. Medical Research Methodology
• All modern medical and public health research now
requires a considerable amount of biostatistics,
computer science, data processing and machine
learning: data science
– Study design, measurement technologies, distributed data
and parallel computing, databases, mHealth
– Data analysis methodology, software, casual models
– Open access, open data availability, roel of industry
– Ethics
2
© Nicholas P. Jewell, 2014
6. What is the Primary Scientific
Question?
• What is the one “number” you want to know
• What will be the “newspaper headline”?
• What is the parameter of interest?
• What would be a meaningful effect? (range of
parameter values of interest)
© Nicholas P. Jewell, 2014
7. Measurement Issues
• How will exposure be measured?
• How will the outcome be measured?
7
Parameter of
Interest?
Outcome
and
Exposure
Definition
Outcome and
Exposure
Measurement
© Nicholas P. Jewell, 2014
8. Data Collection--Sampling
• What is the target population?
• What is the Study Population?
• How will individuals be sampled (case-control, cohort,
longitudinal)?
• How is exposure assigned to individuals?
• What is now my parameter of interest? 8
© Nicholas P. Jewell, 2014
9. Can I Draw a Prototype DAG?
• Direct acyclic graphs (DAG) to related
variables, including exogeneous and
selection variables if necessary
• Is the parameter of interest identifiable
from the design and under what
assumptions?
9
© Nicholas P. Jewell, 2014
10. Pre-specified Analysis Plan
• Multiple Outcomes (Which one is
primary)?
• Confounders?
• Subgroups of Interest?
• Effect Modification of Interest?
• Mediation?
10
© Nicholas P. Jewell, 2014
11. What is the Recipe for
Assessing a Design?
The Four Hs
• What basic questions should be asked?
– How was the Data Collected (e.g. design/sample size
issues/loss to follow up)
– How were the variables measured?
– How was the data analyzed? (e.g. methods/
assumptions, software, pre-specified methods)
– How were the results reported? (e.g. sensitivity,
interpretation, uncertainty, conflicts of interest)
11
© Nicholas P. Jewell, 2014
12. Deeper Questions
The Five Ws
• What is the population of interest?
• What is the data-generating mechanism (e.g.
assumptions, missingness or data filtering, selection)
• What are the parameters of interest that describe the
data-generating mechanism? (e.g. causal inference)
• Which of these parameters are identifiable and how can
they be estimated effectively?
• Where can I find the data and the software?
12
© Nicholas P. Jewell, 2014
13. Would I or Should I Change the
Deisgn?
• Need more power (sample size)
• Need to change parameter of interest?
• Need to change target population?
• Need to change sampling procedure?
• Need to change analysis plan?
• Need to assess measurement errors?
13
© Nicholas P. Jewell, 2014
14. Introduction
• 20th century Statistics: Small Data Problems
– Small number of observations
– Small number of variables (outcomes, inputs and confounders)
• 21st century Statistics: Big Data Problems
– Very high number of observations
– Very high dimensional data sets
– Non-standard data
– Non-standard questions
• Theme: Formulation of scientific question and making
relevant inference (statistics) is more crucial than ever
14
© Nicholas P. Jewell, 2014
16. The Age of Big Data
New York Times, February 11, 2012
It’s a revolution . . . The march of
quantification, made possible by
enormous new sources of data,
will sweep through academia,
business and government. There is
no area that is going to be untouched”
Gary King
16
© Nicholas P. Jewell, 2014
17. Really Big Data: The Higgs Boson etc
• Large Hadron Collider produces 600M particle
collisions per second in its detectors
• Each collision yields 1MB of data
• Produces a petabyte (1015 bytes) per second
• Standard DVD stores 5 gigabytes (109 bytes), so the
collider has been filling about 200K DVDs per second
over the three years in order to pin down the Higgs
Boson
90% of data stored in the world today has been created in the past 2 years
21st century data is not just numbers, it is Youtube videos, tweets,
17
crowdsourcing information etc © Nicholas P. Jewell, 2014
18. Lots of Discussion
• McKinsey report – Big data: the next frontier for innovation,
competition, and productivity (05-2011)
• Google search on the term “Big Data” – over 2 billion results found
(as a comparison, “linear model” has about 25 million; “Beatles” has
about 53 million; “Super Bowl” has about 821 million)
• media fever reporting on Big Data: examples from the US media
only
– Big Data’s big problem: little talent (WSJ, 04-29-2012)
– Big Data is on the rise, bringing big questions (WSJ, 11-29-2012)
• to just get an idea to see how popular this topic is right now ...
18
© Nicholas P. Jewell, 2014
20. The End of Theory: The Data Deluge
Makes the Scientific Method Obsolete
By Chris Anderson 06.23.08 (Editor-in-Chief, Wired)
"All models are wrong, but some are useful.” (George Box, 1987)
"All models are wrong, and increasingly you can succeed without them.” (Peter Norvig, Google's research director, 2008)
“We can stop looking for models. We can analyze the data without hypotheses about what it might show.
We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms
find patterns where science cannot.”
20
“Correlation supersedes causation, and science can advance even without coherent models, unified theories, or
really any mechanistic explanation at all.”
© Nicholas P. Jewell, 2014
22. What is Big Data?
• “Big Data is like teenage sex: everyone talks about it,
nobody really knows how to do it, everyone thinks
everyone else is doing it, so everyone claims they are
doing it...”
22
© Nicholas P. Jewell, 2014
23. What is Big Data?
• Big Data refers to the idea:
– NSF Big Data Initiative (2012): “that scientists manage, analyze,
visualize, and extract useful information from large, diverse,
distributed and heterogeneous data sets so as to accelerate the
progress of scientific discovery and innovation.”
– WSJ (04-29-2012): “that an enterprise mine all the data it
collects right across its operations to unlock golden nuggets of
business intelligence.”
• Big Data:
– data can be (and very often are) “large, diverse, distributed and
heterogeneous”
– it is definitely helpful to keep “Big” in mind
– but, do we always need such a big and diverse data? –
subsampling?
– the key is, in many cases, to use data to answer scientific and
public health questions 23
© Nicholas P. Jewell, 2014
24. The Big Data Problem
(Mike Jordan)
• Computer science studies the management of
resources, such as time and space and energy
• Data has not been viewed as a resource, but as a
“workload”
• The fundamental issue is that data now needs to be
viewed as a resource
– the data resource combines with other resources to yield timely,
cost-effective, high-quality decisions and inferences
• Just as with time or space, it should be the case (to first
order) that the more of the data resource the better
– is that true in our current state of knowledge?
24
© Nicholas P. Jewell, 2014
25. The Answer is Usually No
• query complexity grows faster than number of
data points
• the more rows in a table, the more columns
• the more columns, the more hypotheses that can be
considered
• so, the more data the greater the chance that random
fluctuations look like signal (e.g., more false positives)
• the more data, the less likely a sophisticated
algorithm will run in an acceptable time frame
• and then we have to back off to cheaper algorithms that may
be more error-prone
• or we can subsample, but this requires knowing the statistical
value of each data point, which we generally don’t know
a priori 25
© Nicholas P. Jewell, 2014
26. Even Simple Questions . . .
(Nicholas Chamandi and others at Google)
• Data streams
– Interact with one record at a time
– Records are not guaranteed to be sorted in a meaningful way
– One record is not necessarily one i.i.d statistical observation
Suppose Xi is the number of
queries (per day, say) for a user
Easy to compute
26
but not
Xn
i=1
Xi
Xn
i=1
Xi
2
How about the median or
mode of the distribution?
27. Curriculum Evolution
cs (then): Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques, 1st edition [2006, 2nd edition]
stat (then): Hastie, T., Tibshirani, R., and Friedman, J. (2001). Elements of Statistical
Learning, 1st edition [2009, 2nd edition]
cs (now): Rajaraman, A., Leskovec, J., and Ullman, J.D. (2012+). Mining of Massive Datasets, manuscript
27
Then Now
cs stat cs stat
data warehouse regression models distributed system
OLAP (online analytical proc) lasso, ridge, PCA, PLS Map-Reduce, Hadoop
?
association rules splines, kernel smooth association, freq items ?
classification CART MARS GAM PageRank, link analysis ?
clustering boosting boosting
prediction model classification, SVM SVD, dim reduction
text mining neural networks machine learning
multimedial mining clustering online advertisement
transactional db∗ p >> n
Recommendation sys
social networks network models social networks
28. Curriculum Topics
• statistical skills to build appropriate models given
the massive complicated and usually very messy
data
– data mining methods for data with big size and
dimensionality
– some more recent data mining topics, e.g., networks
and graphical models, adaptive designs, dynamic
treatment regimes, personalized medicine
– also emphasize: general but useful statistical
principles and statistical thinking; e.g., model
interpretation, uncertainty characterization, causal
inference
28
© Nicholas P. Jewell, 2014
29. Curriculum Topics
• engineering and computing skills to carry out all
operations
– Hadoop distributed file system, and Map Reduce
parallel computing system
– programming in script language such as perl, python
– not just R or SAS, but not that hard either!
– modern computing algorithms
© Nicholas P. Jewell, 2014
29
30. Curriculum Topics
• new forms of assignments/assessment:
– real world Big Data problems from kaggle.com
– only ask some high level questions; formulate the
specific question of your own – problem formulation is
crucial
– gain some understanding of important medical and
public health problems, and become familiar with
common terminology
– gain some experiences of processing big and messy
real world data
– push for a very concise presentation – the summary
can not be more than 3 sentences; the final oral
presentation is no more than 5 minutes, both strictly
enforced 30
© Nicholas P. Jewell, 2014