Multiple time frame trading analysis -brianshannon.pdf
Denise Esserman MedicReS World Congress 2015
1. The Future of Frequentist
Hypothesis Testing
Denise Esserman, PhD
Yale Center for Analytical Sciences (YCAS)
Yale School of Public Health
October 19-25 | 2015 New York
www.medicres.org
2. Outline
• Definition of Big Data
• Expanding into Medicine
• Statistical Concerns
• The Future
October 19-25 | 2015 New York
www.medicres.org
3. “Big data is like teenage sex: everyone talks
about it, nobody really knows how to do it,
everyone thinks that everyone else is doing it,
so everyone claims they are doing it…”
Dan Ariely 2013
[Terry Speed talk, 2014]
October 19-25 | 2015 New York
www.medicres.org
4. What are/is “Big Data”?
October 19-25 | 2015 New York
www.medicres.org
• “Buzzword”
• Multiple Definitions
– The V’s: volume, velocity, variability, variety,
veracity, value (and complexity)
• Variability in the quantity and quality of the
data
5. Origin
• Believed to have originated with Web search
companies
– Querying very large distributed aggregations of
loosely-structured data1
• Does not always reference the volume of
data
October 19-25 | 2015 New York
www.medicres.org
6. Big Data Trends
• Google trends
October 19-25 | 2015 New York
www.medicres.org
7. “The hopeful vision of big data is that
organizations will be able to harvest and
harness every byte of relevant data and use it
to make the best decisions. Big data
technologies not only support the ability to
collect large amounts, but more importantly,
the ability to understand and take advantage
of its full value.”
Mark Troester14
October 19-25 | 2015 New York
www.medicres.org
8. Big Data in Medicine
• “Potential lies in innovative ways it can be
linked, related, and integrated to provide
more detailed and personalized
information than is possible with data from
a single source”7
• For health care providers to offer
personalized medicine
October 19-25 | 2015 New York
www.medicres.org
9. NIH – Big Data to Knowledge6
“The ability to harvest the wealth of information
contained in biomedical Big Data will advance our
understanding of human health and disease;
however, lack of appropriate tools, poor data
accessibility, and insufficient training, are major
impediments to rapid translational impact. To meet
this challenge, the National Institutes of Health
(NIH) launched the Big Data to Knowledge (BD2K)
initiative in 2012.”
October 19-25 | 2015 New York
www.medicres.org
10. BD2K Mission Statement
“BD2K is a trans-NIH initiative established
to enable biomedical research as a digital
research enterprise, to facilitate discovery
and support new knowledge, and to
maximize community engagement.”
October 19-25 | 2015 New York
www.medicres.org
11. BD2K Four Major Aims
• To facilitate broad use of biomedical digital assets by making them
discoverable, accessible, and citable.
• To conduct research and develop the methods, software, and tools
needed to analyze biomedical Big Data.
• To enhance training in the development and use of methods and
tools necessary for biomedical Big Data science.
• To support a data ecosystem that accelerates discovery as part of a
digital enterprise.
October 19-25 | 2015 New York
www.medicres.org
12. Biomedical Big Data6
• More than just very large data or a large number of data
sources.
– Complexity, challenges, and new opportunities
presented by the combined analysis of data.
• Diverse and complex.
– Imaging, phenotypic, molecular, exposure, health,
behavioral, and many other types of data.
October 19-25 | 2015 New York
www.medicres.org
13. • Faces many challenges.
– Unwieldy amount of information
– Lack of organization and access to data and tools
– Insufficient training in data science methods
• Spectacular opportunities.
– Maximize the potential of existing data and enable new
directions for research.
October 19-25 | 2015 New York
www.medicres.org
14. Quantity of data does not mean one can
ignore foundational issues of measurement
and construct validity and reliability and
dependencies among data5
October 19-25 | 2015 New York
www.medicres.org
15. Google Flu Trends2,3
• Machine Learning Algorithm
– Predict number of flu cases based on Google
Search Terms
– Theory Free
– Misunderstanding about uncertainties in data
collection and modeling process
• Inaccurate results over time
– Lack of statistics: did not know what linked
search terms to spread of flu
October 19-25 | 2015 New York
www.medicres.org
16. Google Flu Trends (cont)
• Correlation rather than causation
– Cheaper and easier
• Theory-free analysis is fragile
• Intended as a “complementary signal”
rather than stand alone forecasting tool4
October 19-25 | 2015 New York
www.medicres.org
17. “There are a lot of small data problems that
occur in big data. They don’t disappear
because you’ve got lots of the stuff. They get
worse.”
David Spielgelhalter
“Big data is like a big trash dump. You have to
know how to find the nuggets so it’s
profitable.”
Vin Gupta
October 19-25 | 2015 New York
www.medicres.org
18. Indexing vs. Analyzing Big Data
• Search companies index it
– Make relevant data easy to use
• Statisticians analyze it
– Find structure within the data
October 19-25 | 2015 New York
www.medicres.org
19. Data Scientists7
• People who draw insights from large quantities of
data
– Innovative problem solvers
– Expertise in statistical modeling and machine learning
– Specialized programming skills
– Solid grasp of problem domain
• Data science is blend of statistical, mathematical,
and computational sciences
October 19-25 | 2015 New York
www.medicres.org
20. Statistical Disconnect
• Statisticians should be leaders of Big Data
and data science movement8
– Scope goes beyond traditional activities
• Statisticians need to be more engaged
– Need to develop the skills to handle the sheer
volume of data
• Data scientists need to engage more
statisticians (or more statisticians need to
become data scientists)
October 19-25 | 2015 New York
www.medicres.org
21. “Better data matters because simply having
Big Data does not guarantee reliable answers
for Big Questions.”
Robert Rodriguez
October 19-25 | 2015 New York
www.medicres.org
22. Some of the Statistical Concerns
• Sampling populations
– Sampling error – not representative
• Confounders
• Multiple Testing
• Bias
– Sampling bias – not randomly chosen
• Overfitting
October 19-25 | 2015 New York
www.medicres.org
23. Big n problem
• Myth that problem is only computational in
nature, not statistical because of large n
• Standard errors can be large even with Big
Data
– Issue with large p
October 19-25 | 2015 New York
www.medicres.org
24. • Scale of data requires spreading across
cluster or grid of computers
• Computational work to be distributed with
the data
– Google MapReduce Model for parallel
programming (Apache Hadoop)
October 19-25 | 2015 New York
www.medicres.org
25. Software Alchemy (SA)
• Simple, powerful method to reduce computation
• Partition the data in r groups and calculate
average of estimator across groups
– Requires partitions to be distributed similarly
• May need initial “shuffle” step
• Works well for any asymptotically normal
estimator
– Also works for p growing
October 19-25 | 2015 New York
www.medicres.org
26. Big p problem
• High-dimensional data
• Dimension reduction
– e.g. Principal components analysis, variable
selection
• Issues:
– Multiple comparisons and simultaneous
inference
– Sparsity of data
October 19-25 | 2015 New York
www.medicres.org
27. In the future…
• Can we find methods that allow larger
values of p than “safe” o(√n)?
– Dimension reduction may distort results
• Can we more easily verify technical
assumptions?
– e.g. lack of potential consistency of the LASSO
• Can we find more general methods?
October 19-25 | 2015 New York
www.medicres.org
28. Theoretical Null Distributions
• Null distribution is most often not estimated, but
hypothesized in classic hypothesis testing
• Incorrect null might lead to false inference
• Influences
– Correlation
– Incorrect assumptions
– Unobserved covariates
October 19-25 | 2015 New York
www.medicres.org
29. Empirical Null Distribution
• Empirical null estimated from study’s data
• Does not assume “nice” normal with variance
going to 0
• Need independence assumption, but do not need
identical assumption
– Can have heterogeneous groups
October 19-25 | 2015 New York
www.medicres.org
30. Big-data Clinical Trials (BCT)10
• Neglected problem in RCT – analysis
typically based on different effectiveness of
different interventions provided at
baseline
• Want to be able to analyze association of
baseline treatment and its subsequent
dynamic processes
October 19-25 | 2015 New York
www.medicres.org
31. Example: Blood Pressure
• RCT:
– Effectiveness of Antihypertensive on BP control
– BP measured at specified outcomes
– Long term outcomes (e.g. stroke)
• BCT:
– Maintenance stable BP every day, hour, minute
– “Dosing” equipment
October 19-25 | 2015 New York
www.medicres.org
32. Epidemiologic Perspective
• Big data is useful to detect rare drug-related
side effects, not likely to be observed in RCT
• Chance to look at rare diseases
• Can contribute to an understanding of the
strengths and limitations of new population
sources12
October 19-25 | 2015 New York
www.medicres.org
33. Future of BCT10
• How will this be defined?
• What is the “right” data to collect?
• Who is in a position to design this trial?
• How do we handle threats of big data?
• How do we incorporate the non-static
populations?
October 19-25 | 2015 New York
www.medicres.org
34. Moving Beyond Rectangular Data
• Structure of the data may be irregular, non-
structured
– Knowledge will change over time11
• Varying types of data
– Pictures, videos, images
– Unstructured and Sei-structured from Social
Media
October 19-25 | 2015 New York
www.medicres.org
35. Machine Learning
“…a subfield of computer science that evolved from
the study of pattern recognition and computational
learning theory in artificial intelligence. Machine
learning explores the study and construction of
algorithms that can learn from and make predictions
on data. Such algorithms operate by building a
model from example inputs in order to make data-
driven predictions or decisions, rather than
following strictly statistic program instructions.”9
October 19-25 | 2015 New York
www.medicres.org
36. Machine Learning Tasks
• Supervised learning
– Example inputs and output – learn a rule that maps inputs to
outputs
• Semi-supervised learning
– Incomplete training signal (i.e. target outputs missing)
• Unsupervised learning
– Leave on own to find structure in input
• Reinforcement learning
– Interaction with dynamic environment
October 19-25 | 2015 New York
www.medicres.org
37. Example: Support Vector Machines
• Supervised learning method
• Used for classification and regression
analysis
• Can perform non-linear classification
October 19-25 | 2015 New York
www.medicres.org
38. Challenges with Machine Learning
• Need to choose the appropriate algorithm
• Need to be able to define hyper-parameters
– One or model parameters
• Data needs to be in appropriate format
– This is not trivial
• Execution time grows with number of attributes
and data instances
October 19-25 | 2015 New York
www.medicres.org
39. Future of Machine Learning
• Automatic searches for optimal algorithm
and hyper-parameters
– Very time consuming, limited usefulness at
present
• More user friendly approaches
– e.g. allow healthcare researcher to efficiently
and independently build predictive model13
October 19-25 | 2015 New York
www.medicres.org
40. Example: Machine Learning for Big
Clinical Data (MLBCD)13
• Supports whole process of iterative machine learning on
big clinical data
– Clinical parameter extraction
– Feature construction
– Algorithm and hyper-parameter selection
– Model building
– Model evaluation
• Can use after once (1) defined study population and
research question, (2) obtained clinical data set, and (3)
prepped data, including cleaning and filling in missing
data
October 19-25 | 2015 New York
www.medicres.org
41. Data Set Preparation
• Tremendous amount of work goes into
getting a data set together
– Acquire, normalize, clean
• e.g. Pivoting entity-attribute value format EMR to
relational table formats13
October 19-25 | 2015 New York
www.medicres.org
ID (entity Test # (Attribute) Pulse (Value)
100100 Test 1 98
989021 Test 2 101
100100 Test 2 75
989021 Test 3 99
989021 Test 4 88
ID Test_1 Test_2 Test_3
100100 98 75 null
989021 null 101 99
42. • Even bigger challenge with large data sets
– Different formats
– Different locations
– Data quality and governance
– Security, privacy and regulatory challenges
• Surprisingly little work done here – and it
should be a priority!
October 19-25 | 2015 New York
www.medicres.org
43. Example: Fitbit
• Study of Cardiac Risk
• Use Fitbit to measure daily step counts for 3
years (4000 participants)
• Participants need to upload to
laptop/phone and then sync to Fitbit server
• Devise holds one month of data
• Need to then connect with other study data
October 19-25 | 2015 New York
www.medicres.org
44. Where should the field head?
• Need new techniques for data management
• New tools for data analysis
• New tools for data visualization
• Ways to acquire and analyze unstructured
text data
October 19-25 | 2015 New York
www.medicres.org
45. “Big data is not about the technologies to
store massive amounts of data, it is about
creating a flexible infrastructure with high-
performance computing, high-performance
analytics and governance – in a deployment
model that makes sense for the
organization.”14
Mark Troester
October 19-25 | 2015 New York
www.medicres.org
46. References
1. http://www.webopedia.com/TERM/B/big_data.html (accessed October 15, 2015)
2. http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/ (accessed October
15,2015)
3. http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html (accessed October 15, 2015)
4. http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/?_r=0 (accessed October 16, 2015)
5. Lazer D, Kennedy R, King G, Vespignani. The Parable of Google Flu: Traps in Big Data Analysis. Science, 343: 1203-05, (2014).
6. https://datascience.nih.gov/bd2k/about/what (accessed October 15, 2015)
7. http://magazine.amstat.org/blog/2012/06/01/prescorner/ (accessed October 17, 2015)
8. http://magazine.amstat.org/blog/2013/06/01/the-asa-and-big-data/ (accessed October 18, 2015)
9. https://en.wikipedia.org/wiki/Machine_learning (accessed October 18, 2015)
10. Wang SD. Opportunities and challenges of clinical research in the big-data era: from RCT to BCT. J Thorac Dis, 5(6): 721-
723, (2013)
11. Wang SD, Shen Y. Redefining big-data clinical trial (BCT). Annals of Translational Medicine. 2(10): 96, (2014)
12. Gange SJ, Golub ET. From smallpox to big data: The Next 100 years of epidemiologic methods. American Journal of
Epidemiology. DOI: 10.1093/aje/kwv150
13. Luo G. MLBCD: a machine learning tool for big clinical data. Inf Sci Syst. 3:3, 2015.
14. Big Data Meets Big Data Analytics. SAS White Paper
October 19-25 | 2015 New York
www.medicres.org