SlideShare une entreprise Scribd logo
1  sur  18
Visualizing High
Dimensional Data with
Manifold Learning in R
BY COLLEEN M. FARRELLY, DATA SCIENTIST AT GRAHAM HOLDINGS
(KAPLAN HIGHER AND PROFESSIONAL EDUCATION)
My Path to Data Science
Former MD/PhD student who started doing research/attending workshops in geometry,
topology, and machine learning
Switched degree programs into biostatistics with a topology-based slant
Have worked in biotechnology, military, education, and the social sciences
Currently on the business side of running a university, with a lot of financial modeling and risk
modeling
Mining for Data Relationships
Exploratory analysis
 Important step in data science
projects
 Trend/covariance visualization
 Clustering
 Powerful combination for
understanding many types of
problems
Types of data problems
 Time series analyses
 Predictive analyses
 Network analyses
9
3
13
5
1
7
8
14
10
11
12
6
16
15
17
2
4
0204060
Intelligence and Achievement Dendrogram
hclust (*, "complete")
dist(mydata[, 2:4])
Height
Unique subgroup identified
Time Series and Financial Data
Key tasks in time
series/financial data
analyses:
 Forecasting future time
points
 Identifying drivers of the
dynamic process (ex. why
are sales rising?)
 Identifying tipping points
(crashes, spikes…)
 Identifying covarying
behavior (sectors that
behave similarly, stocks that
influence each other, daily
rising/falling patterns…)
Dow Jones Industrial Average
Morse-Smale Clustering
Multivariate technique from topology
similar to mode clustering
 Find peaks and valleys in data by filtering
on a defined function:
 A watershed on mountains
 Dribbling a soccer ball across a field of hills
 Separate data based on shared peaks
and valleys
 Many nice developments on
convergence and theoretical properties
R package has nice dimensionality
reduction plots to highlight cluster
differences with respect to the filter
function and predictor sets
5
Dimensionality Reduction and Visualization
Helpful in visualizing multivariate trends and group
differences, particularly for multivariate time series
data
Assume data lies in a lower-dimensional subspace and
map full dataset to that subspace (right)
Types of methods:
 Linear (principal component analysis, or PCA)
 Nonlinear (manifold learning)
 Local (preserving neighborhood metrics like distance
between points)
 Global (preserving global characteristics like
connectedness and limits)
Manifold learning methods related to a branch of
mathematics called differential geometry
Manifold Learning Methods
Three main methods considered in this analysis:
 Multidimensional scaling (MDS)
 Global method based on distance preservation and matrix
decomposition
 Distances can be Euclidean, geodesic, Manhattan...
 Nice theoretical result relating it to PCA when best subspace is
linear
 Locally linear embedding (LLE)
 Local method based on nearest neighbor graph, weighting, and
matrix decomposition
 Related to ISOMAP and other methods
 t-distributed stochastic neighbor embedding (t-SNE)
 Local and global method based on mapping of probability
distributions and random walks
 Preserves both local and global characteristics of the original data
space
 Very strong performance on a variety of problems lately
Breast Cancer Dataset Comparison
Example Stock Market Dataset
Emerging markets
 Important for investors
 Future drivers of global trade
 Global trends
 Daily fluctuations
 Tipping points (crashes and opportunities)
This example:
 Recent Kaggle dataset of daily National Stock
Exchange of India prices from July 2003-
February 2018:
 https://www.kaggle.com/abhishekyana/nse-listed-
1384-companies-data/data
 Cleaned (nulls removed, <1%) and daily fluctuation
ranges added (7 total time series columns)
 3616 days included
Clustering Results
R package (msr)
 10 nearest
neighbors
 Persistence
level=1
 5 level splits
 Plot of group
trajectories (far
left)
4 distinct groups
 2 represent stable
trends (red, blue)
 2 represent
transition points in
market behavior
(green, aqua)
PCA Plot
R function
princomp()
with 2
components
Fits quite well
and shows
spread within
each cluster
MDS Plot
R function
cmdscale() with
2 components
and a Euclidean
distance metric
Relationships
very linear and
well-separated
globally
 Matches PCA
well
 Separates into:
1. Daily price
2. Daily
fluctuation
0 5000 10000 15000
-600-400-2000
MDS Results
Dimension 1
Dimension2
LLE Plot
R function lle()
with 2
components
and 10 nearest
neighbors (lle
package)
Separation and
fit not great
Suggests global
behavior more
important than
local for this
time series 0 1 2 3
-4-3-2-101
LLE Results
Dimension 1
Dimension2
t-SNE Plot
R package dimRed
with function
getDimRedData(),
perplexity
(smoothing) at 80,
2 components, and
tsne method
Parses out tipping
points within
growth period and
exact moments of
transitional events
(see green group)
-30 -20 -10 0 10 20 30 40
-30-20-100102030
tSNE Results
Dimension 1
Dimension2
Deep Dive into MDS Components
MDS components separate into prices
(component 1) and fluctuation ranges
(component 2), summarized in
correlation table
Fluctuation ranges increasing as the
market gains points (left)
Original Time Series MDS Component 1 MDS Component 2
open 1.00E+00 3.25E-03
high 1.00E+00 -6.71E-03
low 1.00E+00 9.00E-03
fluctuation.range 6.84E-01 -7.06E-01
close 1.00E+00 -2.56E-03
day.range 5.14E-01 -7.47E-01
adj_close 1.00E+00 -2.41E-03
Transition Periods Deep Dive
Transition
periods
overlap with
long-term
trends
Shorter time-
to-transition
periods in
recent years
Results Overview
NSE shows exponential growth in a time period of changes
 New regulations
 Oil price drops
 Fall of inflation
Tipping points of growth
 Includes current period, starting late 2017/early 2018
 Actually predicted tumble of NSE during February of 2018 in late 2017
 Crash predicted by several economists for sometime in 2018:
 https://www.getmoneyrich.com/indian-stock-market-correction-likely-in-2017-2018/
 https://www.livemint.com/Money/pXdnLHA2r1FJfwJhFEDqjO/Stock-market-crash-Experts-divided-on-whether-theres-more.html
Fluctuations and volatility
 Increasing in past few years
 Can vary a lot during the day while starting and closing with similar values
Conclusions
Clustering and dimensionality reduction for
multivariate data exploration
 Helpful for understanding multivariate time
series data
 Helpful for understanding other types of data
prior to analysis
Performs very well, showing behavior
deviations before major events
Can provide an understanding of covariance
structure (relationships between stocks,
volatility within a market…)
References
Farrelly, C. M. (2017). Dimensionality Reduction Ensembles. arXiv preprint arXiv:1710.04484.
Gerber, S., Rübel, O., Bremer, P. T., Pascucci, V., & Whitaker, R. T. (2013). Morse–smale
regression. Journal of Computational and Graphical Statistics, 22(1), 193-214.
Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika, 29(1), 1-27.
Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning
research, 9(Nov), 2579-2605.
Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear
embedding. science, 290(5500), 2323-2326.
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and
intelligent laboratory systems, 2(1-3), 37-52.
ResearchGate profile with folder for talk (data, R code, PPT):
https://www.researchgate.net/profile/Colleen_Farrelly2

Contenu connexe

Tendances

Graphics Lecture 7
Graphics Lecture 7Graphics Lecture 7
Graphics Lecture 7Saiful Islam
 
Microcontoller and Embedded System
Microcontoller and Embedded SystemMicrocontoller and Embedded System
Microcontoller and Embedded SystemKaran Thakkar
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series dataKrish_ver2
 
Spatial Transformation
Spatial TransformationSpatial Transformation
Spatial TransformationEhsan Hamzei
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.netwww.myassignmenthelp.net
 
Python libraries for data science
Python libraries for data sciencePython libraries for data science
Python libraries for data sciencenilashri2
 
Data Visualization & Analytics.pptx
Data Visualization & Analytics.pptxData Visualization & Analytics.pptx
Data Visualization & Analytics.pptxhiralpatel3085
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Business analytics workshop presentation final
Business analytics workshop presentation   finalBusiness analytics workshop presentation   final
Business analytics workshop presentation finalBrian Beveridge
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in PythonJagriti Goswami
 
Timer counter in arm7(lpc2148)
Timer counter in arm7(lpc2148)Timer counter in arm7(lpc2148)
Timer counter in arm7(lpc2148)Aarav Soni
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxVrishit Saraswat
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfDeptii Chaudhari
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Natural Language Processing In Healthcare
Natural Language Processing In HealthcareNatural Language Processing In Healthcare
Natural Language Processing In HealthcareLaxmiMPriya
 
Difference Between Microprocessors and Microcontrollers
Difference Between Microprocessors and MicrocontrollersDifference Between Microprocessors and Microcontrollers
Difference Between Microprocessors and Microcontrollerselprocus
 

Tendances (20)

Graphics Lecture 7
Graphics Lecture 7Graphics Lecture 7
Graphics Lecture 7
 
Microcontoller and Embedded System
Microcontoller and Embedded SystemMicrocontoller and Embedded System
Microcontoller and Embedded System
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 
Spatial Transformation
Spatial TransformationSpatial Transformation
Spatial Transformation
 
5. phase of nlp
5. phase of nlp5. phase of nlp
5. phase of nlp
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
Python libraries for data science
Python libraries for data sciencePython libraries for data science
Python libraries for data science
 
Data Visualization & Analytics.pptx
Data Visualization & Analytics.pptxData Visualization & Analytics.pptx
Data Visualization & Analytics.pptx
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Business analytics workshop presentation final
Business analytics workshop presentation   finalBusiness analytics workshop presentation   final
Business analytics workshop presentation final
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in Python
 
Timer counter in arm7(lpc2148)
Timer counter in arm7(lpc2148)Timer counter in arm7(lpc2148)
Timer counter in arm7(lpc2148)
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
Optimisation vs prediction
Optimisation vs predictionOptimisation vs prediction
Optimisation vs prediction
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Natural Language Processing In Healthcare
Natural Language Processing In HealthcareNatural Language Processing In Healthcare
Natural Language Processing In Healthcare
 
Difference Between Microprocessors and Microcontrollers
Difference Between Microprocessors and MicrocontrollersDifference Between Microprocessors and Microcontrollers
Difference Between Microprocessors and Microcontrollers
 

Similaire à High-Dimensional Data Visualization, Geometry, and Stock Market Crashes

KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfDr. Radhey Shyam
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research ReportDrMAlagupriyasafiq
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data ProcessingDrMAlagupriyasafiq
 
CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...
CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...
CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...Alkis Vazacopoulos
 
Towards reducing the
Towards reducing theTowards reducing the
Towards reducing theIJDKP
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
 
On multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and queryingOn multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and queryingJaspreet Issaj
 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...IJAEMSJORNAL
 
Unit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxUnit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxJANNU VINAY
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisKaty Allen
 
Using R for Classification of Large Social Network Data
Using R for Classification of Large Social Network DataUsing R for Classification of Large Social Network Data
Using R for Classification of Large Social Network DataIJCSIS Research Publications
 
Role of Modern Geographical Knowledge in National Development
Role  of Modern Geographical Knowledge in National DevelopmentRole  of Modern Geographical Knowledge in National Development
Role of Modern Geographical Knowledge in National DevelopmentProf Ashis Sarkar
 
High dimensionality reduction on graphical data
High dimensionality reduction on graphical dataHigh dimensionality reduction on graphical data
High dimensionality reduction on graphical dataeSAT Journals
 
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...GagandeepKaur872517
 
Predictive geospatial analytics using principal component regression
Predictive geospatial analytics using principal component regression Predictive geospatial analytics using principal component regression
Predictive geospatial analytics using principal component regression IJECEIAES
 
Module 04 Content· As a continuation to examining your policies, r
Module 04 Content· As a continuation to examining your policies, rModule 04 Content· As a continuation to examining your policies, r
Module 04 Content· As a continuation to examining your policies, rIlonaThornburg83
 
A hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial dataA hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial dataijdms
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 

Similaire à High-Dimensional Data Visualization, Geometry, and Stock Market Crashes (20)

KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research Report
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data Processing
 
CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...
CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...
CONTINUOUSLY IMPROVE THE PERFORMANCE OF PLANNING AND SCHEDULING MODELS WITH P...
 
Towards reducing the
Towards reducing theTowards reducing the
Towards reducing the
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
 
On multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and queryingOn multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and querying
 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
 
Unit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxUnit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptx
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Using R for Classification of Large Social Network Data
Using R for Classification of Large Social Network DataUsing R for Classification of Large Social Network Data
Using R for Classification of Large Social Network Data
 
Role of Modern Geographical Knowledge in National Development
Role  of Modern Geographical Knowledge in National DevelopmentRole  of Modern Geographical Knowledge in National Development
Role of Modern Geographical Knowledge in National Development
 
High dimensionality reduction on graphical data
High dimensionality reduction on graphical dataHigh dimensionality reduction on graphical data
High dimensionality reduction on graphical data
 
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
Comparative Analysis of RMSE and MAP Metrices for Evaluating CNN and LSTM Mod...
 
Predictive geospatial analytics using principal component regression
Predictive geospatial analytics using principal component regression Predictive geospatial analytics using principal component regression
Predictive geospatial analytics using principal component regression
 
Module 04 Content· As a continuation to examining your policies, r
Module 04 Content· As a continuation to examining your policies, rModule 04 Content· As a continuation to examining your policies, r
Module 04 Content· As a continuation to examining your policies, r
 
Lesson 6 chapter 4
Lesson 6   chapter 4Lesson 6   chapter 4
Lesson 6 chapter 4
 
A hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial dataA hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial data
 
671_JeevanRavula_CEE
671_JeevanRavula_CEE671_JeevanRavula_CEE
671_JeevanRavula_CEE
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 

Plus de Colleen Farrelly

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Colleen Farrelly
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptxColleen Farrelly
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxColleen Farrelly
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxColleen Farrelly
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxColleen Farrelly
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxColleen Farrelly
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxColleen Farrelly
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptxColleen Farrelly
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptxColleen Farrelly
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptxColleen Farrelly
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxColleen Farrelly
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptxColleen Farrelly
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasColleen Farrelly
 
Geometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxGeometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxColleen Farrelly
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptxColleen Farrelly
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxColleen Farrelly
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxColleen Farrelly
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing Colleen Farrelly
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science TalkColleen Farrelly
 

Plus de Colleen Farrelly (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptx
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptx
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptx
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptx
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptx
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptx
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptx
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptx
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptx
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved Areas
 
Geometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxGeometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptx
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptx
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptx
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk
 

Dernier

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Dernier (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 

High-Dimensional Data Visualization, Geometry, and Stock Market Crashes

  • 1. Visualizing High Dimensional Data with Manifold Learning in R BY COLLEEN M. FARRELLY, DATA SCIENTIST AT GRAHAM HOLDINGS (KAPLAN HIGHER AND PROFESSIONAL EDUCATION)
  • 2. My Path to Data Science Former MD/PhD student who started doing research/attending workshops in geometry, topology, and machine learning Switched degree programs into biostatistics with a topology-based slant Have worked in biotechnology, military, education, and the social sciences Currently on the business side of running a university, with a lot of financial modeling and risk modeling
  • 3. Mining for Data Relationships Exploratory analysis  Important step in data science projects  Trend/covariance visualization  Clustering  Powerful combination for understanding many types of problems Types of data problems  Time series analyses  Predictive analyses  Network analyses 9 3 13 5 1 7 8 14 10 11 12 6 16 15 17 2 4 0204060 Intelligence and Achievement Dendrogram hclust (*, "complete") dist(mydata[, 2:4]) Height Unique subgroup identified
  • 4. Time Series and Financial Data Key tasks in time series/financial data analyses:  Forecasting future time points  Identifying drivers of the dynamic process (ex. why are sales rising?)  Identifying tipping points (crashes, spikes…)  Identifying covarying behavior (sectors that behave similarly, stocks that influence each other, daily rising/falling patterns…) Dow Jones Industrial Average
  • 5. Morse-Smale Clustering Multivariate technique from topology similar to mode clustering  Find peaks and valleys in data by filtering on a defined function:  A watershed on mountains  Dribbling a soccer ball across a field of hills  Separate data based on shared peaks and valleys  Many nice developments on convergence and theoretical properties R package has nice dimensionality reduction plots to highlight cluster differences with respect to the filter function and predictor sets 5
  • 6. Dimensionality Reduction and Visualization Helpful in visualizing multivariate trends and group differences, particularly for multivariate time series data Assume data lies in a lower-dimensional subspace and map full dataset to that subspace (right) Types of methods:  Linear (principal component analysis, or PCA)  Nonlinear (manifold learning)  Local (preserving neighborhood metrics like distance between points)  Global (preserving global characteristics like connectedness and limits) Manifold learning methods related to a branch of mathematics called differential geometry
  • 7. Manifold Learning Methods Three main methods considered in this analysis:  Multidimensional scaling (MDS)  Global method based on distance preservation and matrix decomposition  Distances can be Euclidean, geodesic, Manhattan...  Nice theoretical result relating it to PCA when best subspace is linear  Locally linear embedding (LLE)  Local method based on nearest neighbor graph, weighting, and matrix decomposition  Related to ISOMAP and other methods  t-distributed stochastic neighbor embedding (t-SNE)  Local and global method based on mapping of probability distributions and random walks  Preserves both local and global characteristics of the original data space  Very strong performance on a variety of problems lately Breast Cancer Dataset Comparison
  • 8. Example Stock Market Dataset Emerging markets  Important for investors  Future drivers of global trade  Global trends  Daily fluctuations  Tipping points (crashes and opportunities) This example:  Recent Kaggle dataset of daily National Stock Exchange of India prices from July 2003- February 2018:  https://www.kaggle.com/abhishekyana/nse-listed- 1384-companies-data/data  Cleaned (nulls removed, <1%) and daily fluctuation ranges added (7 total time series columns)  3616 days included
  • 9. Clustering Results R package (msr)  10 nearest neighbors  Persistence level=1  5 level splits  Plot of group trajectories (far left) 4 distinct groups  2 represent stable trends (red, blue)  2 represent transition points in market behavior (green, aqua)
  • 10. PCA Plot R function princomp() with 2 components Fits quite well and shows spread within each cluster
  • 11. MDS Plot R function cmdscale() with 2 components and a Euclidean distance metric Relationships very linear and well-separated globally  Matches PCA well  Separates into: 1. Daily price 2. Daily fluctuation 0 5000 10000 15000 -600-400-2000 MDS Results Dimension 1 Dimension2
  • 12. LLE Plot R function lle() with 2 components and 10 nearest neighbors (lle package) Separation and fit not great Suggests global behavior more important than local for this time series 0 1 2 3 -4-3-2-101 LLE Results Dimension 1 Dimension2
  • 13. t-SNE Plot R package dimRed with function getDimRedData(), perplexity (smoothing) at 80, 2 components, and tsne method Parses out tipping points within growth period and exact moments of transitional events (see green group) -30 -20 -10 0 10 20 30 40 -30-20-100102030 tSNE Results Dimension 1 Dimension2
  • 14. Deep Dive into MDS Components MDS components separate into prices (component 1) and fluctuation ranges (component 2), summarized in correlation table Fluctuation ranges increasing as the market gains points (left) Original Time Series MDS Component 1 MDS Component 2 open 1.00E+00 3.25E-03 high 1.00E+00 -6.71E-03 low 1.00E+00 9.00E-03 fluctuation.range 6.84E-01 -7.06E-01 close 1.00E+00 -2.56E-03 day.range 5.14E-01 -7.47E-01 adj_close 1.00E+00 -2.41E-03
  • 15. Transition Periods Deep Dive Transition periods overlap with long-term trends Shorter time- to-transition periods in recent years
  • 16. Results Overview NSE shows exponential growth in a time period of changes  New regulations  Oil price drops  Fall of inflation Tipping points of growth  Includes current period, starting late 2017/early 2018  Actually predicted tumble of NSE during February of 2018 in late 2017  Crash predicted by several economists for sometime in 2018:  https://www.getmoneyrich.com/indian-stock-market-correction-likely-in-2017-2018/  https://www.livemint.com/Money/pXdnLHA2r1FJfwJhFEDqjO/Stock-market-crash-Experts-divided-on-whether-theres-more.html Fluctuations and volatility  Increasing in past few years  Can vary a lot during the day while starting and closing with similar values
  • 17. Conclusions Clustering and dimensionality reduction for multivariate data exploration  Helpful for understanding multivariate time series data  Helpful for understanding other types of data prior to analysis Performs very well, showing behavior deviations before major events Can provide an understanding of covariance structure (relationships between stocks, volatility within a market…)
  • 18. References Farrelly, C. M. (2017). Dimensionality Reduction Ensembles. arXiv preprint arXiv:1710.04484. Gerber, S., Rübel, O., Bremer, P. T., Pascucci, V., & Whitaker, R. T. (2013). Morse–smale regression. Journal of Computational and Graphical Statistics, 22(1), 193-214. Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1), 1-27. Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500), 2323-2326. Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3), 37-52. ResearchGate profile with folder for talk (data, R code, PPT): https://www.researchgate.net/profile/Colleen_Farrelly2