SlideShare une entreprise Scribd logo
1  sur  13
Linear Regression on 1 Terabytes of Data?
Some Crazy Observations and Actions
Hesen Peng
Amazon.com
Big Data Exploration with Amazon
Model building procedure for
a major internet company
Planning and
Idea Generation
Data collection
Model building
and offline
evaluation
Implementation
for application
online
Performance
evaluation in
real world
Experiment
Design,
Clinical Trial
Major Machine
Learning/Stat
research
Interesting
weekend project
Unsupervised
Machine Learning,
Survival analysis
Power Point
Linear regression with 1TB of data
Wanna try it out?
• Use Amazon Web Service! (with free tire)
– http://aws.amazon.com/education/
• Write simple distributed algorithm:
– Python: MRJob (https://github.com/Yelp/mrjob)
– R: RHadoop (https://github.com/RevolutionAnalytics/RHadoop)
– Launch your own Sun/Oracle Grid Engine
environment for parallel computing
(http://star.mit.edu/cluster/)
New Challenges
• Association beyond linear
– Make better use of data: (most) factors are statistically
significant in linear models with 1 TB of data
– (Better?) Prediction
• Everything goes to real time
– Build/ update model, analytics, data storage in real
time
– Faster response to new happenings
– Save engineering overhead
Real time big data analytics work flow
Real time data input
(training + testing data)
Real time analytics front
end
Dashboarding/
monitoring
Model building / update
Prediction server
Outlier detection and
pre-processing
Huge Statistical
ChallengeTree design rather than
ring design, enabling
parallel construction and
update
Where are we?
Offline model
building and
scheduled updating
Linear regression / GLM
using Mahout etc
Random
Forest, SVM, Hashing, and
beyond
Mutual
information, Brownian
Covariate, Mira score, and
density estimation!
Batch processing and
near real time
updating
Batch update to the linear
model
Batch update of random
forest, adaptively throw
away trees
?
Real time data
processing / cleaning
and model building
Linear model built and
consumed in real time
?
Real time universal
association discovery !
Timeliness of model build
Complexityof
association
Universal association discovery
• Discovere associations between to random
vectors
• Regardless of dimension and association form
(linear / nonlinear/ higher order interaction).
• E.g. Mutual information, Brownian Distance
Covariate, Mira score (1NN edge sum)
Intuition
Hesen Peng, Tianwe Yu. SeMira: Universal Association Discovery and Variable Selection
among Continuous Variables using Functions on the Observation Graph
Mira score: another function on the
distance graph
• Where d(i) is the distance between observation i
and its nearest neighbore.
• O(N2P)
• How to adapt to real time analytics?
– Segment data for batch processing
– Keep partial data in memory and change the
calculation function
From O(N2P) to O(NP)
A whole distance
matrix between
observations
Only keep the most up-to-
date few in memory and
calculate NN distance btw
observations kept in memory
Yes, loss of power;
assuming association is
independent of
sequence of observation
We are still at Day 1
• Mira score: only capable of detecting association
between continuous variables
– SeMira: variable selection
– No prediction yet
• Functions on the distance graph is a gold mine.
• Real time analytics = $$$
– Fraud detection
– Clustering
– Recommendation systems
Join Us!
• Ask Hesen for referral:
hesepeng@amazon.com
• http://www.amazon.com/gp/jobs
• Jobs of all levels:
– Research Scientist
– Business Intelligence Engineer
– Software Development Engineers
– Machine Learning scientist
– Manager in Machine Learning

Contenu connexe

Tendances

A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 

Tendances (20)

VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Azure Machine Learning and ML on Premises
Azure Machine Learning and ML on PremisesAzure Machine Learning and ML on Premises
Azure Machine Learning and ML on Premises
 
Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and Practices
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical Computation
 
R Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceR Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal Dependence
 
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmK Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
 
Magellan FOSS4G Talk, Boston 2017
Magellan FOSS4G Talk, Boston 2017Magellan FOSS4G Talk, Boston 2017
Magellan FOSS4G Talk, Boston 2017
 

Similaire à Linear regression on 1 terabytes of data? Some crazy observations and actions

Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
sscdotopen
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...
Ashish Gupta
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
Motaz El-Saban
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...
Ashish Gupta
 

Similaire à Linear regression on 1 terabytes of data? Some crazy observations and actions (20)

Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Machine Learning for Forecasting: From Data to Deployment
Machine Learning for Forecasting: From Data to DeploymentMachine Learning for Forecasting: From Data to Deployment
Machine Learning for Forecasting: From Data to Deployment
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...
 
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Linear regression on 1 terabytes of data? Some crazy observations and actions

  • 1. Linear Regression on 1 Terabytes of Data? Some Crazy Observations and Actions Hesen Peng Amazon.com Big Data Exploration with Amazon
  • 2. Model building procedure for a major internet company Planning and Idea Generation Data collection Model building and offline evaluation Implementation for application online Performance evaluation in real world Experiment Design, Clinical Trial Major Machine Learning/Stat research Interesting weekend project Unsupervised Machine Learning, Survival analysis Power Point
  • 4. Wanna try it out? • Use Amazon Web Service! (with free tire) – http://aws.amazon.com/education/ • Write simple distributed algorithm: – Python: MRJob (https://github.com/Yelp/mrjob) – R: RHadoop (https://github.com/RevolutionAnalytics/RHadoop) – Launch your own Sun/Oracle Grid Engine environment for parallel computing (http://star.mit.edu/cluster/)
  • 5. New Challenges • Association beyond linear – Make better use of data: (most) factors are statistically significant in linear models with 1 TB of data – (Better?) Prediction • Everything goes to real time – Build/ update model, analytics, data storage in real time – Faster response to new happenings – Save engineering overhead
  • 6. Real time big data analytics work flow Real time data input (training + testing data) Real time analytics front end Dashboarding/ monitoring Model building / update Prediction server Outlier detection and pre-processing Huge Statistical ChallengeTree design rather than ring design, enabling parallel construction and update
  • 7. Where are we? Offline model building and scheduled updating Linear regression / GLM using Mahout etc Random Forest, SVM, Hashing, and beyond Mutual information, Brownian Covariate, Mira score, and density estimation! Batch processing and near real time updating Batch update to the linear model Batch update of random forest, adaptively throw away trees ? Real time data processing / cleaning and model building Linear model built and consumed in real time ? Real time universal association discovery ! Timeliness of model build Complexityof association
  • 8. Universal association discovery • Discovere associations between to random vectors • Regardless of dimension and association form (linear / nonlinear/ higher order interaction). • E.g. Mutual information, Brownian Distance Covariate, Mira score (1NN edge sum)
  • 9. Intuition Hesen Peng, Tianwe Yu. SeMira: Universal Association Discovery and Variable Selection among Continuous Variables using Functions on the Observation Graph
  • 10. Mira score: another function on the distance graph • Where d(i) is the distance between observation i and its nearest neighbore. • O(N2P) • How to adapt to real time analytics? – Segment data for batch processing – Keep partial data in memory and change the calculation function
  • 11. From O(N2P) to O(NP) A whole distance matrix between observations Only keep the most up-to- date few in memory and calculate NN distance btw observations kept in memory Yes, loss of power; assuming association is independent of sequence of observation
  • 12. We are still at Day 1 • Mira score: only capable of detecting association between continuous variables – SeMira: variable selection – No prediction yet • Functions on the distance graph is a gold mine. • Real time analytics = $$$ – Fraud detection – Clustering – Recommendation systems
  • 13. Join Us! • Ask Hesen for referral: hesepeng@amazon.com • http://www.amazon.com/gp/jobs • Jobs of all levels: – Research Scientist – Business Intelligence Engineer – Software Development Engineers – Machine Learning scientist – Manager in Machine Learning