SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Big Data Science
Hype?
Levente Török
Blinkbox Music Ltd ... GE Hungary
Disclaimer
All statements appearing in slides or in the presentation represent my personal
opinion. They are not in connection to any companies nor any person I had or
have connection to.
I reserve these statements with risk of error.
Summary
- Big data? Data Science? Hype?
- Continuous improvement of Online Systems
- A/B testing
Data Science, hype?
Harvard Business Review in 2012
Data Science, hype?
Forbes in 2015
Whether employers know or don’t know what data scientists do, they have been
using—in rapidly-growing numbers—the term“data scientist” in job
descriptions in the past two years as Indeed.com’s data demonstrates.
Developers, developers ...
“Data Science” in media
Yahoo Finance:
“If you take a cue from the Harvard Business Review, the title goes to data
scientists. That’s right. Data scientist, as in, the type of person who can write
computer code and harness the power of big data to come up with innovative
ways to use it for companies like Google (GOOG), Amazon (AMZN), and
Yahoo! (YHOO).”
“Data Science” in media
Nature Jobs:
Data Science, what is this?
Wikipedia
“Data Science is the extraction of knowledge from data, which is a continuation
of the field data mining and predictive analytics”
Data? Science... ?
1) Big Data Engineer
- Hive, Yarn, Spark, Impala
2) Data Miner
- SAS, Knime, Rapid Miner, Weka,
IBM Clementine
3) Big Data & Data Miner
- Apache - Mahout
- Spark - MLlib, Spark - GraphX
- Apache - Giraph
- GraphLab ?
Data Scientist?
Big data - big failure:
If an algo doesn’t work on small data, it wont work on big data.
4) Data Scientist is a real scientist:
Follows scientific principles in data modeling:
- conjectures hypothesis on statistical structure of data
- validates it offline and online
- improves model iteratively
Tools: R / Python / C++
http://bit.ly/1B3bSS1
Tools: verdict
other -> R -> python = 0.44 * 0.26 = 0.11
other -> python -> R = 0.23 * 0.18 = 0.04
Is this correct?
However ... what?
Improving Online Systems
Examples
Recommender systems (ie. RecSys)
What to listen next?
What ad to display?
Anomaly detection:
Is this user/system behaviour “normal”?
Does this system going to fail soon?
Data Flow in Online Sys
Online sys -> log -> daily aggregation -> long term -> batch model bld.
storage
queue -> async online model updates
near optimal online data model
The major difficulty
daily aggregates
datasource
batch model training
online model training
1. batch model training starts: 4:00, finishes 4:30
2. new online model updates starts at 4:30, would finish at 5:10 with all the events from 0:00
to 4:30 but new events arrived in the mean time
.... -> streaming architectures
queue
Offline data modelling
Train Test
Model Prediction
Parameters
Offline modeling
1. Data splits for train / test / quiz
- time based: eg 2 weeks / 1 day / 1 day
- entity based: set of users
- session: set of sessions of users
Test data preparation:
- manual pos/neg sample data points labeled, or injected
2. Train by batch training
Given a data set, we try to fit the model to the data set controlled by model
parameters.
Offline data modelling
3. Prediction phase: Given a model
- for each users we met in train, we give predictions
- for each event we can see in test set, we predict likelihood
4. Evaluation phase: prediction and test data similarity is measured
- RecSys: NDCG, Recall, Precision, AUC, ... 20 different metrics
- Artificially labelled data set for anomaly detection: C2B (AUC),
weighted AUC ...
- Sanity check! -> Q/A team
Offline data modelling
4. Parameter search in parallel
The output of the searching is the parameter vector (+ model id) that
returns the optimal solution offline according to our belief
NB: usually we are unsure which offline measure is going to reflect the best
online results, so we have number of optimal parameter vectors according to
different offline measures.
A/B testing
Train_A Model_A Online pred_A Performance_A
Model_B Online pred_B Performance_BTrain_B
??
Online performance tuning
Train_A Model_A Online pred_A
Parameters
Performance_A
Model_B Online pred_B Performance_BTrain_B
Online traffic split adj.
Train_A Model_A Online pred_A Performance_A
Model_B Online pred_B Performance_BTrain_B
Offline-Online matching
Model NDCG AUC ... Avg Sess Len
A 1 1 1
B 2 3 3
C 3 2 2
Offline measures Online measure
compare with Pearsons corr. coeff.
On-line testing
5. A/B testing
- control model
- tested model (model with an offline optimal parameter set)
6. Evaluation of online results:
Measures:
- Session length, station length
- Return rate, CLTV
Filter and compare models -> wow!
On-line testing
7. Run many models one-by-one according phase 4.
8. Figure out the best offline metrics:
Compare order statistics of offline and online models
(ie Pearsons correlation) to figure out which of the offline metrics matter the
most in online performance.
Model comparisons
Problems:
1. Day 1 A is better, Day 2 B is better
2. The version with the longest session length != the version with the highest
full play ratio of tracks
3. Outliers are dominates the session length average:
- Number of users listen the service “forever”
- Bouncing users pollutes the session length average with high noise
A/B testing
1. Version A: Control group
2. Version B: Treatment group
With n_A, n_B users, we have successes of k_A, k_B.
Is it enough if I compare k_A / n_A with k_B / n_B ?
A/B testing?
Questions:
- What if one day A wins, next the B wins?
- How many users should I use for testing?
- How long should I run test?
- What if we have A, B, C ... versions we want to test?
Classical Statistics
Hypothesis testing:
- Does treatment B have any effect?
- up to probability: (1-alpha)
- given: a sample size of N
Even the most well known A/B testing platforms can lead you illusory results.
Command: “Sample size estimator”
Binomial ?
Note that:
Binomial distribution:
Beta distribution: where
New statistics
n_A = 150, k_A = 18
n_B = 145, k_B = 14
The major question:
New statistics
n_A = 150, k_A = 18
n_B = 145, k_B = 14
The major question:
Chance2beat:
x
f_A(x;...)
f_B(x;...)
Chance 2 beat
- This is a probability, we want to increase by testing. For example:
- Can be:
- Gaussians,
- distributions w/ priors
- empiric distributions, or
- small sample size data sets directly
- Sometimes it is not enough: use bootstrapping!
Thanks

Contenu connexe

Tendances

Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
Nikhil Sharma
 

Tendances (20)

Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representation
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Spline representations
Spline representationsSpline representations
Spline representations
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
 
Homomorphic filtering
Homomorphic filteringHomomorphic filtering
Homomorphic filtering
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithm
 
Back propagation
Back propagationBack propagation
Back propagation
 
DIGITAL IMAGE PROCESSING - LECTURE NOTES
DIGITAL IMAGE PROCESSING - LECTURE NOTESDIGITAL IMAGE PROCESSING - LECTURE NOTES
DIGITAL IMAGE PROCESSING - LECTURE NOTES
 
The role of NLP & ML in Cognitive System by Sunantha Krishnan
The role of NLP & ML in Cognitive System by Sunantha KrishnanThe role of NLP & ML in Cognitive System by Sunantha Krishnan
The role of NLP & ML in Cognitive System by Sunantha Krishnan
 
Machine learning seminar ppt
Machine learning seminar pptMachine learning seminar ppt
Machine learning seminar ppt
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptx
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Backpropagation algo
Backpropagation  algoBackpropagation  algo
Backpropagation algo
 
Noise Models
Noise ModelsNoise Models
Noise Models
 

En vedette

En vedette (10)

Devoxx US 2017 "The Seven (More) Deadly Sins of Microservices"
Devoxx US 2017 "The Seven (More) Deadly Sins of Microservices"Devoxx US 2017 "The Seven (More) Deadly Sins of Microservices"
Devoxx US 2017 "The Seven (More) Deadly Sins of Microservices"
 
What's New in JHipsterLand - Devoxx Poland 2017
What's New in JHipsterLand - Devoxx Poland 2017What's New in JHipsterLand - Devoxx Poland 2017
What's New in JHipsterLand - Devoxx Poland 2017
 
Swift -Helyzetjelentés az iOS programozás új nyelvéről
Swift -Helyzetjelentés az iOS programozás új nyelvérőlSwift -Helyzetjelentés az iOS programozás új nyelvéről
Swift -Helyzetjelentés az iOS programozás új nyelvéről
 
Linux Kernel – Hogyan csapjunk bele?
Linux Kernel – Hogyan csapjunk bele?Linux Kernel – Hogyan csapjunk bele?
Linux Kernel – Hogyan csapjunk bele?
 
10 tips to become an awesome Technical Lead v2 (Devoxx PL)
10 tips to become an awesome Technical Lead v2 (Devoxx PL)10 tips to become an awesome Technical Lead v2 (Devoxx PL)
10 tips to become an awesome Technical Lead v2 (Devoxx PL)
 
Progressive Web Apps / GDG DevFest - Season 2016
Progressive Web Apps / GDG DevFest - Season 2016Progressive Web Apps / GDG DevFest - Season 2016
Progressive Web Apps / GDG DevFest - Season 2016
 
CDI 2.0 is upon us Devoxx
CDI 2.0 is upon us DevoxxCDI 2.0 is upon us Devoxx
CDI 2.0 is upon us Devoxx
 
DATA DRIVEN DESIGN - avagy hogy fér össze a kreativitás a tényekkel
DATA DRIVEN DESIGN - avagy hogy fér össze a kreativitás a tényekkelDATA DRIVEN DESIGN - avagy hogy fér össze a kreativitás a tényekkel
DATA DRIVEN DESIGN - avagy hogy fér össze a kreativitás a tényekkel
 
DevAssistant, Docker and You
DevAssistant, Docker and YouDevAssistant, Docker and You
DevAssistant, Docker and You
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipster
 

Similaire à Big Data Science - hype?

Testing Software Solutions
Testing Software SolutionsTesting Software Solutions
Testing Software Solutions
gavhays
 
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docxAdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
galerussel59292
 

Similaire à Big Data Science - hype? (20)

Intro to Data Analytics with Oscar's Director of Product
 Intro to Data Analytics with Oscar's Director of Product Intro to Data Analytics with Oscar's Director of Product
Intro to Data Analytics with Oscar's Director of Product
 
Data and Business Team Collaboration
Data and Business Team CollaborationData and Business Team Collaboration
Data and Business Team Collaboration
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
 
Projects
ProjectsProjects
Projects
 
Testing Software Solutions
Testing Software SolutionsTesting Software Solutions
Testing Software Solutions
 
Implementation of Spam Classifier using Naïve Bayes Algorithm
Implementation of Spam Classifier using Naïve Bayes AlgorithmImplementation of Spam Classifier using Naïve Bayes Algorithm
Implementation of Spam Classifier using Naïve Bayes Algorithm
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Big Data
Big DataBig Data
Big Data
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
 
Predicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AIPredicting Medical Test Results using Driverless AI
Predicting Medical Test Results using Driverless AI
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
Automation in the Bug Flow - Machine Learning for Triaging and TracingAutomation in the Bug Flow - Machine Learning for Triaging and Tracing
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
 
Better Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsBetter Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data Decisions
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
When UX (guy) Meets Operations
When UX (guy) Meets OperationsWhen UX (guy) Meets Operations
When UX (guy) Meets Operations
 
Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching
 
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docxAdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
AdvanceStorage.zipyyy.docxMOVIE VIEWS SYSTEMProp.docx
 
The Machine Learning Audit. MIS ITAC 2017 Keynote
The Machine Learning Audit. MIS ITAC 2017 KeynoteThe Machine Learning Audit. MIS ITAC 2017 Keynote
The Machine Learning Audit. MIS ITAC 2017 Keynote
 

Plus de BalaBit

SCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big DataSCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
BalaBit
 
syslog-ng: from log collection to processing and information extraction
syslog-ng: from log collection to processing and information extractionsyslog-ng: from log collection to processing and information extraction
syslog-ng: from log collection to processing and information extraction
BalaBit
 
Techreggeli - Logmenedzsment
Techreggeli - LogmenedzsmentTechreggeli - Logmenedzsment
Techreggeli - Logmenedzsment
BalaBit
 
State of the art logging
State of the art loggingState of the art logging
State of the art logging
BalaBit
 
Why proper logging is important
Why proper logging is importantWhy proper logging is important
Why proper logging is important
BalaBit
 
Balabit Company Overview
Balabit Company OverviewBalabit Company Overview
Balabit Company Overview
BalaBit
 
BalaBit IT Security cégismertető prezentációja
BalaBit IT Security cégismertető prezentációjaBalaBit IT Security cégismertető prezentációja
BalaBit IT Security cégismertető prezentációja
BalaBit
 
The Future of Electro Car
The Future of Electro CarThe Future of Electro Car
The Future of Electro Car
BalaBit
 

Plus de BalaBit (18)

SCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big DataSCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
 
NIAS 2015 - The value add of open source for innovation
NIAS 2015 - The value add of open source for innovationNIAS 2015 - The value add of open source for innovation
NIAS 2015 - The value add of open source for innovation
 
Les Assises 2015 - Why people are the most important aspect of IT security?
Les Assises 2015 - Why people are the most important aspect of IT security?Les Assises 2015 - Why people are the most important aspect of IT security?
Les Assises 2015 - Why people are the most important aspect of IT security?
 
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
 
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
LOADays 2015 - syslog-ng - from log collection to processing and infomation e...
 
syslog-ng: from log collection to processing and information extraction
syslog-ng: from log collection to processing and information extractionsyslog-ng: from log collection to processing and information extraction
syslog-ng: from log collection to processing and information extraction
 
eCSI - The Agile IT security
eCSI - The Agile IT securityeCSI - The Agile IT security
eCSI - The Agile IT security
 
Top 10 reasons to monitor privileged users
Top 10 reasons to monitor privileged usersTop 10 reasons to monitor privileged users
Top 10 reasons to monitor privileged users
 
Hogyan maradj egészséges irodai munka mellett?
Hogyan maradj egészséges irodai munka mellett?Hogyan maradj egészséges irodai munka mellett?
Hogyan maradj egészséges irodai munka mellett?
 
Regulatory compliance and system logging
Regulatory compliance and system loggingRegulatory compliance and system logging
Regulatory compliance and system logging
 
Kontrolle und revisionssichere Auditierung privilegierter IT-Zugriffe
Kontrolle und revisionssichere Auditierung privilegierter IT-ZugriffeKontrolle und revisionssichere Auditierung privilegierter IT-Zugriffe
Kontrolle und revisionssichere Auditierung privilegierter IT-Zugriffe
 
Techreggeli - Logmenedzsment
Techreggeli - LogmenedzsmentTechreggeli - Logmenedzsment
Techreggeli - Logmenedzsment
 
State of the art logging
State of the art loggingState of the art logging
State of the art logging
 
Why proper logging is important
Why proper logging is importantWhy proper logging is important
Why proper logging is important
 
Balabit Company Overview
Balabit Company OverviewBalabit Company Overview
Balabit Company Overview
 
BalaBit IT Security cégismertető prezentációja
BalaBit IT Security cégismertető prezentációjaBalaBit IT Security cégismertető prezentációja
BalaBit IT Security cégismertető prezentációja
 
The Future of Electro Car
The Future of Electro CarThe Future of Electro Car
The Future of Electro Car
 
Compliance needs transparency
Compliance needs transparencyCompliance needs transparency
Compliance needs transparency
 

Dernier

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Dernier (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 

Big Data Science - hype?

  • 1. Big Data Science Hype? Levente Török Blinkbox Music Ltd ... GE Hungary
  • 2. Disclaimer All statements appearing in slides or in the presentation represent my personal opinion. They are not in connection to any companies nor any person I had or have connection to. I reserve these statements with risk of error.
  • 3. Summary - Big data? Data Science? Hype? - Continuous improvement of Online Systems - A/B testing
  • 4. Data Science, hype? Harvard Business Review in 2012
  • 5. Data Science, hype? Forbes in 2015 Whether employers know or don’t know what data scientists do, they have been using—in rapidly-growing numbers—the term“data scientist” in job descriptions in the past two years as Indeed.com’s data demonstrates.
  • 7. “Data Science” in media Yahoo Finance: “If you take a cue from the Harvard Business Review, the title goes to data scientists. That’s right. Data scientist, as in, the type of person who can write computer code and harness the power of big data to come up with innovative ways to use it for companies like Google (GOOG), Amazon (AMZN), and Yahoo! (YHOO).”
  • 8. “Data Science” in media Nature Jobs:
  • 9. Data Science, what is this? Wikipedia “Data Science is the extraction of knowledge from data, which is a continuation of the field data mining and predictive analytics”
  • 10. Data? Science... ? 1) Big Data Engineer - Hive, Yarn, Spark, Impala 2) Data Miner - SAS, Knime, Rapid Miner, Weka, IBM Clementine 3) Big Data & Data Miner - Apache - Mahout - Spark - MLlib, Spark - GraphX - Apache - Giraph - GraphLab ?
  • 11. Data Scientist? Big data - big failure: If an algo doesn’t work on small data, it wont work on big data. 4) Data Scientist is a real scientist: Follows scientific principles in data modeling: - conjectures hypothesis on statistical structure of data - validates it offline and online - improves model iteratively
  • 12. Tools: R / Python / C++ http://bit.ly/1B3bSS1
  • 13. Tools: verdict other -> R -> python = 0.44 * 0.26 = 0.11 other -> python -> R = 0.23 * 0.18 = 0.04 Is this correct? However ... what?
  • 14. Improving Online Systems Examples Recommender systems (ie. RecSys) What to listen next? What ad to display? Anomaly detection: Is this user/system behaviour “normal”? Does this system going to fail soon?
  • 15. Data Flow in Online Sys Online sys -> log -> daily aggregation -> long term -> batch model bld. storage queue -> async online model updates near optimal online data model
  • 16. The major difficulty daily aggregates datasource batch model training online model training 1. batch model training starts: 4:00, finishes 4:30 2. new online model updates starts at 4:30, would finish at 5:10 with all the events from 0:00 to 4:30 but new events arrived in the mean time .... -> streaming architectures queue
  • 17. Offline data modelling Train Test Model Prediction Parameters
  • 18. Offline modeling 1. Data splits for train / test / quiz - time based: eg 2 weeks / 1 day / 1 day - entity based: set of users - session: set of sessions of users Test data preparation: - manual pos/neg sample data points labeled, or injected 2. Train by batch training Given a data set, we try to fit the model to the data set controlled by model parameters.
  • 19. Offline data modelling 3. Prediction phase: Given a model - for each users we met in train, we give predictions - for each event we can see in test set, we predict likelihood 4. Evaluation phase: prediction and test data similarity is measured - RecSys: NDCG, Recall, Precision, AUC, ... 20 different metrics - Artificially labelled data set for anomaly detection: C2B (AUC), weighted AUC ... - Sanity check! -> Q/A team
  • 20. Offline data modelling 4. Parameter search in parallel The output of the searching is the parameter vector (+ model id) that returns the optimal solution offline according to our belief NB: usually we are unsure which offline measure is going to reflect the best online results, so we have number of optimal parameter vectors according to different offline measures.
  • 21. A/B testing Train_A Model_A Online pred_A Performance_A Model_B Online pred_B Performance_BTrain_B ??
  • 22. Online performance tuning Train_A Model_A Online pred_A Parameters Performance_A Model_B Online pred_B Performance_BTrain_B
  • 23. Online traffic split adj. Train_A Model_A Online pred_A Performance_A Model_B Online pred_B Performance_BTrain_B
  • 24. Offline-Online matching Model NDCG AUC ... Avg Sess Len A 1 1 1 B 2 3 3 C 3 2 2 Offline measures Online measure compare with Pearsons corr. coeff.
  • 25. On-line testing 5. A/B testing - control model - tested model (model with an offline optimal parameter set) 6. Evaluation of online results: Measures: - Session length, station length - Return rate, CLTV Filter and compare models -> wow!
  • 26. On-line testing 7. Run many models one-by-one according phase 4. 8. Figure out the best offline metrics: Compare order statistics of offline and online models (ie Pearsons correlation) to figure out which of the offline metrics matter the most in online performance.
  • 27. Model comparisons Problems: 1. Day 1 A is better, Day 2 B is better 2. The version with the longest session length != the version with the highest full play ratio of tracks 3. Outliers are dominates the session length average: - Number of users listen the service “forever” - Bouncing users pollutes the session length average with high noise
  • 28. A/B testing 1. Version A: Control group 2. Version B: Treatment group With n_A, n_B users, we have successes of k_A, k_B. Is it enough if I compare k_A / n_A with k_B / n_B ?
  • 29. A/B testing? Questions: - What if one day A wins, next the B wins? - How many users should I use for testing? - How long should I run test? - What if we have A, B, C ... versions we want to test?
  • 30. Classical Statistics Hypothesis testing: - Does treatment B have any effect? - up to probability: (1-alpha) - given: a sample size of N Even the most well known A/B testing platforms can lead you illusory results. Command: “Sample size estimator”
  • 31. Binomial ? Note that: Binomial distribution: Beta distribution: where
  • 32. New statistics n_A = 150, k_A = 18 n_B = 145, k_B = 14 The major question:
  • 33. New statistics n_A = 150, k_A = 18 n_B = 145, k_B = 14 The major question: Chance2beat: x f_A(x;...) f_B(x;...)
  • 34. Chance 2 beat - This is a probability, we want to increase by testing. For example: - Can be: - Gaussians, - distributions w/ priors - empiric distributions, or - small sample size data sets directly - Sometimes it is not enough: use bootstrapping!