Publicité

10 Aug 2018•0 j'aime## 1 j'aime

•80 vues## vues

Soyez le premier à aimer ceci

afficher plus

Nombre de vues

0

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

0

Télécharger pour lire hors ligne

Signaler

Sciences

The term 'Data Scientist' arose fairly recently to express the specialised recruitment needs of certain well-known data-driven Silicon Valley firms. It signifies a mix of diverse and rare talents, mostly drawing from Computer Science (with emphasis on Big Data), Statistics and Machine Learning. In this talk, we will attempt to briefly survey the state-of-the-art both in terms of problems and solutions at the vanguard of Data Science. We will cover both novel developments, as well as centuries-old best practices, in an attempt to demonstrate that Data Science is indeed a Science, in the full sense of the word. This talk represents part of a seminar series that the speaker has given across the world, including Google (Mountainview), Cisco (San Jose) and Aviva Headquarters (London), and represents joint work with Professor David Hand (OBE).

Publicité

- Why Data Science is a Science Dr. Christoforos Anagnostopoulos Founder and Chief Data Scientist, Mentat Innovations Lecturer in Statistics (on leave), Imperial College London Mentat Innovations
- Credentials BA Mathematics at Cambridge University MSc Machine Learning at Edinburgh University MSc Logic and Computer Science at Athens University PhD in Machine Learning for Data Streams at Imperial Postdoc Fellow at Statistical Laboratory, Cambridge Uni. Lecturer in Statistics at Imperial College Founder and Chief Scientist of Mentat Innovations
- Credentials PhD in Machine Learning for Data Streams at Imperial Postdoc Fellow at Statistical Laboratory, Cambridge Uni. Lecturer in Statistics at Imperial College Founder and Chief Scientist of Mentat Innovations Numerous consulting projects in real-time data analysis: • social media analysis, sensor network telemetry, online RTB advertising, cybersecurity and fraud, retail banking • engaged with data journalism on several occasions (The Independent, The Guardian, BBC, …) Mentat Innovations is pioneering real-time anomaly detection on network, application and telemetry data
- This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: Prof. David Hand, OBE (Chairman of Advisory Board of Mentat) Renowned statistician, twice president of Royal Statistical Society Authority on pattern recognition and data mining for retail ﬁnance
- This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: Professor Niall Adams, Imperial College London Machine Learning expert Data Mining in CyberSecurity pioneer
- This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: Professor David Leslie, Lancaster University World-wide expert in machine learning within game theory
- This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: George Cotsikis (CEO and co-Founder of Mentat) Enterpreneur, 17 years experience in quantitative ﬁnance
- Data Science: the origins
- Data Science: the origins Courtesy of Cathy O’Neil and Rachel Schutt
- Data Science: the origins Data Mining Pattern Recognition Statistical Modelling Business Intelligence Many rediscoveries of data analysis in the last 20 years Neural Nets Knowledge Discovery
- Data Science: the origins Data Mining Pattern Recognition Statistical Modelling Analytics Business Intelligence Predictive Analytics Many rediscoveries of data analysis in the last 20 years Big Data Search and Information Retrieval Neural Nets Knowledge Discovery
- Data Science: the origins Data Mining Pattern Recognition Machine Learning Statistical Modelling Analytics Business Intelligence Predictive Analytics Many rediscoveries of data analysis in the last 20 years Big Data Search and Information Retrieval Natural Language Preocessing Neural Nets Deep Learning Knowledge Discovery
- Data Science: the origins Data Mining Pattern Recognition Machine Learning Statistical Modelling Analytics Business Intelligence Predictive Analytics Many rediscoveries of data analysis in the last 20 years Big Data Search and Information Retrieval Natural Language Preocessing Neural Nets Deep Learning Learning from Data Knowledge Discovery
- Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science”
- Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science” 1997: Jeff Wu claims “statisticians” are “data scientists”.
- Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science” 1997: Jeff Wu claims “statisticians” are “data scientists”. 2001: William Cleveland introduces data science as an independent discipline, extending statistics.
- Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science” 1997: Jeff Wu claims “statisticians” are “data scientists”. 2001: William Cleveland introduces data science as an independent discipline, extending statistics. 2008: DJ Patil (LinkedIn) and Jeff Hammerbacher (Facebook) describe their job role as that of “Data Scientist”
- Data Science: the origins Term became trending since 2008 38 years
- What about Big Data? Volume SQL HDFS
- What about Big Data? Volume SQL HDFS Velocity complex events processing apache storm apache spark streaming
- What about Big Data? Volume SQL HDFS Velocity complex events processing apache storm apache spark streaming Variety structured semi-structured unstructured social graphs, system logs, tweets/blogs, CCTV many variables, sampling variability (e.g., spatiotemporal)
- What about Big Data? Volume Velocity Variety Veracity Value Nobody wants data. Everybody wants data-driven reliable actionable insights.
- Big Data in Science CERN 1 Petabyte per day 10 GB per second Astrostatistics Biomedical Climatology
- Big Data in Science Models guided by theory Well formulated questions Big Data in the Commercial World Little to no theory “Needle in the haystack”
- Big Data in the Commercial World Example: car loan provider Online advertising Saw an ad Clicked Browsed Converted Cookie Info
- Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Application data submitted Credit bureau queried Credit scoring computed Interest raid tailored Loan offered
- Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data Timely payments for 3 months Delayed 4th payment Delayed 5th payment
- Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data External data Social media data Public info about employer Demographic data Macroeconomic data
- Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data External data Collections Sent letter, no reply Telephoned, non-cooperative In-person visit
- Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data External data Collections Data silos No substantive theory Often question is unclear (“ﬁshing”) Data quality low Not necessarily that Big Variety of data
- Statistical Methodology Exploratory Data Analysis Formulate question, get data
- Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Statistical Methodology Formulate question, get data
- Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data
- Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats
- Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats variable selection, dimensionality reduction, model averaging (ensembles), Cross-Validation, bootstrapping, QQ plots, outlier detection,…
- Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats variable selection, dimensionality reduction, model averaging (ensembles), Cross-Validation, bootstrapping, QQ plots, outlier detection,… classiﬁcation regression forecasting X,Y,Z have an effect on W
- Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats variable selection, dimensionality reduction, model averaging (ensembles), Cross-Validation, bootstrapping, QQ plots, outlier detection,… classiﬁcation regression forecasting X,Y,Z have an effect on W Anomaly / Change Detection
- Statistical Methodology Bayesian vs Classical Classical: data are noisy, parameters are ﬁxed but unknown. We use probability distributions to model the noise. Bayesian: we use probability distributions to model our uncertainty about both the data and the parameters
- Statistical Methodology Bayesian vs Classical Classical: data are noisy, parameters are ﬁxed but unknown. We use probability distributions to model the noise. Bayesian: we use probability distributions to model our uncertainty about both the data and the parameters In practice: Bayesians “average” over their uncertainty a lot. This means they use a lot of numerical integration (recently: Monte Carlo). Everything has a probability distribution. Some are subjective. Frequentists usually report “their best guess”. They use a lot of classical optimisation (gradient descent etc.) - faster. In cases where the variation is simple/physical, less subjective.
- Statistical Methodology Data Mining and Pattern Recognition • Focus on pattern extraction rather than inference • Often no question formulated in advance Machine Learning • Focus on prediction (out-of-sample error) • Largely more automatic, black-box techniques are OK • Huge success stories in stylised worlds • Onus on the user to ﬁt their problem into one of only a few “templates” (classiﬁcation, regression) - carries big risks. Deep Learning and Cognitive AI • Aims to replicate human cognition, low to mid-level faculties such as vision, hearing, natural language understanding. • Can share methods with statistics/probabilistic modelling, but is mostly fundamentally different in its approach.
- Statistical Methodology ANALYTICS LEARNINGvs
- Statistical Methodology ANALYTICS LEARNINGvs retrospective summaries generalisation
- Statistical Methodology ANALYTICS LEARNINGvs retrospective summaries generalisation a matter of resources to compute the exact answer (storage, distributed queries, parallel computation, …) mathematics probability theory numerical optimisation logic and algorithms no “exact” answer
- Statistical Methodology Takeaways: • Black boxes aren’t enough • More Data != More Information • Big Data needs Big Models • Quantity vs Quality vs Homogeneity
- Black boxes aren’t enough Peter Norvig: Statement largely driven by “quantum step” in machine translation offered by black-box (neural net) techniques, compared to explicit grammar models and classical natural language processing tools Black-box AI is experiencing a second coming. However, it does rely on (nearly commoditised) natural language preprocessing tools for keyword extraction, named entity recognition etc. Almost never true. Even if generalisation is not needed, there are always sources of error (measurement, nonresponse), as well as latent factors (e.g., the effect of X on Y, correlation, causality).
- More Data != More Information 20 years worth of credit scoring data, but … • Only one snapshot of each applicant’s behaviour • Unknown levels of demographic variability • Unknown levels of temporal variability With more data (usually) comes more heterogeneity: one could say that Big Data = Many Small Datasets Databases went from ﬂat to relational to noSQL, but most commodity models are pre-relational! Models are not as re-usable as people think (for example, a decision tree might be a good predictor but a poor customer segmentation tool)
- More Data != More Information The signal sometimes simply isn’t there Substantive theory (and common sense) are still needed. External (unobserved) factors, inherent inpredictability Biased sampling (observational vs prospective - e.g., A/B testing). The lost art of survey sampling (elections?)
- Big Data needs Big Models With enough data, everything is signiﬁcant This assumes the model is right and the data i.i.d. • Bigger data typically means more sources of variation • Model complexity should grow with the data (Kolmogorov) −5 0 5 10 15 −2000200400 Small Data Attribute Response ● ● ● ● ● ● ● ● ● ● Truth Complex model Simple model −5 0 5 10 15 −2000200400 Bigger Data Attribute Response ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● Truth Complex model Simple model
- Big Data needs Big Models
- Big Data needs Big Models Personally a big fan of Bayesian non-parametrics. Zoubin Ghahramani thinks it’s “the rise of the automated statistician”
- Big Data needs Big Models Fat Data vs Tall Data Sometimes bigger means more features for the same examples: curse of dimensionality. Modern techniques for sparse learning (p >> n) are a great aid (e.g., Lasso) ID Age Income Tweet Tweet Tweet ... 1 2 3 4 ... ID Age Income 1 2 3 4 5 6 7 8 ...
- Big Data needs Big Models Fat Data vs Tall Data Consider recommender systems. As data grows: • more items, more users • each user ranks a ﬁxed number of items: sparser matrices
- Big Data needs Big Models Temporal homogeneity: the hidden bottleneck At one extreme, one could ignore all past data as irrelevant At the other one could assume the future is like the past Solutions in the middle include dynamic modelling (very complicated and computationally expensive), and exponential ﬁlters of various speciﬁcations (my ﬁeld of expertise) −4 −2 0 2 4 0.00.20.40.60.81.0 X Density Prior Posterior Posterior with power prior Posterior with flat prior
- Big Data needs Big Models Temporal homogeneity: the hidden bottleneck Sometimes there is nothing to do ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● −4 −2 0 2 4 −4−2024 X1 X2 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Class 1 Class 2 ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● −4 −2 0 2 4 −4−2024 X1 X2 ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● Class 1 Class 2
- Big Data needs Big Models Temporal homogeneity: the hidden bottleneck What looks like drift for one model might not be for another, especially when the population, not the concept, is drifting ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −10−50510 X y ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● old data new data
- Big Data needs Big Models Robustness Important to have built-in guarantees. Robustness and model diagnostics is the unsung hero in classical statistics. Complicating the assumption set sometimes leads to overly complex models. Robustness is often the expedient solution.
- Do not torture the data The Wall Street Journal: “Big Data Unveils Some Weird Correlations” • orange used cars are more reliable • taller people are better at repaying loans −4 −2 0 2 4 0.00.20.40.60.81.0 X Density Prior Posterior Posterior with power prior Posterior with flat prior • http://www.tylervigen.com
- Streaming data Exact answers are sometimes possible (e.g., running mean) But sometimes they are not (e.g., top-K, median) Streaming approximate algorithms are fast, and can be very accurate, but they can be complicated (e.g., hyperloglog). Keep constant memory footprint. Keep up (do not queue)
- Streaming data However, in Machine Learning, there is no “exact” answers. Will batch always outperform streaming (more resources)? • Temporal heterogeneity (drift) • Simulated annealing • Overﬁtting (prequential learning) www.ment.at/blog.html Keep constant memory footprint. Keep up (do not queue)
- Streaming data However, in Machine Learning, there is no “exact” answers. Will batch always outperform streaming (more resources)? • Temporal heterogeneity (drift) • Simulated annealing • Overﬁtting (prequential learning) www.ment.at/blog.html Keep constant memory footprint. Keep up (do not queue)
- Infrastructure I haven’t discussed infrastructure as much. It’s critical. If you are late, sometimes you might as well give up. Parallelisation (e.g., GPUs), distribution (e.g., HDFS), streaming (e.g., Spark Streaming), λ-architectures … Algorithms often need to be designed from scratch. Great progress in this direction. Keep working on it!
- datastream.io
- datastream.io additional deployment options
- How to manage data scientists Treat negative results like you treat positive results Encourage lab reports: data analysis is a process. Do not overﬁt. Do not ﬁsh for p-values. Do not torture the data. Specify hypotheses in advance whenever possible. Then test. Black box solutions are great for prediction. Only. Do not silo data scientists. Incorporate expert knowledge whenever possible. Explicit prior beliefs are not a bias risk.
- Conclusions • Knowledge is power. Knowledge relies on data. • The process of extracting knowledge from data has become more efﬁcient and more powerful than ever – but it’s still far from automatic (we are working on it ...) • Big Data needs Big Models • More Data != More Information • A Data Scientist is a team, not an individual
- Afterthought What about strong Artiﬁcial Intelligence? Machines are outperforming humans in an increasingly broad array of cognitive tasks. Last time this happened we had the Industrial Revolution. Data Science is at the cusp of this wave. This is an exciting time, but it also carries a lot of responsibility.
- Afterthought If machines replace us, there will only be one profession left AI programmers and Data Scientists

Publicité