Publicité
Publicité

Contenu connexe

Présentations pour vous(20)

Publicité

Why Data Science is a Science

  1. Why Data Science is a Science Dr. Christoforos Anagnostopoulos Founder and Chief Data Scientist, Mentat Innovations Lecturer in Statistics (on leave), Imperial College London Mentat Innovations
  2. Credentials BA Mathematics at Cambridge University MSc Machine Learning at Edinburgh University MSc Logic and Computer Science at Athens University PhD in Machine Learning for Data Streams at Imperial Postdoc Fellow at Statistical Laboratory, Cambridge Uni. Lecturer in Statistics at Imperial College Founder and Chief Scientist of Mentat Innovations
  3. Credentials PhD in Machine Learning for Data Streams at Imperial Postdoc Fellow at Statistical Laboratory, Cambridge Uni. Lecturer in Statistics at Imperial College Founder and Chief Scientist of Mentat Innovations Numerous consulting projects in real-time data analysis: • social media analysis, sensor network telemetry, online RTB advertising, cybersecurity and fraud, retail banking • engaged with data journalism on several occasions (The Independent, The Guardian, BBC, …) Mentat Innovations is pioneering real-time anomaly detection on network, application and telemetry data
  4. This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: Prof. David Hand, OBE (Chairman of Advisory Board of Mentat) Renowned statistician, twice president of Royal Statistical Society Authority on pattern recognition and data mining for retail finance
  5. This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: Professor Niall Adams, Imperial College London Machine Learning expert Data Mining in CyberSecurity pioneer
  6. This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: Professor David Leslie, Lancaster University World-wide expert in machine learning within game theory
  7. This talk This talk has been given around the world Much of the thinking in this talk comes from colleagues that I have had the privilege to work with over the years: George Cotsikis (CEO and co-Founder of Mentat) Enterpreneur, 17 years experience in quantitative finance
  8. Data Science: the origins
  9. Data Science: the origins Courtesy of Cathy O’Neil and Rachel Schutt
  10. Data Science: the origins Data Mining Pattern Recognition Statistical Modelling Business Intelligence Many rediscoveries of data analysis in the last 20 years Neural Nets Knowledge Discovery
  11. Data Science: the origins Data Mining Pattern Recognition Statistical Modelling Analytics Business Intelligence Predictive Analytics Many rediscoveries of data analysis in the last 20 years Big Data Search and Information Retrieval Neural Nets Knowledge Discovery
  12. Data Science: the origins Data Mining Pattern Recognition Machine Learning Statistical Modelling Analytics Business Intelligence Predictive Analytics Many rediscoveries of data analysis in the last 20 years Big Data Search and Information Retrieval Natural Language Preocessing Neural Nets Deep Learning Knowledge Discovery
  13. Data Science: the origins Data Mining Pattern Recognition Machine Learning Statistical Modelling Analytics Business Intelligence Predictive Analytics Many rediscoveries of data analysis in the last 20 years Big Data Search and Information Retrieval Natural Language Preocessing Neural Nets Deep Learning Learning from Data Knowledge Discovery
  14. Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science”
  15. Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science” 1997: Jeff Wu claims “statisticians” are “data scientists”.
  16. Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science” 1997: Jeff Wu claims “statisticians” are “data scientists”. 2001: William Cleveland introduces data science as an independent discipline, extending statistics.
  17. Data Science: the origins Many rediscoveries of data analysis in the last 20 years 1970s: Peter Naur introduces “data science” as a synonym to “computer science” 1997: Jeff Wu claims “statisticians” are “data scientists”. 2001: William Cleveland introduces data science as an independent discipline, extending statistics. 2008: DJ Patil (LinkedIn) and Jeff Hammerbacher (Facebook) describe their job role as that of “Data Scientist”
  18. Data Science: the origins Term became trending since 2008 38 years
  19. What about Big Data? Volume SQL HDFS
  20. What about Big Data? Volume SQL HDFS Velocity complex events processing apache storm apache spark streaming
  21. What about Big Data? Volume SQL HDFS Velocity complex events processing apache storm apache spark streaming Variety structured semi-structured unstructured social graphs, system logs, tweets/blogs, CCTV many variables, sampling variability (e.g., spatiotemporal)
  22. What about Big Data? Volume Velocity Variety Veracity Value Nobody wants data. Everybody wants data-driven reliable actionable insights.
  23. Big Data in Science CERN 1 Petabyte per day 10 GB per second Astrostatistics Biomedical Climatology
  24. Big Data in Science Models guided by theory Well formulated questions Big Data in the Commercial World Little to no theory “Needle in the haystack”
  25. Big Data in the Commercial World Example: car loan provider Online advertising Saw an ad Clicked Browsed Converted Cookie Info
  26. Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Application data submitted Credit bureau queried Credit scoring computed Interest raid tailored Loan offered
  27. Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data Timely payments for 3 months Delayed 4th payment Delayed 5th payment
  28. Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data External data Social media data Public info about employer Demographic data Macroeconomic data
  29. Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data External data Collections Sent letter, no reply Telephoned, non-cooperative In-person visit
  30. Big Data in the Commercial World Example: car loan provider Online advertising Credit scoring data Behavioural data External data Collections Data silos No substantive theory Often question is unclear (“fishing”) Data quality low Not necessarily that Big Variety of data
  31. Statistical Methodology Exploratory Data Analysis Formulate question, get data
  32. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Statistical Methodology Formulate question, get data
  33. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data
  34. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats
  35. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats variable selection, dimensionality reduction, model averaging (ensembles), Cross-Validation, bootstrapping, QQ plots, outlier detection,…
  36. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats variable selection, dimensionality reduction, model averaging (ensembles), Cross-Validation, bootstrapping, QQ plots, outlier detection,… classification regression forecasting X,Y,Z have an effect on W
  37. Exploratory Data Analysis Model and Variable Selection Model Fitting Model Diagnostics Inference Prediction Statistical Methodology Formulate question, get data histograms density plots xy-plots summary stats variable selection, dimensionality reduction, model averaging (ensembles), Cross-Validation, bootstrapping, QQ plots, outlier detection,… classification regression forecasting X,Y,Z have an effect on W Anomaly / Change Detection
  38. Statistical Methodology Bayesian vs Classical Classical: data are noisy, parameters are fixed but unknown. We use probability distributions to model the noise. Bayesian: we use probability distributions to model our uncertainty about both the data and the parameters
  39. Statistical Methodology Bayesian vs Classical Classical: data are noisy, parameters are fixed but unknown. We use probability distributions to model the noise. Bayesian: we use probability distributions to model our uncertainty about both the data and the parameters In practice: Bayesians “average” over their uncertainty a lot. This means they use a lot of numerical integration (recently: Monte Carlo). Everything has a probability distribution. Some are subjective. Frequentists usually report “their best guess”. They use a lot of classical optimisation (gradient descent etc.) - faster. In cases where the variation is simple/physical, less subjective.
  40. Statistical Methodology Data Mining and Pattern Recognition • Focus on pattern extraction rather than inference • Often no question formulated in advance Machine Learning • Focus on prediction (out-of-sample error) • Largely more automatic, black-box techniques are OK • Huge success stories in stylised worlds • Onus on the user to fit their problem into one of only a few “templates” (classification, regression) - carries big risks. Deep Learning and Cognitive AI • Aims to replicate human cognition, low to mid-level faculties such as vision, hearing, natural language understanding. • Can share methods with statistics/probabilistic modelling, but is mostly fundamentally different in its approach.
  41. Statistical Methodology ANALYTICS LEARNINGvs
  42. Statistical Methodology ANALYTICS LEARNINGvs retrospective summaries generalisation
  43. Statistical Methodology ANALYTICS LEARNINGvs retrospective summaries generalisation a matter of resources to compute the exact answer (storage, distributed queries, parallel computation, …) mathematics probability theory numerical optimisation logic and algorithms no “exact” answer
  44. Statistical Methodology Takeaways: • Black boxes aren’t enough • More Data != More Information • Big Data needs Big Models • Quantity vs Quality vs Homogeneity 

  45. Black boxes aren’t enough Peter Norvig: Statement largely driven by “quantum step” in machine translation offered by black-box (neural net) techniques, compared to explicit grammar models and classical natural language processing tools Black-box AI is experiencing a second coming. However, it does rely on (nearly commoditised) natural language preprocessing tools for keyword extraction, named entity recognition etc. 
 
 Almost never true. Even if generalisation is not needed, there are always sources of error (measurement, nonresponse), as well as latent factors (e.g., the effect of X on Y, correlation, causality).
  46. More Data != More Information 20 years worth of credit scoring data, but … • Only one snapshot of each applicant’s behaviour • Unknown levels of demographic variability • Unknown levels of temporal variability With more data (usually) comes more heterogeneity: one could say that Big Data = Many Small Datasets Databases went from flat to relational to noSQL, but most commodity models are pre-relational! Models are not as re-usable as people think (for example, a decision tree might be a good predictor but a poor customer segmentation tool)
  47. More Data != More Information The signal sometimes simply isn’t there Substantive theory (and common sense) are still needed. External (unobserved) factors, inherent inpredictability Biased sampling (observational vs prospective - e.g., A/B testing). The lost art of survey sampling (elections?)
  48. Big Data needs Big Models With enough data, everything is significant This assumes the model is right and the data i.i.d. • Bigger data typically means more sources of variation • Model complexity should grow with the data (Kolmogorov) −5 0 5 10 15 −2000200400 Small Data Attribute Response ● ● ● ● ● ● ● ● ● ● Truth Complex model Simple model −5 0 5 10 15 −2000200400 Bigger Data Attribute Response ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● Truth Complex model Simple model
  49. Big Data needs Big Models
  50. Big Data needs Big Models Personally a big fan of Bayesian non-parametrics. Zoubin Ghahramani thinks it’s “the rise of the automated statistician”
  51. Big Data needs Big Models Fat Data vs Tall Data Sometimes bigger means more features for the same examples: curse of dimensionality. Modern techniques for sparse learning (p >> n) are a great aid (e.g., Lasso) ID Age Income Tweet Tweet Tweet ... 1 2 3 4 ... ID Age Income 1 2 3 4 5 6 7 8 ...
  52. Big Data needs Big Models Fat Data vs Tall Data Consider recommender systems. As data grows: • more items, more users • each user ranks a fixed number of items: sparser matrices
  53. Big Data needs Big Models Temporal homogeneity: the hidden bottleneck At one extreme, one could ignore all past data as irrelevant At the other one could assume the future is like the past Solutions in the middle include dynamic modelling (very complicated and computationally expensive), and exponential filters of various specifications (my field of expertise) −4 −2 0 2 4 0.00.20.40.60.81.0 X Density Prior Posterior Posterior with power prior Posterior with flat prior
  54. Big Data needs Big Models Temporal homogeneity: the hidden bottleneck Sometimes there is nothing to do ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● −4 −2 0 2 4 −4−2024 X1 X2 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Class 1 Class 2 ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● −4 −2 0 2 4 −4−2024 X1 X2 ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● Class 1 Class 2
  55. Big Data needs Big Models Temporal homogeneity: the hidden bottleneck What looks like drift for one model might not be for another, especially when the population, not the concept, is drifting ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −10−50510 X y ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● old data new data
  56. Big Data needs Big Models Robustness Important to have built-in guarantees. Robustness and model diagnostics is the unsung hero in classical statistics. Complicating the assumption set sometimes leads to overly complex models. Robustness is often the expedient solution.
  57. Do not torture the data The Wall Street Journal: “Big Data Unveils Some Weird Correlations” • orange used cars are more reliable • taller people are better at repaying loans −4 −2 0 2 4 0.00.20.40.60.81.0 X Density Prior Posterior Posterior with power prior Posterior with flat prior • http://www.tylervigen.com 

  58. Streaming data Exact answers are sometimes possible (e.g., running mean) But sometimes they are not (e.g., top-K, median) Streaming approximate algorithms are fast, and can be very accurate, but they can be complicated (e.g., hyperloglog). Keep constant memory footprint. Keep up (do not queue)
  59. Streaming data However, in Machine Learning, there is no “exact” answers. Will batch always outperform streaming (more resources)? • Temporal heterogeneity (drift) • Simulated annealing • Overfitting (prequential learning) www.ment.at/blog.html Keep constant memory footprint. Keep up (do not queue)
  60. Streaming data However, in Machine Learning, there is no “exact” answers. Will batch always outperform streaming (more resources)? • Temporal heterogeneity (drift) • Simulated annealing • Overfitting (prequential learning) www.ment.at/blog.html Keep constant memory footprint. Keep up (do not queue)
  61. Infrastructure I haven’t discussed infrastructure as much. It’s critical. If you are late, sometimes you might as well give up. Parallelisation (e.g., GPUs), distribution (e.g., HDFS), streaming (e.g., Spark Streaming), λ-architectures … Algorithms often need to be designed from scratch. Great progress in this direction. Keep working on it!
  62. datastream.io
  63. datastream.io additional deployment options
  64. How to manage data scientists Treat negative results like you treat positive results Encourage lab reports: data analysis is a process. Do not overfit. Do not fish for p-values. Do not torture the data. Specify hypotheses in advance whenever possible. Then test. Black box solutions are great for prediction. Only. Do not silo data scientists. Incorporate expert knowledge whenever possible. Explicit prior beliefs are not a bias risk.
  65. Conclusions • Knowledge is power. Knowledge relies on data. 
 • The process of extracting knowledge from data has become more efficient and more powerful than ever – but it’s still far from automatic (we are working on it ...) 
 • Big Data needs Big Models 
 • More Data != More Information 
 • A Data Scientist is a team, not an individual 

  66. Afterthought What about strong Artificial Intelligence? Machines are outperforming humans in an increasingly broad array of cognitive tasks. Last time this happened we had the Industrial Revolution. Data Science is at the cusp of this wave. This is an exciting time, but it also carries a lot of responsibility.
  67. Afterthought If machines replace us, there will only be one profession left AI programmers and Data Scientists
Publicité