Knowledge Discovery

Knowledge Discovery
& Data Mining
22nd ACM SIGKDD 2016
André Karpištšenko

~80 sessions for 2,700 participants
• Business Applications and Frameworks at Scale
• Data Streams Mining
• DashOpt features
• Outlier Detection
• Bayesian Optimization
• Deep Learning
• Investing into AI and Data
• Bonus keywords
88 countries, 35% YoY, 15-20% acceptance

Business Application Examples
• Consumer Internet focus: Content Ranking,
Recommendation, User Intent and Context Prediction
• Industrial Internet focus: Autonomy, Predictive
Maintenance, Operational Intelligence, Production Planning
• B2B focus: Targeting, Lead Generation, Sales Development,
Opportunity Management, Account Management
• Web content analytics: Image, Video, Text Classification for
Relevance, Products Categorization, Sentiments
• Other: Cyber Security, Fraud/Spam Detection, NLP, Speech
Recognition, Image/Video Recognition

Predictive Modeling Flow
DashOpt
Feature
Engineering
Raw
Data
Raw
Features
Labels
Feature
Integration
Features
with Labels
Data
Partitioning
Training
Data
Validation
Data
Testing
Data
Model Training
Evaluate for
model selection
Compute offline
evaluation metrics
Best model
Offline scoring
and indexing
Online/offline
systems
Online A/B test
Label
preparation Log data
Scoring
features
Raw features
Feature
integration
Model
Performance
Test Results

Data Technologies
• Most common: HDFS, MapReduce, Spark, Hive
• Decision spectrum: build, assemble, buy
• Factors: network, open-source, maturity, needs
Exploratory Production

Classic ML Platform at Scale
Workflows
HDFS Feature Mart Ground Truth Models Scores
MapReduce / Yarn
Workflow Scheduler & Manager
Workflows Workflows
Intelligence Engine
Metadata
Store
Pig, Hive,
Python, Scala,
Shell script, …
Feature Engineering
Libraries
Machine Learning
Libraries
Drivers
Drivers
Application 1 Application 2 Application N
…

http://github.com/szilard/benchm-ml
Baseline Performance Benchmark

https://sites.google.com/site/iotminingtutorial/
IoT Data Streams Mining
• 10 billion devices in 2015 -> 34 billion devices by 2020
• Continuous data, dynamic models, distributed, few seconds

Streams Mining: Actors Model
Data processing pipeline Distributed processing
Kappa Architecture
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Outlier Detection
• Single point anomaly detection: likelihood over distribution
• Finding anomalous groups: divergence estimation
• Methods: percentage change, T-test, Chi-square test, Generalized ESD (Extreme
Studentized Deviate) test, Seasonal Hybrid ESD, etc.
• Goal: move from detection to automated response

Outlier Detection in Practice
• Too many detections of too little value
• Use methods for thresholds
• Breakout detection and Concept Drift
• For changing distributions move baselines over time
• Risk of overfitting to known anomalies, not finding unknown anomalies

Bayesian aka Active Optimization
• Examples: Design of Experiments, hyper-parameters of
supervised learning, algorithms tested with simulations
f is an unknown expensive black-box function with the goal to
approximately optimize f with as few experiments as possible
• No free lunch theorem
• Other bio-inspired
algorithms for
optimization exploitation
and exploration: neural
networks, genetic
algorithms, swarm
intelligence, ant colony
optimisation, etc.

Bayesian Optimization in Practice
• SigOpt experience: 20 dimensions, above human capacity.
• Uber ATC experience: scaling active optimization to high
dimensions as the default works reliably for 5-7 dim.
• Variables are added during optimization.
• Choose fidelity using heuristics.

Deep Learning
• Compute power, GPU, learning architectures and a lot of
labeled data are what drive DL
• Applied for Vision and Speech: matches human performance
• Not possible where experiments are costly: biotech
• Kaggle winners are not DL models: tree ensembles, SVMs
• Common technologies: TensorFlow, Caffe, Theano, Keras
• Thousands of pieces of software: modules and layers
• Explainability and interpretability are the next big things
• EU regulation. Tradeoff: accuracy vs explainability.

Deep Learning Trends
• Vision nets are deeper and structured (Larsson 2016)
• Language nets have also dynamics, memory and attention
(Rocktaschel 2016, Miller 2016)
• Probabilistic programming (Lake, Tenenbaum)
• Programs as networks (Riedel)
• The Neural Programmer and Interpreter for learning programs
(Reed et al 2016)
• Computation graphs interacting with memory
• Loop for reasoning for nested questions (Miller 2016)
• Generative adversarial networks (Reed 2016). Models capable of
imagining images, videos and text.

Investing into AI and Data
• Data acquisition, real-time detection and visualization not solved yet.
• Empower more people to do data science. Automate routines.
• Unsolved problems are learning from unlabeled data, planning,
reasoning, problem solving, concept formulation, 1/10k compute.
• Key decisions: Timing, accuracy in what is hard, find verticals and
focus, identify differentiation & size of the prize & people & partners
Business of outliers: 1% capital returns 526x, 48% returns 0
0
450
900
Q2'15 Q3'15 Q4'15 Q1'16 Q2'16
Peak Data
Peak ML
VC assessment:

Bonus Keywords
• Lifelong Machine Learning: systems approach, transfer learning,
never-ending learners. Useful for knowledge build.
• Graphons: graph convergence and limits through infinite number of
vertices. Useful for privacy preserving mining.
• Computational Social Science: how individuals interact to produce
collective behaviour. Individuals exert more effort by themselves than
groups.
• Information Security: trusted key management is most sensitive.
Secret must be changed frequently. Confidentiality easier to violate
than authenticity. Integrity. Offence more lucrative than defence.
• Enterprise Data: in reality “random data salad” prone to constant
change due to M&A, politics, dynamic schema DBs (e.g. Mongo), legacy
burden, restructuring, leadership changes, data hoarding. Machine
driven, human guided processes required.

Thinking about Value from Data Science

Knowledge Discovery

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Knowledge Discovery

Similaire à Knowledge Discovery (20)

Plus de André Karpištšenko

Plus de André Karpištšenko (7)

Dernier

Dernier (20)

Knowledge Discovery