SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Design Challenges for
Real Predictive
Platforms
Max Gasner
@gasnerpants
What I’ve Been Doing
! Navia Systems (2010-2011)

- probabilistic programming
! Prior Knowledge (2011-2012)

- the Veritable API (went live in July 2012)
! acquired by salesforce.com (2012–2014)

- predictive intelligence team

!
APIs and cloud services to expose nonparametric Bayes (!?)

…. to a general audience (?!)
Predictive Platforms?
! Methods have advanced to support flexible use

!
! Market is getting there - lots of data, many frustrated business
users

!
! Let’s not mistake specialist problems for the wider need

!
! For most business problems, it’s a cold start and the competition
is no predictive solution
The Database Analogy
“Just” deterministic storage, collation, query, sorting,
aggregation
Early database systems were purpose-built by consultants and
tied tightly to implementation details
RETRIEVE FORCE STATUS WITH RUNWAY LENGTH > 8000,
GCD(DENVA)>GCD(DENVA,BEVENS) THEN LIST AFLD NAME, GCD,
RUNWAY LENGTH (System 473L, 1966)

Relational database revolution (largely) decoupled schema
from storage, interface from implementation
The Database Analogy
Sensible defaults, but
highly configurable and
extensible by experts
Ingest from
many sources;
data is typed

Flexible query; some
queries will fail;
sensible defaults, but
highly configurable and
extensible by experts

Indexing

Query

Modeling

“Prediction”

Many clients;
databases
outlive initial
applications
Database Lessons
! “Decouple” implementation so users can be productive at
different levels of abstraction
!
! Extensive gains (more applications are possible) and
intensive gains (applications are easier to develop and
maintain)
!
! Quantity >> quality: more is much more than better
Database Lessons
“Everyone” writes SQL
SELECT * FROM Patients WHERE Icd9 LIKE “250” AND DischargeDate =
2/11/2014

!
It needs to be this easy to deploy and query models
INFER WillReadmit FROM (SELECT * FROM Patients WHERE Icd9 LIKE
“250” AND DischargeDate = 2/11/2014)
Desiderata for Real Platforms
! Be robust to the real world: data is messy, users are
inexperienced, and problems are underspecified
! Be honest about limitations: fail gracefully, but always fail
when to do otherwise would be misleading
! Be flexible to changes in the structure of data and the
questions that matter
! Be simple to use and provide the basic building blocks for
complex applications (but don’t try to solve language, vision, and dating)
Robust
! Far more data is usually available than is understood
! Every dataset has missing values
! Every value is noisy
!
! Systems shouldn’t fail in the presence of irrelevant or
partially observed data
! Systems should be conservative in the face of
uncertainty
Honest
! No system is adequate to every problem or dataset
! Some mistakes are expensive and some are cheap
! Black boxes are easy to use and hard to trust
!
! Systems should provide measures of uncertainty
! Systems should explain their reasoning (in the sense of
EXPLAIN)
Flexible
! The world isn’t a real-valued matrix
! Modeling choices shouldn’t mean we fake our datatypes
! The world is nonstationary and every predictive problem is
streaming
!
! Systems should handle heterogeneous data natively
! Systems should retrain (and validate) continuously
Simple
! Predictive systems need to be easy to engineer with
! And they need to be easy to engineer
! The business user and the modeler both have valid interests
in a predictive system and both need to be able to use it
!
! Systems should be decoupled into pieces
! Systems should expose a small set of operations that
can compose to form complex predictive systems
Case Study: BayesDB
! Built on flexible general model for
denormalized flat data tables*
! Separates index backend(s) from query
frontend
! Exposes query interface through SQLlike language, “BQL”
! Open-source project (looking for
@vmansinghka
hackers)
*Exercise for the reader: what about relational, graph, free text,
!
http://probcomp.csail.mit.edu/bayesdb
and time series data?
Building Blocks
! ANALYZE
Construct models (like table views) from the dataset (table)

! SIMULATE
Generate new (unobserved) rows like those in the table

! INFER
Fill in “missing” (or target) values for partially-observed rows

! ESTIMATE PAIRWISE DEPENDENCE PROBABILITIES (!!)
Exposes the structure of the learned model
Separate Concerns
! ANALYZE abstracts “what is doing the analysis”, decouples
model choice, inference strategy, validation from query
! Enables heterogeneous backends, on-the-fly model
selection, incremental model updates, cost-based modeling
! Analyses of the same data might treat it differently for
different purposes
! Challenge: training the predictive DBA?
Flexible Query
! SIMULATE exposes the joint distribution but no actual
values (anonymization, synthetic data generation)
! INFER supports traditional single-valued prediction, but also
joint prediction, conditioned on any combination of values
! Flexibility goes hand in hand with consistency: expect that
the results will be consistent in distribution
! Challenge: exposing query to the interactive end user?
Structure Discovery
! ESTIMATE PAIRWISE DEPENDENCE PROBABILITIES
(eppdepp?) exposes the structure of the model
! Moving to broader measures of dependence than correlation
! Structure is key for iterative, exploratory workflows
! Structure feeds into optimization of query and inference
strategies
! Challenge: representing structural uncertainty?
Expose Uncertainty
! Values should come with error bars, and explanations of
how they were derived
! Automated systems can use uncertainty to make costbenefit tradeoffs (do show this ad, but don’t let this patient
be discharged without this test being reviewed)
! Uncertainty lets us to move beyond anecdotes
! Challenge: uncertainty is hard to understand and explain
Hard Problems: Getting Data In
!

: Source vs. Dataset

(vs. Transformation, vs. Multi-Dataset, ….)

! Can we add more semantics to data definitions and
schemas, to lever our prior knowledge?
! Can we use cloud services/crowdsource to better transform
and interpret inputs?
! We need to design the entire data collection and storage
pipeline to better support analytic consumers of data
Hard Problems: Getting Results Out
! Not clear what the right default presentation is
! Much work to be done in exposing and explaining predicted
values and uncertainty
! As predictive systems start to support UIs (beyond News
Feed), we need to design new paradigms for interaction with
imputed and uncertain values
! It’s hard to form mental models of reactive/adaptive systems
Hard Problems: PL/BQL?
! The holy grail of “custom data types” — columns with
custom models written in probabilistic programming
languages
! Think ICD9: we have a really strong prior (medicine +
biology) but no way to express it, let alone do inference
! Put domain-specific modeling in the hands of “anyone”
! How many people have written some PL/SQL? How many
people have written production database internals?
Predictive in the Ecosystem
! Today: many specialized views of data (extending the basic
OLAP/OLTP distinction for a new era of bigger data and new
demands)
! Tomorrow: predictive services as true services inter pares,
with many clients of their own, deriving data from the same
sources of truth as other services
! Lots of work to do flowing provenance, prior knowledge, and
schemata through the entire pipeline
Ecosystem?
! Let a hundred flowers blossom, let a hundred general
purpose predictive platforms contend
! Lots of uncertainty about the right (combination of) models
to support the interface
! Lots of room to innovate on API and presentation
! Many problems in business very eager for credible solutions
! Database analogy: we are waiting for Codd
Thanks!
Max Gasner
@gasnerpants

Contenu connexe

Similaire à Strata 2014: Design Challenges for Real Predictive Platforms

Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented ApproachRazorleaf Corporation
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingAnimesh Srivastava
 
Afternoons with Azure - Power BI and Azure Analysis Services
Afternoons with Azure - Power BI and Azure Analysis ServicesAfternoons with Azure - Power BI and Azure Analysis Services
Afternoons with Azure - Power BI and Azure Analysis ServicesCCG
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
data-spread-demo
data-spread-demodata-spread-demo
data-spread-demoBofan Sun
 
Graph databases and OrientDB
Graph databases and OrientDBGraph databases and OrientDB
Graph databases and OrientDBAhsan Bilal
 
Enterprise NoSQL: Silver Bullet or Poison Pill
Enterprise NoSQL: Silver Bullet or Poison PillEnterprise NoSQL: Silver Bullet or Poison Pill
Enterprise NoSQL: Silver Bullet or Poison PillBilly Newport
 
Teradata Aster Discovery Platform
Teradata Aster Discovery PlatformTeradata Aster Discovery Platform
Teradata Aster Discovery PlatformScott Antony
 
The Full Stack
The Full StackThe Full Stack
The Full StackJon Reades
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Zhang Bo
 

Similaire à Strata 2014: Design Challenges for Real Predictive Platforms (20)

Mr bi amrp
Mr bi amrpMr bi amrp
Mr bi amrp
 
Some NoSQL
Some NoSQLSome NoSQL
Some NoSQL
 
Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented Approach
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Afternoons with Azure - Power BI and Azure Analysis Services
Afternoons with Azure - Power BI and Azure Analysis ServicesAfternoons with Azure - Power BI and Azure Analysis Services
Afternoons with Azure - Power BI and Azure Analysis Services
 
Abstract.DOCX
Abstract.DOCXAbstract.DOCX
Abstract.DOCX
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
 
data-spread-demo
data-spread-demodata-spread-demo
data-spread-demo
 
Graph databases and OrientDB
Graph databases and OrientDBGraph databases and OrientDB
Graph databases and OrientDB
 
Enterprise NoSQL: Silver Bullet or Poison Pill
Enterprise NoSQL: Silver Bullet or Poison PillEnterprise NoSQL: Silver Bullet or Poison Pill
Enterprise NoSQL: Silver Bullet or Poison Pill
 
Teradata Aster Discovery Platform
Teradata Aster Discovery PlatformTeradata Aster Discovery Platform
Teradata Aster Discovery Platform
 
Software Patterns
Software PatternsSoftware Patterns
Software Patterns
 
The Full Stack
The Full StackThe Full Stack
The Full Stack
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)
 

Dernier

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Dernier (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Strata 2014: Design Challenges for Real Predictive Platforms

  • 1. Design Challenges for Real Predictive Platforms Max Gasner @gasnerpants
  • 2. What I’ve Been Doing ! Navia Systems (2010-2011)
 - probabilistic programming ! Prior Knowledge (2011-2012)
 - the Veritable API (went live in July 2012) ! acquired by salesforce.com (2012–2014)
 - predictive intelligence team ! APIs and cloud services to expose nonparametric Bayes (!?)
 …. to a general audience (?!)
  • 3. Predictive Platforms? ! Methods have advanced to support flexible use ! ! Market is getting there - lots of data, many frustrated business users ! ! Let’s not mistake specialist problems for the wider need ! ! For most business problems, it’s a cold start and the competition is no predictive solution
  • 4. The Database Analogy “Just” deterministic storage, collation, query, sorting, aggregation Early database systems were purpose-built by consultants and tied tightly to implementation details RETRIEVE FORCE STATUS WITH RUNWAY LENGTH > 8000, GCD(DENVA)>GCD(DENVA,BEVENS) THEN LIST AFLD NAME, GCD, RUNWAY LENGTH (System 473L, 1966) Relational database revolution (largely) decoupled schema from storage, interface from implementation
  • 5. The Database Analogy Sensible defaults, but highly configurable and extensible by experts Ingest from many sources; data is typed Flexible query; some queries will fail; sensible defaults, but highly configurable and extensible by experts Indexing Query Modeling “Prediction” Many clients; databases outlive initial applications
  • 6. Database Lessons ! “Decouple” implementation so users can be productive at different levels of abstraction ! ! Extensive gains (more applications are possible) and intensive gains (applications are easier to develop and maintain) ! ! Quantity >> quality: more is much more than better
  • 7. Database Lessons “Everyone” writes SQL SELECT * FROM Patients WHERE Icd9 LIKE “250” AND DischargeDate = 2/11/2014 ! It needs to be this easy to deploy and query models INFER WillReadmit FROM (SELECT * FROM Patients WHERE Icd9 LIKE “250” AND DischargeDate = 2/11/2014)
  • 8. Desiderata for Real Platforms ! Be robust to the real world: data is messy, users are inexperienced, and problems are underspecified ! Be honest about limitations: fail gracefully, but always fail when to do otherwise would be misleading ! Be flexible to changes in the structure of data and the questions that matter ! Be simple to use and provide the basic building blocks for complex applications (but don’t try to solve language, vision, and dating)
  • 9. Robust ! Far more data is usually available than is understood ! Every dataset has missing values ! Every value is noisy ! ! Systems shouldn’t fail in the presence of irrelevant or partially observed data ! Systems should be conservative in the face of uncertainty
  • 10. Honest ! No system is adequate to every problem or dataset ! Some mistakes are expensive and some are cheap ! Black boxes are easy to use and hard to trust ! ! Systems should provide measures of uncertainty ! Systems should explain their reasoning (in the sense of EXPLAIN)
  • 11. Flexible ! The world isn’t a real-valued matrix ! Modeling choices shouldn’t mean we fake our datatypes ! The world is nonstationary and every predictive problem is streaming ! ! Systems should handle heterogeneous data natively ! Systems should retrain (and validate) continuously
  • 12. Simple ! Predictive systems need to be easy to engineer with ! And they need to be easy to engineer ! The business user and the modeler both have valid interests in a predictive system and both need to be able to use it ! ! Systems should be decoupled into pieces ! Systems should expose a small set of operations that can compose to form complex predictive systems
  • 13. Case Study: BayesDB ! Built on flexible general model for denormalized flat data tables* ! Separates index backend(s) from query frontend ! Exposes query interface through SQLlike language, “BQL” ! Open-source project (looking for @vmansinghka hackers) *Exercise for the reader: what about relational, graph, free text, ! http://probcomp.csail.mit.edu/bayesdb and time series data?
  • 14. Building Blocks ! ANALYZE Construct models (like table views) from the dataset (table) ! SIMULATE Generate new (unobserved) rows like those in the table ! INFER Fill in “missing” (or target) values for partially-observed rows ! ESTIMATE PAIRWISE DEPENDENCE PROBABILITIES (!!) Exposes the structure of the learned model
  • 15. Separate Concerns ! ANALYZE abstracts “what is doing the analysis”, decouples model choice, inference strategy, validation from query ! Enables heterogeneous backends, on-the-fly model selection, incremental model updates, cost-based modeling ! Analyses of the same data might treat it differently for different purposes ! Challenge: training the predictive DBA?
  • 16. Flexible Query ! SIMULATE exposes the joint distribution but no actual values (anonymization, synthetic data generation) ! INFER supports traditional single-valued prediction, but also joint prediction, conditioned on any combination of values ! Flexibility goes hand in hand with consistency: expect that the results will be consistent in distribution ! Challenge: exposing query to the interactive end user?
  • 17. Structure Discovery ! ESTIMATE PAIRWISE DEPENDENCE PROBABILITIES (eppdepp?) exposes the structure of the model ! Moving to broader measures of dependence than correlation ! Structure is key for iterative, exploratory workflows ! Structure feeds into optimization of query and inference strategies ! Challenge: representing structural uncertainty?
  • 18. Expose Uncertainty ! Values should come with error bars, and explanations of how they were derived ! Automated systems can use uncertainty to make costbenefit tradeoffs (do show this ad, but don’t let this patient be discharged without this test being reviewed) ! Uncertainty lets us to move beyond anecdotes ! Challenge: uncertainty is hard to understand and explain
  • 19. Hard Problems: Getting Data In ! : Source vs. Dataset
 (vs. Transformation, vs. Multi-Dataset, ….) ! Can we add more semantics to data definitions and schemas, to lever our prior knowledge? ! Can we use cloud services/crowdsource to better transform and interpret inputs? ! We need to design the entire data collection and storage pipeline to better support analytic consumers of data
  • 20. Hard Problems: Getting Results Out ! Not clear what the right default presentation is ! Much work to be done in exposing and explaining predicted values and uncertainty ! As predictive systems start to support UIs (beyond News Feed), we need to design new paradigms for interaction with imputed and uncertain values ! It’s hard to form mental models of reactive/adaptive systems
  • 21. Hard Problems: PL/BQL? ! The holy grail of “custom data types” — columns with custom models written in probabilistic programming languages ! Think ICD9: we have a really strong prior (medicine + biology) but no way to express it, let alone do inference ! Put domain-specific modeling in the hands of “anyone” ! How many people have written some PL/SQL? How many people have written production database internals?
  • 22. Predictive in the Ecosystem ! Today: many specialized views of data (extending the basic OLAP/OLTP distinction for a new era of bigger data and new demands) ! Tomorrow: predictive services as true services inter pares, with many clients of their own, deriving data from the same sources of truth as other services ! Lots of work to do flowing provenance, prior knowledge, and schemata through the entire pipeline
  • 23. Ecosystem? ! Let a hundred flowers blossom, let a hundred general purpose predictive platforms contend ! Lots of uncertainty about the right (combination of) models to support the interface ! Lots of room to innovate on API and presentation ! Many problems in business very eager for credible solutions ! Database analogy: we are waiting for Codd