SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Enhance Predictive Modeling with Better Data
Preparation
Ritu Jain, Director of Industry & Solutions Marketing, Alteryx Inc.
Dr. Dan Putler, Chief Scientist, Alteryx Inc.
April 6, 2016
Speakers
Ritu Jain
Director, Industry & Solutions Marketing
Alteryx
Dr. Dan Putler
Chief Scientist
Alteryx
Download a FREETrial: alteryx.com/trial© 2016 Alteryx, Inc. | Confidential
Agenda
• Alteryx Overview
• ThinkingThrough
Predictive Model Use
Case
• Starting with the
Right Data
• Choosing the Right
ModelingTechnique
3
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
Customer Success
Customers across the world
1500+
Strong Foundation
95%+
Renewal rate
&
Investment for Innovation
Associates across
North America, Europe &
Australia
Corporate Info.
The Leading Platform for Self-Service DataAnalytics
300+
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
Designed for Analyst Enablement
5
Enrich
Prep & Blend Analyze
Input All Relevant Data
Share
OutputAll Popular Formats
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• What decision needs to be made?
• What information is needed to inform that decision?
• Typically developing a mental model of the process helps a great deal in terms of
determining all the potentially relevant information
• What type of analysis is going to be able to provide the exact information needed to
inform the decision?
6
Understanding the Business Issue
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• How much electricity does a utility need to have the capacity to supply for any given
hour tomorrow?
• To which of its customers should an outdoor sports retailer send a paddling sports
catalog?
7
Two Specific Use Cases to Illustrate Business Issue Understanding
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• The question “How much electricity does a utility need to have the capacity to supply
for any given hour tomorrow?” actually has two underlying decisions:
• Which of our existing power plants should we start to bring online or start to take offline?
• Should we purchase electricity from the spot market, and, if yes, how much?
• The critical information that needs to be known is how much electricity will be
demanded in each hour of the day tomorrow
• Unfortunately, this information is not known at the time decisions need to be made, but it
can be predicted using a predictive model
8
The Electricity Supply Use Case
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• What factors are likely to drive the demand for electricity in a given hour tomorrow?
• This is where having a mental model of the process can be very handy
• Some factors that are likely to be important:
• Day of the week
• Hour of the day
• The temperature that hour and the preceding hour
• The month of the year
• One issue is that, like electricity demand, the temperature in that hour (or even the
preceding hour) tomorrow will not be known at the time decisions are made (today), but
the temperature in each hour tomorrow can be predicted using a model
9
The Electricity Supply Use Case
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• Available factors to predict hourly temperatures
• The forecast high and low for the day from the National Weather Service or other
organization
• The number of minutes since sunrise or sunset at the start of each hour
• The temperature for the same hour on the previous day
• In this case two different predictive models are needed:
• Predict hourly temperatures for the next day
• Predict hourly electricity use given temperatures and other factors
10
The Electricity Supply Use Case
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• The question “To which of its customers should an outdoor sports retailer send a
paddling sports catalog? ” has a definitive answer: Send it to any customer where the
full cost of sending the catalog is less than the expected margin dollars (item price less
item cost) from the items a customer would purchase from the catalog
• While the criteria for answering the question about whether a specific customer should
be sent a catalog is definitive, knowing whether that customer meets the criteria is
another matter
• Predictive models can help to provide the information needed on whether a particular
customer is expected to meet the criteria
• Two models would typically be used
• A model that predicts whether a customer will purchase anything from the catalog at all
• A model of the margin dollars a customer will generate conditional on using the catalog
11
The Paddle Sports Catalog Use Case
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• In terms of selecting variables for the two models we have identified, we need to make
use of information that is known prior to sending a catalog to a customer.There are a
number of ready candidates to use
• Demographic and socioeconomic information: Age, income, family status
• Location information: State; travel time to a store; proximity to the sea, lakes, or rivers
• Past purchase behavior, typically measured using the concept of Recency, Frequency, and
MonetaryValue (or RFM)
• We also need to have observations on an appropriate target variable.There are two
ways to do this:
• Use appropriate historical data (i.e., the response to last year’s paddle sports catalog)
• Use of a “test” approach, where we send the catalog to a sample of our customers, and then
use this data to predict the behavior of all our customers
12
The Paddle Sports Catalog Use Case
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• There are a large (overwhelming?) number of different modeling methods available
• There are two criteria for selecting the final modeling method to use:
• Selecting an appropriate modeling method, which is largely driven by the data type of the
target variable (categorical or numeric)
• Selecting the model (hence the method) with the greatest predictive efficacy for predicting
new data among a set off of appropriate models
• Basic model types
• Classification models which predict the category into which a case (e.g., a customer) falls
• Regression models which predict numeric quantities
• Linking back to the use cases
• Classification: Whether a customer will respond to the paddling sports catalog
• Regression:The margin dollars from a customer who receives the catalog, hourly
temperature and electricity demand
13
The Nitty-Gritty of Developing a Predictive Model: Modeling Method
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• The data hygiene requirements for developing predictive models is more exacting than
for reporting and building BI dashboards
• The common data hygiene “gotchas” are:
• Fields with missing values. Some modeling methods can address missing values for
predictor variables, others cannot, and typically drop records that contain one or more
missing values from the selected set of predictor variables. No method can address records
with a missing target variable
• Categorical variables that have little variability (e.g., 99% of all records are in the same
category) or have categories with a small number of records (leading to reliability problems
and/or the possibility that new data cannot be predicted due to “unknown” categories)
• Categorical variables that are disguised as integers. For target variables it can mean that an
inappropriate modeling method is used, for predictors, it can mean the variable is used in an
inappropriate way
14
The Nitty-Gritty of Developing a Predictive Model: Data Hygiene
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• Addressing fields with missing values
• For predictor variables it makes sense to impute missing values
• In the case of numeric variables, using a fixed value, such as the mean, median, or zero is
commonly used to address missing values. In addition, a categorical variable can be created
for each predictor to indicate whether its value has been imputed or not. My
recommendation is to use zero values along with a categorical variable to indicate if the
value of the variable has been imputed
• Missing values of categorical values can be replaced with a new category indicating the
value is missing (my recommendation) or the mode value for the variable
• There are model based methods that replace missing values with predicted values based on
other available data
• Records with missing values for the target variable should be filtered out of the data
15
The Nitty-Gritty of Developing a Predictive Model: Data Hygiene
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• Addressing problematic categorical variables
• Addressing categorical variables which are dominated by a single category (e.g., have little
variability) depends on the amount of data available for creating a model. If there is a lot of
data, and there is a reasonable number of records (at least 20) in each of the non-dominant
categories, then including the field in the model is a viable choice. Otherwise, it makes sense
to not include these fields as predictors
• In the case of categorical variables with categories with few records, it makes sense to
combine categories together.The combination of categories should have a sound logical
basis, as opposed to being combined due to having a similar relationship with the target
field
• Fields that use integer values to identify different categories should have their data type
changed to a string type to indicate that the values are actually category labels
16
The Nitty-Gritty of Developing a Predictive Model: Data Hygiene
© 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial
• Clearly define the business issue – create a mental model
• Starting with the right data is critical to the accuracy of predictive models
• Data hygiene requirements for predictive modeling are more stringent than for
BI/Reporting
• Data variable type – “numeric” or “categorical” – matters:
• For selecting an appropriate modeling method
• When imputing missing values
• The volume of data can be critical when addressing problematic categorical variables
17
KeyTakeaways
@alteryx
See what Alteryx can do for you!
Download a free trial ofAlteryx
alteryx.com/trial
or visit alteryx.com for more information
Thank you
@DrDan
Advanced Analytics Forum, Alteryx Community
community.alteryx.com

Contenu connexe

Plus de DATAVERSITY

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for YouDATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?DATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best PracticesDATAVERSITY
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?DATAVERSITY
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesDATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 

Plus de DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Dernier

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Dernier (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Enhance Predictive Modeling with Better Data Preparation

  • 1. Enhance Predictive Modeling with Better Data Preparation Ritu Jain, Director of Industry & Solutions Marketing, Alteryx Inc. Dr. Dan Putler, Chief Scientist, Alteryx Inc. April 6, 2016
  • 2. Speakers Ritu Jain Director, Industry & Solutions Marketing Alteryx Dr. Dan Putler Chief Scientist Alteryx
  • 3. Download a FREETrial: alteryx.com/trial© 2016 Alteryx, Inc. | Confidential Agenda • Alteryx Overview • ThinkingThrough Predictive Model Use Case • Starting with the Right Data • Choosing the Right ModelingTechnique 3
  • 4. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial Customer Success Customers across the world 1500+ Strong Foundation 95%+ Renewal rate & Investment for Innovation Associates across North America, Europe & Australia Corporate Info. The Leading Platform for Self-Service DataAnalytics 300+
  • 5. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial Designed for Analyst Enablement 5 Enrich Prep & Blend Analyze Input All Relevant Data Share OutputAll Popular Formats
  • 6. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • What decision needs to be made? • What information is needed to inform that decision? • Typically developing a mental model of the process helps a great deal in terms of determining all the potentially relevant information • What type of analysis is going to be able to provide the exact information needed to inform the decision? 6 Understanding the Business Issue
  • 7. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • How much electricity does a utility need to have the capacity to supply for any given hour tomorrow? • To which of its customers should an outdoor sports retailer send a paddling sports catalog? 7 Two Specific Use Cases to Illustrate Business Issue Understanding
  • 8. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • The question “How much electricity does a utility need to have the capacity to supply for any given hour tomorrow?” actually has two underlying decisions: • Which of our existing power plants should we start to bring online or start to take offline? • Should we purchase electricity from the spot market, and, if yes, how much? • The critical information that needs to be known is how much electricity will be demanded in each hour of the day tomorrow • Unfortunately, this information is not known at the time decisions need to be made, but it can be predicted using a predictive model 8 The Electricity Supply Use Case
  • 9. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • What factors are likely to drive the demand for electricity in a given hour tomorrow? • This is where having a mental model of the process can be very handy • Some factors that are likely to be important: • Day of the week • Hour of the day • The temperature that hour and the preceding hour • The month of the year • One issue is that, like electricity demand, the temperature in that hour (or even the preceding hour) tomorrow will not be known at the time decisions are made (today), but the temperature in each hour tomorrow can be predicted using a model 9 The Electricity Supply Use Case
  • 10. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • Available factors to predict hourly temperatures • The forecast high and low for the day from the National Weather Service or other organization • The number of minutes since sunrise or sunset at the start of each hour • The temperature for the same hour on the previous day • In this case two different predictive models are needed: • Predict hourly temperatures for the next day • Predict hourly electricity use given temperatures and other factors 10 The Electricity Supply Use Case
  • 11. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • The question “To which of its customers should an outdoor sports retailer send a paddling sports catalog? ” has a definitive answer: Send it to any customer where the full cost of sending the catalog is less than the expected margin dollars (item price less item cost) from the items a customer would purchase from the catalog • While the criteria for answering the question about whether a specific customer should be sent a catalog is definitive, knowing whether that customer meets the criteria is another matter • Predictive models can help to provide the information needed on whether a particular customer is expected to meet the criteria • Two models would typically be used • A model that predicts whether a customer will purchase anything from the catalog at all • A model of the margin dollars a customer will generate conditional on using the catalog 11 The Paddle Sports Catalog Use Case
  • 12. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • In terms of selecting variables for the two models we have identified, we need to make use of information that is known prior to sending a catalog to a customer.There are a number of ready candidates to use • Demographic and socioeconomic information: Age, income, family status • Location information: State; travel time to a store; proximity to the sea, lakes, or rivers • Past purchase behavior, typically measured using the concept of Recency, Frequency, and MonetaryValue (or RFM) • We also need to have observations on an appropriate target variable.There are two ways to do this: • Use appropriate historical data (i.e., the response to last year’s paddle sports catalog) • Use of a “test” approach, where we send the catalog to a sample of our customers, and then use this data to predict the behavior of all our customers 12 The Paddle Sports Catalog Use Case
  • 13. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • There are a large (overwhelming?) number of different modeling methods available • There are two criteria for selecting the final modeling method to use: • Selecting an appropriate modeling method, which is largely driven by the data type of the target variable (categorical or numeric) • Selecting the model (hence the method) with the greatest predictive efficacy for predicting new data among a set off of appropriate models • Basic model types • Classification models which predict the category into which a case (e.g., a customer) falls • Regression models which predict numeric quantities • Linking back to the use cases • Classification: Whether a customer will respond to the paddling sports catalog • Regression:The margin dollars from a customer who receives the catalog, hourly temperature and electricity demand 13 The Nitty-Gritty of Developing a Predictive Model: Modeling Method
  • 14. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • The data hygiene requirements for developing predictive models is more exacting than for reporting and building BI dashboards • The common data hygiene “gotchas” are: • Fields with missing values. Some modeling methods can address missing values for predictor variables, others cannot, and typically drop records that contain one or more missing values from the selected set of predictor variables. No method can address records with a missing target variable • Categorical variables that have little variability (e.g., 99% of all records are in the same category) or have categories with a small number of records (leading to reliability problems and/or the possibility that new data cannot be predicted due to “unknown” categories) • Categorical variables that are disguised as integers. For target variables it can mean that an inappropriate modeling method is used, for predictors, it can mean the variable is used in an inappropriate way 14 The Nitty-Gritty of Developing a Predictive Model: Data Hygiene
  • 15. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • Addressing fields with missing values • For predictor variables it makes sense to impute missing values • In the case of numeric variables, using a fixed value, such as the mean, median, or zero is commonly used to address missing values. In addition, a categorical variable can be created for each predictor to indicate whether its value has been imputed or not. My recommendation is to use zero values along with a categorical variable to indicate if the value of the variable has been imputed • Missing values of categorical values can be replaced with a new category indicating the value is missing (my recommendation) or the mode value for the variable • There are model based methods that replace missing values with predicted values based on other available data • Records with missing values for the target variable should be filtered out of the data 15 The Nitty-Gritty of Developing a Predictive Model: Data Hygiene
  • 16. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • Addressing problematic categorical variables • Addressing categorical variables which are dominated by a single category (e.g., have little variability) depends on the amount of data available for creating a model. If there is a lot of data, and there is a reasonable number of records (at least 20) in each of the non-dominant categories, then including the field in the model is a viable choice. Otherwise, it makes sense to not include these fields as predictors • In the case of categorical variables with categories with few records, it makes sense to combine categories together.The combination of categories should have a sound logical basis, as opposed to being combined due to having a similar relationship with the target field • Fields that use integer values to identify different categories should have their data type changed to a string type to indicate that the values are actually category labels 16 The Nitty-Gritty of Developing a Predictive Model: Data Hygiene
  • 17. © 2016 Alteryx, Inc. | Confidential Download a FREETrial: alteryx.com/trial • Clearly define the business issue – create a mental model • Starting with the right data is critical to the accuracy of predictive models • Data hygiene requirements for predictive modeling are more stringent than for BI/Reporting • Data variable type – “numeric” or “categorical” – matters: • For selecting an appropriate modeling method • When imputing missing values • The volume of data can be critical when addressing problematic categorical variables 17 KeyTakeaways
  • 18. @alteryx See what Alteryx can do for you! Download a free trial ofAlteryx alteryx.com/trial or visit alteryx.com for more information Thank you @DrDan Advanced Analytics Forum, Alteryx Community community.alteryx.com