SlideShare a Scribd company logo
1 of 27
Daqing Zhao, PhD
Founder and Principal, Eureka Analytics
Business Intelligence Innovation Summit,
Chicago
5/23/2013
©Daqing Zhao All rights reserved
Frontiers of Big Data Business
Analytics, Patterns and Cases in
Online Marketing
Agenda
• Overview of big data analytics
• Insights of big data and analysis
• BI process on big data
• Lessons of model building
• Cases for behavioral profiles for predictive models
– Yahoo network segmentation
– Tribal Fusion display ads impression optimization
– University of Phoenix student retention and lead
optimization
• Case of Ask.com SEM algorithms
2
Daqing Zhao, PhD
• Big Data scientist with deep domain knowledge
• Academic training
– Analyzed molecular spectra on Cray supercomputers
– Determined, modeled, simulated molecular motions in 3D space
• Enjoy working with large data and large scale computing
• Worked on computational Internet marketing since 1999
3
New Book on Big Data Analytics
• In the book:
• Daqing Zhao:
• Frontiers of Big Data
Business Analytics:
Patterns and Cases in
Online Marketing
4
Big data, Big Opportunities
• Thanks to Moore’s law, on CPU, storage, network connections
• Too much data, too little knowledge
• Data, analytics changed every field many times over
• From science, government, to commerce
5
Big data characteristics
• Amount of data too big to handle using normal technology,
most data collected are dormant
• Raw data are stored, appended but not updated
• Formatted or free format data
• No aggregation for purpose of data reduction
• Individual customer level and individual event level data
• Sensor data
• Complete 360 degree view
• Process from raw data to get insights and build models
• Some business uses of big data: customer profile, event
prediction, automated decision machine, risk management,
wisdom of crowd
6
Things computers good at
• Computers have perfect memory
– Every page view, click, transaction, every event,…
• Good at finding a needle in a haystack
– E.g., target abandoned shopping carts with promotions
– Clickers of this page in the last week
• Good at trade offs among large number of factors
– Female, 25-34, with child < 5, Asian, earning $30K, rent,
divorced, live in Calif., some college, Walmart,
Coupons.com, Monster.com, drive Camry, …
– Buyer of X or not?
7
Things computers on Internet are good at
• Platforms of cloud sourcing
– Google PageRank, Adwords, Picasa, Translate, …
• Data not previously looked at in aggregate
– Google PageRank/Translate, Amazon Find Product
• Data not previously created, or accumulated
– Social network data at LinkedIn, Facebook
– Amazon Customer Review, Yelp
– Twitter, Flickr
– Wikipedia, Youtube/Khan Academy, eHow, Udemy,
Yahoo/Answers
8
Computers make it possible
• Given data, find models and parameters
– Identify reproducible patterns in the data
– Provide simple picture of a large number of events
– Predict events in the future
• Simulations generate future events, given
assumptions, and current state
– Given a set of models, how future scenario will look like,
under given set of conditions, “what ifs”
• Robots, and agents
– Make decisions based on environment and goals, self
driving cars
9
Computers can’t do everything
• Data often have issues before being well analyzed
• Data often have no taxonomy and context
• Free format data, relevant information need to be
extracted
• Analyst has to define targets, construct predictors
• Analyst has to include critical predictive factors
• Analyst need to add common sense
10
Every wrong data is wrong in its own way
• Some data are not collected, “too big” or “useless”, as in flood
control, purged log data
• Some data feeds to warehouse are incomplete
• Multiple definitions and inconsistent business rules, no
documentation
• Data incomplete due to business nature
– Sparse data
– Separate log in and log out data
– Credit card purchases versus cash
• Some flaws are easy to catch, such as missing, constant
• Some flaws hard to find, partially missing or incorrect
11
Best practices of analyst
• Understand how the data are collected, what data
can and cannot be collected
• Balance cost of collecting data and optimize
modeling
• Use feedback loop to test hypotheses
• Do simulations to see if changes are reasonable
• Good ideas are not necessarily complicated ideas
• Focus on domain knowledge, not just data mining
tools
12
Best Practices of Analytics Managers
• Well versed on analytics, understand analyst, their
behavior, the tests, their work and value
• Focus on domain knowledge, not just data mining
tools
• Focus on impact, not elegance in modeling
• Big Data Analytics are different from small sample
statistics, and need to learn on the job
• As activities become more technical, it is hard to
recognize values and identify issues
– 2008: Financial crisis and credit derivatives
– Principal-agent problem
13
New Information Explosions
• Before ~1450, only nobilities had a few books
• After Gutenberg, information was limited by paper
and printing capacities
– People cried out loud there was too much information
– Then we had libraries, index, abstract, book reviews,…
• Now information is limited by disks & cloud storage
– A person’s lifetime spoken words stored in a thumb drive
– Soon everything can be stored
• Now: how do we make use of all the information?
– Search, crows sourcing, Twitter, Wikipedia, YouTube,
big data and analysis algorithms, …
14
Paradigm Shift in Data Organization
• Mathematics is a way to efficiently use brain resources
– With pen and paper, only simple problems solvable
– Crude approximations, and samples for complicated ones
– Unreasonable effectiveness of mathematics – E. Wigner
• Now, algorithms are ways to efficiently use computing
resources
– Numerical solutions of complex equations
– Large scale simulations, full population databases
– Unreasonable effectiveness of data – P. Norvig
• Elegant, over simplified models are less useful
15
Paradigm Shift in Knowledge
• Knowledge is power, by Francis Bacon
• Past: Drowning in information, starving for
knowledge, by John Naisbitt
• Now, Knowing how to extract knowledge is power
• Soon: There is abundance of knowledge, seeking for
relevance
– Incl. personal finance, medical, political decisions
• Innovations are about connecting the dots
– Distances between the dots are getting smaller
– Leverage knowledge to make decisions, manage risks
16
Big Data problem
• Data size larger than what databases can handle
• Terabytes of data may take hours just to scan it
• Solution requires a cloud of servers with local
storage
– Read, process and write intermediate results in
parallel
– Aggregate at the end
• Cloud computing can build models in scale
• Cloud often scales linearly as number of servers
17
Modeling need to scale
• Traditional predictive models take long time to build
– Small data sets, samples expensive to collect
• Now data are cheap and models may degrade in weeks
– Dimension of predictors are very large
– Number of categories are large
• Human interactive model building not scalable
• Reasons for target events are complex
• Without detailed analysis, it is unclear what drives the
event
• We need to rely on “out of sample testing” and “off the
shelf” modeling
18
Cloud computing
• We built a SAS cloud at University of Phoenix
– I have an invited SAS talk available at SAS web site
– Can process billions of impressions in minutes
• Hadoop clouds are used widely
– Open source software, Hive, Impala, Mahout
– Commodity servers and storage
• Clouds may have 100Ks of servers
– Find needle in a haystack in milliseconds
– Model computations usually would take years to
compute now finishes in minutes
19
Big Data Centers
20
Facebook and Google
data centers use
commodity servers
Google uses 260 million watts
can power 200K Homes – NY Times
Data centers near Columbia River
At Dalles, Oregon
Traditional BI pyramid
• Defines a sequence of efforts
• Most companies never get
beyond reporting and simple
analysis
• No full analysis and predictive
modeling ever done
• Some data issues may not be
caught
• Limited insights hinder
optimal extraction of
knowledge
21
Multidimensional
Report
Standard Report
Segmentation
Predictive
Modeling
Knowledge
Discovery
Datamaturity
Baseline Pyramid
Hadoop
Analysis leads to better data quality
22
Raw data
Algorithms
Analysis
Reports
Business
Rules
Algorithms
Predictive
Models
More analysis leads to better quality
23
Data
Collection
Exploratory
Analysis
Predictive
Modeling
Decision
Algorithms
Better data quality
Data most important
• In modeling, find key data most important
– Identify the smoking gun
• Data transformations
– PageRank is a game changing data transformation
– Wine.com case, wineRank
– Social graph is a key data transformation for credit
card fraud detection
24
Modeling can go wrong
• Leakage in lead scoring model
– For example, use lead source to predict
conversion, when certain values of the field were
populated only for converters
• Display ads conversion model
– Construct data set by taking all converters and a
sample of non-converters
– Predict on page view profiles
– Problem: sample of non-converters included
customers who had no impressions of the ad
25
Modeling lessons
• Yahoo DSL subscribers, one year contract
• If you try to model month to month retention, you
find high retention rate
– Because of contracts and penalties
• The correct way is to model retention at contract
expiry, only on 1/12 of the customers
• For Yahoo email, if you look at quarter by quarter
retention, you find that those acquired early in the
first quarter have lower retention rate
– Because those customers have more time to churn
• A correct way is to use survival analysis
26
Conclusions
• For optimal modeling, domain knowledge is most important
• May require Big Data solutions to scale
• Identify key data and transformations
• Data are not reliable until after seriously analyzed
• Conduct deep analysis, before develop BI reports
• Test and optimize in real market is crucial
• Focus on customer experience not model complexity or
predictive accuracy
• “The best way to get good ideas to have a lot of them”
– Linus Pauling
• Use a lot of common sense
27

More Related Content

What's hot

Big data
Big dataBig data
Big data
Claire Choong
 
DataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business OutcomesDataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business Outcomes
DATAVERSITY
 
Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup
Data Science Consulting at ThoughtWorks -- NYC Open Data MeetupData Science Consulting at ThoughtWorks -- NYC Open Data Meetup
Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup
David Johnston
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 
How to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectHow to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot Project
DATAVERSITY
 

What's hot (20)

Key Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics ProgramKey Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics Program
 
Data-Ed: A Framework for no sql and Hadoop
Data-Ed: A Framework for no sql and HadoopData-Ed: A Framework for no sql and Hadoop
Data-Ed: A Framework for no sql and Hadoop
 
Big data
Big dataBig data
Big data
 
DataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business OutcomesDataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business Outcomes
 
Wtf is data science?
Wtf is data science?Wtf is data science?
Wtf is data science?
 
What is a Data Scientist
What is a Data Scientist What is a Data Scientist
What is a Data Scientist
 
New Developments in Machine Learning - Prof. Dr. Max Welling
New Developments in Machine Learning - Prof. Dr. Max WellingNew Developments in Machine Learning - Prof. Dr. Max Welling
New Developments in Machine Learning - Prof. Dr. Max Welling
 
Data-Ed: Data Architecture Requirements
Data-Ed: Data Architecture Requirements  Data-Ed: Data Architecture Requirements
Data-Ed: Data Architecture Requirements
 
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data WrongThe Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
The Heart of Data Modeling: 7 Ways Your Agile Project is Managing Data Wrong
 
Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup
Data Science Consulting at ThoughtWorks -- NYC Open Data MeetupData Science Consulting at ThoughtWorks -- NYC Open Data Meetup
Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup
 
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Using big data_to_your_advantage
Using big data_to_your_advantageUsing big data_to_your_advantage
Using big data_to_your_advantage
 
Data-Ed Online Webinar: Data Architecture Requirements
Data-Ed Online Webinar: Data Architecture RequirementsData-Ed Online Webinar: Data Architecture Requirements
Data-Ed Online Webinar: Data Architecture Requirements
 
Data science opportunities
Data science opportunitiesData science opportunities
Data science opportunities
 
Supporting decisions with ML
Supporting decisions with MLSupporting decisions with ML
Supporting decisions with ML
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Data-Ed: Monetizing Data Management
Data-Ed: Monetizing Data Management Data-Ed: Monetizing Data Management
Data-Ed: Monetizing Data Management
 
Generating Big Value from Big Data
Generating Big Value from Big DataGenerating Big Value from Big Data
Generating Big Value from Big Data
 
How to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectHow to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot Project
 

Viewers also liked

Memory forensics using VMI for cloud computing
Memory forensics using VMI for cloud computingMemory forensics using VMI for cloud computing
Memory forensics using VMI for cloud computing
Priyanka Aash
 
Lupus érythémateux disséminé1
Lupus érythémateux disséminé1Lupus érythémateux disséminé1
Lupus érythémateux disséminé1
Med Achraf Hadj Ali
 
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
George Beaton
 
Cloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos StefanisCloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos Stefanis
Pavlos Stefanis
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 
6. Non Experimental Methods
6. Non Experimental Methods6. Non Experimental Methods
6. Non Experimental Methods
rossbiology
 

Viewers also liked (18)

SAS Cloud Computing and MapReduce
SAS Cloud Computing and MapReduceSAS Cloud Computing and MapReduce
SAS Cloud Computing and MapReduce
 
Open / Free Cloud platforms and Open Hardware Systems
Open / Free Cloud platforms and Open Hardware SystemsOpen / Free Cloud platforms and Open Hardware Systems
Open / Free Cloud platforms and Open Hardware Systems
 
Memory forensics using VMI for cloud computing
Memory forensics using VMI for cloud computingMemory forensics using VMI for cloud computing
Memory forensics using VMI for cloud computing
 
Virtualization & Cloud Computing Presentation
Virtualization  & Cloud Computing PresentationVirtualization  & Cloud Computing Presentation
Virtualization & Cloud Computing Presentation
 
Lupus érythémateux disséminé1
Lupus érythémateux disséminé1Lupus érythémateux disséminé1
Lupus érythémateux disséminé1
 
Red hat cloud platforms
Red hat cloud platformsRed hat cloud platforms
Red hat cloud platforms
 
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
6 STEPS TO CREATE A SUCCESSFUL BUSINESS INTELLIGENCE STRATEGY
 
CS298_presentation
CS298_presentationCS298_presentation
CS298_presentation
 
Transforming Application Delivery with PaaS and Linux Containers
Transforming Application Delivery with PaaS and Linux ContainersTransforming Application Delivery with PaaS and Linux Containers
Transforming Application Delivery with PaaS and Linux Containers
 
Cloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos StefanisCloud Computing Security (Final Year Project) by Pavlos Stefanis
Cloud Computing Security (Final Year Project) by Pavlos Stefanis
 
Virtual machine
Virtual machineVirtual machine
Virtual machine
 
Virtualization in cloud computing ppt
Virtualization in cloud computing pptVirtualization in cloud computing ppt
Virtualization in cloud computing ppt
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 
Cloud computing simple ppt
Cloud computing simple pptCloud computing simple ppt
Cloud computing simple ppt
 
Cloud computing project report
Cloud computing project reportCloud computing project report
Cloud computing project report
 
6. Non Experimental Methods
6. Non Experimental Methods6. Non Experimental Methods
6. Non Experimental Methods
 
cloud computing ppt
cloud computing pptcloud computing ppt
cloud computing ppt
 

Similar to Big Data Analysis and Business Intelligence

Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
ImXaib
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
Vivastream
 
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
Peter O'Kelly
 

Similar to Big Data Analysis and Business Intelligence (20)

TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
 
Big data, predictive modeling and analytics in online marketing
Big data, predictive modeling and analytics in online marketingBig data, predictive modeling and analytics in online marketing
Big data, predictive modeling and analytics in online marketing
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
 
Trends in data analytics
Trends in data analyticsTrends in data analytics
Trends in data analytics
 
Digital Economics
Digital EconomicsDigital Economics
Digital Economics
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data Analytics
 
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
 
Data mining
Data miningData mining
Data mining
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Data-Ed Online: Data Management Maturity Model
Data-Ed Online: Data Management Maturity ModelData-Ed Online: Data Management Maturity Model
Data-Ed Online: Data Management Maturity Model
 
Data-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity ModelData-Ed: Best Practices with the Data Management Maturity Model
Data-Ed: Best Practices with the Data Management Maturity Model
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Data mining
Data miningData mining
Data mining
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 

Recently uploaded

Recently uploaded (20)

Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGBerhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...
Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...
Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...
 
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
 
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAIGetting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
 
Phases of Negotiation .pptx
 Phases of Negotiation .pptx Phases of Negotiation .pptx
Phases of Negotiation .pptx
 
JAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR ESCORTS
JAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR  ESCORTSJAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR  ESCORTS
JAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR ESCORTS
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation Final
 
Falcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investors
 
Berhampur 70918*19311 CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur 70918*19311 CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGBerhampur 70918*19311 CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur 70918*19311 CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1
 
Falcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business Growth
 
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book nowGUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
 
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
 
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
 
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptxQSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
 
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NSCROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
 
Arti Languages Pre Seed Teaser Deck 2024.pdf
Arti Languages Pre Seed Teaser Deck 2024.pdfArti Languages Pre Seed Teaser Deck 2024.pdf
Arti Languages Pre Seed Teaser Deck 2024.pdf
 

Big Data Analysis and Business Intelligence

  • 1. Daqing Zhao, PhD Founder and Principal, Eureka Analytics Business Intelligence Innovation Summit, Chicago 5/23/2013 ©Daqing Zhao All rights reserved Frontiers of Big Data Business Analytics, Patterns and Cases in Online Marketing
  • 2. Agenda • Overview of big data analytics • Insights of big data and analysis • BI process on big data • Lessons of model building • Cases for behavioral profiles for predictive models – Yahoo network segmentation – Tribal Fusion display ads impression optimization – University of Phoenix student retention and lead optimization • Case of Ask.com SEM algorithms 2
  • 3. Daqing Zhao, PhD • Big Data scientist with deep domain knowledge • Academic training – Analyzed molecular spectra on Cray supercomputers – Determined, modeled, simulated molecular motions in 3D space • Enjoy working with large data and large scale computing • Worked on computational Internet marketing since 1999 3
  • 4. New Book on Big Data Analytics • In the book: • Daqing Zhao: • Frontiers of Big Data Business Analytics: Patterns and Cases in Online Marketing 4
  • 5. Big data, Big Opportunities • Thanks to Moore’s law, on CPU, storage, network connections • Too much data, too little knowledge • Data, analytics changed every field many times over • From science, government, to commerce 5
  • 6. Big data characteristics • Amount of data too big to handle using normal technology, most data collected are dormant • Raw data are stored, appended but not updated • Formatted or free format data • No aggregation for purpose of data reduction • Individual customer level and individual event level data • Sensor data • Complete 360 degree view • Process from raw data to get insights and build models • Some business uses of big data: customer profile, event prediction, automated decision machine, risk management, wisdom of crowd 6
  • 7. Things computers good at • Computers have perfect memory – Every page view, click, transaction, every event,… • Good at finding a needle in a haystack – E.g., target abandoned shopping carts with promotions – Clickers of this page in the last week • Good at trade offs among large number of factors – Female, 25-34, with child < 5, Asian, earning $30K, rent, divorced, live in Calif., some college, Walmart, Coupons.com, Monster.com, drive Camry, … – Buyer of X or not? 7
  • 8. Things computers on Internet are good at • Platforms of cloud sourcing – Google PageRank, Adwords, Picasa, Translate, … • Data not previously looked at in aggregate – Google PageRank/Translate, Amazon Find Product • Data not previously created, or accumulated – Social network data at LinkedIn, Facebook – Amazon Customer Review, Yelp – Twitter, Flickr – Wikipedia, Youtube/Khan Academy, eHow, Udemy, Yahoo/Answers 8
  • 9. Computers make it possible • Given data, find models and parameters – Identify reproducible patterns in the data – Provide simple picture of a large number of events – Predict events in the future • Simulations generate future events, given assumptions, and current state – Given a set of models, how future scenario will look like, under given set of conditions, “what ifs” • Robots, and agents – Make decisions based on environment and goals, self driving cars 9
  • 10. Computers can’t do everything • Data often have issues before being well analyzed • Data often have no taxonomy and context • Free format data, relevant information need to be extracted • Analyst has to define targets, construct predictors • Analyst has to include critical predictive factors • Analyst need to add common sense 10
  • 11. Every wrong data is wrong in its own way • Some data are not collected, “too big” or “useless”, as in flood control, purged log data • Some data feeds to warehouse are incomplete • Multiple definitions and inconsistent business rules, no documentation • Data incomplete due to business nature – Sparse data – Separate log in and log out data – Credit card purchases versus cash • Some flaws are easy to catch, such as missing, constant • Some flaws hard to find, partially missing or incorrect 11
  • 12. Best practices of analyst • Understand how the data are collected, what data can and cannot be collected • Balance cost of collecting data and optimize modeling • Use feedback loop to test hypotheses • Do simulations to see if changes are reasonable • Good ideas are not necessarily complicated ideas • Focus on domain knowledge, not just data mining tools 12
  • 13. Best Practices of Analytics Managers • Well versed on analytics, understand analyst, their behavior, the tests, their work and value • Focus on domain knowledge, not just data mining tools • Focus on impact, not elegance in modeling • Big Data Analytics are different from small sample statistics, and need to learn on the job • As activities become more technical, it is hard to recognize values and identify issues – 2008: Financial crisis and credit derivatives – Principal-agent problem 13
  • 14. New Information Explosions • Before ~1450, only nobilities had a few books • After Gutenberg, information was limited by paper and printing capacities – People cried out loud there was too much information – Then we had libraries, index, abstract, book reviews,… • Now information is limited by disks & cloud storage – A person’s lifetime spoken words stored in a thumb drive – Soon everything can be stored • Now: how do we make use of all the information? – Search, crows sourcing, Twitter, Wikipedia, YouTube, big data and analysis algorithms, … 14
  • 15. Paradigm Shift in Data Organization • Mathematics is a way to efficiently use brain resources – With pen and paper, only simple problems solvable – Crude approximations, and samples for complicated ones – Unreasonable effectiveness of mathematics – E. Wigner • Now, algorithms are ways to efficiently use computing resources – Numerical solutions of complex equations – Large scale simulations, full population databases – Unreasonable effectiveness of data – P. Norvig • Elegant, over simplified models are less useful 15
  • 16. Paradigm Shift in Knowledge • Knowledge is power, by Francis Bacon • Past: Drowning in information, starving for knowledge, by John Naisbitt • Now, Knowing how to extract knowledge is power • Soon: There is abundance of knowledge, seeking for relevance – Incl. personal finance, medical, political decisions • Innovations are about connecting the dots – Distances between the dots are getting smaller – Leverage knowledge to make decisions, manage risks 16
  • 17. Big Data problem • Data size larger than what databases can handle • Terabytes of data may take hours just to scan it • Solution requires a cloud of servers with local storage – Read, process and write intermediate results in parallel – Aggregate at the end • Cloud computing can build models in scale • Cloud often scales linearly as number of servers 17
  • 18. Modeling need to scale • Traditional predictive models take long time to build – Small data sets, samples expensive to collect • Now data are cheap and models may degrade in weeks – Dimension of predictors are very large – Number of categories are large • Human interactive model building not scalable • Reasons for target events are complex • Without detailed analysis, it is unclear what drives the event • We need to rely on “out of sample testing” and “off the shelf” modeling 18
  • 19. Cloud computing • We built a SAS cloud at University of Phoenix – I have an invited SAS talk available at SAS web site – Can process billions of impressions in minutes • Hadoop clouds are used widely – Open source software, Hive, Impala, Mahout – Commodity servers and storage • Clouds may have 100Ks of servers – Find needle in a haystack in milliseconds – Model computations usually would take years to compute now finishes in minutes 19
  • 20. Big Data Centers 20 Facebook and Google data centers use commodity servers Google uses 260 million watts can power 200K Homes – NY Times Data centers near Columbia River At Dalles, Oregon
  • 21. Traditional BI pyramid • Defines a sequence of efforts • Most companies never get beyond reporting and simple analysis • No full analysis and predictive modeling ever done • Some data issues may not be caught • Limited insights hinder optimal extraction of knowledge 21 Multidimensional Report Standard Report Segmentation Predictive Modeling Knowledge Discovery Datamaturity Baseline Pyramid
  • 22. Hadoop Analysis leads to better data quality 22 Raw data Algorithms Analysis Reports Business Rules Algorithms Predictive Models
  • 23. More analysis leads to better quality 23 Data Collection Exploratory Analysis Predictive Modeling Decision Algorithms Better data quality
  • 24. Data most important • In modeling, find key data most important – Identify the smoking gun • Data transformations – PageRank is a game changing data transformation – Wine.com case, wineRank – Social graph is a key data transformation for credit card fraud detection 24
  • 25. Modeling can go wrong • Leakage in lead scoring model – For example, use lead source to predict conversion, when certain values of the field were populated only for converters • Display ads conversion model – Construct data set by taking all converters and a sample of non-converters – Predict on page view profiles – Problem: sample of non-converters included customers who had no impressions of the ad 25
  • 26. Modeling lessons • Yahoo DSL subscribers, one year contract • If you try to model month to month retention, you find high retention rate – Because of contracts and penalties • The correct way is to model retention at contract expiry, only on 1/12 of the customers • For Yahoo email, if you look at quarter by quarter retention, you find that those acquired early in the first quarter have lower retention rate – Because those customers have more time to churn • A correct way is to use survival analysis 26
  • 27. Conclusions • For optimal modeling, domain knowledge is most important • May require Big Data solutions to scale • Identify key data and transformations • Data are not reliable until after seriously analyzed • Conduct deep analysis, before develop BI reports • Test and optimize in real market is crucial • Focus on customer experience not model complexity or predictive accuracy • “The best way to get good ideas to have a lot of them” – Linus Pauling • Use a lot of common sense 27