Data Workshop Overview and Challenges for Consumer Electronics
1. Workshop Overview
• Module 1: Data
• Module 2: Model and Estimation
• Module 3: Sample Output and Empirical Generalization
2. Outline
• Ideal Data for Promotion/Pricing Analytics: Scanner
Data (in CPG)
• Data and Inference: What Can Go Wrong?
• Challenges and Common Mistakes in Consumer
Electronics
• Data Requirement and Potential Data Source
3. Scanner Data (Store Audit Data)
How is Data Collected?
• Syndicated data providers: IRI and A.C. Nielsen
• Sample of stores (Grocery, Drug, Convenience, Mass
Merchandiser, Warehouse stores)
• Scanner data
– UPC info (product features), (Retail) price, Quantity (Volume)
all recorded
• Features
– Centrally collected and coded (daily)
• Displays
– Collected by store auditors (1X/week)
4. 4 Data Dimensions
• The Data Cube
– Geography (Market) x Product x Time x Variable (Measure)
– G x P x T x V > 1,000,000 even for one category
• Aggregation (chain/regions, SKU groups, temporal)
5. Scanner Data: Advantages
• Completeness
– Linking aggregate sales movements to marketing instruments
(price, feature, display, etc)
– Obtaining a richer set of performance measures beyond market
share and factory shipments
• Timeliness
– Getting the data within a window that allows for meaningful
managerial action (i.e. less than old lag time of 8 weeks or
more)
• Accuracy
6. Scanner Data: Limitations
• Not a complete sampling frame: excluded stores
– Small shops, Walmart!
• Hard to make causal statements without careful modeling: non-random
assignment
• No information on consumer behaviors before purchases (e.g.
search, consideration) and consumption after purchases
• No information on psychographics
8. Promotions: Actual data
5 10 15 20 25 30
Week
1
0.8
0.6
0.4
0.2
Market Share
1
0.75
0.5
0.25
Price
F
D
C
F
D
C
F
D
F
D
C
F
D
C
F
D
F
D
F = Feature, D = Display, C = Store Coupon
10. 1. Size of Data Information in Data
• Consider the following two options:
(1) Wal-Mart with 4,000 stores, 52 weeks of data, 500 SKUs (104 million observations!)
(2) Best-buy with 1,500 stores, 52 weeks of data, 500 SKUs (39 million observations)
• Which dataset would be more useful to measure price responses?
25
20
15
10
5
0
Wal-Mart (EDLP)
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
P1 P2 P3
25
20
15
10
5
0
Best-Buy (Hi-Lo)
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
P1 P2 P3
11. 2. Pay Attention to Signal-to-Noise Ratio
• Consider the following measurement. Is there significant impact from marketing event?
Revenue Before Event After Event % Change
Average 10 13 30
• Well, it depends on signal-to-noise ratio!
16
14
12
10
8
6
4
2
0
Revenue before/after Event
0 20 40 60 80 100 120
120
100
80
60
40
20
0
Revenue before/after event
0 20 40 60 80 100 120
12. 3. Careful about Reverse Causality
• Imagine the following data generating process.
– 푆푎푙푒푠푚,푡 = 퐼푛푡푒푟푐푒푝푡푚 + 0 ∙ 퐴푑푣푚,푡 + 휀푚,푡
– 퐴푑푚,푡 = 0.1 ∙ 푆푎푙푒푠푚,푡−1 + 휔푚,푡
• If you do analysis ignoring the reverse causality, you may conclude the following.
60
50
40
30
20
10
0
Sales (m,t) vs. Adv (m,t)
0 1 2 3 4 5 6
Sales (m,t)
Adv (m,t)
Coefficients Standard Error P-value
Intercept -3.357941562 1.121241507 0.004875
Advertising (t) 9.716546286 0.354635984 3.58E-26
Significant impact of advertising?
R square: 0.95 -> Good fit!
13. 4. Omitted Variables Can Be Dangerous
• Oftentimes, we don’t have data on some important variables, which can impact sales,
revenue, or profits.
– Doing analytics ignoring these “omitted variables” can lead to “biased” estimates of
marketing mix effects.
• Think about the graph below (from NYT). Is the family income really responsible for better
academic achievement? What would be potential omitted variable bias here?
14. 5. Selection by Outcome: Bad Idea!
• Problem: Often times, two groups, which are conditioned by outcome variables, are
compared to infer the causal impact of marketing mix
• Example
– To calculate the ROI of paid search campaign, advertisers compare the “conversion
rates” of each “search” keyword. Usually, branded keywords are shown to have high
conversion rates (> 6%) compared to generic keywords (~ 1%).
• How to fix the problems?
– Use proper “control” condition!
– In the paid search example, all the traffics/conversions from consumers who click on
branded keywords are attributed to the resulting sales and profit. An implicit
assumption here is that all the sales/profits are lost without paid search. Really?
– It’s possible that consumers who use branded keywords are already quite committed to
purchase, and they may simply substitute to unpaid (organic) search links if paid
searches are turned off.
– A proper control in this case is “halting selected search engine marketing keywords”
15. Key Challenges
• There are no syndicated data providers such as IRI and Nielsen in Consumer
Electronics
• Slightly better situation in North America or Europe
– NPD (U.S.), GfK (Europe) provides market (or retail channel) level unit sales
and price data by SKUs
– However, they do not provide promotion details
– Even with promotion data, the use of market (or channel) level data can cause
aggregation bias (i.e. overestimation of promotion effects)
• You have to assemble multiple datasets on your own
– At least 2 ~ 3 datasets need to be merged
– SKU-level unit sales data from ERP + External tracking service data (on price
and promotion): half-blind (no sales info for competitors)
– Better data access if you are a category captain
– Most painful and time-consuming step: organizational silo
16. Common Mistakes: For Discussion
• Use factory shipment data instead of retail sales data
– Biased promotion effect estimates due to forward buying from retailers
• Use cross-sectional data to measure price/promotion effects
– Biased price or promotion effect estimates due to omitted variable bias
– Better to use panel data and identify effects from within-store (or within chain) variation
• Use market (or channel) level data
– Promotion effects are not homogeneous within a market (or channel)
– Due to aggregation bias, promotion effects will be overstated
– Better to use store, account, or chain-level data where promotion activities do not vary
across units
• Use data with short history (1 year or less)
– At least, 2 ~ 3 years of data are required to properly measure seasonality
• Ignore price changes and promotion from competitors
– Biased estimates of baseline sales and price/promotion effects
18. Data Requirement
• Key elements of data
– Unit sales by SKUs (outcome): ideally for the entire category (including competitors), but
feasible only with data for focal company‘s own SKUs
– Price measures by SKU(causal): focal company + competitors
– Promotion measures by SKU/product line/brand (causal): focal company + competitors
• Duration
– Ideally 3 years (of weekly data); At least 2 years of data
– To properly control seasonality
• Level of aggregation
– Ideally store-level data; chain or account (chain-market combination) data can be used
as long as promotion/price policies are uniform (within chain or account)
– Using market or channel-level data can cause overstating of promotion effects due to
aggregation bias
• Type of response data: Retail sales data (Do not use factory shipment data)
– Due to forward buying from retailers
19. Potential Data Source: For Discussion
• Key elements of data
– Unit sales by SKUs (outcome)
– Price measures by SKU(causal)
– Promotion measures by SKU/product line/brand (causal)