End-to-End Predictive Analytics for Digital Advertising
1. 1
The University of Dundee
School of Computing
MSc Business Intelligence Project
Title: End To End Predictive Analytics Implementation in the Digital Advertising
Industry
Supervisor: Dr. Iain Martin
Date: 16th
January 2013
I declare that the special study described in this thesis has been carried out and the
thesis composed by me, and that the thesis has not been accepted in fulfilment of
the requirements of any other degree of professional qualification
Rafael Garcia-Navarro
2. TABLE OF CONTENTS
Executive Summary ……………………………………………………… 4
Acknowledgements …………………………………………………….... 5
Certificate …………………………………………………………………. 6
Confidentiality Agreement ………………………………………………. 7
Table of Figures …………………………………………………………… 8
Table of Tables ……………………………………………………………. 8
1. Overview ………………………………………………………………. 9
Introduction …………………………………………………………………… 9
The State of Art in the Digital Marketing Agency Industry ………………. 10
Aim and Objectives ………………………………………………………….. 12
2. Methodology …………………………………………………………... 13
Background …………………………………………………………………... 13
Implemented Data Mining Methodology …………………………………... 15
3. Business Understanding …………………………………………….. 15
Determine Business Objectives ……………………………………………. 15
Assess the Situation …………………………………………………………. 16
Determine Data Mining Goals ……………………………………………… 16
Produce Project Plan ………………………………………………………... 16
4. Data Understanding ………………………………………………….. 18
Collect Initial Data …………………………………………………………… 18
Describe Data ………………………………………………………………... 19
Explore Data …………………………………………………………………. 20
Verify Data Quality …………………………………………………………... 22
5. Data Preparation ……………………………………………………… 22
Select Data …………………………………………………………………… 22
Clean and Construct Data ………………………………………………….. 24
Data Warehouse Data Modelling ……………………………………… 25
Extract Transform and Load (ETL) ……………………………………. 30
Integrate and Format Data …………………………………………………. 36
6. Modelling ………………………………………………………………. 41
Select Modelling Techniques ………………………………………………. 41
Generating Test Design …………………………………………………….. 41
Build Model …………………………………………………………………… 42
Assess Model ………………………………………………………………… 46
3. 7. Evaluation ……………………………………………………………… 47
Evaluate Results …………………………………………………………….. 47
Review Process ……………………………………………………………… 49
Determine Next Steps ………………………………………………………. 49
8. Deployment ……………………………………………………………. 49
Plan Deployment …………………………………………………………….. 49
9. Conclusions ……………………………………………………………. 50
10. Future Work …………………………………………………………… 51
11. References ……………………………………………………………. 53
Appendices
Appendix A – Project Plan …………………………………………………….... 56
Appendix B – Doubleclick Data Definition Tables ……………………………. 57
4. Executive Summary
The recognition of data as an organisational strategic asset continues to increase across
industries, though digital marketing agencies have not traditionally embraced existing
technologies to address some of the key areas where more robust approaches are expected
and demanded by clients – channel contribution analysis and campaign response prediction.
Traditional technologies such as Microsoft SQL Server, Visual Basic, T-SQL and R can be
effectively used by digital marketing agencies looking to begin the journey to make data a
central component of their business proposition. However, the nature of digital data with its
high volume, variety and velocity lends itself to the application of emerging technologies
specifically developed from the ground up to operate in this new paradigm. The assessment
of those is beyond the scope of this research.
The aim of this project was to combine the robustness of long established data warehousing
development principles, with the benefits of statistical analysis to address business
challenges.
The Extract, Transform and Load (ETL) component of this project proved to be the most
challenging as well as intellectually rewarding. The ability to integrate, transform and
structure data for statistical analysis was the key focus and deliverable of this stage.
The methodology developed throughout this thesis will form the basis of the business
intelligence and statistical function to be implemented across Neo@Ogilvy UK, and should
be seen as an initial step in a complex and ever evolving discipline.
Areas of interest and of significant importance have been left untouched due to time
constraints, but will be further developed by the researcher beyond the academic
requirements to fulfil the MSc in Business Intelligence.
5. Table of Figures
Figure 1: Last Click Digital Marketing Measurement Model …………………………. 11
Figure 2: The Data Mining Process - Mundy, Thornthwaite and Kimball (2011) …... 13
Figure 3: Generic Tasks (bold) and Outputs (italic) CRISP-DM Model ……………… 14
Figure 4: Doubleclick Digital Marketing Campaign Setup Process Flow ……………... 19
Figure 5: Doubleclick Dimension Log Files …………………………………………….... 20
Figure 6: Doubleclick AdvertiserActivity/Click/Impression Log Files …………………. 21
Figure 7: Platform Performance Issues ………………………………………………… 22
Figure 8: DW/BI System Architecture Model …………………………………………… 24
Figure 9: User Model …………………………………………………………………….. 25
Figure 10: Sun Model …………………………………………………………………….. 27
Figure 11: Star Schema ………………………………………………………………….. 27
Figure 12: OLAP Schema ……………………………………………………………….. 28
Figure 13: ETL Steps …………………………………………………………………….. 30
Figure 14: Full Decision Tree Model Output SSAS TotalRevenue …………………. 45
Figure 15: Full Decision Tree Model Output SSAS NumberOfProducts …………… 45
Figure 16: Partial Decision Tree Model Output SSAS TotalRevenue ……………… 45
Figure 17: Decision Tree Model Output SSAS NumberOfProducts ………………… 45
Figure 18: Dependency Network Output SSAS TotalRevenue …………………….. 46
Figure 19: Dependency Network Output SSAS NumberOfProducts ………………. 46
Figure 20: Model Evaluation Graph TotalRevenue ………………………………….. 47
Figure 21: Model Evaluation Graph NumberOfProducts ……………………………. 47
Figure 22: Solution Deployment ……………………………………………………….. 50
Table of Tables
Table 1: Project Product Description …………………………………………………… 15
Table 2: AdvertiserActivity Other-Data Variables ……………………………………... 23
Table 3: dimension description ………………………………………………………….. 26
Table 4: dimensional design specification ……………………………………………... 29
Table 5: ETL Design Specification ……………………………………………………… 31
Table 6: Analytical Dataset Details ……………………………………………………… 39
Table 7: Validation Results ………………………………………………………………. 41
Table 8: GLM Model Output Rattle (R) NumberOfProducts Vs TotalRevenue …….. 43
Table 9: GLM Scoring File Output Dependent Variable NumberOfProducts ………. 44
Table 10: Product Categories ……………….……………………………………………. 48
6. 1. Overview
Introduction
Gone are the days when organisations could happily rely on mass reach advertising
channels such as television to get access to the desired audiences through sixty seconds
advertisements. Media fragmentation, social changes and consumer lifestyle demands have
rendered traditional marketing channels ineffective in fulfilling the commercial role
traditionally associated with the marketing industry – generate demand from either newly
created or existing consumer needs.
The nature of the digital marketing industry, where customer interactions can be tracked and
measured to an unprecedented level, presents both challenges and opportunities for those
organisations able to harness the power of data whilst managing and addressing the privacy
concerns of users, businesses and national governments.
The proliferation of data sources and its associated volume is presenting significant
challenges to the digital marketing industry as a result of the exponential growth driven by
the social shift around media consumption experienced over the last decade.
Businesses are searching for marketing agency partners who can assist them navigating
through the complexity of this new data paradigm, and at the same time maximising the
return on investment across ever growing digital marketing budgets. The World Advertising
Research Council (WARC) in its “Adstats: Global adspend forecast” study (2012) projects an
overall 12.3% growth for online marketing expenditure across the 12 key markets (Australia,
Brazil, Canada, China, France, Germany, India, Italy, Japan, Russia, UK and US). This is
also corroborated by the Internet Advertising Bureau (2012), the trade association for online
and mobile advertising in the UK, which reported a 12.6% growth in digital advertising
expenditure in the first half of the year compared to the first 6 months in 2011.
Not only data can provide a competitive advantage to organisations looking to build a data
driven culture for marketing investment decision making but also, through the application of
robust statistical techniques, it can deliver more meaningful and relevant messages to
consumers to improve both the user experience and the brand perception in the market
place.
The development of the technical data infrastructure and the implementation of the statistical
capability to improve digital marketing campaigns targeting decisions, and to quantify the
actual contribution of digital marketing channels towards a product purchase and/or revenue
associated to the purchase are the central areas of focus for this project.
7. In line with commonly accepted estimates around the development of business intelligence
systems (Williams, 2011 p.57), the effort required to deliver this project was approximately
split as follows - 80-85% of the resources were allocated to the development of the Extract,
Transform and Load (ETL from hereon) processes, with the remaining 15-20% dedicated to
the statistical analysis and academic writing.
The State of Art in the Digital Marketing Agency Industry
Digital marketing is often referred to with a myriad of different terms – amongst the most
popular are e-marketing, interactive marketing and online marketing. Brodie, Winklhofer,
Coviello and Johnston (2007) formally defined e-marketing as “using the Internet and other
interactive technologies to create and mediate dialogue between the firm and identified
customers”.
This discipline was added by the authors in 2001 to the Contemporary Marketing Practices
(CMP) classification originally developed by Coviello, Brodie and Munro in 1997 (Brodie,
Winklhofer, Coviello and Johnston, 2007) which included:
• Transaction marketing (TM) defined as “using the traditional ‘4P’ approach to attract
customers in broad market or specific segment”
• Database marketing (DM) defined as “using database tools to target customers in a
specific segment or microsegment of the market”
• Interaction marketing (IM) defined as “developing personal interactions between
employees and individual customers”
• Network marketing (NM) defined as “developing relationships with customers and
firms within the network”
Whilst some of the core marketing concepts from the traditional disciplines aforementioned
remain relevant to digital marketing, the digital marketing agency industry has historically
struggled to benefit from a closer integration with the natural synergies identified by Coviello
et al. (2001). Due to the relevance to this thesis, it is worth highlighting the close relationship
between database marketing and e-marketing identified by Coviello when he stated that “eM
focuses on real-time dialogue that is enabled and mediated by information technology, and
so builds on and enhances DM. Rather than being a one-way relationship ‘to’ the customer
where databases are used to personalise communication, the interactive, technology-
enabled communication of eM is ‘with’ and ‘among’ many parties”.
The adoption of database technologies and advanced statistical tools across the digital
marketing agency industry to drive targeting decisions and to measure the commercial
contribution (e.g. products purchased/revenue) of various digital marketing channels remains
relatively low. The state of art of the latter is illustrated by the current methodology used
8. across the digital marketing industry to measure the commercial contribution of digital
channels, commonly referred to as the last click model. This flawed concept is presented in
figure 1:
Yahoo
Ad
User
Google
Ad
The User is tracked based on
the tracking code associated
to Google’s ad campaign
FT.com
Ad
The User is tracked based on
the tracking code associated
to the FT’s ad campaign
The User is tracked based on
the tracking code associated
to the Yahoo’s ad campaign
Interaction 1
User clicks on a link ad in Google
Client
Product
Client
Site
Interaction 2
User sees a
banner ad in FT.com
User is directed to Client site
but does not purchase
Interaction 3
User clicks on a link ad
in Yahoo
User is directed to Client site
and purchases the product
Yahoo user
purchases
the product
The Javascript tag in the
Client website assigns the
product purchase to the last
click – i.e. the last interaction
directing the user to the client
site
Figure 1: Last click digital marketing measurement model
Within the above model, the contribution that interactions 1 and 2 (i.e. Google and FT.com
ads) might have made towards influencing the user to purchase the product is simply
ignored by the current industry standard method used to measure the performance of digital
marketing campaigns.
With regards to targeting decisions, data mining of the ever increasing datasets generated
by the digital marketing platforms is not widely adopted across the industry and presents a
significant opportunity to digital marketing agencies looking to improve the quality and
relevancy of its campaign targeting decisions. Montgomery and Smith (2009) refer to this
notion as personalisation which “is meant to eliminate tedious tasks for the customer, and
allow the marketer to better identify the user’s needs and goals from past behaviour” (p.130).
They also recognised that “clickstream is underutilised and it is likely to take years before its
potential is fully leveraged” (p.133).
9. Aim and Objectives
The aim of this dissertation is to provide a practical business intelligence framework that
allows digital marketing agencies to implement a predictive analytics solution to better
address the current business challenges experienced across the industry highlighted in the
section above. To this extent, a business based project was deemed as the most suitable
approach to meet the aforementioned objective. An agreement was reached with
Neo@Ogilvy, the performance marketing digital agency, to allow access to Client’s digital
marketing activity data for this purpose. The University of Dundee agreed to and supported
this proposal.
The concentration of this project is around three key areas:
• Developing the end data warehouse and associated ETL processes to
productionalise this framework in a high data volume, variety and velocity
environment in order to enable the statistical analysis of the areas below
• Targeting decisions: utilising user cookie level data (leaf level) to develop statistical
models to predict future marketing campaign response
• Channels commercial contribution: using user cookie level data (leaf level) to
statistically attribute the contribution of each of the digital channels towards a product
purchase and/or revenue associated to the purchase
Leaf data at user cookie level generated by the digital marketing platform Doubleclick,
owned by Google, was used as source data for the project. This platform records every user
interaction associated to any digital marketing campaign run by Neo@Ogilvy. This amounts
to millions of records on a daily basis across multiple markets and products.
The technical development of the above solution is underpinned by exhaustive secondary
research of the leading academic and business literature on the field of business intelligence
and analytics.
10. 9. Conclusions
The digital marketing industry is quickly becoming a dominant commercial channel in the
UK, though it can be argued that its revenue growth is not matched by the use made of
advanced statistical techniques to address the 2 key areas this thesis focuses on.
The ETL component of this project was underestimated by the researcher during the initial
project planning phase, both in terms of data complexity and processing power required.
However, the solution developed is deemed to be robust and should be seen as an initial
step towards addressing some of the current data utilisation shortcomings across the digital
marketing industry. This component of the project was developed using Microsoft SSIS ,
VB.net and T-SQL, though the design logic of each step can be adapted to other platforms.
The CRISP-DM methodology followed throughout this thesis has provided a structured
framework to progress through the different stages of the project in a logical manner to
deliver against the agreed objectives. However, a note of criticism is around areas of
perceived duplication of tasks as highlighted throughout the paper.
The technology chosen for data storage and processing presented challenges in its ability to
deal with the volume of data generated by the digital marketing platform. With hindsight, a
potential alternative might have been a Hadoop platform which “allows for the distributed
processing of large datasets across clusters of computers using simple programming
models” (Hadoop, undated), combined with the programming language MapReduce that
excels at parallel data processing across very large data sets. This programming language
was developed as a system “for efficient large-scale data processing presented by Google in
2004 to cope with the challenge of processing very large input data generated by Internet-
based applications” (Marozzo, Talia & Trunfio, 2012, p.1382).
The implemented statistical analysis framework is a suitable option, albeit not the only one,
to deal with the way in which the log file data was distributed. It is important for anyone
looking to apply the methodology proposed on this thesis to fully understand how to match
the statistical technique chosen to the data available.
From a resource allocation perspective within the context of the time available by the
researcher, the aim to provide an end to end predictive analytics solution might have been
too broad a scope for the project. The development of the data warehouse consumed 80-
85% of the time dedicated to it, leaving limited margin to offer an in depth assessment of the
pros and cons of the different statistical techniques available to address the key objectives.
This is an area of keen interest that will be further explored and investigated beyond this
thesis.
11. 10. Future Work
Traditional data warehousing technologies can be deployed to deliver a practical solution to
the challenges outlined at the start of the project. However, 3 key areas could be of interest
to further advance the depth and efficiency of such a solution. These have been outlined
throughout the difference stages of the CRISP-DM methodology, but for the purpose of
clarity will be summarised below:
• Parallel processing technologies: research into how technologies such as Hadoop,
MapReduce, Hive, Pig, etc. can improve the processing of extremely large digital
marketing datasets
• Statistical techniques assessment: a thorough understanding of the different
statistical methods is of critical importance to fully unlock the potential of predictive
analytics. This thesis has not carried out such an in-depth assessment due to time
constraints, so it presents an area of great research potential for future work
• Recency, Frequency and Intensity: the time dimension is available in the dataset but
has not been utilised in the statistical modelling stage of this project. Given the
expected impact that time decay has on the effectiveness of marketing
communications, this subject is regarded as a significant opportunity to further
expand the predictive power of the statistical models
12. 11. References
Borle, S., Singh, S. & Jain, D. (2008) Customer Lifetime Value Measurement. Management
Science. [Online]. Available from:
http://web.ebscohost.com/ehost/detail?sid=9a2e89fb-989d-4e50-b86f-
d931b5d0b6fc%40sessionmgr15&vid=1&hid=18&bdata=JnNpdGU9ZWhvc3QtbGl2ZSZzY2
9wZT1zaXRl#db=buh&AN=29984784 [Accessed 28 May 2011]
Brodie, R.J., Winklhofer, H., Coviello, N.E. & Johnston, W.J. (2007) Is eMarketing coming of
age? An examination of the penetration of e-marketing and firm performance. [Online].
Available from:
http://www.sciencedirect.com/science/article/pii/S1094996807700191 [Accessed 8
September 2012]
Bucklin, R.E. & Sismeiro, C. (2009) Click Here for Internet Insight Advances in Clickstream
Data. [Online] Available from:
http://www.sciencedirect.com/science/article/pii/S1094996808000054 [Accessed 25 July
2012]
Chapman, P., Clinton. J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. & Wirth, R.
(2000) CRISP-DM 1.0. [Online]. Available from:
ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/Documentation/14/UserM
anual/CRISP-DM.pdf [Accessed 26 May 2012]
Delen, D., Cogdell, D., & Kasap, N. (2012) A Comparative analysis of data mining methods
in predicting NCAA Bowl Outcomes. [Online]. Available from:
http://www.sciencedirect.com/science/article/pii/S0169207011000914 [Accessed 5
December 2012]
Duke University (undated) Interpretation in Multiple Regression. [Online]. Available from:
http://www.stat.duke.edu/courses/Spring00/sta242/handouts/beesIII.pdf [Accessed 5
December 2012]
Enke, D. & Thawornwong, S. (2005) The use of data mining and neural networks for
forecasting stock market returns. [Online]. Available from:
http://www.sciencedirect.com/science/article/pii/S0957417405001156 [Accessed 5
December 2012]
Faraway, J. (2002) Practical Regression and Anova using R. [Online]. Available from:
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf [Accessed 6 August 2012]
Garcia-Navarro, R (2011) Investigating Requirements Analysis. MSc Business Intelligence
module AC52035, Dundee. University of Dundee
Google (undated) The foundation for managing online ads. [Online]. Available from:
http://www.google.co.uk/doubleclick/advertisers/solutions/ad-serving.html [Accessed 16
November 2012]
Hadoop (undated) Welcome to Apache Hadoop. [Online]. Available from:
http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F [Accessed 16 December 2012]
Illinois State University. (undated) SPSS: Descriptive Statistics. [Online]. Available from:
http://psychology.illinoisstate.edu/jccutti/138web/spss/spss3.html [Accessed 5 December
2012]
13. Inmon, W.H. (2005) Building the Data Warehouse. 4th
Edition. Wiley Publishing Inc.
Internet Advertising Bureau. (undated) H1 2012 Internet Advertising worth £2.6 billion.
[Online]. Available from
http://www.iabuk.net/research/library/2012-h1-digital-adspend-factsheet-0 [Accessed 24
November 2012]
Jackman, S. (undated) Generalised Linear Models. [Online]. Available from:
http://jackman.stanford.edu/papers/glm.pdf [Accessed 5 December 2012]
Kimball, R. (2002) The Data Warehouse Toolkit. 2nd
Edition. Wiley Publishing Inc.
Kimball, R. (2008) The Data Warehouse Lifecycle Toolkit. 2nd
Edition. Wiley Computer
Publishing
Marozzo, F., Domenico, T. & Trunfio, P. (2012) P2P-MapReduce: Parallel data processing in
dynamic Cloud environments. [Online]. Available from:
http://www.sciencedirect.com/science/article/pii/S0022000011001668 [Accessed 16
December 2012]
Microsoft. (2012) 2012 SQL Server 2012 Tutorials: Analysis Services - Data Mining. [Online].
Available from:
https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CFQQFjAA
&url=http%3A%2F%2Fdownload.microsoft.com%2Fdownload%2F0%2FF%2FB%2F0FBFA
A46-2BFD-478F-8E56-
7BF3C672DF9D%2FSQL%2520Server%25202012%2520Tutorials%2520-
%2520Analysis%2520Services%2520Data%2520Mining.pdf&ei=UkzcUPS7Aeml0QXpx4HI
CQ&usg=AFQjCNHNfl2s6R1-
UZFvTjCmASHNPlgTAw&sig2=n4TX2YaDd_1vVrYpMNvyVg&bvm=bv.1355534169,d.d2k
[Accessed 29 August 2012]
Montgomery, A.L. & Smith, M.D. (2009) Prospects of Personalisation on the Internet.
[Online]. Available from:
http://www.sciencedirect.com/science/article/pii/S1094996809000322 [Accessed 8
September 2012]
Mundy, J., Thornthwaite, W., & Kimball, R. (2011) The Microsoft Data Warehouse Toolkit.
Wiley Publishing, Inc.
Office of Government Commerce (2009) Managing Successful Projects with PRINCE2. 5th
Edition. TSO@Blackwell
Olson, D. & Chae, B., K. (2012) Direct marketing decision support through predictive
customer response modelling. [Online]. Available from:
http://www.sciencedirect.com/science/article/pii/S0167923612001881 [Accessed 5
December 2012]
Rattle. Graphical user interface for data mining in R. [Online]. Available from:
http://cran.r-project.org/web/packages/rattle/index.html [accessed 12 October 2012]
Sharma, S., Osei-Bryson, K.M. & Kasper, G. (2012) Evaluation of an integrated Knowledge
Discovery and Data Mining process model. [Online]. Available from:
http://www.sciencedirect.com/science/article/pii/S0957417412002886# [Accessed 5
December 2012]
14. Whitehorn, M. (2011) Modelling a BI System – MSc Business Intelligence Lecture Notes.
Available from: https://my.dundee.ac.uk [Accessed 26 February 2011]
Williams, G. (2011) Data Mining with Rattle and R. SpringerKim
Williams, G. (undated) Predicted versus Observed. [Online]. Available from:
http://datamining.togaware.com/survivor/Predicted_versus.html [Accessed 13 September
2012]
World Advertising Research Council (WARC). Adstats: Global adspend forecast. [Online].
Available from:
http://www.warc.com/Content/ContentViewer.aspx?MasterContentRef=d1e68f3e-c4da-
48ee-90cd-b37e48763f50 [accessed 5 December 2012]