Soyez le premier à aimer ceci
Ad Networks act as the middleman between advertisers and publishers on the Internet. The advertiser is the agent that wants to allocate a particular ad in different medias. The publisher is the agent who owns the medias. These medias are usually web pages or mobile applications.
Each time an ad is shown in a web page or in a mobile application an impression event is generated. These impressions and other events are the source of analytical panels that are used by the agents (advertiser and publisher) to analyze the performance of its campaigns or its web pages.
Presenting these panels to the agents is a technical challenge because Ad Networks have to deal with billions of events each day, and have to present interactive panels to thousands of agents. The scale of the problem requires the usage of distributed tools. Obviously, Hadoop may come to the rescue with its storage and computing capacity. It can be used to precompute several statistics that are later presented in the panels.
But that is not enough for the agents. In order to perform exploratory analytics they need an interactive panel that allows to filter down by a particular web page, country and device in a particular time-frame, or whichever other ad-hoc filter.
Therefore, something more than Hadoop is needed in order to store the data and to perform some statical precomputations. At Datasalt, we have addressed this problem for some clients and we have found a solution than will be presented in the talk.
The solution includes two modules: the off-line and the on-line.
The off-line module is in charge of storing the received events and preforming the most costly operations: cleaning the dataset; performing some aggregations in order to reduce the size of the data; and create some file structures that will be used later to serve the on-line analytics. All these tasks are handled properly by Hadoop. The most innovative part on this process is the last step where some file-structures are created for being exported to the on-line part in order to serve the analytical panels.
The on-line module is in charge of serving the analytical queries received from the agents' panel webapp. The queries are basic statistics (count, count distinct, stdev, sum, etc) run over a subset of the input dataset represented by an ad-hoc filter. The challenge here is that the system has to serve statistics for filters “on the fly”. That makes it impossible to precalculate everything on the off-line side. Therefore, part of the calculations must be done on-demand. That would not be a problem if the scale of the data wouldn't be that big. Some kind of scalable database is needed for this task.