This document discusses building a recommendation system for IPTV as part of a content analytics system. It describes collecting user interaction data from IPTV, OTT, and other content delivery services. The data is ingested into a scalable analytics platform using Kafka and stored in Vertica. A collaborative filtering recommender model is built using Spark ML to generate recommendations without explicit ratings by estimating implicit ratings from usage data. Recommendations are stored back in Vertica and integrated into the IPTV platform via API to suggest new content to users.
2. About us
Focus on customer satisfaction
• LEADING DW/BI IMPLEMENTER IN SEE
• 200+ REALISED PROJECTS
• 90+ USERS IN 20 COUNTRIES
• 110+ EMPLOYEES
• 90+ CONSULTANTS/IMPLEMENTERS
• 70 TECHNICAL CONSULTANTS
• 20 BUSINESS CONSULTANTS
• 5 PROJECT MANAGERS
• OVER 600 MAN/YEARS OF EXPERIENCE IN LEADING
TECHNOLOGIES
3. Our business expertise fields
Innovative approach in the business decision support area
Strategic ICT consulting
• Analysis
• Design
• Development
• Implementation
• Support
• Education
4. Introduction – Content delivery services & CX for content delivery services
PI Content Analytics System – Business requirements & Technical solution
Implementation of recommendation system as part of our Content Analytics System
Conclusion
Agenda
5. Communication operators are offering to their consumers many
services that enable them to consume video content, using
different fixed and mobile technologies through different devices,
either in their homes or mobile.
Big part of the offer is understanding the user’s needs, and in the
case of the operators, the data is available, but there are systems
built for tis specific purpose.
Introduction
6. Analysis of the consumers’ behaviour in the real-time and over
loger period of time gives operators possibility to
• maximize revenue
• minimize costs
• Serve their customers better
• Reach the highest possible level of customer experience
For that reason, operators are using sophisticated
recommendation engines to propose to their consumers’
content that may be interesting for them.
Introduction
7. • Enable consumers to watch only broadcasts in real-time
• Include following types of services:
• Digital Terrestrial TV
• Digital Satellite TV
• Digital Cable TV
• Usually the data can’t be binded to a particular customer
Broadcasting services
8. • Using Internet and IP protocol for distribution of the content.
• Consumers can watch real-time broadcasts and historical broadcasts that are
stored on providers’ infrastructure.
• Includes the following types of services:
• Internet Protocol television (IPTV)
• Over the Top content services (OTT)
• Mobile TV (TV on the go)
Streaming services
9. IPTV vs. OTT
IPTV Over-the-top technology
Content provider Local telecom
Studio, channel, or
independent service
Transmission network
Local telecom - dedicated
network
Public Internet + local telecom
Receiver
Local telecom provides (set-top
box)
Purchased by consumer (TV,
computer or mobile)
Display device Screen provided by consumer Screen provided by consumer
10. Analytical systems for content delivery services shall enable operators to:
• Understand consumers’ behavior, which content they consume, through which
channel at what time and on what device
• Analyze the performance of offered service packages and package options
• Use rating information to negotiate content with content providers
• Segment consumers based on their behavior
• Approach consumers with appropriate offers and recommendations
Business requirements
11. Architecture of the solution
Scalable architecture that can process up to hundreds of millions records daily
IPTV, Digital TV
Video on demand
OTT providers
Channels
Devices
Recommendations / Open API
Insights
ML / Predictive
models
12. • Solution supports both batch processing and real-time streaming
• Data can be loaded to Vertica by running a series of COPY statements, each of which
loads small amounts of data into Vertica database
• For real-time streaming, Kafka integration feature can be used to automatically load data
to Vertica database as it streams through Kafka channel
Technical solution – Flexible data ingestion
13. • Data enters Kafka as a message, typically in JSON or AVRO format
• custom parser can be built for other data formats.
• Feed of messages in a common category come together to form topics.
• Kafka divides the topics up into partitions that it can be fed in parallel to
Vertica target tables for further analysis.
Technical solution – Kafka integration feature
14. Real-time usage reports and
dashboards
• Calculate different metrics, Apply
filters
• Drill-down to fine granularity data
• Time series (trend) analysis
• Map view
Business Solution
Analytics
15. Detailed usage and behavior analytics
• Consumer
• Channel
• Content item
• Device
• Operating system
• Delivery type (live, catchup, VOD...)
• Action that was taken by user to view specific
content item
Predictive models for segmentation,
recommendations and cross & up-sell
Business Solution
Analytics
17. • Information overload problem
• Improve customer experience
• Increase revenue (cross-sell /
upsell)
Approach consumers with
appropriate offers and
recommendations
Recommender systems for content delivery
18. Two main approaches:
• Content-based recommender systems – use profile information filtering
• information from customer profile (demographic data, answers from a suitable questionnaire), and
• information about the content and its attributes (i.e. for movies that will be the genre, director, starring
actors, box office popularity)
• Collaborative filtering recommender systems – use interaction information
• explicit user ratings, or user interactions with content delivery platform
• make predictions (filtering) about the interests of a user by analysing preferences from many users
(collaborating)
• CF methods usually produce better results, but have one important disadvantage – cold start problem
(they cannot make prediction for new users)
Hybrid recommender systems
Approaches
19. Important issues for recommender system for content delivery are:
• Absence of rating results from users for delivered content items
• Same user account is used by one or more people in the same household, thus resulting in a
user behavior that’s the union of the behaviors of all household members
• Available items for recommendations are constantly changing
• Prices for content change over time
• Some additional rules shall be applied for particular users (i.e. filtering of adult only content)
• Recommendations must be created very often to be able to have current recommendations for
large number of users
Issues
20. Input data:
• Source data for the recommendations engine is stored
in the Vertica platform
• It is automatically refreshed and maintained
on daily basis
• Two main tables:
ouser-item interaction table
oitems metadata
Recommender system as a part of PI Content
Analytics system
21. • For video-on-demand type of content on the IPTV platform we used
model-based collaborative filtering, where users and items are both
represented by a set of latent factors.
• Factors, or features, are inferred from the ratings patterns and represent
their characteristics that do not necessarily have to be human-
interpretable; they are implicitly present computer-calculated dimensions
used as characterizations of users/items in the calculations.
• Matrix factorization techniques are
used to learn these factors
Chosen approach
22. • Spark ML, Python
• Vertica Connector for Apache Spark
• Spark uses ALS (alternating least squares) to minimize the squared error
on the set of known ratings in order to learn the latent factors.
• ALS works by rotating between fixing one side and solving a least squares
problem on the other, and vice-versa. These steps are performed
iteratively until convergence.
leverage parallelization
Tools and techniques
23. • IPTV platform does not provide explicit user ratings, so we used the usage data to estimate users'
preferences, i.e. to assign implicit ratings, based on the percentage of the show duration they
watched, by the following formula:
𝑟𝑢(𝑖) =
5 , 𝑝𝑐𝑡 𝑢(𝑖) ≥ 1
2 + 3 ∗ 𝑝𝑐𝑡 𝑢 (𝑖) , 0 < 𝑝𝑐𝑡 𝑢(𝑖) < 1
𝑟𝑢(𝑖) = estimated rating of user u to item i (minimum rating = 1, maximum rating = 5)
𝑝𝑐𝑡 𝑢(𝑖) = percentage of the show i the user u watched (0.8 means he watched 80% of the show
duration)
The more he watches, the greater the level of confidence in his preference estimation
Implicit ratings are inferred from the users' activities
Implicit ratings
24. • The recommender engine processing is scheduled and performed on daily basis, during
the low activity periods
• The output of the processing is top 10 recommendations for every user
• The results are stored in Vertica and fetched to display to the user as his top 10
recommendations while he is looking for something to watch (integration with IPTV
platform through API)
• For implementation for real-time broadcasts, recommendation table shall be extended
with show start timestamp, to be able to propose to the user the show that has not started
yet and will start soonest.
• Cold start problem – most popular items recommendations
Storing results and generating recommendations
25. • Recommender systems are playing a very important role in improving
customer experience in many digital industries
• We designed and implemented a Content Analytics System with two main
features:
capability of fast processing of semi-structured and unstructured
heterogeneous data originating from digital content delivery services
recommendation engine that continuously gathers data and generates
recommendations
Conclusion
Naše igralište je cijeli svijet, a ovo su samo neka od područja kojima se bavimo:
• implementacija skladišta podataka (DWH)
• analitika velikih podataka (Big data analitika)
• integracija podataka (Data lake)
• poslovna inteligencija (BI)
• rudarenje podataka (Data mining)
• upravljanje rizikom
• upravljanje matičnim podacima
overwhelming for customers
aid in the discovery process
unpersonalized recommendations (most popular products)
R.s. are used for many years in e-commerce.
Amazon, Netflix, Youtube
big impact in recommender systems for content delivery was done by Netflix Prize contest in 2006
2 types of information - 2 main approaches:
CONTENT
-> profiles for users and products to characterize them;
user profile: age, gender, education, interests, etc. + product profile
-> provide recommendation by matching user’s interests with description and attributes of items (content)
CF
-> predicting what users will like based on their similarity to other users; crowd wisdom
-> The underlying assumption is that if user A has the same opinion as user B on an issue, A is more likely to have B's opinion on a different issue.
-> generally more accurate, captures data aspects that are dificult to profile using content filtering
-> neighborhood methods and latent factor models
Detailed historical data on customer activities is used
User-item table describes customer activities, i.e. which shows did the user watch, when, for how long, etc.
The data is aggregated on the level of subscriber id and item id.
loaded using defined rules for filtering records not related to actual content consummation (<30 sec: browsing through the channels, couple hours: tv left on)
Content item table - info about TV shows (title, type, length, genre, description, director, year, etc.)
Factors not necessarily human-interpretable; they are implicitly present computer-calculated dimensions used as characterizations of users/items in the calculations.
V2S & S2V
- Spark DataFrames to Vertica tables
- data from Vertica to Spark RDDs or DataFrames for use with Python, R, Scala and Java
ALS - one of the methods used in matrix factorization models
- since there are two unknowns in the optimization process – item features vector and user features vector – als rotates between ..
- user and item factors are computed independently of other user/item factors, Spark is a parallel data processing engine -> leverage ALS technique for better performance on large datasets
The logic behind it is that, if the user at least started to watch the show, that means he showed interest for it, and his rating will be at least two.
Implicit ratings are inferred from the users' activities; recommendations are based only on the users' past behavior.
- 10 items that the algorithm predicted the user will most likely want to watch
The list of most popular items is also kept in a table and regularly refreshed.
one of the most important applications is for content delivery