Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create features is critical for machine learning projects to be successful. To enable this, we built a time machine that computes features for any arbitrary time in the recent past for offline experimentation. We also built a real-time stream processing system to capture the interests of members during different times of the day and to quickly adapt to changes in the collective interests of members as it happens in case of real-world events.
Building the time machine for offline experimentation and the real-time infrastructure for online recommendations with Apache Spark (Streaming) and Apache Cassandra empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. We will delve into the architecture, use case details, data models used for cassandra and share our learnings.
About the Speakers
Prasanna Padmanabhan Engineering Manager, Netflix
Prasanna leads the Data Systems for Personalization team at Netflix. His primary focus is on building various big data infrastructure components that help their algorithmic engineers to innovate faster and improve personalization for Netflix members. In the past, he has built distributed data systems that leverages both batch and stream processing.
Roopa Tangirala Engineering Manager, Netflix
Roopa Tangirala is an experienced engineering leader with extensive background in databases, be they distributed or relational. She manages the database engineering team at Netflix responsible for operating cloud persistent and semipersistent runtime stores for Netflix, which includes Cassandra, Elasticsearch, Dynomite and MySQL databases, by ensuring data availability, durability, and scalability to meet the growing business needs.
5. Ranking
Everything is a Recommendation
Rows
Over 80% of what
members watch
comes from our
recommendations
Recommendations
are driven by
Machine Learning
Algorithms
6. Data Driven
Offline Experiment
using Historical
Data
Online
A/B Testing
Rollout Feature to
ALL members
Success Success
Fail
Algorithmic Page
Generation
Trending Now
9. Algorithmic Page Generation
Without Algorithmic Page Generation With Algorithmic Page Generation
Diversity of the Page
Affinity for specific rows
Drawbacks
16. Variant 2
Algorithmic Page Generation
Production Variant 1
Evaluate best variant
based on the plays
Actual
Plays:
17. Offline Experiment Architecture
Member
Selection
Runs once a day
Ratings
Service
S3
Snapshot Snapshot
Store
Snapshot
Forklift
Viewing
History
Service
MyList
Service
Data
Snapshots
Evaluate
Metrics
Generate
Pages
… …
A/B Test
18. Data Model - Requirements
• Need for historical service data
• Optimize for Batch Writes and Point Reads
28. Trending Now - Data Infrastructure
Impression
Service
Viewing
History
Service
UI
Online
Services
Trends
Store
Compute
Trends
Model
Training
Captures
videos shown
in view port
Captures
videos
played by
members
Publish
Models
Viewing
History
Service
Ratings. .. .
29. State Management in Cassandra
Video Number of Plays
Stranger Things 100
Narcos 200
Orange is the new Black 300
30. State Management in Cassandra
Trends
Store
State
Present
?
Compute Trends
Yes
No
Init State from
Cassandra
Load State
Update
State
Read
Events
31. Data Model - Requirements
• Trending data is for a specific interval of time
• Optimize for Batch Writes and Batch Reads
Good afternoon everyone. My name is Prasanna. I lead the Data Systems for Personalization team at Netflix. Our team builds the Machine Learning infrastructure that powers Netflix recommendations and I have Roopa with me who leads the Cloud Database Engineering team at Netflix. Today, we are going to talk about a few use cases where we use Spark + Cassandra in our data pipelines and share some of the learnings from it.
At Netflix, we aspire to a day when our members can turn on Netflix and the absolute best content for them has already started playing for them. While we know we are far away from realizing this dream, it sets a vision for us to improve the recommendations that span our service. So, where do we use recommendations in our service.
Our journey of building the recommendations systems started with predicting the rating that our members will give for a video and based on that recommend appropriate videos
That later evolved into creating meaningful grouping of videos and being able to personalize the videos within each group.
Today, we have multitude of algorithms for doing recommendations. Not only are the videos within a row personalized for you, but the rows themselves are personalized for you. Our 80% of what our members watch come from the videos that are recommended to them, which are driven by machine learning algorithms. So how do we improve these algorithms to realize our grand vision
Just like everything else at Netflix, we follow a data driven approach to improve our recommendations. Once we have an idea, we run an offline experiment using historical data to see if this new idea would have made better recommendations. If it did, we would deploy it to an online A/B test to see it performs well in Production too.
We look at various metrics such as Viewing hours, Member Retention and member satisfaction to evaluate the success of an A/B test. If the A/B test is a success, we would rollout that feature to ALL members. And If not, go back to the whiteboard, come up with a better idea and start over the offline experiment.
For the rest of my talk, I’m going to take one use case of Offline Experimentation and one use case for an Online A/B and walk through how we use Spark and Cassandra to help improve recommendations
As we saw earlier, Offline Experiment is a step prior to doing an A/B test. It helps us decide if an idea is even worth doing an A/B test.
Let’s take the use case of Algorithmic Page generation for Offline Experimentation. How can we personalize the ordering of rows on the homepage for each member.
We initially used to have a rule based approach of page generation. For example, the rules could specify that the 1st row be Continue Watching, the 2nd row be Top Picks and so on and so forth. The drawback for this approach is that it does not take into account the diversity of the page nor the affinity of our members to specific rows.
Algorithmic Page generation addresses these issues by personalize the row and the ordering of the rows on the home page based on our member’s viewing patterns, diversity of the page and many more attributes.
Let’s take an example to see how we evaluate different pages algorithmically. Say this is a page that a member sees based on the current Production algorithm.
Variant 1 is a new page that was generated with a new algorithm
And Variant 2 is another page that was generated with another new algorithm. We first look for some of the basic things like how is the Row distribution (for ex: how many members see a Continue Watching row) and how is the TV/Movie Ratio (Does one variant over index on say TV shows)
More importantly, we look at the actual videos that were played by the member and find the best page that could have made those videos easily discoverable. In this case, say the member played Hot Rod, The Short Game and Family Guy.
We can see that HotRod was recommended in all the 3 versions of the page, except that Production and Variant 2 recommended that video much higher in the page.
Similarly, Short Game too was recommended in all the 3 versions of the page, except that Variant 2 again recommended that video much higher in the page.
We also look at negative samples. Family Guy was a video that was played, but not recommended in any version of the page (probably our members searched for it). We typically consider this as a fail in our recommendations. Given this data, we would choose Variant 2 as the winning page algorithm as it would likely surfaces the videos that would be played by our members much higher in the page.
Now lets look at the offline experimentation architecture that made this possible. The most critical requirement for building an offline experiment is to provide an ability to travel back in time and be able to generate the page our members would have seen, if they used our service at a given time in the past. We built this ability to time travel by snapshotting data of our various online services and use that snapshot data to generate the experimental page.
The first step in building the snapshot infrastructure is to select the set of members for whom we need to Snapshot data. Snapshotting data for all our members would be an expensive operation. Rather, we select a stratified set of our members based on member’s tenure, their viewing patterns, the devices they use etc.
Once we have the set of users, the next step is to snapshot data of various online services such as Ratings, Viewing History, MyList that help improve personalization. As you folks might be aware, Netflix embraces a fine grained Service Oriented architecture for our cloud based deployment model.
These snapshot data are then stored in S3 in nested parquet format for both space and time efficiency. Many of our offline experiments run inside Spark and they can directly consume the snapshot data from S3. However, for Algorithmic Page generation, we need to consume this snapshot data for one member at a time. This is because we are reusing our existing online systems, which generates the page for a live user request to also generate the experimental page given the snapshot data. S3 is not suited to do random seeks of the data stored in it.
Alternatively, We know Cassandra is well suited for this use case. To that end, we used Spark to read the snapshot data from S3 and write that data into Cassandra. We used the Spark Cassandra connector, which took care of the nitty gritty details of connecting to all the cassandra nodes in the ring, maintaining the connection pool, doing retries and optimizing the reads/writes to Cassandra.
Once the data is available in Cassandra, we will now be able to get the state of netflix data services for any given member and a timestamp in the past. We can then generate the experimental pages for this member based on the new algorithm and evaluate the metrics needed to see which of those page algorithms could have done better recommendations and if there is a clear winner, we would deploy it to a A/B test.
Before we look into the data models that we used in Cassandra. There were 2 requirements that we needed to address when building the data model:
The need for storing historical data from various data services such as Ratings, Viewing History. This is the core for building the time machine
Optimize the data model for batch writes that happen from Spark and for Point reads from the online systems during Page generation
Here is the data model that we used for storing our member’s MyList data for Offline Experiment. So yes, the obvious thing is to have different column families for different data services.
Date and MemberId concatenated together formed the Row Key. Column name was a static string and its value being a blog of the MyList data for that member. With this data model, a query to get a member’s MyList data for a given timestamp in the past would translate to a point query read, which is very efficient in Cassandra.
However, a similar data model for storing Viewing History would not work. This is because the viewing history data for a member could be very big and would become a wide row, causing heap pressure which inturn would affect latencies
To avoid the issues of having a wide row, we divided the rows into a predefined set of shards. In this case, Date MemberId and Shard Index becomes the row key and the Viewing data blob was the column value.
Now lets focus on the next use case of how did we use Spark (Spark Streaming to be precise) + Cassandra for an Online A/B test.
The Trending Now row captures the video that are Trending, but personalized for you.
Here is a screenshot of my Trending Now row when its 7 pm on a Monday, when my daughter takes control of our remote.
Here is a screenshot of my Trending Now row when its 10 pm on a Saturday. Its ME time and the ACTION finally begins
Oh yeah, Pokemon’s impact was seen on Netflix too
The key to building a Trending Now row is to have a Fast Feedback loop. Netflix is supported in 1000s of devices with each sending various types of data that help improve personalization such as a Play Event or the fact that a video is recommended to a member and not played and so on so forth. We built various data systems that captures these data In real time. Once we have the required data for personalization, we built several Sparl Streaming applications that can read these data in real time and compute the trends data, all in real-time. The Trends data is fed as i/p to our recommender systems which then looks at a member’s taste and personalized
Lets do a little more deeper dive into the architecture. We capture all user interactions within our service. For ex: the videos that are recommended to our members and shown in their view port is captured by the Impression Service. Similarly, the videos that are played by our members are captured by the Viewing History Service. Both these data services sends those events into Kafka.
Spark Streaming jobs consume these events from Kafka and compute the Trends data required for building the Trending Now row. This trends data is persisted into Cassandra. Again, we use the Spark Cassandra connector in the Spark Streaming job to batch write all the Trends data into Cassandra. The one thing that we had to configure in the connector was the connection timeout, which is different for a Streaming job.
The Trends data is then combined with data from services such as Viewing History and Ratings and fed as input to the Model Training job. The output of which is a model that is consumed by online services to do the Personalized Trending Now recommendations for the next time interval.
We also use Cassandra for managing the state of our spark streaming jobs. So what is a State in Spark Streaming? Think of it as a simple Key Value pair that gets updated continuously as events happen in real time. Let’s say we simply want to the count the number of times a video is played. In that case, videoId and the count forms the state.
Spark provides a way to bootstrap the state when your streaming job is restarted or started for the first time.
For Trending now, as we read events from Kafka, we first checks if the State is Present. If It did, we use the data from the existing state and the new event that was read from Kafka, perform computations and update the new state back into Cassandra. If not, it would load the State data from Cassandra and load into Spark as part of bootstrapping.
Lets look into the data model that we used for Trending Now. The 2 main requirements to address is 1) Trends data is applicable only a specific interval of time and that 2) we should optimize for both batch writes and batch reads as model training happens inside Spark
It’s kind’a obvious that we had to create separate column families for separate interval of times primarily because given the time interval, we would need only the data from that interval. VideoId along with some metadata such as Country, Timezone formed the row key. We had 2 columns that contained the Plays and Impressions data for each video.
With that, I would like to now introduce Roopa who will walk us through few more use cases for using Spark + Cassandra and our learning's from it.
You just saw two main data driven spark cassandra use cases. Once Prassan’s ab testing is a success , it needs to be rolled out to ALL members and there is a huge growth in dataset which needs a bigger Cassandra cluster.
You want to move bulk dataset from one cluster to another. How would you go about doing it in a fast and quick way?
Meet forklift- why is it called forklift? We are moving data across the clusters! Lets look into the architecture in detail now
Meson is Netflix built - general purpose workflow orchestration and scheduling framework. It manages the lifecycle of several ML pipelines that execute workloads across heterogeneous systems. This same framework was used for the forklift too.
Spark Cassandra Connector This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.
Mesos provides task isolation and excellent abstraction of CPU, memory, storage, and other compute resources. Meson leverages these features to achieve scale and fault tolerance for its tasks.
Spark jobs submitted from Meson share the same Mesos slaves to run the tasks.
What makes this extension different? The nodes are not being doubled which cassandra can do well for you, instead they are being increased by few percentage. We don’t use vnodes, so the only option for adding capacity to the cluster is either doubling or creating a new cluster and populating the data. We are taking about clusters having hunderds of nodes and doubling always does not work. Forklift comes in very handy for this type of use case.
We were very early adaptors of Cassandra and started using it in 0.5 version in production. So of course we were using thrift and all our streaming microservices which were built over the years were based on thrift’s schemaless design and used to access cassandra. With the advent of CQL there are apps which want to use the richer datamodel of CQL and migrate to cql for better performance. Forklifter plays a great job with the migration since you can map the datamodel and transform from source to destination in the relevant format.
This is another use case we use forklift for, where for certain big clusters, instead of replacing nodes one at a time that would take weeks, we create a new cluster in trusty and forklift the data after the dual writes are enabled.
Performance
We get good support from datastax and one of the options was using DSE spark instead of running datastax spark connector talking to cassandra. But performance of our cassandra clusters was a concern since these clusters are used in the path of streaming serving all the members watch great streaming content, have very strict SLA’s. Running spark along with cassandra would constraint the limited resources we have in AWS and was a big concern.
Cassandra is statefull, and if we had spark and cassandra running together, its not easy to scale up the cluster when you are running into resource constrains. With spark running seperately we can scale up and down the spark cluster with out affecting cassandra
Running the spark and cassandra clusters are cost effective too, since we can use the instances from the shared pool and release them when the job is complete.
Can lead to NPE - An `TTL` of 0 when written becomes a `null` in C* When read, this `TTL` becomes a `null` The `null` cannot be written back to C* as `TTL` Fixed in 3.1- we used a workaround of translating the data when reading from source and writing to destination.
--------
In thrift you could define the column level TTL and different columns could have different ttl’s. In cql there is a row TTL and there is no way to define column TTL in a single mutation. SO when you are copying data from thrift to CQL you would need to split the writes into multiple mutations by batching.
Input.split.size_in_mb uses a internal system table in C* ( >= 2.1.5) to determine the size of the data in C*. The table is called system.size_estimates is not meant to be absolutely accurate so there will be some inaccuracy with smaller tables and split sizes. When you use spark cassandra connector cassandraTable() function to load data from Cassandra to Spark it will automatically create Spark partitions aligned to the Cassandra partition key. It will try to create an appropriate number of partitions by estimating the size of the table and dividing this by the parameter spark.cassandra.input.split.size_in_mb (64mb by default). (One instance where the default will need changing is if you have a small source table – in that case use withReadConf() to override the parameter.)
conf spark.cassandra.connection.keep_alive_ms, 5000 =900000 , Period of time to keep unused connections open
-- conf spark.cassandra.connection.timeout_ms default 5000, - we put 50 Maximum period of time to attempt connecting to a node
-- conf spark.driver.maxResultSize=default 1G but we had it 4g Limit of total size of serialized results of all partitions for each Spark action (e.g. collect). Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.
This usually means that the size of the partitions you are attempting to create are larger than the executor's heap can handle. Remember that all of the executors run in the same JVM so the size of the data is multiplied by the number of executor slots.
increase the heap size of the executors spark.executor.memory or shrink the size of the partitions by decreasing spark.cassandra.input.split.size_in_mb
cassandra.output.batch.size.bytes Default = 1024. Maximum total size of the batch in bytes. Overridden by spark.cassandra.output.batch.size.rows
cassandra.output.batch.size.rows (default: auto – batch size determined by size.byts): Number of rows per single batch. The default is 'auto' which means the connector will adjust the number of rows based on the amount of data in each row
cassandra.output.concurrent.writes (default: 5) Maximum number of batches executed in parallel by a single Spark task
cassandra.output.throughput_mb_per_sec (default: unlimited): Maximum write throughput allowed per single core in MB/s.
Limit this on long (+8 hour) runs to 70% of your max throughput as seen on a smaller job for stability
Spark is able to issue write requests much more quickly than Cassandra can handle them. This can lead to GC issues and build up of hints. - Version 1.2 higher -cassandra.output.throughput_mb_per_sec - Allows you to control the amount of data written to C* per Spark core per second.
If this is the case with your application, older versions- try lowering the number of concurrent writes and the current batch size using the following options. spark.cassandra.output.batch.size.rows spark.cassandra.output.concurrent.writes
I would like you to leave you all with this thought - Spark Cassandra connector library makes it very easy to create spark applications that need access to Cassandra!! We have used in and seen, you all do too. If you are excited about the ML algorithms and how we can go back to making the very first desire, of customer clicking Netflix and their favourite movie or TV show starts playing or you are super excited about the scale and challenges in providing persistence as a service, do talk to Pransanna or Me as we are always looking for great talent to join our teams!!