For this upcoming meetup Juan Valencia, Principal Engineer at ShareThis, will be presenting on their real-world use of Apache Cassandra for high throughput and mission critical applications.
This meetup will cover how to set up your projects successfully by having a good data model, running Cassandra, and using the Hector Java client. We will have a Q&A session at the end of Juan's presentation, to ensure everyone's questions are answered.
Hope you can make it!
What You Will Learn at this Meetup:
• Real-World Use Case on ShareThis + Apache Cassandra
• Data Modeling with Apache Cassandra
• Using the Java Hector Client Library with Cassandra
Abstract
Juan Valencia, Principal Engineer at ShareThis, will be presenting on the use of Cassandra for high throughput applications. ShareThis has been running on Cassandra since version 0.6 and currently runs 4 Cassandra clusters, powering batch analytics, real-time analytics, a counter service, and a data lookup service.
2. ShareThis + Our Customers: Keys to Unlocking Social
1. DEPLOY SOCIAL TOOLS ACROSS BRANDS (AND DEVICES)
2. TAKE YOUR SOCIAL INVENTORY TO MARKET
3. LEVERAGE SHARETHIS: FOR DIRECT SALES, RESEARCH AND UN-RESERVED INVENTORY
2
3. Largest Ecosystem For Sharing and Engagement Across The Web
120 SOCIAL CHANNELS
SHARETHIS ECOSYSTEM
211 MILLION PEOPLE
(95.1% of the web)
2.4 MILLION PUBLISHERS
Source: ComScore U.S. January 2013; internal numbers, January 2013
3
11. Use Case: Count Service for URL's
●
1 Billion Pageviews per day = 12k pageviews per second
●
60 Million Social Referrals per day = 720 social referrals per second
●
1 Million Shares per day = 12 shares per second
●
No expiration of Data* (3bn rows)
●
Requires minimum latency possible
●
Multiple read requests per page on blogs
●
Normalize and Hash the URL for a row key
●
Each social channel is a column
●
Retrieve the whole row for counts
●
Fix it by cheating ^_^ *
13. Insights that Matter – Your Social Analytics Dashboard
Timely Social Analytics
Benchmark your social
engagement with SQI
Identify
popular articles
Dive deeper into your
most social content
Measure social
activity on an hourly,
daily, weekly &
monthly basis.
Uncover which social
channels are driving
the most social traffic
12 - x1.large
13
14. Use Case: Loading Processed Batch Data
●
Backend Hadoop stack for processing analytics
●
58 JSON schemas map tabular data to key/value storage for slicing
●
MondoDB* did not scale for frequent row level writes on the same table
●
Needed to maintain read throughput during spikes to writes when
analytics were finished
●
No TTL* - Works daily, doesn't work hourly
●
Switching from Astyanax to Hector
●
Using a Hector Client through Java API's
15. Use Case: Loading Processed Batch Data (continued)
{
}
"schema":
[
{
"column_name":"publisher",
"column_type":"UTF8Type",
"column_level":"common",
"column_master":""
},
{"column_name":"domain","column_type":"UTF8Type","column_level":"common","column_master":""},
{"column_name":"percenta","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},
{"column_name":"percentb","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},
{"column_name":"sqi","column_type":"FloatType","column_level":"composite_slave","column_master":"category"},
{"column_name":"month","column_type":"UTF8Type","column_level":"partition","column_master":""},
{"column_name":"category","column_type":"UTF8Type","column_level":"composite_master","column_master":""}
],
"row_key_format": "publisher:domain:month",
"column_family_name": "sqi_table"
CF -> Data Type
Row -> Publisher:domain:timestamp
Columns -> master:slave = value (topics, categories, urls, timestamps, etc)
17. Insights that Matter – Your Social Analytics Dashboard
Real Time Social Analytics
Benchmark your social
engagement with SQI
Identify trending
articles in real-time
Dive deeper into your
most social content
Measure social
activity on an hourly,
daily, weekly &
monthly basis.
Uncover which social
channels are driving
the most social traffic
12 cc1.4xlarge
17
18. Insights that Matter – Your Social Analytics Dashboard
Real Time Social Analytics
Benchmark your social
engagement with SQI
Identify trending
articles in real-time
Dive deeper into your
most social content
Measure social
activity on an hourly,
daily, weekly &
monthly basis.
Uncover which social
channels are driving
the most social traffic
12 cc1.4xlarge
18
21. Insights that Matter – And aren't accessible
●
Too many columns – unbounded url / channel sets
●
Cascading failure
●
Solutions:
–
Bigger Boxes – meh...
–
Split up the columns – split the rowkeys
●
–
Split up the columns – split the CF
●
–
Hash Urls and keep stats separate
Move URLs to their own space
Split up the columns – split the Keyspace
●
Keyspace is a timestamp
23. ●
●
●
●
●
●
●
How many rows will there be?
How many columns per row will you need?
How will you slice your data?
What are the maximum number of rows ?
What are the maximum number of columns?
Is your data relational?
How long will your data live?
23
36. Conclusions
●
Data Modeling is Important
●
Use Cassandra for write throughput
●
Keep your ring even and your data slice-able
●
Wrap your libraries and switch when you need to
37. We're hiring: http://www.sharethis.com/about/careers
●
●
●
Work with REAL big data, billions of requests per day
Work on products that millions people see and interact with on a daily
basis
●
Work with a real-time pipeline, machine learning, complex user models
●
#1 fastest growing company San Francisco
●
free lunches
●
... and of course work with a bunch fun, smart people and PhDs
We can change the look of the slide (and featured publishers), but I feel the ecosystem is a cool concept and graphic for getting a quick overview of who we are. The text below can be worked in somehow too, with the new look of this slide. Maybe the text can be cut down too.
ShareThis empowers publishers with solutions to improve and drive value from the social engagement of their site. People share content that's most relevant to them, with people who they believe will also enjoy the content. More than 2.5 million publishers increase eyeballs, engagement, and advertising revenue through the ShareThis sharing platform.
<number>