SlideShare a Scribd company logo
1 of 31
Accidental scaling issues
From a hobby project to one of the
largest online fashion communities
About Me
•
•
•
•

Thierry Schellenbach
Founder/ CTO Fashiolista
Github/tschellenbach
Feedly & Django Facebook

• Blog: mellowmorning.com
• @tschellenbach
Today
• Fashiolista’s growth
• Pre Cassandra feed systems
• Github/tschellenbach/Feedly
– Cassandra learnings
– Remaining challenges
A long time ago

Rick, Joost, Thierry & Thijs
Launched Fashiolista at TNW
Got a few hundred users
And went back to work
Brazil?!
• Blogs
• Twitter
• Capricho (Teen
magazine with
1.8M followers)
Growth
2nd largest fashion community
• 1.5M members
• 17M loves/month
• 94M pageviews (google analytics)
5.000.000+
14.000.000+
The team
Global Fashion Discovery
Our Stack
•
•
•
•
•
•
•
•
•

Django/Python
PostgreSQL/ Pgbouncer
Cassandra
Redis
Solr
Celery/ RabbitMQ
AWS/ Ubuntu
Nginx/ Gunicorn/ Supervisor
Newrelic, Datadog & Sentry
Feed History
1. PostgreSQL
2. Redis – Feedly 0.1
3. Cassandra – Feedly 0.9
More details in this highscalability post:
http://bit.ly/hsfeedly
PostgreSQL - Pull
1. Smooth till we reached ~100M activities
2. Spikes in performance due to the query
planner
Redis - Push
1. Fast, Easy to setup and maintain
2. Becomes expensive really quickly

115K Followers
Cassandra - Feedly 0.9
1.
2.
3.
4.
5.

Few moving components
Supported by Datastax
Instagram
Easy to add capacity
Cost effective
We open sourced Feedly!
• Github/tschellenbach/Feedly
• Python library, which allows you to build
newsfeed and notification systems using
Cassandra and/or Redis
Feedly – What can you build?
Newsfeeds

Notification systems
Cassandra Challenges
1. Which Python library to chose?
•
•
•
•

Pycassa
CQLEngine (using the old CQL module)
Python-Driver (beta)
Fork CQLEngine to support Python-Driver
– Github/tbarbugli/cqlengine
Cassandra Challenges
2. Importing data
(300M loves * 1000 followers = 300 billion activities)

• High CPU load
• Nodes going down
• Start with many nodes, scale down afterwards
Cassandra Challenges
3. Optimizing import speed
(300M loves * 1000 followers = 300 billion activities)

•
•
•
•

Python-Driver
Batch queries
Non-Atomic (unlogged) batch queries
Prepared statements
Cassandra Challenges
4. Data model denormalization
CREATE TABLE fashiolista_feedly.timeline_flat (
feed_id ascii,
activity_id varint,
actor int,
extra_context blob,
object int,
target int,
time timestamp,
verb int
PRIMARY KEY (feed_id, activity_id) )
WITH CLUSTERING ORDER BY (activity_id ASC)
AND bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};
Opscenter is great
Opscenter & Datastax AMI are great
For startups Enterprise is also Free
Evaluation
7 instances, m1.xlarge, 2.59 TB
Cassandra 2.0.0, CQL3, Python-driver
(Would have been one expensive Redis cluster)
Current challenges
Average load times are good, but 99th percentile
sometimes spikes
Current Challenges
How do we limit the storage for feeds?
Trimming?

(Not supported)
DELETE from timeline_flat WHERE activity_id < 5000

Use a TTL on the rows?
Fork Feedly
This is our first time using Cassandra, let us
know how we can further speedup our
implementation:

http://bit.ly/feedlycassandra
Check out Feedly at
Github.com/tschellenbach/Feedly
Ask questions, Give tips to these guys:

Thierry Schellenbach

Tommaso Barbugli

Guyon Morée

More Related Content

Similar to Feedly & Cassandra at Fashiolista

Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
Scylla Summit 2018: How Scylla Helps You to be a Better Application Developer
Scylla Summit 2018: How Scylla Helps You to be a Better Application DeveloperScylla Summit 2018: How Scylla Helps You to be a Better Application Developer
Scylla Summit 2018: How Scylla Helps You to be a Better Application DeveloperScyllaDB
 
Apache Solr for TYPO3 what's new 2018
Apache Solr for TYPO3 what's new 2018Apache Solr for TYPO3 what's new 2018
Apache Solr for TYPO3 what's new 2018timohund
 
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015Tachyon Nexus, Inc.
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
[System design] Design a tweeter-like system
[System design] Design a tweeter-like system[System design] Design a tweeter-like system
[System design] Design a tweeter-like systemAree Oh
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
What is MariaDB Server 10.3?
What is MariaDB Server 10.3?What is MariaDB Server 10.3?
What is MariaDB Server 10.3?Colin Charles
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
ExtBase workshop
ExtBase workshop ExtBase workshop
ExtBase workshop schmutt
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
Django Overview
Django OverviewDjango Overview
Django OverviewBrian Tol
 
Presto updates to 0.178
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178Kai Sasaki
 
AWS Summit Amsterdam - Thierry Schellenbach Founder/ Fashiolista
AWS Summit Amsterdam - Thierry Schellenbach Founder/ FashiolistaAWS Summit Amsterdam - Thierry Schellenbach Founder/ Fashiolista
AWS Summit Amsterdam - Thierry Schellenbach Founder/ FashiolistaThierry Schellenbach
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...DataStax
 
Scaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the MonolithScaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the MonolithRoss McFadyen
 

Similar to Feedly & Cassandra at Fashiolista (20)

Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Scylla Summit 2018: How Scylla Helps You to be a Better Application Developer
Scylla Summit 2018: How Scylla Helps You to be a Better Application DeveloperScylla Summit 2018: How Scylla Helps You to be a Better Application Developer
Scylla Summit 2018: How Scylla Helps You to be a Better Application Developer
 
Apache Solr for TYPO3 what's new 2018
Apache Solr for TYPO3 what's new 2018Apache Solr for TYPO3 what's new 2018
Apache Solr for TYPO3 what's new 2018
 
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
[System design] Design a tweeter-like system
[System design] Design a tweeter-like system[System design] Design a tweeter-like system
[System design] Design a tweeter-like system
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
What is MariaDB Server 10.3?
What is MariaDB Server 10.3?What is MariaDB Server 10.3?
What is MariaDB Server 10.3?
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
ExtBase workshop
ExtBase workshop ExtBase workshop
ExtBase workshop
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Django Overview
Django OverviewDjango Overview
Django Overview
 
Presto updates to 0.178
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178
 
AWS Summit Amsterdam - Thierry Schellenbach Founder/ Fashiolista
AWS Summit Amsterdam - Thierry Schellenbach Founder/ FashiolistaAWS Summit Amsterdam - Thierry Schellenbach Founder/ Fashiolista
AWS Summit Amsterdam - Thierry Schellenbach Founder/ Fashiolista
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
 
Scaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the MonolithScaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the Monolith
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

Feedly & Cassandra at Fashiolista

Editor's Notes

  1. Follow me on Twitter and Github
  2. Today I’ll give a quick introduction to Fashiolista and our growth over the past years.Afterwards I’ll explain how our feed systems worked prior to Cassandra.But most importantly, we’ve opensourced all the code which we’ll be discussing during this talk.I’ll start by explaining some of our Cassandra learnings.There are many people in this room with Cassandra expertise so we definitely encourage you to have a look on Github.It’s quite possible you’ll find something which can be improved.
  3. 1.)Fashiolista started out as a hobby project4 guys, working on product comparisonWe we’re doing ok, but growth wasn’t spectacular.Noticed the rapidly growing fashion segment and tried to incorporate it.The first iteration on YouTellMe was a massive fail.Fortunately a few girls from the Amsterdam fashion institute helped us cover up our lack of fashion sense.We started with an empty sheet and designed a product around inspiration instead of search.Now at this point Fashiolista was just a hobby project, which we spent a few weeks on before launching it at TNW.
  4. So we launched with a bang at TNW.Organized a mini fashion show on stage and clearly stood out from the other startups.But at this point Fashiolista was jst a side project. We got a few hundred users and went back to work on our product comparison site.
  5. The next week though, my co-founder Thijs called while I was shopping at the AH.All the graphs looked off, and the growth over the past days completely disappeared.All that remained on the graphs was a spike showing the current day.Turns out several Brazilian blogs and the teen magazine Capricho posted about Fashiolista.Within a few hours tens of thousands of users signed up for Fashiolista.
  6. Over the past 2 years thing have moved along rapidly.Currently we’re the second largest fashion community worldwide.With close to 1.5 M members, and massive monthly engagement.
  7. And the team has also grown considerably
  8. Users of Fashiolista install the so called “love button”. While browsing around the web they can use this button to add their favourite fashion finds to Fashiolista.
  9. Once they click the button, we figure out the relevant image on the page and allow you to add it to your profile.
  10. The find is added to your profile and other people can follow the items you love.
  11. So a quick interlude about what we run.We’re a pretty standard Python/ Django stack.Similar to sites like Instagram and Pinterest.
  12. This talk will focus on this page, The feed page.It shows the content by people you follow.When scaling a social site this is quite a tough problem to solve.Since there is no easy way to shard the data.
  13. Our feed setup went through 3 generations.We started out with PostgreSQL, moved to Redis and eventually settled on Cassandra.The topic of scaling feed systems is something which we can talk about for days.Today I won’t go into much detail, but definitely have a look at my post on Highscalability if you are building something similar.
  14. Our first setup with PostgreSQL was really easy. It took 5 minutes to develop and kept on running smoothly till we had about 100M activities in the database.
  15. We were using Redis for our caching needs. Building a push based feed system with Redis was really easy.It took only a few weeks to develop. It was fast, easy to setup and maintain.The push approach works by storing a small list for every user.When kayture loves something, this love is stored on the feed for all the people which follow her.The Redis approach worked really well, but storing everything in memory can become expensive really quickly.
  16. We evaluated several options for replacing our redis based approach.We looked at Cassandra, Hbase and dynamodb.We chose Cassandra because it has fewer moving parts, is supported by Datastax and is used by at least one other large startup for their feed system.In addition it’s trivial to add more capacity and the storage is very cost effective.
  17. We’veopensourcedFeedly which you can find on Github.This is great, cause solving the scalability of your feed system is a lot of work.And it’s better to share this across multiple companies.
  18. You can build newsfeed systems. Examples are your:Facebook news feed, twitter stream, pinterest content etc.Alternatively you can also built notifications systems.Which are basically a simpler version of the newsfeed problem.
  19. Which language are you guys using?Java? Python? Ruby? Node? PHP?Pycassa is reliable, but uses the old thrift API and doesn’t support CQL.It’s reliable, almost all examples still use Pycassa, but it’s not very future compatible.CQLEngine is an ORM for writing CQL. It’s a great piece of code, but it relies on the old CQL adapter module.Python-Driver is where all the development effort of the datastax guys is. They say it’s not ready for production, but it’s already a really good beta.- It uses the native binary protocol- The client is smart, saving you a few roundtrips- You can use prepared statements- You can run your queries asyncWe forked CQLEngine and added support for Python-Driver, have a look at Githubhttps://github.com/tbarbugli/cqlengine
  20. Another thing we didn’t expect was the high CPU load Cassandra generates when importing data.When we tried the import with only a few nodes, they would often go down.The solution was to run a huge number of nodes during import and subsequently scale back down.
  21. When importing the 300M loves we used 4 techniques to import as fast as possible.- First of all we’re using Python-Driver which has excellent performance- Secondly we used batch queries- Batch queries on their own can actually be slower than regular queries, due to their atomic by default behaviour. To further improve speeds you want to use UNLOGGED batch queries- Last we used prepared statements to remove a bit of query parsing overhead
  22. Completely denormalized approach.We evaluated a more normalized approach.ButThe performance is worse as you’ll often hit many nodesIt doesn’t fit as naturally with Cassandra as there are no transactions
  23. https://github.com/tbarbugli/cqlengine
  24. https://github.com/tbarbugli/cqlengine
  25. https://github.com/tbarbugli/cqlengine
  26. https://github.com/tbarbugli/cqlengine
  27. https://github.com/tbarbugli/cqlengine