SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
Wikimedia Content API:
A Cassandra Use-case
Eric Evans <eevans@wikimedia.org>
@jericevans
Apache Bigdata | May 10, 2016
Our Vision:
A world in which every single human can freely
share in the sum of all knowledge.
About:
● Global movement
● Largest collection of free, collaborative knowledge in human history
● 16 projects
● 16.5 billion total page views per month
● 58.9 million unique devices per day
● More than 13k new editors each month
● More than 75k active editors month-to-month
About: Wikipedia
● More than 38 million articles in 290 languages
● Over 10k new articles added per day
● 13 million edits per month
● Ranked #6 globally in web traffic
Wikimedia Architecture
LAMP
THE ARCHITECTURE
ALL OF IT
Wikitext
= Star Wars: The Force Awakens =
Star Wars: The Force Awakens is a 2015 American epic space opera
film directed, co-produced, and co-written by [[J. J. Abrams]].
HTML
<h1>
Star Wars: The Force Awakens
</h1>
<p>
Star Wars: The Force Awakens is a 2015 American epic space opera
film directed, co-produced, and co-written by
<a href="/wiki/J._J._Abrams" title="J. J. Abrams">
J. J. Abrams
</a>
</p>
Wikitext
HTML
Conversion
wikitext html
WYSIWYG
Conversion
wikitext html
Character-based diffs
Metadata
[[Foo|bar]]
<a rel="mw:WikiLink" href="./Foo">bar</a>
Metadata
[[Foo|{{echo|bar}}]]
<a rel="mw:WikiLink" href="./Foo">
<span about="#mwt1" typeof="mw:Object/Template"
data-parsoid="{...}" >bar</span>
</a>
Parsoid
● Node.js service
● Converts wikitext to HTML/RDFa
● Converts HTML/RDFa to wikitext
● Semantics, and syntax (avoid dirty diffs)!
● Expensive (slow)
● Resulting output is large
RESTBase
● Services aggregator / proxy (REST)
● Durable cache (Cassandra)
● Wikimedia’s content API (e.g. https://en.wikipedia.org/api/rest_v1?doc)
Cassandra
RESTBase
RESTBase
Parsoid ... ...
Other use-cases
● Mobile content service
● Math formula rendering service
● Dumps
● ...
Cassandra
Environment
● 2 datacenters
● 3 racks per datacenter
● 18 hosts (16 core, 128G, SSDs)
● 36 nodes
● Deflate compression (~14-18%)
● 31T storage (~206T uncompressed)
● Cassandra 2.1.13 (moving to 2.2.6)
● Read-heavy workload (5:1)
Data model
Data model
CREATE TABLE data (
domain text,
title text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY ((domain, title), rev, tid)
) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)
Data model
en.wikipedia.org + Star_Wars:_The_Force_Awakens
717862573 717873822
...97466b12...7c7a913d3d8a1f2dd66c...7c7a913d3d8a
...
09877568...7c7a913d3d8a
bdebc9a6...7c7a913d3d8a827e2ec2...7c7a913d3d8a
Compression
Compression
chunk_length_kb
Compression
chunk_length_kb
Brotli compression
● Brought to you by the folks at Google; Successor to deflate
● Cassandra implementation (https://github.com/eevans/cassandra-brotli)
● Initial results very promising
● Better compression, lower cost (apples-apples)
● And, wider windows are possible (apples-oranges)
○ GC/memory permitting
○ Example: level=1, lgblock=4096, chunk_length_kb=4096, yields 1.73% compressed size!
○ https://phabricator.wikimedia.org/T122028
● Stay tuned!
Compaction
Compaction
● The cost of having log-structured storage
● Asynchronously (post-write) optimize data on disk for reads
● At a minimum, reorganize into fewer files
○ Dropping what is obsolete
○ Expiring TTLs
○ Removing deleted (aka tombstoned) data (after a fashion)
● Reorganize data so results are nearer each other
Compaction strategies
● Size-tiered
○ Combines tables of similar size
○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes
○ Needs working space for largest possible compaction (100%)
○ Minimal IO
● Leveled
○ Small, fixed size files in levels of exponentially increasing size
○ Files have non-overlapping ranges within a level
○ Amortized; Lots of continuous compaction
○ Very efficient reads, but also quite IO intensive
● Date-tiered
Date-tiered compaction
● Newest of the bunch; Introduced in 2.1
● For append-only workloads (no overwrites, no deletes)
● Where data is ordered sort-ascending, (and not written out-of-order)
● Windows of time, arranged in tiers
● Avoids mixing “old” data with “new”
● Cold data eventually ceases to be compacted
1
1 2
2 31
2-51
2 3 41
1 2
min_threshold (4)
base_time_seconds
3-6
Date-tiered compaction
● Newest of the bunch; Introduced in 2.1
● For append-only workloads (no overwrites, no deletes)
● Where data is ordered sort-ascending, (and not written out-of-order)
● Windows of time, arranged in tiers
● Avoids mixing “old” data with “new”
● Cold data eventually ceases to be compacted
● Hard to reason about
● Optimizations are easily defeated
● https://phabricator.wikimedia.org/T126221
DTCS: So now what?
● Size-tiered compaction? Might as well.
● TimeWindowCompactionStrategy (https://github.com/jeffjirsa/twcs)?
Maybe...
● Reduce node density?
Garbage Collection
GC
● Early adopters of G1 (Garbage 1st)
● Successor to Concurrent Mark-sweep (CMS)
● Incremental parallel compacting collector
● More predictable than CMS
● Configurable pause-time target
Concurrent Mark-sweep
Heap
Old generation
Young generation
Eden Survivor
G1
Survivor Old-Gen Eden Eden
Survivor
Old-Gen Old-Gen
Eden Survivor
Old-Gen
Humongous objects
● Anything >= ½ region size is classified as Humongous
● Humongous objects are allocated into Humongous Regions
● Only one object for a region (wastes space, creates fragmentation)
● Until 1.8u40, humongous regions collected only during full collections (Bad)
● Since 1.8u40, end of the marking cycle, during the cleanup phase (Better)
● Treated as exceptions, so should be exceptional
○ For us, that means 8MB regions
● Enable GC logging and have a look!
Multi-instance
“Many smaller-sized Cassandra nodes is
always better than fewer, dense ones.”
— Everyone
Motivation
● Compaction
● GC
What we do
● Processes (yup)
● Puppetized configuration
○ /etc/cassandra-a/
○ /etc/cassandra-b/
○ systemd units
○ Etc
● Shared RAID-0
What we should have done
● Virtualization
● Containers
● Blades
The Good
● Fault-tolerance
● Availability
● Datacenter / rack awareness
● Nice, helpful people (tickets, IRC, etc)
The Bad
● Usability
○ Compaction
○ Streaming
○ JMX
● Vertical scaling
● JVM
The Ugly
● Release process
● Upgrades
Wikimedia Content API: A Cassandra Use-case

Contenu connexe

Tendances

Tendances (20)

Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica ColoftCassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra meetup slides - Oct 15 Santa Monica Coloft
 
Google Drive vs. Dropbox
Google Drive vs. DropboxGoogle Drive vs. Dropbox
Google Drive vs. Dropbox
 
SqliteToRealm
SqliteToRealmSqliteToRealm
SqliteToRealm
 
Object Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container EnvirnomentObject Storage in a Cloud-Native Container Envirnoment
Object Storage in a Cloud-Native Container Envirnoment
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan Rossi
 
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
My Learnings on Setting up a Kubernetes Cluster on AWS using Kubernetes Opera...
 
Unleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platformUnleashing k8 s to reduce complexities of an entire middleware platform
Unleashing k8 s to reduce complexities of an entire middleware platform
 
Game DDOS Prevention
Game DDOS PreventionGame DDOS Prevention
Game DDOS Prevention
 
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax AstraApache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
 
Introduction to Bizur
Introduction to BizurIntroduction to Bizur
Introduction to Bizur
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
 
PTD and beyond
PTD and beyondPTD and beyond
PTD and beyond
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Corwin on containers
Corwin on containersCorwin on containers
Corwin on containers
 
Marble talk at akademy 2008
Marble talk  at akademy 2008Marble talk  at akademy 2008
Marble talk at akademy 2008
 
Memcached
MemcachedMemcached
Memcached
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018Cloud storage: the right way OSS EU 2018
Cloud storage: the right way OSS EU 2018
 
Scaling Islandora
Scaling IslandoraScaling Islandora
Scaling Islandora
 
The Rise of Cloud Computing Systems
The Rise of Cloud Computing SystemsThe Rise of Cloud Computing Systems
The Rise of Cloud Computing Systems
 

En vedette

Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
Eric Evans
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
Eric Evans
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
 

En vedette (20)

Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 

Similaire à Wikimedia Content API: A Cassandra Use-case

Similaire à Wikimedia Content API: A Cassandra Use-case (20)

Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
 
It's always sunny with OpenJ9
It's always sunny with OpenJ9It's always sunny with OpenJ9
It's always sunny with OpenJ9
 
6 Months Sailing with Docker in Production
6 Months Sailing with Docker in Production 6 Months Sailing with Docker in Production
6 Months Sailing with Docker in Production
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
#VirtualDesignMaster 3 Challenge 3 - Dennis George
#VirtualDesignMaster 3 Challenge 3 - Dennis George#VirtualDesignMaster 3 Challenge 3 - Dennis George
#VirtualDesignMaster 3 Challenge 3 - Dennis George
 
Using Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsUsing Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series Workloads
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 
The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2The 2nd half. Scaling to the next^2
The 2nd half. Scaling to the next^2
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
No SQL Technologies
No SQL TechnologiesNo SQL Technologies
No SQL Technologies
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 

Plus de Eric Evans (9)

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Wikimedia Content API: A Cassandra Use-case

  • 1. Wikimedia Content API: A Cassandra Use-case Eric Evans <eevans@wikimedia.org> @jericevans Apache Bigdata | May 10, 2016
  • 2.
  • 3. Our Vision: A world in which every single human can freely share in the sum of all knowledge.
  • 4. About: ● Global movement ● Largest collection of free, collaborative knowledge in human history ● 16 projects ● 16.5 billion total page views per month ● 58.9 million unique devices per day ● More than 13k new editors each month ● More than 75k active editors month-to-month
  • 5. About: Wikipedia ● More than 38 million articles in 290 languages ● Over 10k new articles added per day ● 13 million edits per month ● Ranked #6 globally in web traffic
  • 9.
  • 10.
  • 11. Wikitext = Star Wars: The Force Awakens = Star Wars: The Force Awakens is a 2015 American epic space opera film directed, co-produced, and co-written by [[J. J. Abrams]].
  • 12. HTML <h1> Star Wars: The Force Awakens </h1> <p> Star Wars: The Force Awakens is a 2015 American epic space opera film directed, co-produced, and co-written by <a href="/wiki/J._J._Abrams" title="J. J. Abrams"> J. J. Abrams </a> </p>
  • 14. HTML
  • 20. Metadata [[Foo|{{echo|bar}}]] <a rel="mw:WikiLink" href="./Foo"> <span about="#mwt1" typeof="mw:Object/Template" data-parsoid="{...}" >bar</span> </a>
  • 21. Parsoid ● Node.js service ● Converts wikitext to HTML/RDFa ● Converts HTML/RDFa to wikitext ● Semantics, and syntax (avoid dirty diffs)! ● Expensive (slow) ● Resulting output is large
  • 22. RESTBase ● Services aggregator / proxy (REST) ● Durable cache (Cassandra) ● Wikimedia’s content API (e.g. https://en.wikipedia.org/api/rest_v1?doc)
  • 24. Other use-cases ● Mobile content service ● Math formula rendering service ● Dumps ● ...
  • 26. Environment ● 2 datacenters ● 3 racks per datacenter ● 18 hosts (16 core, 128G, SSDs) ● 36 nodes ● Deflate compression (~14-18%) ● 31T storage (~206T uncompressed) ● Cassandra 2.1.13 (moving to 2.2.6) ● Read-heavy workload (5:1)
  • 28. Data model CREATE TABLE data ( domain text, title text, rev int, tid timeuuid, value blob, PRIMARY KEY ((domain, title), rev, tid) ) WITH CLUSTERING ORDER BY (rev DESC, tid DESC)
  • 29. Data model en.wikipedia.org + Star_Wars:_The_Force_Awakens 717862573 717873822 ...97466b12...7c7a913d3d8a1f2dd66c...7c7a913d3d8a ... 09877568...7c7a913d3d8a bdebc9a6...7c7a913d3d8a827e2ec2...7c7a913d3d8a
  • 33. Brotli compression ● Brought to you by the folks at Google; Successor to deflate ● Cassandra implementation (https://github.com/eevans/cassandra-brotli) ● Initial results very promising ● Better compression, lower cost (apples-apples) ● And, wider windows are possible (apples-oranges) ○ GC/memory permitting ○ Example: level=1, lgblock=4096, chunk_length_kb=4096, yields 1.73% compressed size! ○ https://phabricator.wikimedia.org/T122028 ● Stay tuned!
  • 35. Compaction ● The cost of having log-structured storage ● Asynchronously (post-write) optimize data on disk for reads ● At a minimum, reorganize into fewer files ○ Dropping what is obsolete ○ Expiring TTLs ○ Removing deleted (aka tombstoned) data (after a fashion) ● Reorganize data so results are nearer each other
  • 36. Compaction strategies ● Size-tiered ○ Combines tables of similar size ○ Oblivious to column distribution; Works best for workloads with no overwrites/deletes ○ Needs working space for largest possible compaction (100%) ○ Minimal IO ● Leveled ○ Small, fixed size files in levels of exponentially increasing size ○ Files have non-overlapping ranges within a level ○ Amortized; Lots of continuous compaction ○ Very efficient reads, but also quite IO intensive ● Date-tiered
  • 37. Date-tiered compaction ● Newest of the bunch; Introduced in 2.1 ● For append-only workloads (no overwrites, no deletes) ● Where data is ordered sort-ascending, (and not written out-of-order) ● Windows of time, arranged in tiers ● Avoids mixing “old” data with “new” ● Cold data eventually ceases to be compacted
  • 38. 1 1 2 2 31 2-51 2 3 41 1 2 min_threshold (4) base_time_seconds 3-6
  • 39. Date-tiered compaction ● Newest of the bunch; Introduced in 2.1 ● For append-only workloads (no overwrites, no deletes) ● Where data is ordered sort-ascending, (and not written out-of-order) ● Windows of time, arranged in tiers ● Avoids mixing “old” data with “new” ● Cold data eventually ceases to be compacted ● Hard to reason about ● Optimizations are easily defeated ● https://phabricator.wikimedia.org/T126221
  • 40.
  • 41. DTCS: So now what? ● Size-tiered compaction? Might as well. ● TimeWindowCompactionStrategy (https://github.com/jeffjirsa/twcs)? Maybe... ● Reduce node density?
  • 43. GC ● Early adopters of G1 (Garbage 1st) ● Successor to Concurrent Mark-sweep (CMS) ● Incremental parallel compacting collector ● More predictable than CMS ● Configurable pause-time target
  • 45. G1 Survivor Old-Gen Eden Eden Survivor Old-Gen Old-Gen Eden Survivor Old-Gen
  • 46. Humongous objects ● Anything >= ½ region size is classified as Humongous ● Humongous objects are allocated into Humongous Regions ● Only one object for a region (wastes space, creates fragmentation) ● Until 1.8u40, humongous regions collected only during full collections (Bad) ● Since 1.8u40, end of the marking cycle, during the cleanup phase (Better) ● Treated as exceptions, so should be exceptional ○ For us, that means 8MB regions ● Enable GC logging and have a look!
  • 48. “Many smaller-sized Cassandra nodes is always better than fewer, dense ones.” — Everyone
  • 50. What we do ● Processes (yup) ● Puppetized configuration ○ /etc/cassandra-a/ ○ /etc/cassandra-b/ ○ systemd units ○ Etc ● Shared RAID-0
  • 51. What we should have done ● Virtualization ● Containers ● Blades
  • 52. The Good ● Fault-tolerance ● Availability ● Datacenter / rack awareness ● Nice, helpful people (tickets, IRC, etc)
  • 53. The Bad ● Usability ○ Compaction ○ Streaming ○ JMX ● Vertical scaling ● JVM
  • 54. The Ugly ● Release process ● Upgrades