SlideShare une entreprise Scribd logo
1  sur  21
Valhalla at Pantheon
     A Distributed File System Built
on Cassandra, Twisted Python, and FUSE
Pantheon's Requirements
● Density
  ○ Over 50K volumes in a single cluster
  ○ Over 1000 clients on a single application server
● Storage volume
  ○ Over 10TB in a single cluster
  ○ De-duplication of redundant data
● Throughput
  ○ Peaks during the U.S. business day and during site
    imports and backups
● Performance
  ○ Back-end for Drupal web applications; access
    has to be fast enough to not burden a web request
  ○ The applications won't be adapted from running on
    local disk to running on Valhalla
Why not off-the-shelf?
● NFS
  ○ UID mapping requires trusted clients and networks
  ○ Standard Kerberos implementations have no HA
  ○ No cloud HA for client/server communication
● GlusterFS
  ○ Cannot scale volume density (though HekaFS can)
  ○ Cannot de-duplicate data
● Ceph
  ○ Security model relies on trusted clients
● MooseFS
  ○ Only primitive security
Valhalla's Design Manifesto
● Drupal applications read and write whole
  files between 10KB and 10MB
   ○ And most reads hit the edge proxy cache
● Drupal tracks files in its database and has
  little need for fstat() or directory listings
● POSIX compliance for locking and
  permissions is unimportant
   ○ But volume-level access control is critical
● Volumes may contain up to 1MM files
● Availability and performance trump
  consistency
volumes                                               content_by_file


       /d1/      /d1/f1.txt         /d1/d3/    /d1/d3/f2.txt                 content
vol1                                                       ...   ade12...
                 ade12...                      c12bea...                     binary



        /dir1/     /dir1/file.txt       /dir1/f2.txt                         content
vol2                                                       ...   c12bea...
                   ade12...             c12bea...                            binary



        /dir3/     /dir3/f3.txt         /dir3/f2.txt                         content
vol3                                                       ...   13a8cd...
                   13a8cd...            c12bea...                            binary

                              ...                                                      ...




                                                       Valhalla 1.0
Valhalla 1.0 Retrospective
● What worked
  ○ Efficient volume cloning
● What didn't
  ○ Slow computation of directory content when a
    directory is small but contains a large subdirectory
    ■ Fix: Depth prefix for entries
  ○ Slow computation of file size
    ■ Fix: Denormalize metadata into directory entries
  ○ Problems replicating large files
    ■ Fix: Split files into chunks
volumes                                                            content_by_file


       1:/d1/      1:/d1/f1.txt              1:/d1/d3/     2:/d1/d3/f2.txt                        0           1
vol1               {"size": 1243,                          {"size": 111,
                                                                                ...   ade12...
                    "hash": ade12...                        "hash": c12bea...
                                                                                                  binary      binary



        1:/dir1/      1:/dir1/file.txt           1:/dir1/f2.txt                                   0
vol2                                                                            ...   c12bea...
                      {"size": 1243,             {"size": 111,                                    binary
                      "hash": ade12...            "hash": c12bea...




        1:/dir3/        1:/dir3/f3.txt            1:/dir3/f2.txt                                  0           1        2
vol3                                                                            ...   13a8cd...
                        {"size": 5243,            {"size": 111,                                   binary      binary   binary
                        "hash": 13a8cd...          "hash": c12bea...


                                       ...                                                              ...




                                                             Valhalla 2.0
Valhalla 2.0 Retrospective
● What worked
  ○ Version 1.0 issues fixed
● Problems to solve
  ○ Directory listings iterate over many columns
    ■ Fix: Cache complete PROPFIND responses
  ○ Single-threaded client bottlenecks
    ■ Fix: "Fast path" with direct HTTP from PHP and
        proxied by Nginx
  ○ File content compaction eats up too much disk
    ■ Fix: "Offloading" cold and large content to S3
        using iterative scripts and real-time decisions
listing_cache                               Unchanged

                                                  content_by_file
       /dir1/         /dir2/
vol1
       binary         binary
                                                        ...


       /dir1/                                        volumes
vol2
       binary
                                                        ...

       /d1/           /d1/d2/   /d3/
vol3
       binary         binary    binary

                ...




                                   Valhalla 3.0
Valhalla 3.0 Retrospective
● What worked
  ○ Version 2.0 issues fixed
● Problems to solve
  ○ Changes invalidate cached PROPFINDs, and then
    clients do a PROPFIND
    ■ Fix: Extend schema and API to support volume
        and directory event propagation
  ○ Single-threaded client still bottlenecks
    ■ Fix: New, multithreaded client
  ○ Client uses a write-invalidate cache
    ■ Fix: Move to a write-through/write-back model
Meanwhile, in backups
● Stopped using davfs2 file mounts
● New backup preparation algorithm
  a. Backup builder downloads volume manifest
  b. Iterates through each file and goes directly from S3
     to the tarball
  c. Any files not yet on S3 get pushed there by
     requesting an "offload"
● Lower client overhead
● Lower server overhead
● Longer backup preparation time
events                                       Unchanged

                                                                                       content_by_file
               t=1                                  t=2
vol1:/dir1/
               {"path": "/dir2/","event":           {"path": "/dir2/f2.txt","event":
               "CREATED"...                         "CREATED"...
                                                                                             ...


               t=5                                  t=6                                   volumes
vol1:/dir2/
               {"path": "/dir5/","event":           {"path": "/dir6/","event":
               "CREATED"...                         "CREATED"...
                                                                                             ...

               t=5                                  t=6
                                                                                        listing_cache
vol3:/d1/d2/
               {"path": "f3.txt","event":           {"path": "f3.txt","event":
               "CREATED"...                         "DESTROYED"...


                                              ...                                            ...




                                                      Valhalla 4.0
Valhalla 4.0 Retrospective
● What worked
  ○ Version 3.0 issues fixed
● Problems to solve
  ○ Cloning volumes breaks the event stream
     ■ Fix: Invalidate events from before the volume
        clone request
  ○ Clients receiving earlier copies of their own events
     ■ Fix: Only send clients events published by other
        clients
  ○ Clients write a file and then have to re-download it
    because of ETag limitations
     ■ Fix: Extend PUT to send ETag on response
  ○ Iteration through file content items times out
     ■ Fix: Iterate through local sstable keys
volume_metadata              Unchanged

                                              content_by_file
       rewritten
vol1
       t=3
                                                    ...

                                                 volumes
vol2

                                                    ...

       rewritten
                                               listing_cache
vol3
       t=2

                         ...                        ...

                                                  events



                                                    ...
                               Valhalla 4.5
Implementing the Client Side
● Ditched davfs2
  ○ Single-threaded with only experimental patches to
    multi-thread
  ○ Crufty code base designed to abstract FUSE versus
    Coda
● Based code off of fusedav
  ○ Already multithreaded
  ○ Uses proven Neon WebDAV client
● Gutted cache
  ○ Needed fine-grained update capability for write-
    through and write-back
  ○ Replaced with LevelDB
● Added in high-level FUSE operations
  ○ Atomic open+truncate, atomic create+open, etc.
Caching model
● LevelDB
  ○   Embeddable with low overhead
  ○   Iteratation without allocation management
  ○   Data model identical to single Cassandra row
  ○   Storage model similar to Cassandra sstables
  ○   Similar atomicity to row changes in Cassandra 1.1+
● Mirrored volume row locally
  ○ Including prefixes and metadata
  ○ May move to Merkel-based replication later
Benchmarks versus Local and Older Models
Benchmarks versus Local and Older Models
What's Next at Pantheon
● Move more toward a pNFS model
  ○ No file content storage in Cassandra (all in S3)
  ○ Peer-to-peer or other non-Cassandra file content
    coordination between clients
● Peer-to-peer cache advisories between
  clients
  ○ Less chatty server communication to poll events
  ○ Smaller window of incoherence (3s to <1s)
● Dropping the "fast path"
  ○ Client is already multithreaded
  ○ Client cache is smarter than direct Valhalla access
  ○ Minimizes incompatibility with Drupal
What's Next for the Community
● Finalize GPL-licensed FuseDAV client
  ○ Already public on GitHub
  ○ Public test suite with bundled server
  ○ Coordinate with existing FuseDAV users to make the
    Pantheon version the official successor
● Publish WebDAV extensions and seek
  standards acceptance
  ○ Progressive PROPFIND
  ○ ETag on PUT
David Strauss
● My groups
  ○ Drupal Association
  ○ Pantheon Systems
  ○ systemd/udev
● Get in touch
  ○ david@davidstrauss.net
  ○ @davidstrauss
  ○ facebook.com/warpforge
● Learn more about Pantheon
  ○   Developer Open House
  ○   Presented by Kyle Mathews and Josh Koenig
  ○   Thursday, February 14th, 12PM PST
  ○   Sign up: http://tinyurl.com/a3ofpc2

Contenu connexe

Similaire à Valhalla at Pantheon

Troubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentTroubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentSadique Puthen
 
ContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with DockerContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with DockerDocker-Hanoi
 
Rooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in DockerRooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in DockerPhil Estes
 
Taking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and PuppetTaking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and PuppetPuppet
 
Docker puppetcamp london 2013
Docker puppetcamp london 2013Docker puppetcamp london 2013
Docker puppetcamp london 2013Tomas Doran
 
Docker4Drupal 2.1 for Development
Docker4Drupal 2.1 for DevelopmentDocker4Drupal 2.1 for Development
Docker4Drupal 2.1 for DevelopmentWebsolutions Agency
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Zabbix
 
Introduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxDataIntroduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxDataInfluxData
 
The whale, the container, and the ocean
The whale, the container, and the oceanThe whale, the container, and the ocean
The whale, the container, and the oceanNick Palenchar
 
Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT Balena
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDatainside-BigData.com
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesAkihiro Suda
 
Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7Etsuji Nakai
 
Real World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionReal World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionBen Hall
 
Take care of hundred containers and not go crazy
Take care of hundred containers and not go crazyTake care of hundred containers and not go crazy
Take care of hundred containers and not go crazyHonza Horák
 
Be a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the TradeBe a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the TradeDocker, Inc.
 
Orchestrating Docker with OpenStack
Orchestrating Docker with OpenStackOrchestrating Docker with OpenStack
Orchestrating Docker with OpenStackErica Windisch
 
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...NETWAYS
 

Similaire à Valhalla at Pantheon (20)

Troubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentTroubleshooting containerized triple o deployment
Troubleshooting containerized triple o deployment
 
ContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with DockerContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with Docker
 
Rooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in DockerRooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in Docker
 
Taking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and PuppetTaking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and Puppet
 
Docker puppetcamp london 2013
Docker puppetcamp london 2013Docker puppetcamp london 2013
Docker puppetcamp london 2013
 
dh-make-perl
dh-make-perldh-make-perl
dh-make-perl
 
Docker4Drupal 2.1 for Development
Docker4Drupal 2.1 for DevelopmentDocker4Drupal 2.1 for Development
Docker4Drupal 2.1 for Development
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
Introduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxDataIntroduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxData
 
The whale, the container, and the ocean
The whale, the container, and the oceanThe whale, the container, and the ocean
The whale, the container, and the ocean
 
Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimes
 
Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7
 
Real World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionReal World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and Production
 
Take care of hundred containers and not go crazy
Take care of hundred containers and not go crazyTake care of hundred containers and not go crazy
Take care of hundred containers and not go crazy
 
Demo 0.9.4
Demo 0.9.4Demo 0.9.4
Demo 0.9.4
 
Be a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the TradeBe a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the Trade
 
Orchestrating Docker with OpenStack
Orchestrating Docker with OpenStackOrchestrating Docker with OpenStack
Orchestrating Docker with OpenStack
 
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
 

Plus de David Timothy Strauss

Plus de David Timothy Strauss (13)

Advanced Drupal 8 Caching
Advanced Drupal 8 CachingAdvanced Drupal 8 Caching
Advanced Drupal 8 Caching
 
LCache DrupalCon Dublin 2016
LCache DrupalCon Dublin 2016LCache DrupalCon Dublin 2016
LCache DrupalCon Dublin 2016
 
Container Security via Monitoring and Orchestration - Container Security Summit
Container Security via Monitoring and Orchestration - Container Security SummitContainer Security via Monitoring and Orchestration - Container Security Summit
Container Security via Monitoring and Orchestration - Container Security Summit
 
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
 
Effective service and resource management with systemd
Effective service and resource management with systemdEffective service and resource management with systemd
Effective service and resource management with systemd
 
Containers > VMs
Containers > VMsContainers > VMs
Containers > VMs
 
PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale (Lone Star PHP 2014)PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale (Lone Star PHP 2014)
 
PHP at Density and Scale
PHP at Density and ScalePHP at Density and Scale
PHP at Density and Scale
 
PHP at Density and Scale
PHP at Density and ScalePHP at Density and Scale
PHP at Density and Scale
 
Scalable Drupal Infrastructure
Scalable Drupal InfrastructureScalable Drupal Infrastructure
Scalable Drupal Infrastructure
 
Planning LAMP infrastructure
Planning LAMP infrastructurePlanning LAMP infrastructure
Planning LAMP infrastructure
 
Is Drupal Secure?
Is Drupal Secure?Is Drupal Secure?
Is Drupal Secure?
 
Cassandra queuing
Cassandra queuingCassandra queuing
Cassandra queuing
 

Dernier

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Dernier (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Valhalla at Pantheon

  • 1. Valhalla at Pantheon A Distributed File System Built on Cassandra, Twisted Python, and FUSE
  • 2. Pantheon's Requirements ● Density ○ Over 50K volumes in a single cluster ○ Over 1000 clients on a single application server ● Storage volume ○ Over 10TB in a single cluster ○ De-duplication of redundant data ● Throughput ○ Peaks during the U.S. business day and during site imports and backups ● Performance ○ Back-end for Drupal web applications; access has to be fast enough to not burden a web request ○ The applications won't be adapted from running on local disk to running on Valhalla
  • 3. Why not off-the-shelf? ● NFS ○ UID mapping requires trusted clients and networks ○ Standard Kerberos implementations have no HA ○ No cloud HA for client/server communication ● GlusterFS ○ Cannot scale volume density (though HekaFS can) ○ Cannot de-duplicate data ● Ceph ○ Security model relies on trusted clients ● MooseFS ○ Only primitive security
  • 4. Valhalla's Design Manifesto ● Drupal applications read and write whole files between 10KB and 10MB ○ And most reads hit the edge proxy cache ● Drupal tracks files in its database and has little need for fstat() or directory listings ● POSIX compliance for locking and permissions is unimportant ○ But volume-level access control is critical ● Volumes may contain up to 1MM files ● Availability and performance trump consistency
  • 5. volumes content_by_file /d1/ /d1/f1.txt /d1/d3/ /d1/d3/f2.txt content vol1 ... ade12... ade12... c12bea... binary /dir1/ /dir1/file.txt /dir1/f2.txt content vol2 ... c12bea... ade12... c12bea... binary /dir3/ /dir3/f3.txt /dir3/f2.txt content vol3 ... 13a8cd... 13a8cd... c12bea... binary ... ... Valhalla 1.0
  • 6. Valhalla 1.0 Retrospective ● What worked ○ Efficient volume cloning ● What didn't ○ Slow computation of directory content when a directory is small but contains a large subdirectory ■ Fix: Depth prefix for entries ○ Slow computation of file size ■ Fix: Denormalize metadata into directory entries ○ Problems replicating large files ■ Fix: Split files into chunks
  • 7. volumes content_by_file 1:/d1/ 1:/d1/f1.txt 1:/d1/d3/ 2:/d1/d3/f2.txt 0 1 vol1 {"size": 1243, {"size": 111, ... ade12... "hash": ade12... "hash": c12bea... binary binary 1:/dir1/ 1:/dir1/file.txt 1:/dir1/f2.txt 0 vol2 ... c12bea... {"size": 1243, {"size": 111, binary "hash": ade12... "hash": c12bea... 1:/dir3/ 1:/dir3/f3.txt 1:/dir3/f2.txt 0 1 2 vol3 ... 13a8cd... {"size": 5243, {"size": 111, binary binary binary "hash": 13a8cd... "hash": c12bea... ... ... Valhalla 2.0
  • 8. Valhalla 2.0 Retrospective ● What worked ○ Version 1.0 issues fixed ● Problems to solve ○ Directory listings iterate over many columns ■ Fix: Cache complete PROPFIND responses ○ Single-threaded client bottlenecks ■ Fix: "Fast path" with direct HTTP from PHP and proxied by Nginx ○ File content compaction eats up too much disk ■ Fix: "Offloading" cold and large content to S3 using iterative scripts and real-time decisions
  • 9. listing_cache Unchanged content_by_file /dir1/ /dir2/ vol1 binary binary ... /dir1/ volumes vol2 binary ... /d1/ /d1/d2/ /d3/ vol3 binary binary binary ... Valhalla 3.0
  • 10. Valhalla 3.0 Retrospective ● What worked ○ Version 2.0 issues fixed ● Problems to solve ○ Changes invalidate cached PROPFINDs, and then clients do a PROPFIND ■ Fix: Extend schema and API to support volume and directory event propagation ○ Single-threaded client still bottlenecks ■ Fix: New, multithreaded client ○ Client uses a write-invalidate cache ■ Fix: Move to a write-through/write-back model
  • 11. Meanwhile, in backups ● Stopped using davfs2 file mounts ● New backup preparation algorithm a. Backup builder downloads volume manifest b. Iterates through each file and goes directly from S3 to the tarball c. Any files not yet on S3 get pushed there by requesting an "offload" ● Lower client overhead ● Lower server overhead ● Longer backup preparation time
  • 12. events Unchanged content_by_file t=1 t=2 vol1:/dir1/ {"path": "/dir2/","event": {"path": "/dir2/f2.txt","event": "CREATED"... "CREATED"... ... t=5 t=6 volumes vol1:/dir2/ {"path": "/dir5/","event": {"path": "/dir6/","event": "CREATED"... "CREATED"... ... t=5 t=6 listing_cache vol3:/d1/d2/ {"path": "f3.txt","event": {"path": "f3.txt","event": "CREATED"... "DESTROYED"... ... ... Valhalla 4.0
  • 13. Valhalla 4.0 Retrospective ● What worked ○ Version 3.0 issues fixed ● Problems to solve ○ Cloning volumes breaks the event stream ■ Fix: Invalidate events from before the volume clone request ○ Clients receiving earlier copies of their own events ■ Fix: Only send clients events published by other clients ○ Clients write a file and then have to re-download it because of ETag limitations ■ Fix: Extend PUT to send ETag on response ○ Iteration through file content items times out ■ Fix: Iterate through local sstable keys
  • 14. volume_metadata Unchanged content_by_file rewritten vol1 t=3 ... volumes vol2 ... rewritten listing_cache vol3 t=2 ... ... events ... Valhalla 4.5
  • 15. Implementing the Client Side ● Ditched davfs2 ○ Single-threaded with only experimental patches to multi-thread ○ Crufty code base designed to abstract FUSE versus Coda ● Based code off of fusedav ○ Already multithreaded ○ Uses proven Neon WebDAV client ● Gutted cache ○ Needed fine-grained update capability for write- through and write-back ○ Replaced with LevelDB ● Added in high-level FUSE operations ○ Atomic open+truncate, atomic create+open, etc.
  • 16. Caching model ● LevelDB ○ Embeddable with low overhead ○ Iteratation without allocation management ○ Data model identical to single Cassandra row ○ Storage model similar to Cassandra sstables ○ Similar atomicity to row changes in Cassandra 1.1+ ● Mirrored volume row locally ○ Including prefixes and metadata ○ May move to Merkel-based replication later
  • 17. Benchmarks versus Local and Older Models
  • 18. Benchmarks versus Local and Older Models
  • 19. What's Next at Pantheon ● Move more toward a pNFS model ○ No file content storage in Cassandra (all in S3) ○ Peer-to-peer or other non-Cassandra file content coordination between clients ● Peer-to-peer cache advisories between clients ○ Less chatty server communication to poll events ○ Smaller window of incoherence (3s to <1s) ● Dropping the "fast path" ○ Client is already multithreaded ○ Client cache is smarter than direct Valhalla access ○ Minimizes incompatibility with Drupal
  • 20. What's Next for the Community ● Finalize GPL-licensed FuseDAV client ○ Already public on GitHub ○ Public test suite with bundled server ○ Coordinate with existing FuseDAV users to make the Pantheon version the official successor ● Publish WebDAV extensions and seek standards acceptance ○ Progressive PROPFIND ○ ETag on PUT
  • 21. David Strauss ● My groups ○ Drupal Association ○ Pantheon Systems ○ systemd/udev ● Get in touch ○ david@davidstrauss.net ○ @davidstrauss ○ facebook.com/warpforge ● Learn more about Pantheon ○ Developer Open House ○ Presented by Kyle Mathews and Josh Koenig ○ Thursday, February 14th, 12PM PST ○ Sign up: http://tinyurl.com/a3ofpc2