SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
SeaweedFS
Intro
2019.3
chris.lu@gmail.com
SeaweedFS Intro
● Overview
● Internal Architecture
○ Object/Blob store
○ Filer Store
○ S3/Hadoop
○ Notification/Cross-Region Replication
SeaweedFS Intro
Overview: What is special?
● Distributed
● Handles large and small files
● Optimized for large amount of small files
● Random access any file
● Low-latency access any file
● Parallel processing
Overview: APIs
● REST API for object storage
● REST/gRPC API for file system storage
● Hadoop Compatible
● FUSE client to mount file system locally
● S3 API
Architecture
● Object Storage
● File Storage
● Interface/Client Layer
Volume Store
● Based on Facebook
Haystack paper
Object Storage
Object Storage
Master
Volume
Server
Volume
Server
Volume
Server
Client Write
1. Http request file id
3. Http upload file with file id
2. Get file id
Object Storage
Object Storage
Master
Volume
Server
Volume
Server
Volume
Server
Client Write
1. Http request file id
3. Http upload file with file id
2. Get file id
Example file id, 3,01637037d6
● 3 : a volume id
● 01: file key
● 637037d6: file cookie
Object Storage
Object Storage
Master
Volume
Server
Volume
Server
Volume
Server
Client Read
1. Lookup volume id
3. Http get file with file id
2. Get volume location
● Volume locations can be cached.
● Clients can also subscribe to volume
location changes.
Object Storage
File Storage
Master
Volume
Server
Volume
Server
Volume
Server
Filer Client Upload a file to a directory
File Storage
Filer
Filer
Store
Local
MySql
Postgres
Redis
Cassandra
Metadata
Blobs
S3 API
Gateway
S3 Clients
Filer Store Data Layout
/a/b/c/ Attr
/a/b/c/def.txt Attr FileChunks
Volume-Aware Clients
Object Storage
Master
Volume
Server
Volume
Server
Volume
Server
Other SeaweedFS
Volume-Aware Clients
Metadata
File Storage
Filer
Filer
Store
Local
MySql
Postgres
Redis
Cassandra
Metadata
Blobs
Hadoop Client
Mounted FUSE Client
Volume-based data placement
● Volumes are organized with different settings:
○ Collection
■ TTL
■ Replication
● Master randomly assigns a write request to one of the writable volumes.
● Strong consistent writes to all replicas.
● If one replica fails heartbeat, the master marks the volume id as read-only.
● Writes should be assigned to other writable volumes.
Object Storage
Security: per object access control with JWT
Master
Volume
Server
Volume
Server
Volume
Server
Client
1. Request FileId
3. Upload File with FileId + JWT
2. Get FileId + JWT
● A Json Web Token (JWT) has permission
to create/update/delete a file.
● Expires after 10 seconds.
Secure Volume Server
● Mutual TLS
○ Secure master to volume server admin
operations
● JWT
○ Secure object changes
Volume servers can be placed anywhere.
Any server with some free space can be a
volume server.
Master
Volume
Server
Volume
Server
Volume
Server
Mutual TLS gRPC calls
JWT authorized changes
High Availability: Master Server Object Storage
Master
Volume
Server
Volume
Server
Volume
Server
Master
Master
● Multi-Master cluster
● Leader election with Raft consensus
algorithm
High Availability: Filer Server
● Multiple stateless filer servers
● Shared filer store could be any
HA storage solution.
File Storage
Filer
Filer Store
MySql
Postgres
Redis
Cassandra
Filer Filer
Scalability: Filer
● Direct blob access.
● Filer store can be any proven store, and simple to add new store:
○ Redis
○ MySql/Postgres
○ Cassandra
○ Interface for any key-value store
● Unlimited files under one directory.
● Blob storage supports multiple filers.
File Change Notification
● All filer change notifications can
be sent to a message queue.
● Protobuf encoded notification.
● Cross-Region replication is built
on top of this.
File Storage
Filer
Filer
Store
Local
MySql
Postgres
Redis
Cassandra
Metadata
Message
Queue
notifications
Kafka
AWS SNS/SQS,
Azure Service Bus,
Google Pub/Sub,
NATS and RabbitMQ
Atomicity
Operation Atomicity Note
Creating a file yes
Deleting a file yes
Renaming a file Yes with mysql/postgres.
No with
redis/leveldb/cassandra.
Implemented via database
transactions.
Renaming a directory Yes with mysql/postgres.
No with
redis/leveldb/cassandra.
Implemented via database
transactions.
Creating a single directory with
mkdir()
yes
Recursive directory deletion No
Comparing to HDFS
HDFS SeaweedFS
File Metadata Storage Single namenode Multiple stateless filers with
proven scalable filer store,
redis/cassandra/etc.
Storing small files Not recommended. Optimized for small files.
Parallel data access Yes Yes
Hadoop Compatible Yes Yes. (Atomic rename via
database transactions.)
Comparing to CEPH
CEPH SeaweedFS
Data Placement CRUSH maps of the whole
cluster, rather complicated,
especially when adding
storage.
Calculated for each object.
Volume level placement,
amortized for each object.
Storing small files Not optimized. Optimized for small files.
Scaling file system metadata MDS dynamically partition
subtree
Flat and linearly scalable.
Easy to set up Mixed reviews Yes
Design Philosophy
● Scale up each layer independently.
● Batch small files
○ Data placement (CEPH file-level, SeaweedFS volume-level)
○ Tracking (HDFS namenode track blocks, SeaweedFS track volume locations)
○ Easy move/delete/replicate operation.
Open APIs
● gRPC APIs for admin operations
● HTTP APIs for uploading and serving blobs
● gRPC for filer metadata operations
● Protocol buffer defined metadata
Future Plan
● Volume Server
○ Async Replica
○ Erasure Coding
○ Tiered Storage
● Integration
○ CSI, docker volume plugin
○ Kerberos
● Tools
○ Auto Balance
Open APIs for possible extensions
● Build a different filer with striping.
● Build a different replication
● Admin tools
● Custom Encryption
● Async Operations
○ Search
○ Secondary index
● Local cache for cloud files
● CDN

Contenu connexe

Tendances

Tendances (20)

Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
 
An overview of the Kubernetes architecture
An overview of the Kubernetes architectureAn overview of the Kubernetes architecture
An overview of the Kubernetes architecture
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Introduction to Serverless and Google Cloud Functions
Introduction to Serverless and Google Cloud FunctionsIntroduction to Serverless and Google Cloud Functions
Introduction to Serverless and Google Cloud Functions
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
DevOps with Kubernetes
DevOps with KubernetesDevOps with Kubernetes
DevOps with Kubernetes
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Kubernetes: A Short Introduction (2019)
Kubernetes: A Short Introduction (2019)Kubernetes: A Short Introduction (2019)
Kubernetes: A Short Introduction (2019)
 
An intro to Kubernetes operators
An intro to Kubernetes operatorsAn intro to Kubernetes operators
An intro to Kubernetes operators
 
redis 소개자료 - 네오클로바
redis 소개자료 - 네오클로바redis 소개자료 - 네오클로바
redis 소개자료 - 네오클로바
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Kubernetes Workshop
Kubernetes WorkshopKubernetes Workshop
Kubernetes Workshop
 

Similaire à SeaweedFS introduction

Glusterfs and openstack
Glusterfs  and openstackGlusterfs  and openstack
Glusterfs and openstack
openstackindia
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
Schubert Zhang
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystems
guest18a0f1
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
sprdd
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
sprdd
 

Similaire à SeaweedFS introduction (20)

Glusterfs and openstack
Glusterfs  and openstackGlusterfs  and openstack
Glusterfs and openstack
 
Beyond the File System - Designing Large Scale File Storage and Serving
Beyond the File System - Designing Large Scale File Storage and ServingBeyond the File System - Designing Large Scale File Storage and Serving
Beyond the File System - Designing Large Scale File Storage and Serving
 
Filesystems
FilesystemsFilesystems
Filesystems
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
Lisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introductionLisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introduction
 
Apache Hadoop HDFS
Apache Hadoop HDFSApache Hadoop HDFS
Apache Hadoop HDFS
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystems
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystems
 
Beyond the File System: Designing Large-Scale File Storage and Serving
 	Beyond the File System: Designing Large-Scale File Storage and Serving 	Beyond the File System: Designing Large-Scale File Storage and Serving
Beyond the File System: Designing Large-Scale File Storage and Serving
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystems
 
Web20expo Filesystems
Web20expo FilesystemsWeb20expo Filesystems
Web20expo Filesystems
 
Hdfs internals
Hdfs internalsHdfs internals
Hdfs internals
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
 

Dernier

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Dernier (20)

%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 

SeaweedFS introduction

  • 2. SeaweedFS Intro ● Overview ● Internal Architecture ○ Object/Blob store ○ Filer Store ○ S3/Hadoop ○ Notification/Cross-Region Replication
  • 4.
  • 5. Overview: What is special? ● Distributed ● Handles large and small files ● Optimized for large amount of small files ● Random access any file ● Low-latency access any file ● Parallel processing
  • 6. Overview: APIs ● REST API for object storage ● REST/gRPC API for file system storage ● Hadoop Compatible ● FUSE client to mount file system locally ● S3 API
  • 7. Architecture ● Object Storage ● File Storage ● Interface/Client Layer
  • 8. Volume Store ● Based on Facebook Haystack paper
  • 9. Object Storage Object Storage Master Volume Server Volume Server Volume Server Client Write 1. Http request file id 3. Http upload file with file id 2. Get file id
  • 10. Object Storage Object Storage Master Volume Server Volume Server Volume Server Client Write 1. Http request file id 3. Http upload file with file id 2. Get file id Example file id, 3,01637037d6 ● 3 : a volume id ● 01: file key ● 637037d6: file cookie
  • 11. Object Storage Object Storage Master Volume Server Volume Server Volume Server Client Read 1. Lookup volume id 3. Http get file with file id 2. Get volume location ● Volume locations can be cached. ● Clients can also subscribe to volume location changes.
  • 12. Object Storage File Storage Master Volume Server Volume Server Volume Server Filer Client Upload a file to a directory File Storage Filer Filer Store Local MySql Postgres Redis Cassandra Metadata Blobs S3 API Gateway S3 Clients
  • 13. Filer Store Data Layout /a/b/c/ Attr /a/b/c/def.txt Attr FileChunks
  • 14. Volume-Aware Clients Object Storage Master Volume Server Volume Server Volume Server Other SeaweedFS Volume-Aware Clients Metadata File Storage Filer Filer Store Local MySql Postgres Redis Cassandra Metadata Blobs Hadoop Client Mounted FUSE Client
  • 15. Volume-based data placement ● Volumes are organized with different settings: ○ Collection ■ TTL ■ Replication ● Master randomly assigns a write request to one of the writable volumes. ● Strong consistent writes to all replicas. ● If one replica fails heartbeat, the master marks the volume id as read-only. ● Writes should be assigned to other writable volumes.
  • 16. Object Storage Security: per object access control with JWT Master Volume Server Volume Server Volume Server Client 1. Request FileId 3. Upload File with FileId + JWT 2. Get FileId + JWT ● A Json Web Token (JWT) has permission to create/update/delete a file. ● Expires after 10 seconds.
  • 17. Secure Volume Server ● Mutual TLS ○ Secure master to volume server admin operations ● JWT ○ Secure object changes Volume servers can be placed anywhere. Any server with some free space can be a volume server. Master Volume Server Volume Server Volume Server Mutual TLS gRPC calls JWT authorized changes
  • 18. High Availability: Master Server Object Storage Master Volume Server Volume Server Volume Server Master Master ● Multi-Master cluster ● Leader election with Raft consensus algorithm
  • 19. High Availability: Filer Server ● Multiple stateless filer servers ● Shared filer store could be any HA storage solution. File Storage Filer Filer Store MySql Postgres Redis Cassandra Filer Filer
  • 20. Scalability: Filer ● Direct blob access. ● Filer store can be any proven store, and simple to add new store: ○ Redis ○ MySql/Postgres ○ Cassandra ○ Interface for any key-value store ● Unlimited files under one directory. ● Blob storage supports multiple filers.
  • 21. File Change Notification ● All filer change notifications can be sent to a message queue. ● Protobuf encoded notification. ● Cross-Region replication is built on top of this. File Storage Filer Filer Store Local MySql Postgres Redis Cassandra Metadata Message Queue notifications Kafka AWS SNS/SQS, Azure Service Bus, Google Pub/Sub, NATS and RabbitMQ
  • 22.
  • 23. Atomicity Operation Atomicity Note Creating a file yes Deleting a file yes Renaming a file Yes with mysql/postgres. No with redis/leveldb/cassandra. Implemented via database transactions. Renaming a directory Yes with mysql/postgres. No with redis/leveldb/cassandra. Implemented via database transactions. Creating a single directory with mkdir() yes Recursive directory deletion No
  • 24. Comparing to HDFS HDFS SeaweedFS File Metadata Storage Single namenode Multiple stateless filers with proven scalable filer store, redis/cassandra/etc. Storing small files Not recommended. Optimized for small files. Parallel data access Yes Yes Hadoop Compatible Yes Yes. (Atomic rename via database transactions.)
  • 25. Comparing to CEPH CEPH SeaweedFS Data Placement CRUSH maps of the whole cluster, rather complicated, especially when adding storage. Calculated for each object. Volume level placement, amortized for each object. Storing small files Not optimized. Optimized for small files. Scaling file system metadata MDS dynamically partition subtree Flat and linearly scalable. Easy to set up Mixed reviews Yes
  • 26. Design Philosophy ● Scale up each layer independently. ● Batch small files ○ Data placement (CEPH file-level, SeaweedFS volume-level) ○ Tracking (HDFS namenode track blocks, SeaweedFS track volume locations) ○ Easy move/delete/replicate operation.
  • 27. Open APIs ● gRPC APIs for admin operations ● HTTP APIs for uploading and serving blobs ● gRPC for filer metadata operations ● Protocol buffer defined metadata
  • 28. Future Plan ● Volume Server ○ Async Replica ○ Erasure Coding ○ Tiered Storage ● Integration ○ CSI, docker volume plugin ○ Kerberos ● Tools ○ Auto Balance
  • 29. Open APIs for possible extensions ● Build a different filer with striping. ● Build a different replication ● Admin tools ● Custom Encryption ● Async Operations ○ Search ○ Secondary index ● Local cache for cloud files ● CDN