SlideShare une entreprise Scribd logo
1  sur  72
Télécharger pour lire hors ligne
1
Gluster FS
A distributed file system for
today and tomorrow's
BigData
Roberto “FRANK” Franchini
@robfrankie @CELI_NLP
CELI 2015
Roma, 27–28 Marzo 2015Roma, 27–28 Marzo 2015
15 years of experience, proud to be a programmer
Writes software for information extraction, nlp, opinion mining
(@scale ), and a lot of other buzzwords
Implements scalable architectures
Plays with servers (don't say that to my sysadmin)
Member of the JUG-Torino coordination team
whoami(1)whoami(1)
CELI 2014 2
Company
CELI 2015 3
Identify a distributed and scalable file
system
for today and tomorrow's
Big Data
CELI 2015 4
The problem
Once upon a time
CELI 2015 5
Once upon a timeOnce upon a time
CELI 2015 6
Once upon a time
2008: One nfs share
1,5TB ought to be enough for anybody
Once upon a time
CELI 2015 7
Once upon a timeOnce upon a time
CELI 2015 8
Once upon a time
2010: Herd of shares
(1,5TB x N) ought to be enough for anybody
Once upon a time
CELI 2015 9
Once upon a time
DIY Hand Made Distributed File System:
HMDFS (tm)
Nobody could stop the data flood
It was the time for
something new
Once upon a time
CELI 2015 10
What we do with GFS
Can be enlarged on demand
No dedicated HW
OpenSource is preferred and trusted
No specialized API
No specialized Kernel
POSIX compliance
Zillions of big and small files
No NAS or SAN (€€€€€)
Requirements
CELI 2015 11
What we do with GFS
More than 10GB of Lucene inverted indexes to be stored
on a daily basis (more than 200GB/month)
Search stored indexes to extract different sets of
documents for different customers
Analytics on large sets of documents (years of data)
To do what?
CELI 2015 12
FeaturesGlusterFs
CELI 2015 13
Features
Clustered Scale-out General Purpose Storage Platform
POSIX-y Distributed File System
Built on commodity systems
x86_64 Linux ++
POSIX filesystems underneath (XFS, EXT4)
No central metadata Server (NO SPOF)
Modular architecture for scale and functionality
GlusterFs
CELI 2015 14
Features
Large Scale File Server
Media/Content Distribution Network (CDN)
Backup / Archive / Disaster Recovery (DR)
HDFS replacement in Hadoop ecosystem
High Performance Computing (HPC)
IaaS storage layer
Database offload (blobs)
Unified Object Store + File Access
Common use cases
CELI 2015 15
Features
ACL and Quota support
Fault-tolerance
Peer to peer
Self-healing
Fast setup up
Enlarge/Shrink on demand
Snapshot
On cloud, on premise (phisical or virtual)
Features
CELI 2015 16
Architecture
CELI 2015 17
Architecture
Peer / Node
Cluster servers (glusterfs server)
Runs the gluster daemons and participates in volumes
Node
CELI 2015 18
Architecture
Brick
A filesystem mountpoint on GusterFS nodes
A unit of storage used as a capacity building block
Brick
CELI 2015 19
ArchitectureFrom disks to bricks
CELI 2015 20
3TB
3TB
3TB
3TB
3TB
Gluster Server
3TB
3TB
3TB
3TB
3TB
Gluster Server
LVM
volume
Gluster Server
LVM
volume
LVM
volume
LVM
volume
ArchitectureFrom disks to bricks
CELI 2015 21
3TB
3TB
3TB
3TB
3TB
Gluster Server
3TB
3TB
3TB
3TB
3TB
RAID 6
Physical
volume
Gluster Server
Single
Brick
Gluster Server
ArchitectureFrom disks to bricks
CELI 2015 22
3TB
3TB
3TB
3TB
3TB
Gluster Server
3TB
3TB
3TB
3TB
3TB
/b1
/b3
/b5
/bN
/b7
Gluster Server
/b2
/b4
/b6
/bM
/b8
Bricks on a nodeBricks on a node
CELI 2015 23
Architecture
Logic between bricks or subvolume that generate a
subvolume with certain characteristic
Distribute, replica, stripe are special translators to generate
simil-RAID configuration
Perfomance translators: read ahead, write behind
Translators
CELI 2015 24
Architecture
Bricks combined and passed through translators
Ultimately, what's presented to the end user
Volume
CELI 2015 25
Architecture
brick →
translator → subvolume →
translator → subvolume →
transator → subvolume →
translator → volume
Volume
CELI 2015 26
VolumeVolume
CELI 2015 27
Gluster Server 1
/brick 14
/brick 13
/brick 12
/brick 11
Gluster Server 2
/brick 24
/brick 23
/brick 22
/brick 21
Gluster Server 3
/brick 34
/brick 33
/brick 32
/brick 31
Client
/bigdata
volume
replicate
distribute
VolumeVolume
CELI 2015 28
Volume types
CELI 2015 29
Distributed
The default configuration
Files “evenly” spread across bricks
Similar to file-level RAID 0
Server/Disk failure could be catastrophic
Distributed
CELI 2014 30
DistributedDistributed
CELI 2015 31
Replicated
Files written synchronously to replica peers
Files read synchronously,
but ultimately serviced by the first responder
Similar to file-level RAID 1
Replicated
CELI 2015 32
ReplicatedReplicated
CELI 2015 33
Distributed + replicated
Distribued + replicated
Similar to file-level RAID 10
Most used layout
Distributed + replicated
CELI 2015 34
Distributed replicatedDistributed + replicated
CELI 2015 35
Striped
Individual files split among bricks (sparse files)
Similar to block-level RAID 0
Limited Use Cases
HPC Pre/Post Processing
File size exceeds brick size
Striped
CELI 2015 36
StripedStriped
CELI 2015 37
Moving parts
CELI 2015 38
Components
glusterd
Management daemon
One instance on each GlusterFS server
Interfaced through gluster CLI
glusterfsd
GlusterFS brick daemon
One process for each brick on each server
Managed by glusterd
Components
CELI 2015 39
Components
glusterfs
Volume service daemon
One process for each volume service
NFS server, FUSE client, Self-Heal, Quota, ...
mount.glusterfs
FUSE native client mount extension
gluster
Gluster Console Manager (CLI)
Components
CELI 2015 40
Components
sudo fdisk /dev/sdb
sudo mkfs.xfs -i size=512 /dev/sdb1
sudo mkdir -p /srv/sdb1
sudo mount /dev/sdb1 /srv/sdb1
sudo mkdir -p /srv/sdb1/brick
echo "/dev/sdb1 /srv/sdb1 xfs defaults 0 0" | sudo
tee -a /etc/fstab
sudo apt-get install glusterfs-server
Quick start
CELI 2015 41
Components
sudo gluster peer probe node02
sudo gluster volume create testvol replica 2
node01:/srv/sdb1/brick node02:/srv/sdb1/brick
sudo gluster volume start testvol
sudo mkdir /mnt/gluster
sudo mount -t glusterfs node01:/testvol /mnt/gluster
Quick start
CELI 2015 42
Clients
CELI 2015 43
Clients: native
FUSE kernel module allows the filesystem to be built and
operated entirely in userspace
Specify mount to any GlusterFS server
Native Client fetches volfile from mount server, then
communicates directly with all nodes to access data
Recommended for high concurrency and high write
performance
Load is inherently balanced across distributed volumes
Clients: native
CELI 2015 44
Clients:NFS
Standard NFS v3 clients
Standard automounter is supported
Mount to any server, or use a load balancer
GlusterFS NFS server includes Network Lock Manager
(NLM) to synchronize locks across clients
Better performance for reading many small files from a
single client
Load balancing must be managed externally
Clients: NFS
CELI 2015 45
Clients: libgfapi
Introduced with GlusterFS 3.4
User-space library for accessing data in GlusterFS
Filesystem-like API
Runs in application process
no FUSE, no copies, no context switches
...but same volfiles, translators, etc.
Clients: libgfapi
CELI 2015 46
Clients: SMB/CIFS
In GlusterFS 3.4 – Samba + libgfapi
No need for local native client mount & re-export
Significant performance improvements with FUSE removed
from the equation
Must be setup on each server you wish to connect to via
CIFS
CTDB is required for Samba clustering
Clients: SMB/CIFS
CELI 2014 47
Clients: HDFS
Access data within and outside of Hadoop
No HDFS name node single point of failure / bottleneck
Seamless replacement for HDFS
Scales with the massive growth of big data
Clients: HDFS
CELI 2015 48
Scalability
CELI 2015 49
Under the hood
Elastic Hash Algorithm
No central metadata
No Performance Bottleneck
Eliminates risk scenarios
Location hashed intelligently on filename
Unique identifiers (GFID), similar to md5sum
Under the hood
CELI 2015 50
Scalability
3TB
3TB
3TB
3TB
3TB
Gluster Server
3TB
3TB
3TB
3TB
3TB
3TB
3TB
3TB
3TB
3TB
Gluster Server
3TB
3TB
3TB
3TB
3TB
3TB
3TB
3TB
3TB
3TB
Gluster Server
3TB
3TB
3TB
3TB
3TB
Scale out performance and availability
Scaleoutcapacitry
Scalability
CELI 2014 51
Scalability
Add disks to servers to increase storage size
Add servers to increase bandwidth and storage size
Add servers to increase availability (replica factor)
Scalability
CELI 2015 52
What we do with
glusterFS
CELI 2015 53
What we do with GFS
Daily production of more than 10GB of Lucene inverted
indexes stored on glusterFS (more than 200GB/month)
Search stored indexes to extract different sets of
documents for every customers
Analytics over large sets of documents
YES: we open indexes directly on storage
(it's POSIX!!!)
What we do with GlusterFS
CELI 2015 54
What we do with GFS
Zillions of very small files: this kind of FS are not designed
to support this use case
What we can't do
CELI 2015 55
2010: first installation
Version 3.0.x
8 (not dedicated) servers
Distributed replicated
No bound on brick size (!!!!)
Ca 4TB avaliable
NOTE: stuck to 3.0.x until 2012 due to problems on 3.1 and
3.2 series, then RH acquired gluster (RH Storage)
2010: first installation
CELI 2015 56
2012: (little) cluster
New installation, version 3.3.2
4TB available on 8 servers (DELL c5000)
still not dedicated
1 brick per server limited to 1TB
2TB-raid 1 on each server
Still in production
2012: a new (little) cluster
CELI 2015 57
2012: enlarge
New installation, upgrade to 3.3.x
6TB available on 12 servers (still not dedicated)
Enlarged to 9TB on 18 servers
Bricks size bounded AND unbounded
2012: enlarge
CELI 2015 58
2013: fail
18 not dedicated servers: too much
18 bricks of different sizes
2 big down due to bricks out of space
2013: fail
CELI 2014 59
2013: fail
Didn’t restart after a move
but…
All data were recovered
(files are scattered on bricks, read from them!)
2014: BIG fail
CELI 2014 60
2013: fail
2 dedicated servers (DELL 720xd)
12 x 3TB SAS raid6
4 bricks per server
28 TB available
distributed replicated
4x1Gb bonded NIC
ca 40 clients (FUSE) (other servers)
2014: consolidate
CELI 2014 61
Consolidate
Gluster Server 1
brick 4
brick 3
brick 2
brick 1
Gluster Server 2
brick 4
brick 3
brick 2
brick 1
Consolidate
CELI 2015 62
ConsolidateOne year trend
CELI 2015 63
2013: fail
Buy a new server
Put in the rack
Reconfigure bricks
Wait until Gluster replicate data
Maybe rebalance the cluster
2015: scale up
CELI 2014 64
Scale up
Gluster Server 1
brick 31
brick 13
brick 12
brick 11
Gluster Server 2
brick 24
brick 32
brick 22
brick 21
Gluster Server 3
brick 14
brick 23
brick 32
brick 31
Scale up
CELI 2015 65
Do
Dedicated server (phisical or virtual)
RAID 6 or RAID 10 (with small files)
Multiple bricks of same size
Plan to scale
Do
CELI 2015 66
Do not
Multi purpose server
Bricks of different size
Very small files (<10Kb)
Write to bricks
Do not
CELI 2015 67
Some raw tests
read
Total transferred file size: 23.10G bytes
43.46M bytes/sec
write
Total transferred file size: 23.10G bytes
38.53M bytes/sec
Some raw tests
CELI 2015 68
Some raw testsTransfer rate
CELI 2015 69
Raw tests
NOTE: ran in production under heavy load, no clean test room
Raw tests
CELI 2015 70
Resources
http://www.gluster.org/
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/
https://github.com/gluster
http://www.redhat.com/products/storage-server/
http://joejulian.name/blog/category/glusterfs/
http://jread.us/2013/06/one-petabyte-red-hat-storage-and-glusterfs-project-
overview/
Resources
CELI 2015 71
CELI 2014 72
franchini@celi.it
www.celi.it
Torino | Milano| Trento
Thank You
@robfrankie

Contenu connexe

Tendances

Tendances (18)

GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
 
Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...
Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...
Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...
 
Accessing gluster ufo_-_eco_willson
Accessing gluster ufo_-_eco_willsonAccessing gluster ufo_-_eco_willson
Accessing gluster ufo_-_eco_willson
 
20160401 Gluster-roadmap
20160401 Gluster-roadmap20160401 Gluster-roadmap
20160401 Gluster-roadmap
 
Lcna example-2012
Lcna example-2012Lcna example-2012
Lcna example-2012
 
Lcna tutorial-2012
Lcna tutorial-2012Lcna tutorial-2012
Lcna tutorial-2012
 
Gluster intro-tdose
Gluster intro-tdoseGluster intro-tdose
Gluster intro-tdose
 
Scale out backups-with_bareos_and_gluster
Scale out backups-with_bareos_and_glusterScale out backups-with_bareos_and_gluster
Scale out backups-with_bareos_and_gluster
 
Gluster intro-tdose
Gluster intro-tdoseGluster intro-tdose
Gluster intro-tdose
 
Sdc challenges-2012
Sdc challenges-2012Sdc challenges-2012
Sdc challenges-2012
 
Software defined storage
Software defined storageSoftware defined storage
Software defined storage
 
Gluster fs current_features_and_roadmap
Gluster fs current_features_and_roadmapGluster fs current_features_and_roadmap
Gluster fs current_features_and_roadmap
 
20160401 guster-roadmap
20160401 guster-roadmap20160401 guster-roadmap
20160401 guster-roadmap
 
20160130 Gluster-roadmap
20160130 Gluster-roadmap20160130 Gluster-roadmap
20160130 Gluster-roadmap
 
Gluster technical overview
Gluster technical overviewGluster technical overview
Gluster technical overview
 
Lisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-onLisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-on
 
Integration of Glusterfs in to commvault simpana
Integration of Glusterfs in to commvault simpanaIntegration of Glusterfs in to commvault simpana
Integration of Glusterfs in to commvault simpana
 
Join the super_colony_-_feb2013
Join the super_colony_-_feb2013Join the super_colony_-_feb2013
Join the super_colony_-_feb2013
 

Similaire à Codemotion Rome 2015. GlusterFS

Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
Schubert Zhang
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
Amdocs
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
Amdocs
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
sprdd
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
sprdd
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS
 

Similaire à Codemotion Rome 2015. GlusterFS (20)

GlusterFS : un file system open source per i big data di oggi e domani - Robe...
GlusterFS : un file system open source per i big data di oggi e domani - Robe...GlusterFS : un file system open source per i big data di oggi e domani - Robe...
GlusterFS : un file system open source per i big data di oggi e domani - Robe...
 
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
 
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GRGlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
GlusterFS Presentation FOSSCOMM2013 HUA, Athens, GR
 
Gluster fs architecture_future_directions_tlv
Gluster fs architecture_future_directions_tlvGluster fs architecture_future_directions_tlv
Gluster fs architecture_future_directions_tlv
 
IBM Platform Computing Elastic Storage
IBM Platform Computing  Elastic StorageIBM Platform Computing  Elastic Storage
IBM Platform Computing Elastic Storage
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
 
Production FS: Adapt or die - Claudia Beresford & Tiago Scolar
Production FS: Adapt or die - Claudia Beresford & Tiago ScolarProduction FS: Adapt or die - Claudia Beresford & Tiago Scolar
Production FS: Adapt or die - Claudia Beresford & Tiago Scolar
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 Meetup
 
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredOSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...
 
20160401 guster-roadmap
20160401 guster-roadmap20160401 guster-roadmap
20160401 guster-roadmap
 
Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext
Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext
Файловая система ReFS в Windows Server 2012/R2 и её будущее в vNext
 
Make room for more virtual desktops with fast storage
Make room for more virtual desktops with fast storageMake room for more virtual desktops with fast storage
Make room for more virtual desktops with fast storage
 

Plus de Roberto Franchini

Plus de Roberto Franchini (7)

Integration tests: use the containers, Luke!
Integration tests: use the containers, Luke!Integration tests: use the containers, Luke!
Integration tests: use the containers, Luke!
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQL
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?
 
What the hell is your software doing at runtime?
What the hell is your software doing at runtime?What the hell is your software doing at runtime?
What the hell is your software doing at runtime?
 
Java application monitoring with Dropwizard Metrics and graphite
Java application monitoring with Dropwizard Metrics and graphite Java application monitoring with Dropwizard Metrics and graphite
Java application monitoring with Dropwizard Metrics and graphite
 
Redis for duplicate detection on real time stream
Redis for duplicate detection on real time streamRedis for duplicate detection on real time stream
Redis for duplicate detection on real time stream
 
TDD - una introduzione
TDD -  una introduzioneTDD -  una introduzione
TDD - una introduzione
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Codemotion Rome 2015. GlusterFS

  • 1. 1 Gluster FS A distributed file system for today and tomorrow's BigData Roberto “FRANK” Franchini @robfrankie @CELI_NLP CELI 2015 Roma, 27–28 Marzo 2015Roma, 27–28 Marzo 2015
  • 2. 15 years of experience, proud to be a programmer Writes software for information extraction, nlp, opinion mining (@scale ), and a lot of other buzzwords Implements scalable architectures Plays with servers (don't say that to my sysadmin) Member of the JUG-Torino coordination team whoami(1)whoami(1) CELI 2014 2
  • 4. Identify a distributed and scalable file system for today and tomorrow's Big Data CELI 2015 4 The problem
  • 5. Once upon a time CELI 2015 5
  • 6. Once upon a timeOnce upon a time CELI 2015 6
  • 7. Once upon a time 2008: One nfs share 1,5TB ought to be enough for anybody Once upon a time CELI 2015 7
  • 8. Once upon a timeOnce upon a time CELI 2015 8
  • 9. Once upon a time 2010: Herd of shares (1,5TB x N) ought to be enough for anybody Once upon a time CELI 2015 9
  • 10. Once upon a time DIY Hand Made Distributed File System: HMDFS (tm) Nobody could stop the data flood It was the time for something new Once upon a time CELI 2015 10
  • 11. What we do with GFS Can be enlarged on demand No dedicated HW OpenSource is preferred and trusted No specialized API No specialized Kernel POSIX compliance Zillions of big and small files No NAS or SAN (€€€€€) Requirements CELI 2015 11
  • 12. What we do with GFS More than 10GB of Lucene inverted indexes to be stored on a daily basis (more than 200GB/month) Search stored indexes to extract different sets of documents for different customers Analytics on large sets of documents (years of data) To do what? CELI 2015 12
  • 14. Features Clustered Scale-out General Purpose Storage Platform POSIX-y Distributed File System Built on commodity systems x86_64 Linux ++ POSIX filesystems underneath (XFS, EXT4) No central metadata Server (NO SPOF) Modular architecture for scale and functionality GlusterFs CELI 2015 14
  • 15. Features Large Scale File Server Media/Content Distribution Network (CDN) Backup / Archive / Disaster Recovery (DR) HDFS replacement in Hadoop ecosystem High Performance Computing (HPC) IaaS storage layer Database offload (blobs) Unified Object Store + File Access Common use cases CELI 2015 15
  • 16. Features ACL and Quota support Fault-tolerance Peer to peer Self-healing Fast setup up Enlarge/Shrink on demand Snapshot On cloud, on premise (phisical or virtual) Features CELI 2015 16
  • 18. Architecture Peer / Node Cluster servers (glusterfs server) Runs the gluster daemons and participates in volumes Node CELI 2015 18
  • 19. Architecture Brick A filesystem mountpoint on GusterFS nodes A unit of storage used as a capacity building block Brick CELI 2015 19
  • 20. ArchitectureFrom disks to bricks CELI 2015 20 3TB 3TB 3TB 3TB 3TB Gluster Server 3TB 3TB 3TB 3TB 3TB Gluster Server LVM volume Gluster Server LVM volume LVM volume LVM volume
  • 21. ArchitectureFrom disks to bricks CELI 2015 21 3TB 3TB 3TB 3TB 3TB Gluster Server 3TB 3TB 3TB 3TB 3TB RAID 6 Physical volume Gluster Server Single Brick Gluster Server
  • 22. ArchitectureFrom disks to bricks CELI 2015 22 3TB 3TB 3TB 3TB 3TB Gluster Server 3TB 3TB 3TB 3TB 3TB /b1 /b3 /b5 /bN /b7 Gluster Server /b2 /b4 /b6 /bM /b8
  • 23. Bricks on a nodeBricks on a node CELI 2015 23
  • 24. Architecture Logic between bricks or subvolume that generate a subvolume with certain characteristic Distribute, replica, stripe are special translators to generate simil-RAID configuration Perfomance translators: read ahead, write behind Translators CELI 2015 24
  • 25. Architecture Bricks combined and passed through translators Ultimately, what's presented to the end user Volume CELI 2015 25
  • 26. Architecture brick → translator → subvolume → translator → subvolume → transator → subvolume → translator → volume Volume CELI 2015 26
  • 27. VolumeVolume CELI 2015 27 Gluster Server 1 /brick 14 /brick 13 /brick 12 /brick 11 Gluster Server 2 /brick 24 /brick 23 /brick 22 /brick 21 Gluster Server 3 /brick 34 /brick 33 /brick 32 /brick 31 Client /bigdata volume replicate distribute
  • 30. Distributed The default configuration Files “evenly” spread across bricks Similar to file-level RAID 0 Server/Disk failure could be catastrophic Distributed CELI 2014 30
  • 32. Replicated Files written synchronously to replica peers Files read synchronously, but ultimately serviced by the first responder Similar to file-level RAID 1 Replicated CELI 2015 32
  • 34. Distributed + replicated Distribued + replicated Similar to file-level RAID 10 Most used layout Distributed + replicated CELI 2015 34
  • 35. Distributed replicatedDistributed + replicated CELI 2015 35
  • 36. Striped Individual files split among bricks (sparse files) Similar to block-level RAID 0 Limited Use Cases HPC Pre/Post Processing File size exceeds brick size Striped CELI 2015 36
  • 39. Components glusterd Management daemon One instance on each GlusterFS server Interfaced through gluster CLI glusterfsd GlusterFS brick daemon One process for each brick on each server Managed by glusterd Components CELI 2015 39
  • 40. Components glusterfs Volume service daemon One process for each volume service NFS server, FUSE client, Self-Heal, Quota, ... mount.glusterfs FUSE native client mount extension gluster Gluster Console Manager (CLI) Components CELI 2015 40
  • 41. Components sudo fdisk /dev/sdb sudo mkfs.xfs -i size=512 /dev/sdb1 sudo mkdir -p /srv/sdb1 sudo mount /dev/sdb1 /srv/sdb1 sudo mkdir -p /srv/sdb1/brick echo "/dev/sdb1 /srv/sdb1 xfs defaults 0 0" | sudo tee -a /etc/fstab sudo apt-get install glusterfs-server Quick start CELI 2015 41
  • 42. Components sudo gluster peer probe node02 sudo gluster volume create testvol replica 2 node01:/srv/sdb1/brick node02:/srv/sdb1/brick sudo gluster volume start testvol sudo mkdir /mnt/gluster sudo mount -t glusterfs node01:/testvol /mnt/gluster Quick start CELI 2015 42
  • 44. Clients: native FUSE kernel module allows the filesystem to be built and operated entirely in userspace Specify mount to any GlusterFS server Native Client fetches volfile from mount server, then communicates directly with all nodes to access data Recommended for high concurrency and high write performance Load is inherently balanced across distributed volumes Clients: native CELI 2015 44
  • 45. Clients:NFS Standard NFS v3 clients Standard automounter is supported Mount to any server, or use a load balancer GlusterFS NFS server includes Network Lock Manager (NLM) to synchronize locks across clients Better performance for reading many small files from a single client Load balancing must be managed externally Clients: NFS CELI 2015 45
  • 46. Clients: libgfapi Introduced with GlusterFS 3.4 User-space library for accessing data in GlusterFS Filesystem-like API Runs in application process no FUSE, no copies, no context switches ...but same volfiles, translators, etc. Clients: libgfapi CELI 2015 46
  • 47. Clients: SMB/CIFS In GlusterFS 3.4 – Samba + libgfapi No need for local native client mount & re-export Significant performance improvements with FUSE removed from the equation Must be setup on each server you wish to connect to via CIFS CTDB is required for Samba clustering Clients: SMB/CIFS CELI 2014 47
  • 48. Clients: HDFS Access data within and outside of Hadoop No HDFS name node single point of failure / bottleneck Seamless replacement for HDFS Scales with the massive growth of big data Clients: HDFS CELI 2015 48
  • 50. Under the hood Elastic Hash Algorithm No central metadata No Performance Bottleneck Eliminates risk scenarios Location hashed intelligently on filename Unique identifiers (GFID), similar to md5sum Under the hood CELI 2015 50
  • 51. Scalability 3TB 3TB 3TB 3TB 3TB Gluster Server 3TB 3TB 3TB 3TB 3TB 3TB 3TB 3TB 3TB 3TB Gluster Server 3TB 3TB 3TB 3TB 3TB 3TB 3TB 3TB 3TB 3TB Gluster Server 3TB 3TB 3TB 3TB 3TB Scale out performance and availability Scaleoutcapacitry Scalability CELI 2014 51
  • 52. Scalability Add disks to servers to increase storage size Add servers to increase bandwidth and storage size Add servers to increase availability (replica factor) Scalability CELI 2015 52
  • 53. What we do with glusterFS CELI 2015 53
  • 54. What we do with GFS Daily production of more than 10GB of Lucene inverted indexes stored on glusterFS (more than 200GB/month) Search stored indexes to extract different sets of documents for every customers Analytics over large sets of documents YES: we open indexes directly on storage (it's POSIX!!!) What we do with GlusterFS CELI 2015 54
  • 55. What we do with GFS Zillions of very small files: this kind of FS are not designed to support this use case What we can't do CELI 2015 55
  • 56. 2010: first installation Version 3.0.x 8 (not dedicated) servers Distributed replicated No bound on brick size (!!!!) Ca 4TB avaliable NOTE: stuck to 3.0.x until 2012 due to problems on 3.1 and 3.2 series, then RH acquired gluster (RH Storage) 2010: first installation CELI 2015 56
  • 57. 2012: (little) cluster New installation, version 3.3.2 4TB available on 8 servers (DELL c5000) still not dedicated 1 brick per server limited to 1TB 2TB-raid 1 on each server Still in production 2012: a new (little) cluster CELI 2015 57
  • 58. 2012: enlarge New installation, upgrade to 3.3.x 6TB available on 12 servers (still not dedicated) Enlarged to 9TB on 18 servers Bricks size bounded AND unbounded 2012: enlarge CELI 2015 58
  • 59. 2013: fail 18 not dedicated servers: too much 18 bricks of different sizes 2 big down due to bricks out of space 2013: fail CELI 2014 59
  • 60. 2013: fail Didn’t restart after a move but… All data were recovered (files are scattered on bricks, read from them!) 2014: BIG fail CELI 2014 60
  • 61. 2013: fail 2 dedicated servers (DELL 720xd) 12 x 3TB SAS raid6 4 bricks per server 28 TB available distributed replicated 4x1Gb bonded NIC ca 40 clients (FUSE) (other servers) 2014: consolidate CELI 2014 61
  • 62. Consolidate Gluster Server 1 brick 4 brick 3 brick 2 brick 1 Gluster Server 2 brick 4 brick 3 brick 2 brick 1 Consolidate CELI 2015 62
  • 64. 2013: fail Buy a new server Put in the rack Reconfigure bricks Wait until Gluster replicate data Maybe rebalance the cluster 2015: scale up CELI 2014 64
  • 65. Scale up Gluster Server 1 brick 31 brick 13 brick 12 brick 11 Gluster Server 2 brick 24 brick 32 brick 22 brick 21 Gluster Server 3 brick 14 brick 23 brick 32 brick 31 Scale up CELI 2015 65
  • 66. Do Dedicated server (phisical or virtual) RAID 6 or RAID 10 (with small files) Multiple bricks of same size Plan to scale Do CELI 2015 66
  • 67. Do not Multi purpose server Bricks of different size Very small files (<10Kb) Write to bricks Do not CELI 2015 67
  • 68. Some raw tests read Total transferred file size: 23.10G bytes 43.46M bytes/sec write Total transferred file size: 23.10G bytes 38.53M bytes/sec Some raw tests CELI 2015 68
  • 69. Some raw testsTransfer rate CELI 2015 69
  • 70. Raw tests NOTE: ran in production under heavy load, no clean test room Raw tests CELI 2015 70
  • 72. CELI 2014 72 franchini@celi.it www.celi.it Torino | Milano| Trento Thank You @robfrankie