Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Gluster
Tutorial
Jeff Darcy, Red Hat
LISA 2016 (Boston)
Agenda
▸ Alternating info-dump and hands-on
▹ This is part of the info-dump ;)
▸ Gluster basics
▸ Initial setup
▸ Extra fe...
Who Am I?
▸ One of three project-wide architects
▸ First Red Hat employee to be seriously
involved with Gluster (before
ac...
TEMPLATE CREDITS
Special thanks to all the people who made and released these
awesome resources for free:
▸ Presentation t...
Some Terminology
▸ A brick is simply a directory on a server
▸ We use translators to combine bricks
into more complex subv...
Hands On: Getting Started
1. Use the RHGS test drive
▹ http://bit.ly/glustertestdrive
2. Start a Fedora/CentOS VM
▹ Use yu...
Brick / Translator Example
Server A
/brick1
Server B
/brick2
Server C
/brick3
Server D
/brick4
Brick / Translator Example
Server A
/brick1
Server B
/brick2
Replica
Set 1
Server C
/brick3
Server D
/brick4
Replica
Set 2...
Brick / Translator Example
Server A
/brick1
Server B
/brick2
Replica
Set 1
Server C
/brick3
Server D
/brick4
Replica
Set 2...
Translator Patterns
Server A
/brick1
Server B
/brick2
Replica
Set 1
Fan-out or “cluster”
e.g. AFR, EC, DHT, ...
AFR
md-cac...
Access Methods
FUSE
Samba
Ganesha
TCMU
GFAPI
Self heal
Rebalance
Quota
Snapshot
Bitrot
GlusterD
▸ Management daemon
▸ Maintains membership, detects server
failures
▸ Stages configuration changes
▸ Starts and m...
Simple Configuration Example
serverA# gluster peer probe serverB
serverA# gluster volume create fubar 
replica 2 
serverA:...
Hands On: Connect Servers
[root@vagrant-testVM glusterfs]# gluster peer probe
192.168.121.66
peer probe: success.
[root@va...
Hands On: Server Volume Setup
[root@vagrant-testVM glusterfs]# gluster volume create fubar 
replica 2 testvm:/d/backends/f...
Hands On: Server Volume Setup
[root@vagrant-testVM glusterfs]# gluster volume start fubar
volume start: fubar: success
[ro...
Hands On: Client Volume Setup
[root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:fubar 
/mnt/glusterfs/0
[root@vag...
Hands On: It’s a Filesystem!
▸ Create some files
▸ Create directories, symlinks, ...
▸ Rename, delete, ...
▸ Test performa...
Distribution and Rebalancing
Server X’s range Server Y’s range
0 0x7fffffff 0xffffffff
● Each brick “claims” a range of ha...
Distribution and Rebalancing
Server X’s range Server Y’s range
0 0x80000000 0xffffffff
Server X’s range Server Y’s range
0...
Sharding
▸ Divides files into chunks
▸ Each chunk is placed separately
according to hash
▸ High probability (not certainty...
Hands On: Adding a Brick
[root@vagrant-testVM glusterfs]# gluster volume create xyzzy
testvm:/d/backends/xyzzy{0,1}
[root@...
Hands On: Adding a Brick
[root@vagrant-testVM glusterfs]# gluster volume add-brick xyzzy 
testvm:/d/backends/xyzzy2
volume...
Hands On: Adding a Brick
[root@vagrant-testVM glusterfs]# getfattr -d -e hex -m 
trusted.glusterfs.dht /d/backends/xyzzy{0...
Split Brain (problem definition)
▸ “Split brain” is when we don’t have
enough information to determine
correct recovery ac...
How Replication Works
▸ Client sends operation (e.g. write) to all
replicas directly
▸ Coordination: pre-op, post-op, lock...
Split Brain (how it happens)
Server A
Client X
Client Y
Server B
Network
partition
Split Brain (what it looks like)
[root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/0
ls: cannot access /mnt/glusterfs/0/b...
Split Brain (dealing with it)
▸ Primary mechanism: quorum
▹ server side, client side, or both
▹ arbiters
▸ Secondary: rule...
Server Side Quorum
Brick A Brick B Brick C
Client X Client Y
Writes succeed Has no servers
Forced down
Client Side Quorum
Brick A Brick B Brick C
Client X Client Y
Writes succeed Writes rejected locally
(EROFS)
Stays up
Erasure Coding
▸ Encode N input blocks into N+K output
blocks, so that original can be recovered
from any N.
▸ RAID is era...
Erasure Coding
Erasure Coding
BREAK
Quota
▸ Gluster supports directory-level quota
▸ For nested directories, lowest applicable
limit applies
▸ Soft and hard l...
Quota
▸ Problem: global vs. local limits
▹ quota is global (per volume)
▹ files are pseudo-randomly distributed
across bri...
Hands On: Quota
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy enable
volume quota : success
[root@vagrant-te...
Hands On: Quota
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list
Path Hard-limit Soft-limit
--------------...
Hands On: Quota
[root@vagrant-testVM glusterfs]# dd if=/dev/zero 
of=/mnt/glusterfs/0/john/bigfile bs=1048576 count=85 con...
Hands On: Quota
[root@vagrant-testVM glusterfs]# dd if=/dev/zero 
of=/mnt/glusterfs/0/john/bigfile2 bs=1048576 count=85 co...
Snapshots
▸ Gluster supports read-only snapshots
and writable clones of snapshots
▸ Also, snapshot restores
▸ Support is b...
Hands On: Snapshots
[root@vagrant-testVM glusterfs]# fallocate -l $((100*1024*1024)) 
/tmp/snap-brick0
[root@vagrant-testV...
Hands On: Snapshots
[root@vagrant-testVM glusterfs]# lvcreate -L 50MB -T /dev/snap-vg0/thinpool
Rounding up size to full p...
Hands On: Snapshots
[root@vagrant-testVM glusterfs]# gluster volume create xyzzy 
testvm:/d/backends/xyzzy{0,1} force
[roo...
Hands On: Snapshots
[root@vagrant-testVM glusterfs]# gluster snapshot activate 
snap1_GMT-2016.11.29-14.57.11
Snapshot act...
Hands On: Snapshots
[root@vagrant-testVM glusterfs]# gluster snapshot clone clone1 
snap1_GMT-2016.11.29-14.57.11
snapshot...
Hands On: Snapshots
# Unmount and stop clone.
# Stop original volume - but leave snapshot activated!
[root@vagrant-testVM ...
BREAK
Other Features
▸ Geo-replication
▸ Bitrot detection
▸ Transport security
▸ Encryption, compression/dedup etc. can
be done ...
Gluster 4.x
▸ GlusterD 2
▹ higher scale + interfaces + smarts
▸ Server-side replication
▸ DHT improvements for scale
▸ Mor...
Thank You!
http://gluster.org
jdarcy@redhat.com
Prochain SlideShare
Chargement dans…5
×

Hands On Gluster with Jeff Darcy

1 663 vues

Publié le

Hands On Gluster with Jeff Darcy - LISA 2016

Publié dans : Technologie
  • Soyez le premier à commenter

Hands On Gluster with Jeff Darcy

  1. 1. Gluster Tutorial Jeff Darcy, Red Hat LISA 2016 (Boston)
  2. 2. Agenda ▸ Alternating info-dump and hands-on ▹ This is part of the info-dump ;) ▸ Gluster basics ▸ Initial setup ▸ Extra features ▸ Maintenance and trouble-shooting
  3. 3. Who Am I? ▸ One of three project-wide architects ▸ First Red Hat employee to be seriously involved with Gluster (before acquisition) ▸ Previously worked on NFS (v2..v4), Lustre, PVFS2, others ▸ General distributed-storage blatherer ▹ http://pl.atyp.us / @Obdurodon
  4. 4. TEMPLATE CREDITS Special thanks to all the people who made and released these awesome resources for free: ▸ Presentation template by SlidesCarnival ▸ Photographs by Death to the Stock Photo (license)
  5. 5. Some Terminology ▸ A brick is simply a directory on a server ▸ We use translators to combine bricks into more complex subvolumes ▹ For scale, replication, sharding, ... ▸ This forms a translator graph, contained in a volfile ▸ Internal daemons (e.g. self heal) use the same bricks arranged into slightly different volfiles
  6. 6. Hands On: Getting Started 1. Use the RHGS test drive ▹ http://bit.ly/glustertestdrive 2. Start a Fedora/CentOS VM ▹ Use yum/dnf to install gluster ▹ base, libs, server, fuse, client-xlators, cli 3. Docker Docker Docker ▹ https://github.com/gluster/gluster-containers
  7. 7. Brick / Translator Example Server A /brick1 Server B /brick2 Server C /brick3 Server D /brick4
  8. 8. Brick / Translator Example Server A /brick1 Server B /brick2 Replica Set 1 Server C /brick3 Server D /brick4 Replica Set 2 A subvolume Also a subvolume
  9. 9. Brick / Translator Example Server A /brick1 Server B /brick2 Replica Set 1 Server C /brick3 Server D /brick4 Replica Set 2 Volume “fubar”
  10. 10. Translator Patterns Server A /brick1 Server B /brick2 Replica Set 1 Fan-out or “cluster” e.g. AFR, EC, DHT, ... AFR md-cache Pass through e.g. performance
  11. 11. Access Methods FUSE Samba Ganesha TCMU GFAPI Self heal Rebalance Quota Snapshot Bitrot
  12. 12. GlusterD ▸ Management daemon ▸ Maintains membership, detects server failures ▸ Stages configuration changes ▸ Starts and monitors other daemons
  13. 13. Simple Configuration Example serverA# gluster peer probe serverB serverA# gluster volume create fubar replica 2 serverA:/brick1 serverB:/brick2 serverA# gluster volume start fubar clientX# mount -t glusterfs serverA:fubar /mnt/gluster_fubar
  14. 14. Hands On: Connect Servers [root@vagrant-testVM glusterfs]# gluster peer probe 192.168.121.66 peer probe: success. [root@vagrant-testVM glusterfs]# gluster peer status Number of Peers: 1 Hostname: 192.168.121.66 Uuid: 95aee0b5-c816-445b-8dbc-f88da7e95660 State: Accepted peer request (Connected)
  15. 15. Hands On: Server Volume Setup [root@vagrant-testVM glusterfs]# gluster volume create fubar replica 2 testvm:/d/backends/fubar{0,1} force volume create: fubar: success: please start the volume to access data [root@vagrant-testVM glusterfs]# gluster volume info fubar ... (see for yourself) [root@vagrant-testVM glusterfs]# gluster volume status fubar Volume fubar is not started
  16. 16. Hands On: Server Volume Setup [root@vagrant-testVM glusterfs]# gluster volume start fubar volume start: fubar: success [root@vagrant-testVM glusterfs]# gluster volume status fubar Status of volume: fubar Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick testvm:/d/backends/fubar0 49152 0 Y 13104 Brick testvm:/d/backends/fubar1 49153 0 Y 13133 Self-heal Daemon on localhost N/A N/A Y 13163 Task Status of Volume fubar ------------------------------------------------------------------------------ There are no active volume tasks
  17. 17. Hands On: Client Volume Setup [root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:fubar /mnt/glusterfs/0 [root@vagrant-testVM glusterfs]# df /mnt/glusterfs/0 Filesystem 1K-blocks Used Available Use% Mounted on testvm:fubar 5232640 33280 5199360 1% /mnt/glusterfs/0 [root@vagrant-testVM glusterfs]# ls -a /mnt/glusterfs/0 . .. [root@vagrant-testVM glusterfs]# ls -a /d/backends/fubar0 . .. .glusterfs
  18. 18. Hands On: It’s a Filesystem! ▸ Create some files ▸ Create directories, symlinks, ... ▸ Rename, delete, ... ▸ Test performance ▹ OK, not yet
  19. 19. Distribution and Rebalancing Server X’s range Server Y’s range 0 0x7fffffff 0xffffffff ● Each brick “claims” a range of hash values ○ Collection of claims is called a layout ● Files (dots) are hashed, placed on brick claiming that range ● When bricks are added, claims are adjusted to minimize data motion
  20. 20. Distribution and Rebalancing Server X’s range Server Y’s range 0 0x80000000 0xffffffff Server X’s range Server Y’s range 0 0x55555555 0xaaaaaaaa 0xffffffff Server Z’s range Move X->Z Move Y->Z
  21. 21. Sharding ▸ Divides files into chunks ▸ Each chunk is placed separately according to hash ▸ High probability (not certainty) of chunks being on different subvolumes ▸ Spreads capacity and I/O across subvolumes
  22. 22. Hands On: Adding a Brick [root@vagrant-testVM glusterfs]# gluster volume create xyzzy testvm:/d/backends/xyzzy{0,1} [root@vagrant-testVM glusterfs]# getfattr -d -e hex -m trusted.glusterfs.dht /d/backends/xyzzy{0,1} # file: d/backends/xyzzy0 trusted.glusterfs.dht=0x0000000100000000000000007ffffffe # file: d/backends/xyzzy1 trusted.glusterfs.dht=0x00000001000000007fffffffffffffff
  23. 23. Hands On: Adding a Brick [root@vagrant-testVM glusterfs]# gluster volume add-brick xyzzy testvm:/d/backends/xyzzy2 volume add-brick: success [root@vagrant-testVM glusterfs]# gluster volume rebalance xyzzy fix-layout start volume rebalance: xyzzy: success: Rebalance on xyzzy has been started successfully. Use rebalance status command to check status of the rebalance process. ID: 88782248-7c12-4ba8-97f6-f5ce6815963
  24. 24. Hands On: Adding a Brick [root@vagrant-testVM glusterfs]# getfattr -d -e hex -m trusted.glusterfs.dht /d/backends/xyzzy{0,1,2} # file: d/backends/xyzzy0 trusted.glusterfs.dht=0x00000001000000000000000055555554 # file: d/backends/xyzzy1 trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff # file: d/backends/xyzzy2 trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9
  25. 25. Split Brain (problem definition) ▸ “Split brain” is when we don’t have enough information to determine correct recovery action ▸ Can be caused by node failure or network partition ▸ Every distributed data store has to prevent and/or deal with it
  26. 26. How Replication Works ▸ Client sends operation (e.g. write) to all replicas directly ▸ Coordination: pre-op, post-op, locking ▹ enables recovery in case of failure ▸ Self-heal (repair) usually done by internal daemon
  27. 27. Split Brain (how it happens) Server A Client X Client Y Server B Network partition
  28. 28. Split Brain (what it looks like) [root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/0 ls: cannot access /mnt/glusterfs/0/best-sf: Input/output error best-sf [root@vagrant-testVM glusterfs]# cat /mnt/glusterfs/0/best-sf cat: /mnt/glusterfs/0/best-sf: Input/output error [root@vagrant-testVM glusterfs]# cat /d/backends/fubar0/best-sf star trek [root@vagrant-testVM glusterfs]# cat /d/backends/fubar1/best-sf star wars What the...?
  29. 29. Split Brain (dealing with it) ▸ Primary mechanism: quorum ▹ server side, client side, or both ▹ arbiters ▸ Secondary: rule-based resolution ▹ e.g. largest, latest timestamp ▹ Thanks, Facebook! ▸ Last choice: manual repair
  30. 30. Server Side Quorum Brick A Brick B Brick C Client X Client Y Writes succeed Has no servers Forced down
  31. 31. Client Side Quorum Brick A Brick B Brick C Client X Client Y Writes succeed Writes rejected locally (EROFS) Stays up
  32. 32. Erasure Coding ▸ Encode N input blocks into N+K output blocks, so that original can be recovered from any N. ▸ RAID is erasure coding with K=1 (RAID 5) or K=2 (RAID 6) ▸ Our implementation mostly has the same flow as replication
  33. 33. Erasure Coding
  34. 34. Erasure Coding
  35. 35. BREAK
  36. 36. Quota ▸ Gluster supports directory-level quota ▸ For nested directories, lowest applicable limit applies ▸ Soft and hard limits ▹ Exceeding soft limit gets logged ▹ Exceeding hard limit gets EDQUOT
  37. 37. Quota ▸ Problem: global vs. local limits ▹ quota is global (per volume) ▹ files are pseudo-randomly distributed across bricks ▸ How do we enforce this? ▸ Quota daemon exists to handle this coordination
  38. 38. Hands On: Quota [root@vagrant-testVM glusterfs]# gluster volume quota xyzzy enable volume quota : success [root@vagrant-testVM glusterfs]# gluster volume quota xyzzy soft-timeout 0 volume quota : success [root@vagrant-testVM glusterfs]# gluster volume quota xyzzy hard-timeout 0 volume quota : success [root@vagrant-testVM glusterfs]# gluster volume quota xyzzy limit-usage /john 100MB volume quota : success
  39. 39. Hands On: Quota [root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list Path Hard-limit Soft-limit ----------------------------------------------------------------- /john 100.0MB 80%(80.0MB) Used Available Soft-limit exceeded? Hard-limit exceeded? -------------------------------------------------------------- 0Bytes 100.0MB No No
  40. 40. Hands On: Quota [root@vagrant-testVM glusterfs]# dd if=/dev/zero of=/mnt/glusterfs/0/john/bigfile bs=1048576 count=85 conv=sync 85+0 records in 85+0 records out 89128960 bytes (89 MB) copied, 1.83037 s, 48.7 MB/s [root@vagrant-testVM glusterfs]# grep -i john /var/log/glusterfs/bricks/* /var/log/glusterfs/bricks/d-backends-xyzzy0.log:[2016-11-29 14:31:44.581934] A [MSGID: 120004] [quota.c:4973:quota_log_usage] 0-xyzzy-quota: Usage crossed soft limit: 80.0MB used by /john
  41. 41. Hands On: Quota [root@vagrant-testVM glusterfs]# dd if=/dev/zero of=/mnt/glusterfs/0/john/bigfile2 bs=1048576 count=85 conv=sync dd: error writing '''/mnt/glusterfs/0/john/bigfile2''': Disk quota exceeded [root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list | cut -c 66- Used Available Soft-limit exceeded? Hard-limit exceeded? -------------------------------------------------------------- 101.9MB 0Bytes Yes Yes
  42. 42. Snapshots ▸ Gluster supports read-only snapshots and writable clones of snapshots ▸ Also, snapshot restores ▸ Support is based on / tied to LVM thin provisioning ▹ originally supposed to be more platform-agnostic ▹ maybe some day it really will be
  43. 43. Hands On: Snapshots [root@vagrant-testVM glusterfs]# fallocate -l $((100*1024*1024)) /tmp/snap-brick0 [root@vagrant-testVM glusterfs]# losetup --show -f /tmp/snap-brick0 /dev/loop3 [root@vagrant-testVM glusterfs]# vgcreate snap-vg0 /dev/loop3 Volume group "snap-vg0" successfully created
  44. 44. Hands On: Snapshots [root@vagrant-testVM glusterfs]# lvcreate -L 50MB -T /dev/snap-vg0/thinpool Rounding up size to full physical extent 52.00 MiB Logical volume "thinpool" created. [root@vagrant-testVM glusterfs]# lvcreate -V 200MB -T /dev/snap-vg0/thinpool -n snap-lv0 Logical volume "snap-lv0" created. [root@vagrant-testVM glusterfs]# mkfs.xfs /dev/snap-vg0/snap-lv0 ... [root@vagrant-testVM glusterfs]# mount /dev/snap-vg0/snap-lv0 /d/backends/xyzzy0 ...
  45. 45. Hands On: Snapshots [root@vagrant-testVM glusterfs]# gluster volume create xyzzy testvm:/d/backends/xyzzy{0,1} force [root@vagrant-testVM glusterfs]# echo hello > /mnt/glusterfs/0/file1 [root@vagrant-testVM glusterfs]# echo hello > /mnt/glusterfs/0/file2 [root@vagrant-testVM glusterfs]# gluster snapshot create snap1 xyzzy snapshot create: success: Snap snap1_GMT-2016.11.29-14.57.11 created successfully [root@vagrant-testVM glusterfs]# echo hello > /mnt/glusterfs/0/file3
  46. 46. Hands On: Snapshots [root@vagrant-testVM glusterfs]# gluster snapshot activate snap1_GMT-2016.11.29-14.57.11 Snapshot activate: snap1_GMT-2016.11.29-14.57.11: Snap activated successfully [root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:/snaps/snap1_GMT-2016.11.29-14.57.11/xyzzy /mnt/glusterfs/1 [root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/1 file1 file2 [root@vagrant-testVM glusterfs]# echo hello > /mnt/glusterfs/1/file3 -bash: /mnt/glusterfs/1/file3: Read-only file system
  47. 47. Hands On: Snapshots [root@vagrant-testVM glusterfs]# gluster snapshot clone clone1 snap1_GMT-2016.11.29-14.57.11 snapshot clone: success: Clone clone1 created successfully [root@vagrant-testVM glusterfs]# gluster volume start clone1 volume start: clone1: success [root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:/clone1 /mnt/glusterfs/2 [root@vagrant-testVM glusterfs]# echo goodbye > /mnt/glusterfs/2/file3
  48. 48. Hands On: Snapshots # Unmount and stop clone. # Stop original volume - but leave snapshot activated! [root@vagrant-testVM glusterfs]# gluster snapshot restore snap1_GMT-2016.11.29-14.57.11 Restore operation will replace the original volume with the snapshotted volume. Do you still want to continue? (y/n) y Snapshot restore: snap1_GMT-2016.11.29-14.57.11: Snap restored successfully [root@vagrant-testVM glusterfs]# gluster volume start xyzzy volume start: xyzzy: success [root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/0 file1 file2
  49. 49. BREAK
  50. 50. Other Features ▸ Geo-replication ▸ Bitrot detection ▸ Transport security ▸ Encryption, compression/dedup etc. can be done locally on bricks
  51. 51. Gluster 4.x ▸ GlusterD 2 ▹ higher scale + interfaces + smarts ▸ Server-side replication ▸ DHT improvements for scale ▸ More multitenancy ▹ subvolume mounts, throttling/QoS
  52. 52. Thank You! http://gluster.org jdarcy@redhat.com

×