Ceph is unstable, vSAN got extremely poor performance. Data center need real high end distributed storage to replace traditional disk array support mission critical applications. PhegData X here raise up to answer...
2. • The most successful database machine vendor in China
• Market share ~20%, lower than Oracle Exadata, higher than Huawei FusionCube
• Focusing on performance optimization for real applications
• PhDX (PhegData X) inherit the core of database machine
• Resources pooling, strong consistency, low latency I/O, etc.
• High efficient cache engine for mix media environment
• Adding more for virtualization and container systems
• RESTful API, support OpenStack Cinder
• VMware VAAI/vVol, Docker graph driver reday
History & Backgound
3. • Replacing high-end disk array to support mission critical applications
• Scale out architecture
• Proven data center level reliability, serviceability and performance
• Traditional as well as new applications
• Oracle RAC, DB2 PureScale/DPF, Sybase, MySQL, PostgreSQL …
• Hadoop, Spark, Storm, Kafka, Druid …
• VMware, KVM, XEN, Docker, rkt ...
Targeting on …
4. • PhDX = Generic x86 hardware + S2EBS (SmartScaleEBS) software
• Hardware, nothing special, just commodity metal box
• CPU: Intel Xeon E5/E7 series, v2/v3/v4
• Flash: SATA/NVMe/PCIe SSD, NVDIMM releasing soon
• Network: GbE/10GbE/InfiniBand, Intel Omni-Path ready
• S2EBS (SmartScaleEBS)Software
• DHT based distributed system, no centric meta data node
• Block level interfaces, iSCSI, SRP, iSER and S2EBS native protocol
• RESTful API to support objective interfaces (Cinder, S3 compatible, etc.)
Inside PhDX (PhegData X)
6. BAC maintain logical volumes
Chunk
Chunk
Chunk
OSD
Chunk
Chunk
Chunk
OSD
Chunk
Chunk
Chunk
OSD
Disk PoolLogical
Volume
A logical volume is a set of chunks. The mappings are maintained by BAC module.
8. • Metadata Area
• Super Block——64KB
• Space Bitmap——2MB
• Key Space(Mapping B+ Tree)——512MB
• Data Area
OSD maintain physical disks
Key Space Data Area
Space
Bitmap
Super
Block
9. Keep different disks same usage ratio
Pool
All vOSDs
are equally
used.
Cut physical disk into vOSD (4GB by default) and it’s the actual unit of DHT ring.
3TB
OSD
8TB
OSD6TB
OSD
vOSD
vOSD
vOSD
vOSD
vOSD
vOSD
vOSD
10. Router in the middle of I/O process
APP Router
Router
Router
Router
vOSD
vOSD
vOSD
vOSD
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
DHT
/dev/sd*
BAC
Driver
10GbE/IB
S2EBS Native
Protocol
13. Agile redundancy control
Pool-a: 2-rep 2-rep
Pool-b: 3-rep
3-rep
2-rep 3-rep 4-rep 5-rep
Pool
Common ServerSAN control
redundancy per pool
S2EBS control
redundancy per volume
14. Benefit of volume redundancy control
Get me just 500GB
with 3-rep protected,
the rest will be good
with 2-rep protected
Get all 2-rep
protected data
up to 3-rep
Ctl per pool Ctl per volume
Preserve capacity for
each protecting level
15. Concept of safe boundary
…
Multiple disks
failure could
cause data lost
The more disks
there the more
multiple disks
failure happens
Fact A:
Fact B:
Replicas spread on
too many disks will
impact reliability
Data center requires
99.999% availability
By calculation:
2-rep protection should spread
replicas on less than 100 disks;
3-rep protection should spread
replicas on less than 500 disks
16. Safe boundary related with performance
…
Vol BVol A Vol C
…
Vol BVol A Vol C
Pool safe
boundary
Vol A safe
boundary
Vol B safe
boundary
Vol C safe
boundary
With pool redundancy ctl, safe
boundary limit simultaneous process
With volume redundancy ctl, simultaneous
process range is bigger than safe boundary
Simultaneous range
of all volumes
Simultaneous range
of all volumes
17. • EMC ScaleIO
• Still need centric meta data server, scalability is questionable.
• Ceph
• Poor performance and poor stability.
• VMware vSAN
• Extremely poor performance
• Only work with VMware vSphere
• Nutanix NDFS
• Poor performance, especially high latency
• Not block level storage
Comparison with Equivalents
18. • Performance! Performance! Performance!
• Low latency - 2ms via 10GbE or 0.2ms via InfiniBand
• Parallel processing - Up to 128 nodes serving one volume, IOps & MBps easily heat
physical limitation on host side
• Tiny overhead - 24 bits per I/O, over 99.4% physical bandwidth capable for real data
• Small footprint on host side - 8MB would be enough in most cases
• Little CPU consumption – one core can stably provide 4k~5k IOps
• Agile redundancy control per volume
• Volumes request different redundant level could be created from same pool
• No data migration nor down time, when changing protection level
• Erasing Code being support the same way in next release
PhDX key features and differences