4. What
is
OrangeFS?
• OrangeFS
is
a
next
generation
Parallel
File
System
• Based
on
PVFS
• Distributes
file
data
across
multiple
file
servers
leveraging
any
block
level
file
system.
• Distributed
Meta
Data
across
1
to
all
storage
servers
• Supports
simultaneous
access
by
multiple
clients,
including
Windows
using
the
PVFS
protocol
Directly
• Works
w/
standard
kernel
releases
and
does
not
require
custom
kernel
patches
• Easy
to
install
and
maintain
5. Why
Parallel
File
System?
HPC
–
Data
Intensive
Parallel
(PVFS)
Protocol
• Large
datasets
• Checkpointing
• Visualization
• Video
• BigData
Unstructured
Data
Silos
Interfaces
to
Match
Problems
• Unify
Dispersed
File
Systems
• Simplify
Storage
Leveling
§ Multidimensional
arrays
§
typed
data
§ portable
formats
6. Original
PVFS
Design
Goals
§ Scalable
§ Configurable
file
striping
§ Non-‐contiguous
I/O
patterns
§ Eliminates
bottlenecks
in
I/O
path
§ Does
not
need
locks
for
metadata
ops
§ Does
not
need
locks
for
non-‐conflicting
applications
§ Usability
§ Very
easy
to
install,
small
VFS
kernel
driver
§ Modular
design
for
disk,
network,
etc
§ Easy
to
extend
-‐>
Hundreds
of
Research
Projects
have
used
it,
including
dissertations,
thesis,
etc…
7. OrangeFS
Philosophy
• Focus
on
a
Broader
Set
of
Applications
• Customer
&
Community
Focused
• (>300
Member
Strong
Community
&
Growing)
• Open
Source
• Commercially
Viable
• Enable
Research
9. System
Architecture
• OrangeFS
servers
manage
objects
• Objects
map
to
a
specific
server
• Objects
store
data
or
metadata
• Request
protocol
specifies
operations
on
one
or
more
objects
• OrangeFS
object
implementation
• DB
for
indexing
key/value
data
• Local
block
file
system
for
data
stream
of
bytes
11. 1994-‐2004
Design
and
Development
at
CU
Dr.
Ligon
+
ANL
(CU
Graduates)
Primary
Maint
&
Development
ANL
(CU
Graduates)
+
Community
2004-‐2010
2007-‐2010
New
PVFS
Branch
SC10
(fall
2010)
2015
Announced
with
community
and
is
now
Mainline
of
future
development
as
of
2.8.4
Spring
2012
New
Development
focused
on
a
broader
set
of
problems
SC11
(fall
2011)
Performance
improvements,
Direct
Lib
+
Cache
Stability,
WebDAV,
S3
PVFS2
PVFS
Improved
MD,
Stability,
Server
Side
Operations,
Newer
Kernels,
Testing
Windows
Client,
Stability,
Replicate
on
Immutable
2.8.6
+
Webpack
2.8.5
+
Win
Support
and
Targeted
Development
Services
Initially
Offered
by
Omnibond
OrangeFS
3.0
Summer
2014
Distributed
Dir
MD,
Capability
based
security
2.9.0
Winter
2013
Performance
improvements,
Stability,
2.8.7
+
Webpack
Spring
2014
Performance
improvements,
Stability,
shared
mmap,
multi
TCP/IP
Server
Homing,
Hadoop
MapReduce,
user
lib
fixes,
new
spec
file
for
RPMS
+
DKMS
2.8.8
+
Webpack
Available
in
the
AWS
Marketplace
Replicated
MD,
File
Data,
128
bit
UUID
for
File
Handles,
Parallel
Background
Processes,
web
based
Mgt
Ui,
self
healing
processes,
data
balancing
13. Server
to
Server
Communications
(2.8.5)
Traditional
Metadata
Operation
Create
request
causes
client
to
communicate
with
all
servers
O(p)
Scalable
Metadata
Operation
Create
request
communicates
with
a
single
server
which
in
turn
communicates
with
other
servers
using
a
tree-‐based
protocol
O(log
p)
Mid
Client
Serv
App
Mid
Mid
Client
Serv
Client
Serv
App
Network
App
Mid
Client
Serv
App
Mid
Mid
Client
Serv
Client
Serv
App
Network
App
14. Recent
Additions
(2.8.5)
SSD
Metadata
Storage
Replicate
on
Immutable
(file
based)
Windows
Client
Supports
Windows
32/64
bit
Server
2008,
R2,
Vista,
7
15. Direct
Access
Interface
(2.8.6)
• Implements:
• POSIX
system
calls
• Stdio
library
calls
• Parallel
extensions
• Noncontiguous
I/O
• Non-‐blocking
I/O
• MPI-‐IO
library
• Found
more
boundary
conditions
fixed
in
upcoming
2.8.7
App
Kernel
PVFS
lib
Client
Core
Direct
lib
PVFS
lib
Kernel
App
IB
TCP
16. File
System
File
System
File
System
Direct
Interface
Client
Caching
(2.8.6)
• Direct
Interface
enables
Multi-‐Process
Coherent
Client
Caching
for
a
single
client
File
System
File
System
Client
Application
Direct
Interface
Client
Cache
File
System
17. WebDAV
(2.8.6
webpack)
PVFS
Protocol
OrangeFS
Apache
• Supports
DAV
protocol
and
tested
with
the
Litmus
DAV
test
suite
• Supports
DAV
cooperative
locking
in
metadata
18. S3
(2.8.6
webpack)
PVFS
Protocol
OrangeFS
Apache
• Tested
using
s3cmd
client
• Files
accessible
via
other
access
methods
• Containers
are
Directories
• Accounting
Pieces
not
implimented
19. Summary
-‐
Recently
Added
to
OrangeFS
• In
2.8.3
• Server-‐to-‐Server
Communication
• SSD
Metadata
Storage
• Replicate
on
Immutable
• 2.8.4,
2.8.5
(fixes,
support
for
newer
kernels)
• Windows
Client
• 2.8.6
–
Performance,
Fixes,
IB
updates
• Direct
Access
Libraries
(initial
release)
• preload
library
for
applications,
Including
Optional
Client
Cache
• Webpack
• WebDAV
(with
file
locking),
S3
20. Available
on
the
Amazon
AWS
Marketplace
and
brought
to
you
by
Omnibond
OrangeFS
Instance
Unified High Performance File System
DynamoDB
EBS
Volumes
OrangeFS
on
AWS
Marketplace
22. Hadoop
JNI
Interface
(2.8.8)
• OrangeFS
Java
Native
Interface
• Extension
of
Hadoop
File
System
Class
–>JNI
• Buffering
• Distribution
• Fast
PVFS
Protocol
for
Remote
Configuration
PVFS
Protocol
23. Additional
Items(2.8.8)
• Updated
user
lib
• Shared
mmap
support
in
kernel
module
• Support
for
kernels
up
to
3.11
• Multi-‐homing
servers
over
IP
• Clients
can
access
server
over
multiple
interfaces
(say
clients
on
IPoIB
+clients
on
IPoEthernet
+clients
on
IPoMx
• Enterprise
Installers
(Coming
Shortly)
• Client
(with
DKMS
for
Kernel
Module)
• Server
• Devel
25. Scaling
Tests
16
Storage
Servers
with
2
LVM’d
5+1
RAID
sets
were
tested
with
up
to
32
clients,
with
read
performance
reaching
nearly
12GB/s
and
write
performance
reaching
nearly
8GB/s.
26. MapReduce
over
OrangeFS
• 8
Dell
R720
Servers
Connected
with
10Gb/s
Ethernet
• Remote
Case
adds
an
additional
8
Identical
Servers
and
does
all
OrangeFS
work
Remotely
and
only
Local
work
is
done
on
Compute
Node
(Traditional
HPC
Model)
• *25%
improvement
with
OrangeFS
running
Remotely
27. MapReduce
over
OrangeFS
• 8
Dell
R720
Servers
Connected
with
10Gb/s
Ethernet
• Remote
Clients
are
R720s
with
single
SAS
disks
for
local
data
(vs.
12
disk
arrays
in
the
previous
test).
31. Distributed
Directory
Metadata
(2.9.0)
DirEnt1
DirEnt2
DirEnt3
DirEnt4
DirEnt5
DirEnt6
DirEnt1
DirEnt5
DirEnt3
DirEnt2
DirEnt6
DirEnt4
Server0
Server1
Server2
Server3
Extensible Hashing
u State
Management
based
on
Giga+
u Garth
Gibson,
CMU
u Improves
access
times
for
directories
with
a
very
large
number
of
entries
32. Cert
or
Credential
Signed
Capability
I/O
Signed
Capability
Signed
Capability
I/O
Signed
Capability
I/O
OpenSSL
PKI
• 3
Security
Modes
• Basic
–
OrangeFS/PVFS
Classic
Mode
• Key-‐Based
–
Keys
are
used
to
authorize
clients
for
use
with
the
FS
• User
Certificate
Based
with
LDAP
–
user
certs
are
used
for
access
to
the
file
system
and
are
generated
based
on
LDAP
uid/gid
info
34. Replication
/
Redundancy
• Redundant
Metadata
• seamless
recovery
after
a
failure
• Redundant
objects
from
root
directory
down
• Configurable
• Redundant
Data
Update
mode
(real
time,
on
close,
on
immutable,
none)
Configurable
Number
of
Replicas
• Real
Time
“forked
flow”
work
shows
little
overhead
• Replicate
on
Close
• Replicate
to
external
(like
LTFS)
• Looking
at
supporting
HSM
option
to
external
(no
local
replica)
• Emphasis
on
continuous
operation
OrangeFS
3.0
35. • An
OID
(object
identifier)
is
a
128-‐bit
UUID
that
is
unique
to
the
data-‐space
• An
SID
(server
identifier)
is
a
128-‐bit
UUID
that
is
unique
to
each
server.
• No
more
than
one
copy
of
a
given
data-‐space
can
exist
on
any
server
• The
(OID,
SID)
tuple
is
unique
within
the
file
system.
• (OID,
SID1),
(OID,
SID2),
(OID,
SID3)
are
copies
of
the
object
identifier
on
different
servers.
Handles
-‐>
UUIDs
OrangeFS
3.0
36. • In
an
Exascale
environment
with
the
potential
for
thousands
of
I/O
servers,
it
will
no
longer
be
feasible
for
each
server
to
know
about
all
other
servers.
• Servers
Discovery
• Will
know
a
subset
of
their
neighbors
at
startup
(or
may
be
cached
from
previous
startups).
Similar
to
DNS
domains.
• Servers
will
learn
about
unknown
servers
on
an
as
needed
basis
and
cache
them.
Similar
to
DNS
query
mechanisms
(root
servers,
authoritative
domain
servers).
• SID
Cache,
in
memory
db
to
store
server
attributes
Server
Location
/
SID
Mgt
OrangeFS
3.0
37. Policy
Based
Location
• User
defined
attributes
for
servers
and
clients
• Stored
in
SID
cache
• Policy
is
used
for
data
location,
replication
location
and
multi-‐tenant
support
• Completely
Flexible
• Rack
• Row
• App
• Region
OrangeFS
3.0
38. • Modular
infrastructure
to
easily
build
background
parallel
processes
for
the
file
system
Used
for:
•
Gathering
Stats
for
Monitoring
• Usage
Calculation
(can
be
leveraged
for
Directory
Space
Restrictions,
chargebacks)
• Background
safe
FSCK
processing
(can
mark
bad
items
in
MD)
• Background
Checksum
comparisons
• Etc…
Background
Parallel
Processing
Infrastructure
(3.0)
40. Data
Migration
/
Mgt
• Built
on
Redundancy
&
DBG
processes
• Migrate
objects
between
servers
• De-‐populate
a
server
going
out
of
service
• Populate
a
newly
activated
server
(HW
lifecycle)
• Moving
computation
to
data
• Hierarchical
storage
• Use
existing
metadata
services
• Possible
-‐
Directory
Hierarchy
Cloning
•
Copy
on
Write
(Dev,
QA,
Prod
environments
with
high
%
data
overlap)
OrangeFS
3.x
42. Attribute
Based
Metadata
Search
• Client
tags
files
with
Keys/Values
•
Keys/Values
indexed
on
Metadata
Servers
•
Clients
query
for
files
based
on
Keys/Values
•
Returns
file
handles
with
options
for
filename
and
path
Key/Value
Parallel
Query
Data
Data
File
Access
OrangeFS
3.x
44. Extend
Capability
based
security
• Enable
certificate
level
access
(in
process)
• Federated
access
capable
• Can
be
integrated
with
rules
based
access
control
• Department
x
in
company
y
can
share
with
Department
q
in
company
z
• rules
and
roles
establish
the
relationship
• Each
company
manages
their
own
control
of
who
is
in
the
company
and
in
department
45. SDN
-‐
OpenFlow
• Working
with
OF
research
team
at
CU
• OF
separates
the
control
plane
from
delivery,
gives
ability
to
control
network
with
SW
• Looking
at
bandwidth
optimization
leveraging
OF
and
OrangeFS
46. ParalleX
ParalleX
is
a
new
parallel
execution
model
• Key
components
are:
• Asynchronous
Global
Address
Space
(AGAS)
• Threads
• Parcels
(message
driven
instead
of
message
passing)
• Locality
• Percolation
• Synchronization
primitives
• High
Performance
ParalleX
(HPX)
library
implementation
written
in
C++
47. PXFS
• Parallel
I/O
for
ParalleX
based
on
PVFS
• Common
themes
with
OrangeFS
Next
• Primary
objective:
unification
of
ParalleX
and
storage
name
spaces.
• Integration
of
AGAS
and
storage
metadata
subsystems
• Persistent
object
model
• Extends
ParalleX
with
a
number
of
IO
concepts
• Replication
• Metadata
• Extending
IO
with
ParalleX
concepts
• Moving
work
to
data
• Local
synchronization
• Effort
with
LSU,
Clemson,
and
Indiana
U.
• Walt
Ligon,
Thomas
Sterling
49. Johns
Hopkins
OrangeFS
Selection
• JHU
-‐
HLTCOE
Selected
OrangeFS
• After
evaluating:
Ceph,
GlusterFS,
Lustre
and
OrangeFS
“Leveraging
OrangeFS
for
the
parallel
filesystem,
the
system
as
a
whole
is
capable
of
delivering
30GB/s
write,
46GB/s
read,
and
be-‐
tween
37,260-‐237,180
IOPS
of
performance.
The
variation
in
IOPS
performance
is
dependent
on
the
file
size
and
number
of
bytes
written
per
commit
as
documented
in
the
Test
Results
section.”*
*
http://hltcoe.jhu.edu/uploads/publications/papers/14662_slides.pdf
“The
final
system
design
rep-‐
resents
a
2,775%
increase
in
read
performance
and
a
1,763-‐11,759%
increase
in
IOPS”*
50. Learning
More
• www.orangefs.org
web
site
• Releases
• Documentation
• Wiki
• pvfs2-‐users@beowulf-‐underground.org
• Support
for
users
• pvfs2-‐developers@beowulf-‐underground.org
• Support
for
developers
51. Support
&
Development
Services
• www.orangefs.com
&
www.omnibond.com
• Professional
Support
&
Development
team
• Buy
into
the
project
52.
Intelligent
Transportation
Solutions
Identity
Manager
Drivers
&
Sentinel
Connectors
Parallel
Scale-‐Out
Storage
Software
Social
Media
Interaction
System
Omnibond
Info
Computer
Vision
Enterprise
Personal
Solution
Areas