Scaling API-first – The story of a global engineering organization
SPONSORED WORKSHOP by Amplidata from Structure:Data 2012:
1. Big “Unstructured” Data
A Case for Optimized Object Storage
Paul Speciale
Friday, July 27, 2012
2. Storage facts and trends
Recent studies estimate that data storage capacities will likely increase by over
30X in the coming decade to over 35 Zettabytes
35ZB High-capacity drives
Less Staff / TB
Unstructured Data
Storage Consumption
30X
Time 2020
Friday, July 27, 2012
3. Storage facts and trends
But…. The number of qualified people to manage this huge volume
of data will stay flat (~1.5X)
Administrators will be expected to manage 20X more data each
Efficiency: automate & reduce overhead
Capcity / Budget
ts
en
m
re
q ui
Re
e
ag
or
St
dget
ag e Bu
Stor
Time
Friday, July 27, 2012
4. Storage facts and trends
• Much of that growth (80%) is driven by unstructured data
• Billions of large objects and files
Media Archives Online Images Large Files
Medical Images Online Storage Online Movies
4
Friday, July 27, 2012
5. Storage facts and trends:
Media & Entertainment Industry Example
M&E is driving huge capacity requirements, both with file sizes and volume of files
and storage capacities in use, driven by HD, 3D video formats:
“Petabytes are peanuts”
3TB per hour for 4K video
5
Friday, July 27, 2012
6. Big Data for Analytics vs.
Big “Unstructured” Data
6
Friday, July 27, 2012
7. Big Data for Analytics
• In the 90’s, we experienced an explosion of
data captured for analytics purposes:
• Academic Research
• Chemical R&D facilities
• Travel industry
• Geo-industry, oil & gas
• Financial / Trading
• Agriculture
• In the 2000’s, online applications &
social media triggered a flood of trend
data
7
Friday, July 27, 2012
8. Big Data for Analytics
• Data is captured as many small log files
& concatenated as “Big Data”
• Relational databases were not optimal:
• Too much data, too big
• Insufficient performance for analytics
• This stimulated innovations:
• Hadoop, MapReduce, GFS
• XML databases
• => This is Big Data for Analytics
8
Friday, July 27, 2012
9. Big Data Evolution
• Today, Big Data trend refers to Big Data for
Analytics & Big Unstructured Data:
• Media
• Streaming
• Business
• Scientific
• Fundamentally different data but with lots of
similarities
• Immense capacities
• Number of transactions or objects
• Unstructured data is traditionally stored on host
files systems but:
• Host file systems impose fixed limits - do not
scale up to the size we need
• File systems do not meet performance
requirements due to host limiting access
9
Friday, July 27, 2012
10. Big Unstructured Data
• Most unstructured data is archived, often to tape (cost),
then difficult to access
• Volumes are increasing exponentially
• Data archives are an organization & management burden
(Grandma’s Attic)
10
Friday, July 27, 2012
11. Big Unstructured Data
• Companies are starting to see the value of the
data in their archives:
• Documents of individuals can be valuable for others
• Some companies have legal reasons to keep data available
• Unexplored analytics opportunities
• This data can be mined and monetized
11
Friday, July 27, 2012
12. Big Unstructured Data
But how do store all this data in a
cost efficient way?
“Building cost-efficient Live
Archives”
12
Friday, July 27, 2012
13. Big Unstructured Data
What are the requirements?
• Tape is a difficult option: access Disk Storage
latency is key (online, low-latency access)
• Data has to be always available
online
} + Open application API’s
(App & Cloud-enabled)
}
• Direct interface to the
applications
+ Ultra-high data durability
(Erasure Coding)
• Petabyte scalability
• Extreme reliability, integrity
= Optimized Object
• Cost-efficient Storage
• Security
13
Friday, July 27, 2012
14. Disk vs. Tape
Tape has several obvious advantages over disk
& there will always be use cases for tape
But disks enable live archives with instant data
accessibility
More arguments for disk-based archives
• Disks can be powered down
• Tape requires replication to protect against media errors
• Data integrity checking
• Massive migration projects
• …
14
Friday, July 27, 2012
15. Object Storage Simplifies this Problem
• File System organization of data
becomes a burden
• File systems impose limitations on
numbers of files & directories
• Very time-consuming to organize
data
• Object Storage simplifies this
problem Application Application Application
• Flat “Namespaces” (not file
systems) - without storage limits
• Let’s the applications talk directly Object API
to the Storage
• Use “Object” application API’s to
let applications directly manage
objects & metadata
• File Gateways can be used as a
transition bridge
• Bring legacy data and apps into
Object Storage
15
Friday, July 27, 2012
16. Petabyte Scalability and Beyond
Systems should scale BIG
• Beyond petabytes of data – no built-in limits
• Beyond billions of data objects
Systems should scale uniformly
• Add resources incrementally and grow as a Single System View
• Manage from a “Single Pane of Glass”
• Scale performance and capacity separately
• Migration and seamless growth across newer generations of component
technologies (processors, disk densities)
16
Friday, July 27, 2012
17. Ultra-High Levels of Data Integrity
• Data needs to be archived for lifetimes
• Expect “bit perfect” integrity to store gold-copy of critical assets
• Consolidate multiple copies of data into a single highly-durable tier
• Ensuring the integrity of long-term unstructured data archive requires
new data protection algorithms, to:
• Address the increasing capacity of disk drives
• Solve issues related to long RAID rebuild windows
“Object storage systems based on erasure-coding can not only protect data from
higher numbers of drive failures, but also against the failure of entire storage
modules.”
17
Friday, July 27, 2012
18. Big Unstructured Data
What are the requirements?
• Tape is a difficult option: access Disk Storage
latency is key (online, low-latency access)
• Data has to be always available
online
} + Open application API’s
(App & Cloud-enabled)
}
• Direct interface to the
applications
• Petabyte scalability + Ultra-high data durability
(Erasure Coding)
• Extreme reliability, integrity
• Cost-efficient
= Optimized Object
• Security Storage
18
Friday, July 27, 2012
19. Thank You!
Paul Speciale, VP Products, Amplidata Inc.
www.amplidata.com
Friday, July 27, 2012