SlideShare une entreprise Scribd logo
1  sur  14
Parallel Computing with
HDF Server
1
John Readey
The HDF5 data format 2
• Established 20 years ago the HDF5 file format is the most commonly used
format in Earth Science
• Note: NetCDF4 files are actually HDF5 “under the hood”
• HDF5 was designed with the (somewhat contradictory) goals of:
• Archival format – data that can stored for decades
• Analysis Ready -- data that can be directly utilized for analytics (no
conversion needed)
• There’s a rich set of tools and language SDKs:
• C/C++/Fortran
• Python
• Java, etc.
HDF5 File Format meets the Cloud 3
• Storing large HDF5 collection on AWS is almost always about utilizing S3:
• Cost effective
• Redundant
• Sharable
• It’s easy enough to store HDF5 files as S3 objects, but these files can’t be
read using the HDF5 library (which is expecting a POSIX filesystem)
• Experience using FUSE to read from S3 using HDF5Library has not tended
to work so well
• In practice users have been left with copying files to local disk first
• This has led to interest in alternative formats such as Zarr, TileDB, and
our own HSDS S3 Storage Schema (more on that later)
• Our HSDS server provides a means of efficiently accessing HDF5 data on
S3
HDF Server 4
• HSDS is an open source REST based service for HDF data
• Think of it as HDF gone cloud native. 
• HSDS Features:
• Runs as a set of containers on Kubernetes – so can scale beyond one machine
• Requests can be parallelized across multiple containers
• Feature compatible with the HDF library but is independent code base
• Supports multiple readers/writers
• Uses S3 as data store
• Existing HDF APIs (h5py, h5netcdf, xarray, etc.) work seamlessly with HSDS
• Available now as part of HDF Kita Lab (our hosted Jupyter environment):
https://hdflab.hdfgroup.org
• Available on AWS Marketplace as “Kita Server”
HDF Cloud Schema
Big Idea: Map individual
HDF5 objects (datasets,
groups, chunks) as Object
Storage Objects
• Limit maximum storage object size
• Support parallelism for read/write
• Only data that is modified needs to be
updated
• Multiple clients can be reading/updating
the same “file”
Legend:
• Dataset is partitioned into
chunks
• Each chunk stored as an S3
object
• Dataset meta data (type, shape,
attributes, etc.) stored in a
separate object (as JSON text)
How to store HDF5 content in S3?
Each chunk (heavy outlines) get
persisted as a separate object
6Dataset JSON Example
• creationProperties contains HDF5
dataset creation property list settings.
• Id is the objects UUID.
• Layout represents HDF5 dataspace.
• Root points back to the root group
• Created & lastModified are timestamps
• type represents HDF5 datatype.
• attributes holds a list of HDF5 attribute
JSON objects.
{
"creationProperties": {},
"id": "d-9a097486-58dd-11e8-a964-
0242ac110009",
"layout": {"dims": [10], "class":
"H5D_CHUNKED"},
"root": "g-952b0bfa-58dd-11e8-a964-
0242ac110009",
"created": 1526456944,
"lastModified": 1526456944,
"shape": {"dims": [10], "class":
"H5S_SIMPLE"},
"type": {"base": "H5T_STD_I32LE",
"class": "H5T_INTEGER"},
"attributes": {}
}
Schema Details 7
• Key Organization
• Objects are stored root_id
• All non-root objects are stored as sub-keys of root_id
• “flat” organization to support non-cycle links
• Each storage node is limited to about 300 req/s 5,500 req/s
• Note: rate limit raised last year by Amazon
• Chunks are stored in the same folder as dataset metadata
• Chunk key is determined based on chunk position in the data space
• E.g. c-<uuid>_0_0_0 Is the corner chunk of a 3-dimensional dataset
• Chunk objects get created as needed on first write
• Schema is currently used just by HDF Server, but could just as easily be
used directly by clients (assuming that writes don’t conflict)
• Several improvements have been made over the last year
• Read access for traditional HDF5 Files stored in S3
• More on this in the next slide
• Shuffle filter support
• Along with deflate
• Fast metadata loading
• Optionally load all metadata in one request
• Support for multiple buckets
• HSDS can access data stored in different buckets
• H5netcdf & xarray support
• Support for the REST API is built into these packages
• Additional CLI tools (hsmv, hscp, hsdiff)
• Variable Length data support with compression
• Schema V2
8New Features
Supporting traditional HDF5 files 9
• Downside of the HDF S3 Schema is that data needs be transmogrified*
• Since the bulk of the data is usually the chunk data it makes sense to just
leave the data in place and save pointers to the original file(s):
• Convert just the metadata of the source HDF5 file to the S3 Schema
• Store the source file as a S3 object
• For data reads, metadata provides offset and length into the HDF5 file
• S3 Range GET returns needed data
• This approach can be used either directly or with HDF Server
• Compared with accessing S3 directly, you reduce the number of S3
requests needed
• Performance is comparable to sharded data model
• Only read access is supported
Hybrid Approach: Metadata + HDF5 Files 1
0
S3://BIG_REPO/…/AN_HDF5_FILE.h5
Imported Metadata (JSON) HDF5 File stored as S3 object
Dset /dset1: chunk 0
Dset /dset1: chunk 1
Dset /dset1: chunk n
S3 Range GET(
• S3 Key
• Offset
• Num Bytes)
• HDF Kita Lab runs on AWS in a Kubernetes cluster
• Cluster can scale to handle different number of users
• Each user gets:
• 1 CPU Core (2.5GHz Xeon)
• 8 GB RAM
• 10 GB Disk
• 100 GB S3 Storage
• Access to HDF Kita Server
• Ability to read/write HDF data stored on S3
• User environment configured for commonly used Python Packages for
HDF users:
• H5py(d), pandas, h5netcdf, xarray, bokeh, dask
• HDF Kita Command Line tools:
• Hsinfo, hsls, hsget, hsload, etc.
1
1Kita Lab – playground for HDF Server
• JupyterLab and Kita Server both runs as a set of Docker containers
• Kubernetes transparently manages running these containers across multiple
machines
1
2Kubernetes Platform
AWS
Kubernetes
JupyterHub HDF Kita Server (HSDS)
{Containers
1
3Architecture
AWS S3
Kita Server (HSDS)
User
SN
SN
SN
SN
DN
DN
DN
DN
User Containers &
EBS Volumes
spawn
References 1
4
• HSDS: https://github.com/HDFGroup/hsds
• H5Pyd: https://github.com/HDFGroup/h5pyd
• Kita Lab: https://www.hdfgroup.org/hdfkitalab/
• SciPy2017 talk:
https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy201
7.pdf
• AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power-
from-wind-open-data-on-aws/
• Spark and HDF Blog article: https://www.hdfgroup.org/2015/04/putting-
some-spark-into-hdf-eos
• Notebook from this talk:
https://gist.github.com/jreadey/d1c67aee07451985397f48a50be2cdaa

Contenu connexe

Tendances

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 

Tendances (20)

HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
Caching and Buffering in HDF5
Caching and Buffering in HDF5Caching and Buffering in HDF5
Caching and Buffering in HDF5
 
HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 
HDF Update 2016
HDF Update 2016HDF Update 2016
HDF Update 2016
 
MODIS Land and HDF-EOS
MODIS Land and HDF-EOSMODIS Land and HDF-EOS
MODIS Land and HDF-EOS
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
Parallel HDF5 Developments
Parallel HDF5 DevelopmentsParallel HDF5 Developments
Parallel HDF5 Developments
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF-EOS 2/5 to netCDF Converter
HDF-EOS 2/5 to netCDF ConverterHDF-EOS 2/5 to netCDF Converter
HDF-EOS 2/5 to netCDF Converter
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Putting some Spark into HDF5
Putting some Spark into HDF5Putting some Spark into HDF5
Putting some Spark into HDF5
 
HDF for the Cloud
HDF for the CloudHDF for the Cloud
HDF for the Cloud
 
Easy Access of NASA HDF data via OPeNDAP
Easy Access of NASA HDF data via OPeNDAPEasy Access of NASA HDF data via OPeNDAP
Easy Access of NASA HDF data via OPeNDAP
 
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
 
NetCDF and HDF5
NetCDF and HDF5NetCDF and HDF5
NetCDF and HDF5
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
HDF Project Update
HDF Project UpdateHDF Project Update
HDF Project Update
 

Similaire à Parallel Computing with HDF Server

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdfManoel Ribeiro
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit
 

Similaire à Parallel Computing with HDF Server (20)

Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
 
Integrating HDF5 with SRB
Integrating HDF5 with SRBIntegrating HDF5 with SRB
Integrating HDF5 with SRB
 

Plus de The HDF-EOS Tools and Information Center (11)

The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's GuideHDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's Guide
 
HDF Status Update
HDF Status UpdateHDF Status Update
HDF Status Update
 
NASA Terra Data Fusion
NASA Terra Data FusionNASA Terra Data Fusion
NASA Terra Data Fusion
 
S3 VFD
S3 VFDS3 VFD
S3 VFD
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 

Dernier

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Dernier (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Parallel Computing with HDF Server

  • 1. Parallel Computing with HDF Server 1 John Readey
  • 2. The HDF5 data format 2 • Established 20 years ago the HDF5 file format is the most commonly used format in Earth Science • Note: NetCDF4 files are actually HDF5 “under the hood” • HDF5 was designed with the (somewhat contradictory) goals of: • Archival format – data that can stored for decades • Analysis Ready -- data that can be directly utilized for analytics (no conversion needed) • There’s a rich set of tools and language SDKs: • C/C++/Fortran • Python • Java, etc.
  • 3. HDF5 File Format meets the Cloud 3 • Storing large HDF5 collection on AWS is almost always about utilizing S3: • Cost effective • Redundant • Sharable • It’s easy enough to store HDF5 files as S3 objects, but these files can’t be read using the HDF5 library (which is expecting a POSIX filesystem) • Experience using FUSE to read from S3 using HDF5Library has not tended to work so well • In practice users have been left with copying files to local disk first • This has led to interest in alternative formats such as Zarr, TileDB, and our own HSDS S3 Storage Schema (more on that later) • Our HSDS server provides a means of efficiently accessing HDF5 data on S3
  • 4. HDF Server 4 • HSDS is an open source REST based service for HDF data • Think of it as HDF gone cloud native.  • HSDS Features: • Runs as a set of containers on Kubernetes – so can scale beyond one machine • Requests can be parallelized across multiple containers • Feature compatible with the HDF library but is independent code base • Supports multiple readers/writers • Uses S3 as data store • Existing HDF APIs (h5py, h5netcdf, xarray, etc.) work seamlessly with HSDS • Available now as part of HDF Kita Lab (our hosted Jupyter environment): https://hdflab.hdfgroup.org • Available on AWS Marketplace as “Kita Server”
  • 5. HDF Cloud Schema Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage Objects • Limit maximum storage object size • Support parallelism for read/write • Only data that is modified needs to be updated • Multiple clients can be reading/updating the same “file” Legend: • Dataset is partitioned into chunks • Each chunk stored as an S3 object • Dataset meta data (type, shape, attributes, etc.) stored in a separate object (as JSON text) How to store HDF5 content in S3? Each chunk (heavy outlines) get persisted as a separate object
  • 6. 6Dataset JSON Example • creationProperties contains HDF5 dataset creation property list settings. • Id is the objects UUID. • Layout represents HDF5 dataspace. • Root points back to the root group • Created & lastModified are timestamps • type represents HDF5 datatype. • attributes holds a list of HDF5 attribute JSON objects. { "creationProperties": {}, "id": "d-9a097486-58dd-11e8-a964- 0242ac110009", "layout": {"dims": [10], "class": "H5D_CHUNKED"}, "root": "g-952b0bfa-58dd-11e8-a964- 0242ac110009", "created": 1526456944, "lastModified": 1526456944, "shape": {"dims": [10], "class": "H5S_SIMPLE"}, "type": {"base": "H5T_STD_I32LE", "class": "H5T_INTEGER"}, "attributes": {} }
  • 7. Schema Details 7 • Key Organization • Objects are stored root_id • All non-root objects are stored as sub-keys of root_id • “flat” organization to support non-cycle links • Each storage node is limited to about 300 req/s 5,500 req/s • Note: rate limit raised last year by Amazon • Chunks are stored in the same folder as dataset metadata • Chunk key is determined based on chunk position in the data space • E.g. c-<uuid>_0_0_0 Is the corner chunk of a 3-dimensional dataset • Chunk objects get created as needed on first write • Schema is currently used just by HDF Server, but could just as easily be used directly by clients (assuming that writes don’t conflict)
  • 8. • Several improvements have been made over the last year • Read access for traditional HDF5 Files stored in S3 • More on this in the next slide • Shuffle filter support • Along with deflate • Fast metadata loading • Optionally load all metadata in one request • Support for multiple buckets • HSDS can access data stored in different buckets • H5netcdf & xarray support • Support for the REST API is built into these packages • Additional CLI tools (hsmv, hscp, hsdiff) • Variable Length data support with compression • Schema V2 8New Features
  • 9. Supporting traditional HDF5 files 9 • Downside of the HDF S3 Schema is that data needs be transmogrified* • Since the bulk of the data is usually the chunk data it makes sense to just leave the data in place and save pointers to the original file(s): • Convert just the metadata of the source HDF5 file to the S3 Schema • Store the source file as a S3 object • For data reads, metadata provides offset and length into the HDF5 file • S3 Range GET returns needed data • This approach can be used either directly or with HDF Server • Compared with accessing S3 directly, you reduce the number of S3 requests needed • Performance is comparable to sharded data model • Only read access is supported
  • 10. Hybrid Approach: Metadata + HDF5 Files 1 0 S3://BIG_REPO/…/AN_HDF5_FILE.h5 Imported Metadata (JSON) HDF5 File stored as S3 object Dset /dset1: chunk 0 Dset /dset1: chunk 1 Dset /dset1: chunk n S3 Range GET( • S3 Key • Offset • Num Bytes)
  • 11. • HDF Kita Lab runs on AWS in a Kubernetes cluster • Cluster can scale to handle different number of users • Each user gets: • 1 CPU Core (2.5GHz Xeon) • 8 GB RAM • 10 GB Disk • 100 GB S3 Storage • Access to HDF Kita Server • Ability to read/write HDF data stored on S3 • User environment configured for commonly used Python Packages for HDF users: • H5py(d), pandas, h5netcdf, xarray, bokeh, dask • HDF Kita Command Line tools: • Hsinfo, hsls, hsget, hsload, etc. 1 1Kita Lab – playground for HDF Server
  • 12. • JupyterLab and Kita Server both runs as a set of Docker containers • Kubernetes transparently manages running these containers across multiple machines 1 2Kubernetes Platform AWS Kubernetes JupyterHub HDF Kita Server (HSDS) {Containers
  • 13. 1 3Architecture AWS S3 Kita Server (HSDS) User SN SN SN SN DN DN DN DN User Containers & EBS Volumes spawn
  • 14. References 1 4 • HSDS: https://github.com/HDFGroup/hsds • H5Pyd: https://github.com/HDFGroup/h5pyd • Kita Lab: https://www.hdfgroup.org/hdfkitalab/ • SciPy2017 talk: https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy201 7.pdf • AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power- from-wind-open-data-on-aws/ • Spark and HDF Blog article: https://www.hdfgroup.org/2015/04/putting- some-spark-into-hdf-eos • Notebook from this talk: https://gist.github.com/jreadey/d1c67aee07451985397f48a50be2cdaa

Notes de l'éditeur

  1. Many users of HDF5 are now migrating data archives to public or private cloud systems. The access approaches and performance characteristics of cloud storage are fundamentally different than traditional data storage systems because 1) the data are accessed over http and 2) the data are stored in an object store and identified using unique keys. There are many different ways to organize and access data in the cloud. The HDF Group is currently exploring and developing approaches that will facilitate migration to the cloud and support many existing HDF5 data access use cases. Our goal is to protect data providers and users from disruption as their data and applications are migrated to the cloud.
  2. This idea has been kicking around for a while, but storing potentially millions of files on a Linux filesystem would be problematic. Using S3 as the storage vehicle is a natural fit since there’s no limit to the number of objects in a bucket. With NREL we’ve validated this approach to 50 TB’s of data over 27MM objects (see aws-big-data blog article: https://aws.amazon.com/blogs/big-data/power-from-wind-open-data-on-aws/ )