2. Self introduction
• Ryo Matsumiya
• Twitter: @mattn_
• https://sites.google.com/site/ryomatsumiya0101/
• Ph.D. student (D2)
• Oyama lab. (UEC, B4-M2)
• Endo lab. (Titech, D1-)
• Major topic: Distributed and parallel processing and its software
architecture considering memory (storage) hierarchy
• Memory Hierarchy, Memory-centric Computing,
Data-intensive Computing, Big Data, Task Parallelism,
Programming System, System Software, GPGPU, Storage System
3. About SC (1/3)
• ACM/IEEE International Conference for High Performance
Computing, Networking, Storage and Analysis
• DO NOT confuse similar conferences!
• International Conference on Supercomputing (ICS)
• International Supercomputing Conference (ISC)
• Top conf. in the field of HPC
• About 13,000 attendees
• Including 3,500 international (non-US) attendees in SC ’17
4. About SC (2/3)
• Technical session
• Doctoral forum
• Poster session
• Tutorial session
• Panel session
• Invited talks + Keynote talks
• Workshops
• 38 official workshops
• BoF session
• TOP 500 is announced
• Exhibition
• 250+ organizations
5. About SC (3/3)
• SC ’17 was held in Colorado Convention Center, Denver
• SC ’15: Austin, SC ’16: Salt Lake City
• SC ’18: Dallas, SC ’19: Denver
• Acceptance Rate: 61/327 = 19 %
• Best paper: Extreme Scale Multi-Physics Simulations of the
Tsunamigenic 2004 Sumatra Megathrust Earthquake
• Technical University of Munich + Ludwig-Maximilians-Universität München
• Best poster: AI with Super-Computed Data for Monte Carlo Earthquake
Hazard Classification
• RIKEN + UT
• Gordon Bell Prize: 18.9-Pflops Nonlinear Earthquake Simulation on
Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter
Scenarios
6. PapyrusKV: A High-Performance Parallel Key-
Value Store for Distributed NVM Architectures
• Distributed KVS developed by ORNL
• No system-level daemons and servers
• C++ library using Papyrus
• Design and Implementation of Papyrus: Parallel Aggregate Persistent
Storage (IPDPS ’17)
• Open source
• https://code.ornl.gov/eck/papyrus
• Considering memory hierarchy
• Private SSDs + Private DRAMs
• TSUBAME (Titech), Stampede (TACC)
• Shared SSDs (burst buffers) + Private DRAMs
• Oakforest-PACS (JCAHPC), Cori (LBNL)
9. Structure overview
• Each process has four Memtables and a SSTable
• Memtable
• Used as caches
• Local memtable, Remote memtable, Local immutable memtable,
Remote immutable memtable
• Stored in DRAM
• SSTable
• Sorted String Table
• Stored in NVRAM
10. Data placement
• DBs are divided into files
• Each process has its own file
• In local SSD architectures, the file is stored in a SSD of its process
• In shared SSD architectures, all files are stored in the Burst
Buffer(s)
• Each KV-pair is assigned to a process
• The process is decided by (hash(key) % # of processes)
11. Local cache policy
• LRU+FIFO
• The cache is pushed to LRU-queue firstly
• Mutable-memtable(s)
• The FIFO-queue is pushed an element when it is evicted from the LRU-
queue
• Immutable-memtable(s)
• Evicted elements from the FIFO-queue are written-back to SSDs
Key Value
... ...
Key Value
... ...
Key Value
... ...
LRU FIFO
Mutable memtable Immutable memtable SSTable
DRAM SSD
12. Data structure of tables
• LSM-Tree
• Used by HBase, LevelDB, etc
• In PapyrusKV, trees of MemTables are red-black trees
• The trees in the SSDs are binary trees
O‘Neil et al, The log-structured merge-tree (LSM-tree), Acta Infomatica Vol.33 pp.351-385
13. Remote cache policy
• Can be changed with papyruskv_consistency()
• Two consistency mode
• Sequential consistency
• Relaxed consistency
• papyruskv_protect() under relaxed consisntency can make
remote caches available
• With PAPYRUSKV_RDONLY, remote read caches are available
• With PAPYRUSKV_WRONLY, asynchronously writing back is
available
• Consistency can be guaranteed by calling papyruskv_barrier()
14. Storage group (1/2)
• Extra memory copying is caused when a process gets a KV-
pair of another process in the same node
DRAM DRAM
KV-pair of
Proc. A
Process A Process B
15. Storage group (2/2)
• Solution: directly copying if under relaxed consistency
DRAM DRAM
KV-pair of
Proc. A
Process A Process B
25. Real HPC application:
De-novo genome assembly
Evangelos Georganas, Scalable Parallel Algorithms for Genome Analysis, Ph.D. Thesis, UC Berkeley
26. Application benchmarking
• Comparing with Unified Parallel C (UPC) implementation
• Not use SSDs
• Dataset is human chr14 dataset
• Executed on Cori
27. Summary
• PapyrusKV is a KVS for HPC Clusters
• C++ library based
• PapyrusKV supports both private and shared SSD
architectures
• SSDs are used as persistent memory
• DRAMs are used as caches
• LSM-Tree based cache mechanism
• Users can specify consistent policies
28. Other affective papers in SC ’17
• Why Is MPI So Slow? Analyzing the Fundamental Limits in
Implementing MPI-3.1
• 28 authors! (including three Japanese)
• Observing overheads from MPI standard
• Gravel: Fine-Grain GPU-Initiated Network Messages
• UW-Madison + AMD Research
• Network interface for GPU kernel
• Related: GPUnet [OSDI ’14], GPUrdma [ROSS ’16]
• Reducing GPU overheads
• Topology-Aware GPU Scheduling for Learning Workloads in Cloud
Environments
• Barcelona Supercomputing Center + IBM
29. Call for Jobs
• Hire me!
• Interested in large parallel and/or distributed software
• System software as well as applications
• Not only research, developing and business are also welcome
• I have the best record of (LOC×# of nodes in parallel÷# of
developers) of the active Japanese system-software
students...maybe :-D
Notes de l'éditeur
Each rank performs 10K (Summitdev and Cori) or 1K (Stampede) put operations with 16B keys and 128KB values.
first application performs 10K (Summitdev and Cori) or 1K (Stam- pede) put operations with 16B keys and 128KB values, and then it calls a checkpoint operation that generates a snapshot of the data- base in Lustre. The second application reverts the database from Lustre using a restart operation. The last application reverts the database from Lustre through the restart with redistribution tech- nique. All three applications run with the same number of ranks. Even though the last application does not need a redistribution, we forced it for the evaluation.