6. Supercomputing Today
- Clusters of commodity (Intel/AMD x86) HW & IBM bluegene
- High-speed, low-latency interconnect (Infiniband)
- Coprocessors (GPU, Xeon Phi)
- Storage area networks (SANs) & local disks for temporary files
Hardware ecosystem
7. High-end Computing & Industry
HPC
ENIAC (1946) 최초 Business Computer (1951)
CERN 통신 시스템(1989)
FTP, Gopher, NNTP (1969)
NCSA Mosaic (1993)
World Wide Web
dot-com bubble (1997~)
8. High-end Computing & Industry
HPC
Active Data Repository (1998)
BigData/Hadoop (2005)
Grid Computing (1998)
Amazon EC2 (2006)
14. Hadoop’s Uncomfortable Fit in HPC
Why?
1. Hadoop is an invader!
2. Running Java applications (Hadoop) on supercomputer looks funny
3. Hadoop reinvents HPC technologies poorly
4. HDFS is very slow and very obtuse
Source: G. K. Lockwood, HPCwire Image Source: solar-citrus.tumblr.com
15. HPC Ecosystem
Application level
Middleware
System software
Cluster Hardware
Applications and Community Codes
FORTRAN, C, C++, and IDEs
Domain Specific
Libraries
programming model
Lustre (Parallel
File System)
MPI/OpenMP/
OpenCL/CUDA
Batch Scheduler
(SLURM,SGE)
Linux OS
Infiniband
/Ethernet
SAN/Local
Storage
X86 Servers+
Coprocessors
16. BigData Ecosystem
Application level
Middleware
System software
Cluster Hardware
Mahout, R, etc
MapReduce
programming model
Hadoop File System
HBase
Big Table
Linux OS
Ethernet
Local
Storage
X86 Servers
zookeeper Spark
Hive Pig, etc ...
Virtual Machines and Cloud
17. ADR vs MapReduce
- Data Intensive Computing Framework
- Scientific Datasets 처리
1998 Active Data Repository (ADR)
Processing Remotely Sensed Data
NOAA Tiros-N
w/ AVHRR sensor
AVHRR Level 1 DataAVHRR Level 1 Data
• As the TIROS-N satellite orbits, the
Advanced Very High Resolution Radiometer (AVHRR)
sensor scans perpendicular to the satellite’s track.
• At regular intervals along a scan line measurements
are gathered to form an instantaneous field of view
(IFOV).
• Scan lines are aggregated into Level 1 data sets.
A single file of Global Area
Coverage (GAC) data
represents:
• ~one full earth orbit.
• ~110 minutes.
• ~40 megabytes.
• ~15,000 scan lines.
One scan line is 409 IFOV’s
Source: Chaos project: U.Maryland, College Park
18. DataCutter vs Dryad, Spark
2000 DataCutter
- Workflow 지원 Component Framework
- Pipelined components (filter)
- Stream based communication
class MyFilter : public Filter_Base {
public:
int init( int argc, char * argv[] )
{ ... };
int process( stream_t st[] ) { ... };
int finalize( void ) { ... };
}
Source: Chaos project: U.Maryland, College Park
19. HPC & BigData
HPC 와 BigData 의 공통 해결 과제
1. High-performance interconnect technologies
2. Energy efficient circuit, power, and cooling technologies
3. Power and Failure-aware resilient scalable system software
4. Advanced memory technologies
5. Data management- volume, velocity, and diversity of data
6. Programming models
7. Scale-up? Scale-out?
20. HPC for BigData
Infiniband/RDMA for Hadoop
- Wide adoption of RDMA
- Message Passing Interface (MPI) for HPC
- Parallel File Systems
- Lustre
- GPFS
- Delivering excellent performance
- (latency, bandwidth and CPU Utilization)
- Delivering excellent scalability
21. HPC for BigData
Infiniband/RDMA for Hadoop
- Socket 기반의 Java RPC
- HPC는 고속 lossless DMA 통신 사용
Application
Socket
Transport
NIC Driver
NIC
Application
Socket
Transport
NIC Driver
NIC
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Application
Socket
Transport
NIC Driver
NIC
Application
Socket
Transport
NIC Driver
NIC
Buffer
Buffer
Buffer
Buffer
Source: D.K.Panda, ISCA15 tutorial
22. HPC for BigData
Lustre File System for Hadoop
- Intel HPC distribution for Hadoop
- Map task의 중간 결과물을 Lustre FS 에 저장
Source: Intel
23. HPC for BigData
RDMA 기반 In-Memory Merge
Source: D.K. Panda, ISCA 15 Tutorial
Map 1 Output (sorted)
Map 2 Output (sorted)
Reducer
Map 1 Output (sorted)
Map 2 Output (sorted)
Reducer
1. Sorted map output is divided into small pieces based on shuffle packet size
and size of the key-value pairs
2. As small portion of data is shuffled, merge can take place in memory
24. HPC for BigData
Scale out vs Scale up
1. Scale-out
2. Scale-up
Performance
Performance
Scale-up is more cost-effective
(Microsoft – SOCC 2013)
48. - 최근 데이터 액세스 분포를 사용하여 Hash Key 경계를 분할
Locality-aware Fair Scheduler
In-Memory Hash Key Boundary Management
14 150 1 2 3 4 5 6 7 8 9 10 11 12 13
49. Locality-aware Fair Scheduler
In-Memory Hash Key Boundary Management
0
0.2
0.4
0.6
0.8
1
0
6
12
18
24
30
36
42
48
54
60
66
72
78
84
90
96
102
108
114
120
126
132
138
CDF
server 1: [0,35)
server 2: [35,47)
server 3: [47,91)
server 4: [91,102)
server 5: [102,140)
new task T1 (HK=43)
new task T2 (HK=69)
50. EclipseMR 에서의 MapReduce
Job Scheduler
A
A
B
C
D
E
F
B
C
D
E
F
5
5
11
15
18
26
38
39
47
57
55
EclipseMR
MR application 1 1
2
3
4
4
5
5
3
6
reduce tasks
6
Serv
er
Hash Key Range
A [57~5)
B [5,11)
C [11,18)
D [18,39)
E [39,48)
F [48,57)
- accesses a file (hash key=38)
: data block 5 & 56
53. Ongoing Works
MachineLearning on EclipseMR
- GPU ML 패키지와 EclipseMR의 통합
Hadoop에서 GPU를
쓰는건 너무 어려워요.
WORLD'S LARGEST ARTIFICIAL
NEURAL NETWORKS WITH GPU
Source: Stanford Al Lab
54. Ongoing Works
Cassandra/NoSQL
- Cassandra 와 EclipseMR의 통합
- NewSQL on EclipseMR
DHT 인메모리 캐싱/RDMA 를 활용한
고속 NewSQL 엔진 개발
14
15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
{13,14,15} {0,1}
{2,3,4}
{5,6,7}
{8,9,10,11,12}