Slides from our recent workshop for hedge funds and a review of the cloud grid computing options. Included some live demos tackling 2TB of full depth market data using MATLAB on AWS, and Google BigQuery with Datalab.
Strategies for Landing an Oracle DBA Job as a Fresher
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
1. 変通 [hen-tsoo]
noun
1. Resourcefulness – the quality of being able to cope with a difficult situation
2. Adaptability – the ability to change (or be changed) to fit changed circumstances
3. Agility – the power of moving quickly and easily; nimbleness
INFINITELY SCALABLE CLUSTERS
Grid computing on public cloud
6. WHAT IS PUBLIC CLOUD?
“A service provider makes resources, such as virtual machines, applications and
storage, available to the general public.”
• Utility model
• No contracts
• Shared hardware / multi tenant
• Self managed
7. WHAT IS GRID COMPUTING?
Traditional resource limitations:
• Data store performance
• PC Processor / Memory / Storage
• Network bandwidth
The researcher may wait a long time for results.
• Grid computing moves the computational work from the
PC to a cluster of servers
• The cluster processes the data on behalf of the
researcher and returns the results
• Processing time is reduced
• Larger datasets can be tackled
8. KEY CONCEPTS
The challenges The workflows
Number of tasks
Sizeofdata
Big Data
High Throughput
Computing
MapReduce
High Performance
Computing
Ingest Process
Analyse
Visualise
Store
11. HARDWARE INFLEXIBILITY
• Buy 22 core processors at 2.2GHz or 6
core processors at 3.6GHz?
• Buy 8GB, 16GB or 32GB memory
modules (RAM per core ratio)?
• Graphical Processing Units (GPUs)?
• How much local storage per server?
• What network devices between
servers (32 or 48 port switches?)
• What size file server?
0
20
40
60
80
100
120
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Jobsperday
Date
Grid usage varies depending on research priorities:
12. PROFILING MATLAB RESOURCE USAGE
• MATLAB uses one processor core at a
time (50% on a 2 vCPU machine). Use
parallel computing toolkit for
multicore PCs.
• MATLAB stores all data in RAM, very
little I/O while processing
• I/O spike when writing out results
SysInternals Process Explorer
13. MATLAB GRID WITH
PUBLIC CLOUD
- Pay only for what you use
- Scale compute resource up
AND down
- Minimal capital outlay on
hardware
- Experiment with grid
computing platforms
quickly, cheaply and with
no commitment
14. A DAY IN A PUBLIC
CLOUD CLUSTER
0
20
40
60
80
100
120
140
160
180
Time
00:30:00
01:10:00
01:50:00
02:30:00
03:10:00
03:50:00
04:30:00
05:10:00
05:50:00
06:30:00
07:10:00
07:50:00
08:30:00
09:10:00
09:50:00
10:30:00
11:10:00
11:50:00
12:30:00
13:10:00
13:50:00
14:30:00
15:10:00
15:50:00
16:30:00
17:10:00
17:50:00
18:30:00
19:10:00
19:50:00
20:30:00
21:10:00
21:50:00
22:30:00
23:10:00
Workers Tasks in Queue
- Cluster consisting 32x 4 cores
- Max 128 worker nodes
- Ramps up as jobs get submitted
- Tears down nodes when jobs
finished
- Minimising costs when not in use
16. RUNNING MATLAB CLUSTER ON IAAS
AWS vCPUs are hyper-threaded™
Each vCPU is a hyper thread of an Intel Xeon core for 2nd generation instance types
(M4, M3, C4, C3, R3, HS1, G2, I2, and D2)
https://aws.amazon.com/ec2/instance-types/
Azure does not overcommit memory or
cores. vCPUs are physical cores.
Azure does not use hyper-threading.
https://aws.amazon.com/ec2/instance-types/
17. GRID DEPLOYMENT OPTIONS
1. Infrastructure as a Service (IaaS) DIY
Spin up a compute cluster on VMs for additional capacity and new workloads
2. Burst
Use existing on premises compute cluster and burst on cloud as required
3. Software as a Service (SaaS)
Software vendors and Managed Service Providers provide their own SaaS
solutions. Pay for compute and application software per hour
4. Platform as a Service (PaaS)
Cloud providers’ data analytics platform as a service:
Google BigQuery & Datalab, Microsoft HDInsight, Amazon EMR
20. WHAT IS BIGQUERY?
Hadoop based “service that enables
interactive analysis of massively large
datasets”
• Distributed File System - Stores data
that’s larger than can fit on a single
machine
• Map Reduce – Distributes processing
across multiple systems
http://blogs.forrester.com/mike_gualtieri/13-06-07-what_is_hadoop
22. DON’T FORGET SECURITY
Security considerations:
• Secure transfer and storage of data and code
• Secure remote access to cloud hosted environment
• Secure authentication
• Windows AD credentials
• AWS IAM credentials
• Google accounts
• Microsoft accounts
• Auditing (who accessed what, who changed what)
23. SUMMARY
• Traditional grid and HPC tools can benefit from moving into cloud
• Vast landscape of available tools
• Off-the-shelf PaaS offerings
• Integrations and ecosystems
• Cheap and very quick to experiment
24. Hentsu Ltd
1 Fore Street
London EC2Y 9DT
hello@hentsu.com
https://hentsu.com
MORE INFORMATION?