All these large data sets are so big its difficult to manage with traditional tools. Distributing computing is an approach to solve that problem! First the data needs to be mapped, then it can be analyzed or reduced.
My INSURER PTE LTD - Insurtech Innovation Award 2024
Big Data in a Public Cloud
1. Big Data in a public cloud
Maximize RAM without paying for
overhead CPU and other price/
performance tricks
2. Big Data
• Big Data: just another way of saying large data sets
• All of them are so large its difficult to manage with
traditional tools
• Distributed computing is an approach to solve that
problem
• First the data needs to be Mapped
• Then it can be analyzed – or Reduced
6. CPU / RAM
• Mapping – is not CPU intensive
• Reducing – is (usually) not CPU intensive
• Speed: load data in RAM or it hits the HDD
and creates iowait
10. What about MapReduce PaaS
• Be aware of data lock in
• Be aware of forced tool set – limits your
workflow
• …and I just don’t like it so much as a
developer; it’s a control thing
18. Who and How does workload?
• CERN – the LHC experiments to find the Higgs
Boson
• 200 Petabytes per year
• 200,000 CPU days / day on hundreds of partner
grids and clouds
• Montecarlo simulations (CPU intensive)
19. The CERN case
• Golden VM images with tools, from which they clone
• A set of coordinator servers, which scales the worker nodes
up and down via provisioning API (clone, puppet config)
• Coordinator servers manages workload on each worker
node doing Montecarlo simulations
• Federated cloud brokers such as Enstratus and Slipstream
• Self healing architecture – uptime of a worker node is not
critical; just spin up a new worker node instance
20. The loyalty program case
• The customer runs various loyalty programs
• Plain vanilla Hadoop Map/Reduce:
• Chef and Puppet to deploy and config worker nodes
• Lots of RAM, little CPU
• Self healing architecture – uptime of a worker node is
not critical; just spin up a new worker node instance
21. Find IaaS suppliers with
• Unbundled resources
• Short billing cycles
• New equipment: what’s your life cycle?
• SSD as compute storage
• Allows cross connects from your equipment in the
DC
• CloudSigma, ElasticHost, ProfitBricks to mention a
few