This document provides an introduction to cloud computing using Amazon Web Services (AWS) and MongoDB. It defines cloud computing and describes the various service models including Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS). It outlines some of the key AWS computing, storage, database, and other services like EC2, S3, DynamoDB, and ElastiCache. It also introduces MongoDB as a scalable and natural document-oriented NoSQL database and compares some of its features to SQL databases. Finally, it provides two examples of using AWS and MongoDB for DNA sequencing and genome analysis.
An introduction to cloud computing with Amazon Web Services and MongoDB
1. An introduction to cloud
computing with
Amazon Web Services
and
MongoDB
Samuel Demharter
DTC, 10 March 2016
2. Cloud Computing
“Everybody's in it and nobody's in it. It's like
a cloud that everybody has given a little puff
of mist to, and then the cloud does all the
heavy thinking for everybody. I don't mean
there's really a cloud. I just mean it's
something like that.”
The Sirens of Titan, Kurt Vonnegut, 1959
3. Definition
• Gartner Group: “A style of computing in
which massively scalable scalable and
elastic IT-enabled capabilities are
delivered as a service using Internet
technologies.”
4. Cloud Computing Service Models
Software As A Service
(SAAS)
Platform As A Service
(PAAS)
Infrastructure As A
Service (IAAS)
5. Amazon Web Services
• Development started in 2002
• In 2006, Amazon launched its Elastic
Compute cloud (EC2) and S3 storage
service
• Amazon EC2/S3 was the first widely
accessible cloud computing infrastructure
service
7. AWS Computing
• Elastic Compute Cloud (EC2)
– Access to individual instances as you would
with any other machine
– Customisable configuration
– Auto Scaling
• Amazon Elastic MapReduce
– Process vast amounts of data
– Utilise Hadoop framework
8. AWS Storage
• Simple Storage Service (S3)
– Scalable cloud storage
– HTTP access
– Object store not a file system
– Cheap
• Elastic Block Storage (EBS)
– Local storage
– For use with EC2 instances
– Take snapshot backups
– Fast
9. AWS Databases
• Amazon SimpleDB (noSQL)
– Ease of administration
• Amazon DynamoDB (noSQL)
– Scalability & durability
• Amazon Relational Database Service
(SQL)
– Efficient indexing & querying
• Amazone ElastiCache
– Fast data access
15. What is a database?
A database is a collection of information that
is organized so that it can easily be
accessed, managed, and updated.
16. Why use a database?
• Reusability : You need a single, public,
interface for your data storage that all parts of
your application can use.
• Availability : You need be sure that your
application will always be able to read and
write data.
• Durability : You need to be sure that your
data will stick around.
• Scalability : You need your data storage to
be able to grow with your application.
17. Typical SQL and noSQL databases
SQL
Oracle
MySQL
Microsoft SQL
NoSQL
Key-Value
Column
Document
Graph-based
SQL – Structured Query Language
NoSQL – Not Only SQL
MongoDB
CouchDB
Riak
19. MongoDB
• Distributed
• Document-oriented
• Schema-less storage solution
• Uses JSON-style documents
• Supports Python, PHP, Java, Ruby, C++, etc.
• Replica sets for failovers and speeding up
reads
• Sharding for high performance
20. SQL vs MongoDB (noSQL)
SQL MongoDB (noSQL)
Requires structured data/ well-
designed schema
semi-structured, unstructured &
polymorphic data
Table based Document based
Database atomicity Document atomicity/
eventual consistency
Rules enforced by database Rules enforced by user
Scale-up Scale-out (suitable for distributed
computing)
Flexible & fast
21.
22.
23.
24. Table - Who is the account holder
for account ID 3?
30. Usage Example 1: DNA Sequencing
• Real-time DNA sequencing
• Raw Data
PC
• Basecalling
AWS
• Basecalled
Data
PC
31. Usage Example 1: DNA Sequencing
• Use AWS EC2 computing and S3 storage
• Spot market – auction of unused EC2
instances
• Pay-Per-Use an important economical
factor for Nanopore
• Use a combination of MongoDB and SQL
32. Usage Example 2: Genome Analysis
Genetic Variant Calling
Peter White et al., Ohio State University in collaboration with Genome Next
https://youtu.be/upAtK_SOtsY
37. Definitions
• Instance: A copy of an Amazon Machine
Image running as a virtual server in the
AWS cloud
• Instance type: A specification that defines
the memory, CPU, storage capacity, and
hourly cost for an instance.
• Amazon Machine Image: AMIs are like a
template of a computer's root drive.
38. • Pixar accidentally wipes out nearly every
file of "Toy Story 2" about 10 months into
production. Fortunately, supervising
technical director Galyn Susman had just
become a new mom and had an entire
copy of the movie on her home computer
so that she could work from home. Woody
and Buzz live to see another day, and
movie.
Notes de l'éditeur
In 2006, Amazon launched its Elastic Compute cloud (EC2) as a commercial web service that allows small companies and individuals to rent computers on which to run their own computer applications.
Other key factors that have enabled cloud computing to evolve include the maturing of virtualisation technology, the development of universal high-speed bandwidth, and universal software interoperability standards
a collection of cloud computing services e.g.
Amazon markets AWS as a service to provide large computing capacity more quickly and more cheaply than a client company building an actual physical server farm.[3]
Hadoop is a framework for distributing data and processing across resizable cluster of EC2 instances
EMR: A web service that makes it easy to process large amounts of data efficiently. Amazon EMR uses Hadoop processing combined with several AWS products to do such tasks as web indexing, data mining, log file analysis, machine learning, scientific simulation, and data warehousing.
Open source
Popular with start-ups
Simple application that stores data in file
Want to read data later
Another programme wants to read data
What if not same language?
Multiple programmes at same time use data?
Overloaded. Scale up or scale out?
Scale up – improve hardware – eventually runs out
Scale out – distribute data – manage data across multiple hosts
noSQL termed in 2009
uses JSON-style documents to represent, query and modify data
Similar to CouchBase and CouchDB
MongoDB success is largely due to having easy-to-use, familiar tools.
MongoDB uses memory mapped file for its storage engine
(data is structured per record)
A shard is a replica set that contains a subset of the data for the sharded cluster.
Together, the cluster’s shards hold the entire data set for the cluster.
A virtual machine is a software computer that, like a physical computer, runs an operating system and applications. The virtual machine is comprised of a set of specification and configuration files and is backed by the physical resources of a host.
Some instance types are designed for standard applications, whereas others are designed for CPU-intensive, memory-intensive applications, and so on.
AMI contains the operating system and can also include software and layers of your application, such as database servers, middleware, web servers, and so on.