Contenu connexe
Similaire à Architecting for the cloud storage misc topics (20)
Architecting for the cloud storage misc topics
- 1. © Matthew Bass 2013
Architecting for the Cloud
Len and Matt Bass
Storage in the Cloud
- 2. © Matthew Bass 2013
Outline
This section will focus on storage in the cloud
• We will first look at relational databases
• What solutions emerged for the cloud
• Storage options for NoSQL databases
• Architecture of typical NoSQL databases
- 3. © Matthew Bass 2013
Outline
This section will focus on storage in the cloud
• We will first look at relational databases
• What solutions emerged for the cloud
• Storage options for NoSQL databases
• Architecture of typical NoSQL databases
- 4. © Matthew Bass 2013
History
• The relational data model was created in the
late 1960s
• In the 1980s relational databases became
commercially successful
– Replacing Hierarchical and Network data bases
• Relational databases continue to be the
dominate db model today
- 5. © Matthew Bass 2013
Relational Databases
• The relational model is a mathematical model
for describing the structure of data
– We will not go into this model
• Let’s take a quick review of the 1st and 2nd
normal form, however
- 6. © Matthew Bass 2013
Example
Imagine you sell car parts
– You have warehouses
– You have part inventories
– You have orders
What’s the problem?
Warehouse Warehouse Address Part
- 7. © Matthew Bass 2013
What Happens Here?
Warehouse 1 123 Main Street Transmission, Steering
wheel, Brake pads, …
What about here?
Warehouse 1 123 Main Street Transmission
Warehouse 1 123 Main Street Steering wheel
Warehouse 1 123 Main Street Brake Pads
- 8. © Matthew Bass 2013
The Solution …
Warehouse ID Warehouse Address
Warehouse Table:
Parts Table:
Relations Table:
Part ID Part Description
Warehouse ID Part ID
- 9. © Matthew Bass 2013
This Works
• We have a standard language for querying the
data (SQL)
• We can now extract data in a very flexible way
• We can read, write, update, and delete data
pretty efficiently
– Joins add some overhead
- 10. © Matthew Bass 2013
Moreover We Have RDBMS
• We have robust software systems that manage the
data
• These systems provide many advanced features
including:
– Behavior
– Concurrency control
– Transactions
– Referential integrity
– Optimization
- 11. © Matthew Bass 2013
Behavior
• DBMS’s provide mechanisms for building in
behavior
• These are mechanisms like
– Stored procedures
– PLSQL
• This allows you to simplify the application
logic
- 12. © Matthew Bass 2013
Concurrency Control
• DBMS’s will support multiple user access
• They will lock tables during updates to ensure
that writes are complete prior to reads
• They will manage multiple updates to ensure
integrity and consistency of data
- 13. © Matthew Bass 2013
Transactions
• Transactions are supported
• This ensures that updates either happen
completely or not at all
– Often an atomic update is a set up updates to
individual records across multiple tables
– If only some of these updates happen the integrity
of the overall database is compromised
- 14. © Matthew Bass 2013
Referential Integrity
• Ensures that references from one table refer
to a valid entry in another table
- 15. © Matthew Bass 2013
Optimization
• Database systems will perform a variety of actions
to optimize based on usage patterns
• They will
– Create indexes
– Create virtual tables
– Cache values
– …
- 16. © Matthew Bass 2013
Impedance Mismatch
• There is however, a mismatch
– We need to translate between the relational structure and the
organizational needs
• Think about the reports needed for the warehouse
– Purchase orders
– History of orders for customer
– Parts inventory per warehouse
– …
• This means we will need lots of Joins
– This isn’t too much of an issue until we scale …
- 17. © Matthew Bass 2013
Speaking of Scaling …
Do relational databases scale?
- 18. © Matthew Bass 2013
Internet Scale Is Difficult
• We can “shard” the data
– Split the data across the machines
• This is very difficult to do efficiently
• This makes joins more costly
– Remember joins are common
• This also has a practical limit
– At some point you will need to replicate the data
• The database becomes slow …
- 19. © Matthew Bass 2013
Change is Needed
• For this reason internet scale applications moved to
distributed file systems
– Google was the first
– Many others followed
• This allowed the data to be partitioned across nodes
more efficiently
– We’ll talk about this in a minute
- 20. © Matthew Bass 2013
Outline
This section will focus on storage in the cloud
• We will first look at relational databases
• What solutions emerged for the cloud
• Storage options for NoSQL databases
• Architecture of typical NoSQL databases
- 21. © Matthew Bass 2013
Needs
• Let’s explore the needs in a bit more detail
• The file system needed to:
– Be fault-tolerant
– Handle large files
– Accommodate extremely large data sets
– Accommodate many concurrent clients
– Be flexible enough to handle multiple kinds of applications
- 22. © Matthew Bass 2013
Fault-Tolerance
• Due to the scale of the systems they were deployed
on hundreds or thousands of servers
• This meant that at any given time some of these
nodes would not be operational
• Problems from application bugs, operating system
bugs, human error, hardware failures, and networks
are common
- 23. © Matthew Bass 2013
Large Files/Large Data Sets
• It’s common for files in these systems to be
multiple GBs
• Each file could have millions of objects
– E.g. many individual web pages
• The data sets grow quickly
• The data sets can be multiple terabytes or
petabytes
- 24. © Matthew Bass 2013
Many Concurrent Clients
• The system needs to efficiently handle
multiple clients
• These clients could be reading or writing
- 25. © Matthew Bass 2013
Multiple Applications
• Additionally the system needs to be flexible
enough to handle multiple applications
• Applications have a variety of needs
– Long streaming reads
– Throughput oriented operations
– Low latency reads
– …
- 26. © Matthew Bass 2013
Addressing Needs
• There were a number of things that were done to address the
needs
• One primary decision was the de-normalization of the data
– We’ll talk about this more in the next slides
• Other decisions include (we’ll talk about these in a bit)
– Block size
– Replication strategy
– Data consistency checks
– API and capability of the system
- 27. © Matthew Bass 2013
De-Normalizing Data
• Remember what was difficult with relational models?
– Joins across nodes are expensive
– As is synchronization for replicated data
• If the data is de-normalized it can be “localized”
– Data that will likely be accessed together can be collocated
– In other words store it as you will use it
- 28. © Matthew Bass 2013
Example
• Imagine a Purchase Order
• Typically this would contain
– Customer information
– Product information
– Pricing
- 29. © Matthew Bass 2013
Relational Purchase Order
• The data could would be split across multiple
tables such as
– Customer
– Product Catalog
– Inventory
– …
• If the data set is large enough the data would
be distributed
- 30. © Matthew Bass 2013
De-Normalized Purchase Order
• In a file system without a relational model the data
doesn’t need to be split up
• The purchase order data would be co-located
• If the data set was very large purchase orders would
still be co-located
– Different purchased orders could be distributed
– A single purchase order, however would not be
- 31. © Matthew Bass 2013
Relational vs NoSQL
Relational Model
NoSQL
Customers Product Catalog Inventory
Orders 1 - 100 Orders 101 - 200 Orders 201 - 300
- 32. © Matthew Bass 2013
What Does This Mean?
• Data has no explicit structure (not entirely
true … but we’ll talk about this)
– Data is largely treated as a blob
• This has several implications
– You can change the nature of the data as needed
– You can collocate the data as desired
– The application now has increased burden
- 33. © Matthew Bass 2013
Back to Purchase Order
PO Number PO
1 Contents of PO1 …
2 Contents of PO2 …
3 Contents of PO3 …
4 Contents of PO4 …
Key Value
- 34. © Matthew Bass 2013
Retrieving Data
• To retrieve the purchase order data you provide the reference
key
• The file system routes you to the appropriate node (more
later)
• The single node returns the entire purchase order
• This can happen quickly … regardless of how many purchase
orders you have
Do you see any potential issues?
- 35. © Matthew Bass 2013
Data Locality
• First, being able to retrieve the data quickly depends on the
location of the data
• If the data is distributed it’s difficult to retrieve quickly
– Imagine you want to get the number of times a customer ordered
product X
– More on this later
• While there is not an explicit structure there is an implicit
structure
– Design of this structure is important
- 36. © Matthew Bass 2013
Data Processing
• As the file system treats the data as
unstructured it’s not able to preprocess the
data
• Getting an ordered list, for example, has to be
done in the application
• The validity of the data needs to be checked
by the application
- 37. © Matthew Bass 2013
Updating Data
• What happens if you want to change the data?
– Imagine trying to update the customer’s address
• Updates tend to be difficult
• In this environment you tend to not update data
– Instead you will append the new data
– You can establish rules for the lifetime of the data
- 38. © Matthew Bass 2013
Other Issues
• Things like data integrity are not managed by the file
system
• You don’t (typically) have full support for transactions
• There is no notion of referential integrity
• There is support for some concurrent access, but with
built in assumptions
• Consistency is not typically guaranteed (more later)
- 39. © Matthew Bass 2013
A New Tool in Your Toolbox
• You’ve been given a new kind of hammer
– Remember that everything is not a nail
– In other words these kinds of data stores are good
for some things … and not others
• Today there are many different flavors of
these data stores
– Both in terms of structures and features
- 40. © Matthew Bass 2013
Multiple Data Structures
• Today many options exist
– Key value stores
– Document centric data stores
– Column data bases
• We’ve also started to see old models
reemerge e.g.
– Hierarchical data stores
- 41. © Matthew Bass 2013
Key Value Databases
• Basically you have a key that maps to some “value”
• This value is just a blob
– The database doesn’t care about the content or structure of this value
• The operations are quite simple e.g.
– Read (get the value given a key)
– Insert (inserts a key/value pair)
– Remove (removes the value associated with a given key)
- 42. © Matthew Bass 2013
Key Value Databases II
• There is no real schema
– Basically you query a key and get the value
– This can be useful when accessing things like user sessions, shopping
carts, …
• Concurrency
– Concurrency only makes sense at the level of a single key
– Can have either optimistic write or eventual consistency – we’ll talk
about this more later
• Replication
– Can be handled by the client or the data store – more about this later
- 43. © Matthew Bass 2013
Uses
• Very fast reads
• Scales well
• Good for quick access of data without complex
querying needs
– The classic example is for session management
• Not good for
– Situations where data integrity is critical
– Data with complex querying needs
- 44. © Matthew Bass 2013
Document Centric Databases
• Stores a “document”
ID : 123
Customer : 8790
Line Items : [{product id: 2, quantity: 2}
{product id: 34, quantity 1}]
…
- 45. © Matthew Bass 2013
Document Centric
• No schema
• You can query the data store
– Can return all or part of the document
– Typically query the store by using the id (or key)
• As with key value, discussing concurrency only makes
sense at the level of a single document
- 46. © Matthew Bass 2013
Advantages
• A document centric data store is similar in
many ways to a key/value data store
• It does, however, allow for more complex
queries
– For example you can query using a non-primary
key
- 47. © Matthew Bass 2013
Column Databases
• Row key maps to “column families”
1234 …
Name Matt
Billing Address 123 Main st
Phone 412 770-4145
Profile
Order Data …
Order Data …
Order Data …
Orders
- 48. © Matthew Bass 2013
Column Databases - Rows
• Rows are grouped together to form units of load balancing
– Row keys are ordered and grouped together by locality
– In this example consecutive rows would be from the same domain
(CNN)
• Concurrency makes sense at the level of a row
Key Contents Anchor:cnnsi.com Anchor:my.look.ca
com.cnn.www Html page … …
- 49. © Matthew Bass 2013
Column Databases – Columns
• Columns are grouped into “column families”
• Column families form the unit of access
control
– Clients may or may not have access to all column
families
• Column keys can be used to query data
- 50. © Matthew Bass 2013
Column Databases – Timestamps
• The cells in a column database can be versioned with
a timestamp
• The cells can contain multiple versions
– The application can typically specify how many versions to
keep or when a version times out
• You can use either use a client generated timestamp
or one generated by the storage node
- 51. © Matthew Bass 2013
Examples
Document Centric
• MongoDB
• CouchDB
• RavenDB
Key Value
• DynamoDB
• Azure Table
• Redis
• Riak
Column
• Hbase
• Cassandra
• Hypertable
• SimpleDB
- 52. © Matthew Bass 2013
NoSQL vs RDBMS
• Explicit vs Implicit Schema
– NoSQL databases do have an implicit schema – at
least in most cases
• Distribution of data
• Consistency
• Efficiency of storage
• Additional capabilities
- 53. © Matthew Bass 2013
Schema
• Clearly with Relational DB there is an explicit schema
• You do have an implicit schema with NoSQL db as well
– You typically want to do something with the data
• With relational schema distributed data has a big
performance impact
• Data model of NoSQL data impacts performance as well
– It is easier to distribute data so that related data is co-located
- 54. © Matthew Bass 2013
Consistency - CAP Theorem
• When data becomes distributed you need to
worry about a network partition
– Essentially this means that instances of your data
store can’t communicate
• When this happens you need to choose
between availability or consistency
- 55. © Matthew Bass 2013
Let’s Demonstrate
• Imagine we start a store that takes orders
– Who wants to work at this store?
• The operators need to be able to:
– Take orders
– Give order history
– Modify orders
• We will start with one operator until business grows
…
- 56. © Matthew Bass 2013
Consistency in the Cloud
• Many NoSQL databases give you options
– Eventual consistency
– Optimistic consistency
– …
• They all come with different trade offs
• You must understand the needs of your system to
ensure appropriate behavior
– We’ll talk more about this later
- 57. © Matthew Bass 2013
Outline
This section will focus on storage in the cloud
• We will first look at relational databases
• What solutions emerged for the cloud
• Storage options for NoSQL databases
• Architecture of typical NoSQL databases
- 58. © Matthew Bass 2013
Fault Tolerance
• As we said earlier fault tolerance was a prime
motivator for many of the decisions
• These systems are built with commodity components
that are prone to failure
• They also need to deal with other issues (previously
mentioned) that arise
• We’ll look at a representative example of such a system
to understand what decisions have been made
- 59. © Matthew Bass 2013
Google File System
• Grew out of “BigFiles”
• Distributed, scalable and portable file system
• Written in Java
• Supports the kinds of applications we discussed
earlier
– Search
– Large data retrieval
- 60. © Matthew Bass 2013
Leads to following requirements
1. High reliability through commodity hardware
– Even with RAID, disks will still have one failure per day. If the system has to deal with
failure smoothly in any case, it is much more economical to use commodity hardware.
– Even if disks do not fail, data blocks may get corrupted.
2. Minimal synchronization on writes
– Require each application process to write to a distinct file. File merge can take place
after files are written.
– This means minimal locking during the write process (or read process).
3. Data blocks are all the same size
– Streaming data. ALL blocks are 64MBytes.
– GFS is unaware of any internal logic of data and the Internal logic of data must be
managed by the application
- 61. © Matthew Bass 2013
GFS Interfaces
• Supports the following commands
– Open
– Create
– Read
– Write
– Close
– Append
– Snapshot
- 62. © Matthew Bass 2013
Organization of GFS
• Organized into clusters
• Each cluster might have thousands of machines
• Within each cluster you have the following kinds of
entities
– Clients
– Master servers
– Chunk servers
- 63. © Matthew Bass 2013
GFS Clients
• Clients are any entity that makes a file request
• Requests are often to retrieve existing files
• They might also include manipulating or creating files
• Clients are other computers or applications
– Think of the web server that serves your search
engine as a client
- 64. © Matthew Bass 2013
Chunk Servers
• Responsible for storing the data “chunks”
– These chunks are all 64 MB blocks
• These chunk servers are the work horses of the file system
• They receive requests for data and send the chunks directly to the
client
• The client also writes the files directly to the appropriate chunk
servers
– The reference for replicas come from the master as well
• The chunk server is responsible for determining the correctness of
the write (more later)
- 65. © Matthew Bass 2013
Master Servers
• Acts as a coordinator for the cluster
• Keeps track of the metadata
– This is data that describes the data blocks (or chunks)
– Tells the Master what chunk the file belongs to
• Master tells the client where the chunk is located
• Master keeps an operations log
– Logs the activities of the cluster
– One of the mechanisms used to keep service outages to a
minimum (more later)
- 66. © Matthew Bass 2013
Two Additional Concepts
Lease:
• Lease is the minimal locking that is performed. Client receives lease on a file
when it is opened and, until file is closed or lease expires, no other process
can write to that file. This prevents accidently using the same file name twice.
• Client must renew lease periodically (~ 1 minute) or lease is expired.
Block:
• Every file managed by GFS is divided into 64MByte blocks. Each read/write is
in terms of <file, block #>
• Each block is replicated – three is the default number of replicas.
• As far as GFS is concerned there is no internal structure to a block. The
application must perform any parsing of the data that is necessary.
- 67. © Matthew Bass 2013
Basic Read Operation
Client
Master
Chunk
Server
Chunk
Server
Chunk
Server
Requests
location
of File
Sends
read
request
Returns
location
Returns
file
content
- 68. © Matthew Bass 2013
Basic Write Operation
Client
Master
Chunk
Server
Chunk
Server
Chunk
Server
Requests
location of
primary and
secondary
Sends
data to
write
Returns
locations
Caches locations Sends
data to
write
Applies Mutations
- 69. © Matthew Bass 2013
Reliability Mechanisms
• Master and chunk replication
• Rebalancing
• Stale replication detection
• Checksumming
• Garbage removal
- 70. © Matthew Bass 2013
Master Replication
• One active Master per cluster
• “Shadow” masters exist on other machines
– These shadows may perform limited functions (i.e. reads)
• Monitors the operations of the active master
– Though the operations log
• Maintains contact with the Chunk Servers by polling
– Does this to keep track of data
• If the Master fails the shadow takes over
- 71. © Matthew Bass 2013
Data Replication/Rebalancing
• File system replicates chunks of data
• It stores data on different machines across different racks
– That way if a machine or rack fail another replica exists
• Master also monitors cluster as a whole
• It periodically rebalances the load across the cluster
– All chunk servers run at near capacity but never at full capacity
• Master also monitors each chunk to ensure data is current
– If not it’s designated as a stale replica
– The stale replica become garbage
- 72. © Matthew Bass 2013
Checksum
• In order to detect data corruption checksumming is used
• The system breaks each 64 MB chunk into 64 KB blocks
• Each block has it’s own 32 bit checksum
• The Master monitors the checksums for each block
• If the checksums don’t match what the Master has on record
it is deleted and a new replica created
- 73. © Matthew Bass 2013
Failure Scenarios
• Let’s look at the following failure scenarios to
see what happens
– Client failure
– Corrupt disk
– Chunk server failure
– Master failure
- 74. © Matthew Bass 2013
Client Failure
• Client fails while file open
• Master recognizes this because lease expires
• File is placed in intermediate state where client can re-
activate lease
• After intermediate state expires (~hour), Master informs
Chunk Server that have blocks for that file to delete them
• Master removes all entries associated with file
• Chunk Server deletes blocks
- 75. © Matthew Bass 2013
Corrupt Disk
This is the case where a block becomes corrupted after writing.
Replica1 writes a checksum for every 64 KB in a parallel file.
Replica1 returns checksums along with the block during a read.
Client checks checksum when block returned
If there is an error then Client:
• Retries read from different Replica2
• Informs Master of corrupt block on Replica1
Master:
• Allocates new replica for that block on Replica3
• Informs Replica2 with an existing replica to copy it to Replica3.
• Informs Replica1 with corrupted block to delete that block.
- 76. © Matthew Bass 2013
Chunk Server Failure
Master sends Heartbeat request to Chunk Server
• Active Replica responds with a list of block #, replica #s it has.
• Failed Replica does not respond
Master recognizes Replica’s failure.
Master maintains block #, replica # -> Chunk Server mapping from last Heartbeat.
Master queues all of blocks replicated on failed Chunk Server to generate an additional
replica.
The generation of an additional replica of Block A:
• Allocate new replica on an active Chunk Server say Replica1
• Instruct one of the Chunk Servers with valid replica of Block A to copy it to Replica1.
- 77. © Matthew Bass 2013
Master Failure
• Back up Master maintains copy of log
• Responsible for creating checkpoint image
and trimming EditLog
• BackupNode takes over in case of Master
failure
• BackupNode may also fail
BackupNode
Master
EditLog
Checkpoint
Image
- 78. © Matthew Bass 2013
More about Master Structure
Four Threads:
• Main – perform file management operations.
• Ping/Echo – check on status of Chunk Servers and receive responses from Chunk
Servers
• Replica Management – manage new replica creation and replica deletion
• Lease Management – cancel leases when they expire. Queues replicas for replica
deletion for files where the client has failed.
Three Modes
• Normal operations
• Safe mode – when Master is restarted then no new requests are accepted until
percentage of Chunk Servers have reported their block allocations
• Backup – act as Master backup
- 79. © Matthew Bass 2013
Summary
• Relational databases are difficult to distribute efficiently
– Scalability can be problematic
• NoSQL databases offer an alternative
– Data is typically schema-less
• Aggregates of data that mirror primary use cases are
considered a unit of data
• Queries across nodes requires an efficient mechanism for
aggregation
- 81. © Matthew Bass 2013
Topics
These are topics that have architectural implications
and do not fit neatly into one of the other lectures.
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
81
- 82. © Matthew Bass 2013
Zookeeper
• Zookeeper is intended to manage distributed
coordination
– Synchronization
– data
82
- 83. © Matthew Bass 2013
Distributed applications
• Zookeeper provides guaranteed consistent
(mostly) data structure for every instance of a
distributed application.
– Definition of “mostly” is within eventual consistency
lag (but this is small). More on eventual consistency
later.
• Zookeeper deals with managing failure as well as
consistency.
– Done using Paxos algorithm.
• Zookeeper guarantees that service requests are
linearly ordered and processed in a FIFO order
- 84. © Matthew Bass 2013
Model
• Zookeeper maintains a file type data structure
– Hierarchical
– Data in every node (called znode)
– Amount of data in each node assumed small
(<1M)
– Intended for metadata
• Configuration
• Location
• Group
- 85. © Matthew Bass 2013
Zookeeper znode structure
/
<data>
/b1
<data>
/b1/c1
<data>
/b1/c2
<data>
/b2
<data>
/b2/c1
<data>
- 86. © Matthew Bass 2013
API
Function Type
create write
delete write
Exists read
Get children Read
Get data Read
Set data write
+ others
• All calls return atomic views of state – either
succeed or fail. No partial state returned.
Writes also are atomic. Either succeed or fail.
If they fail, no side effects.
- 87. © Matthew Bass 2013
Example - Group membership
• Remember the load balancer. It has a list of
registered servers.
• The load balancer wants to know which of its
servers are
– Alive
– Providing service
• The list must be
– highly available
– Reflect failure of individual servers
• Strict performance requirements on list manager
- 88. © Matthew Bass 2013
Using Zookeeper to manage group
membership
• Load balancer on initialization
– connects to zookeeper
– Gets list of zookeeper servers
– Create session (if server fails – automatic fail over)
• Load balancer issues Create /”Servers” call
• If already exists get a failure
• Servers register by creating /Server/my_IP
• Load balancer can list children of /Servers and get their IPs.
• Watcher will inform Load balancer if a server fails or leaves.
• Latency is low (order of micro seconds) since
Zookeeper keeps data structures in memory.
- 89. © Matthew Bass 2013
Other use cases
• Leader election
• Distributed locks
• Synchronization
• Configuration
- 90. © Matthew Bass 2013
Topics
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
90
- 91. © Matthew Bass 2013
Failures in the cloud
• Cloud failures large and small
• The Long Tail
• Techniques for dealing with the long tail
- 93. © Matthew Bass 2013
Selected Cloud Outages - 2013
• July 10, Google down for 10 minutes
• June 18, Facebook down for 30 minutes
• Aug 14-17 Outlook.com offline for three days
• Aug 19, Amazon.com down for 40-45 minutes
• Aug 22, Apple iCloud down for 11 hours
• Aug 16, Google down for 5 minutes
• Sept 13, AWS down for ~two hours
• Nov 21, Microsoft services intermittent for ~2
hours
- 95. © Matthew Bass 2013
A year in the life of a Google
datacenter
• Typical first year for a new cluster:
– ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to
recover)
– ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come
back)
- ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
– ~5 racks go wonky (40-80 machines see 50% packetloss)
– ~8 network maintenances (4 might cause ~30-minute random connectivity
losses)
– ~12 router reloads (takes out DNS and external vips for a couple minutes)
– ~3 router failures (have to immediately pull traffic for an hour)
– ~dozens of minor 30-second blips for dns
– ~1000 individual machine failures
– ~thousands of hard drive failures
• slow disks, bad memory, misconfigured machines, flaky machines, dead
horses, etc.
- 96. © Matthew Bass 2013
Amazon failure statistics
• In a data center with ~64,000 servers with 2
disks each
~5 servers and ~17 disks fail every day.
- 97. © Matthew Bass 2013
What does this mean for a
consumer of the cloud?
• You need to be concerned about “long tail”
distribution for requests due to piecewise
failure
• You need to be concerned about business
continuity due to overall failure.
- 98. © Matthew Bass 2013
Short digression into probability
• A distribution describes the probability than any given
reading will have a particular value.
• Many phenomenon in nature are “normally distributed”.
• Most values will cluster
around the mean with
progressively smaller
numbers of values going
toward the edges.
• In a normal distribution
the mean is equal to the
median
- 99. © Matthew Bass 2013
Long Tail
• In a long tail distribution, there are some values
far from the mean.
• These values are sufficient to influence the mean.
• The mean and the
median are
dramatically
different in a long
tail distribution.
- 100. © Matthew Bass 2013
What does this mean?
• If there is a partial failure of the cloud some
activities will take a long time to complete and
exhibit a long tail.
• The figure shows
distribution of
1000 AWS
“launch instance”
calls.
• 4.5% of calls were
“long tail”
Mean Median STD Max
launch
instance
EC2 27.81 23.10 25.12 202.3
- 101. © Matthew Bass 2013
What can you do to prevent long
tail problems?
• “Hedged” request. Suppose you wish to launch
10 instances. Issue 11 requests. Terminate the
request that has not completed when 10 are
completed.
• “Alternative” request. In the above scenario, issue
10 requests. When 8 requests have completed
issue 2 more. Cancel the last 2 to respond.
• Using these techniques reduces the time of the
longest of the 1000 launch instance requests
from 202 sec to 51 sec.
- 102. © Matthew Bass 2013
Topics
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
102
- 103. © Matthew Bass 2013
Business continuity
• Business continuity means that the business should
continue to provide service even if a disaster such as a
fire, floor, or cloud outage occurs.
• Two numbers characterize a business continuity
strategy
– RTO is the Recovery Time Objective – how long before the
service is available again
– RPO is the Recovery Point Objective – what is the point in
time that the system rolls back to. i.e. how much data can
be potentially lost
• Allows for cost/benefit trade offs.
• Many industries such as banks have compliance rules
that require business continuity policies and practices.
- 104. © Matthew Bass 2013
How does business continuity work?
• Replicate site in physically distant location.
• Recall DNS server with multiple sites
• If first site does not respond promptly, client
will try second site.
X
Site 1 Site 2
Website.com
123.45.67.89
456.77.88.99123.45.67.89
456.77.88.99
DNS
- 105. © Matthew Bass 2013
What does it mean to “replicate
site”?
• Must have a parallel datacenter
• Data must be replicated within RPO
– If RPO is small or zero this implies DB replication
– If RPO is larger then can use other means to replicate data
• Software must also be replicated.
– Versions must be identical in both sites
• Using different versions in different sites may result in different
results.
• Configurations in two sites will be different but must yield the
same results.
• Replication of a site incurs costs. You may wish to
increase the RPO and just copy (back up) data to
another site.
- 106. © Matthew Bass 2013
Recall discussion about DNS servers
• There is a hierarchy of DNS servers.
• Local DNS servers are under the control of the
local organization.
• When a disaster happens, the new data center
can be made operative by changing the IP
address in the local DNS server.
- 107. © Matthew Bass 2013
What the the architectural
implications
• State maintained in servers will be lost if a
disaster happens
• Dependencies on other than configuration
parameters must be identical in a replicated
site.
• Applications must be architected to be
movable for one environment to another.
- 108. © Matthew Bass 2013
Topics
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
108
- 109. © Matthew Bass 2013
Dependencies
• There exist many different types of dependencies within a
system. E.g.
– Inter component
– Version
– Configuration parameters
– Hardware
– Location
– Names
– DB schemas
– Platform
– Libraries
• Inconsistency among these dependencies is a common
source of production time errors.
- 110. © Matthew Bass 2013
For example
• You develop some code on your desktop.
– You have installed the latest Java update
– You configure your code to use a Python script to do
some data cleansing
– You depend on a component that your colleagues are
simultaneously developing.
• You deploy your code into production.
– The latest Java version has not been installed.
– Python has not been installed in the production
environment.
– Your colleagues are delayed in their development.
- 111. © Matthew Bass 2013
You finally get your code into
production
• A user has a problem and calls the help desk.
• The help desk doesn’t know how to solve the
problem and escalates it back to you.
• You have gone on vacation.
- 112. © Matthew Bass 2013
Problems lead to a requirement for
a formal “release plan”
1. Define and agree release and deployment plans with customers/stakeholders.
2. Ensure that each release package consists of a set of related assets and service
components that are compatible with each other.
3. Ensure that integrity of a release package and its constituent components is maintained
throughout the transition activities and recorded accurately in the configuration
management system.
4. „„Ensure that all release and deployment packages can be tracked, installed, tested,
verified, and/or uninstalled or backed out, if appropriate.
5. „„Ensure that change is managed during the release and deployment activities.
6. „„Record and manage deviations, risks, issues related to the new or changed service, and
take necessary corrective action.
7. „„Ensure that there is knowledge transfer to enable the customers and users to optimise
their use of the service to support their business activities.
8. „„Ensure that skills and knowledge are transferred to operations and support staff to
enable them to effectively and efficiently deliver, support and maintain the service,
according to required warranties and service levels
*http://en.wikipedia.org/wiki/Deployment_Plan
112
- 113. © Matthew Bass 2013
Release planning is labor intensive
• Note the requirements for coordination in the release plan
• Each item requires multiple people and time consuming
activities.
– Time consuming activities delay introducing features included in
the release.
• Open questions
– Which items are dealt with through process?
– Which items are dealt with through tool support?
– Which items are dealt with through architecture design?
– Which items are dealt with through a combination of the
above?
• We will see an architecture designed to reduce team
coordination inn a subsequent lecture.
- 114. © Matthew Bass 2013
Topics
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
114
- 115. © Matthew Bass 2013
What is a configuration parameter?
• A configuration parameter or environment
variable is a parameter for an application that
either controls the behavior of the application
or specifies a connection of the application to
its environment
– Thread pool or database connection pool size
control the behavior of the application.
– Database url specifies an connection of the app to
a database.
- 116. © Matthew Bass 2013
When are configuration parameters
bound?
• Recommended practice is to bind these at
initialization time for the app.
– App is loaded into an execution environment
– App is told where to find configuration parameters
through language, OS, or environment specific means.
E.g. main parameter in C
– App reads configuration parameters from the
specified location.
• The virtue of this approach is that an app can be
loaded into different execution environments and
doesn’t need to be aware of which environment
it is.
- 117. © Matthew Bass 2013
Use DB as an example – Unit test
• App is given URL for database access
component.
• In the case of unit test, the database access
component is a component that maintains
some fake data in memory for fast access
without the overhead of the full DB.
- 118. © Matthew Bass 2013
Integration Test
• Test database is maintained for integration
testing.
• Test database has subset of full data base.
• URL of test database is provided to App
• App can read or write test database
- 119. © Matthew Bass 2013
Performance testing
• Special database access component exists for
performance testing
– Passes through reads to production database
– Writes to mirror database
• App is given URL of special database access
component
• Allows testing with real data but blocks and
writes to real database
• Mirror database is checked at end of test for
correctness.
- 120. © Matthew Bass 2013
Other configuration parameters
• Other configuration parameters should be
identical from integration test through to
production.
• Reduces possibility of incorrect specification
of configuration parameters.
– Incorrect specification of configuration
parameters is a major source of deployment
errors.
- 121. © Matthew Bass 2013
Topics
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
121
- 122. © Matthew Bass 2013
Monitoring
• When is this done
• Why is it done
• What can you get from monitoring.
• Data sources – monitor/logs
- 123. © Matthew Bass 2013
What is monitoring?
• Monitoring is the collection of data from
individual or collections of systems during the
runtime of these systems.
• Isn’t this an operations problem and not an
architectural problem?
– No.
• Operators are first class stakeholders and their needs should
be considered when designing the system.
• In the modern world, difficult run time problems are solved
by the architect so its to your advantage that the correct
information is available.
• Other reasons are implicit in the uses of monitoring
information which we are about to go into.
- 124. © Matthew Bass 2013
Why monitor?
1. Identifying failures and the associated faults both at
runtime and during post-mortems held after a failure has
occurred.
2. Identifying performance problems both of individual
systems and collections of interacting systems.
3. Characterizing workload for both short term and long
term billing and capacity planning purposes.
4. Measuring user reactions to various types of interfaces
or business offerings. We will discuss A/B testing later.
5. Detecting intruders who are attempting to break into
the system. (outside of our scope).
- 125. © Matthew Bass 2013
Basic metrics
• Per VM instance provider will collect
– CPU utilization
– Disk read/writes
– Messages in/out
• These metrics are used for
– Charging
– Scaling
– Mapping utilization to workload
• Similar type of metrics for storage and utilities
• Can aggregate these metrics over autoscaling groups,
regions, accounts, etc.
- 126. © Matthew Bass 2013
Other metrics
• The problem with the basic metrics is that they are not
related to particular activities whether business or
internal.
• Other things to monitor
– Transactions – transactions per second gives the business
an idea of how many customers are utilizing the system.
– Transactions by type.
– Messages from one portion of the system to another.
– Error conditions detected by different portions of the
system
– … anything you want
- 127. © Matthew Bass 2013
How do I decide what to
monitor?
• Look at reasons for monitoring
– Failure detection
– Performance degradation
– Workload characterization
– User reactions
• For each reason,
– decide what symptoms you would like reported.
– Place responsibilities to detect symptoms in various modules.
– Decide on active/passive monitoring (discussed soon)
– Decide what constitutes an alarm (discussed soon)
– Logic should be under configuration control – levels of reporting
- 128. © Matthew Bass 2013
Metadata is crucial
• Data by itself is not that useful.
• It must be tagged with identifying information including
timestamp..
• For example
– VM CPU usage divided among which processes
– I/O requests to which disks triggered from which VM process
– Messages from which component to which other component in
response to what user requests.
• Ideal – each user request is given a tag and all monitoring
information as a consequence of satisfying that request are
tagged with request ID.
• Other monitoring activities are tagged with ID that
identifies why activity was trigger.
- 129. © Matthew Bass 2013
Why this emphasis on metadata?
• Any of the uses enumerated for monitoring
data require associating effect with its cause.
• The monitoring data represents the effect.
• The metadata enables determining the cause.
- 130. © Matthew Bass 2013
Active/Passive
• Active data collection involves the component
that generates the data. It emits it periodically or
based on a triggering event
– To a key-value store
– To a file
– A message to a known location
• Passive data collection involves the component
that generates the data making it available to an
agent in the same address space. The agent emits
the data either periodically or based on events.
- 131. © Matthew Bass 2013
Data Collection
• Whether active or passive data, the data is
emitted from a component to a known
location periodically or based on events.
System
application
agent
System
application
agent
Monitoring System
- 132. © Matthew Bass 2013
Monitoring Systems
• Data collecting tools
– Ngaio .
– Sensu
– Inciga
– Cloud Watch – AWS specific
- 133. © Matthew Bass 2013
Volumes of data
• It is possible to generate huge amounts of data.
• That is the purpose of data collating tools
– Logstash
– Splunk
• Features of such tools
– Collating data from different instances
– Visualization
– Filtering
– Organizing data
– Reports
- 134. © Matthew Bass 2013
Alarms
• An alarm is a specific message about some
condition needing attention.
– Can be e-mail, text, or on screen for operators.
• Problems with alarms
– False positives – an alarm is raised without
justification
– False negatives – justification exists but no alarm
is raised.
- 135. © Matthew Bass 2013
Summary
• Distributed coordination problems are simplified when
using a tool such as Zookeeper
• You must expect failure in the cloud and prepare for it.
• A disaster is when everything has failed and you need
to have business continuity plans
• Flexibility in the cloud is managed by setting
configuration parameters and they need to be
managed.
• Monitoring lets you know what is going on with your
system from whatever perspective you wish. But, you
must choose your perspective.