2. Link to Yesterday’s lectures
http://www.slideshare.net/lenbass/architecting-
for-the-cloud-scabilityavailability
3. Topics
Scalability is about acquiring resources but once
they are acquired, they still must be used.
Elasticity is about how to use the resources.
This requires understanding
• Concurrency
• State
and their interactions
3
4. What is concurrency?
• Concurrency means
performing several
activities
simultaneously
• Concurrency is used
to improve
performance.
4
5. How do concurrent activities come to
be?
• Explicitly through your code creating a new
thread or process.
• Implicitly through some support system creating
a new thread or process
– Operating system
– Web server
– Database management system
• Implicitly through the infrastructure creating a
new virtual machine
– Elasticity in the cloud
– During deployment of your system
5
6. Key concepts
• Atomicity
– An atomic operation cannot be divided. It is all or nothing.
• Time
– It takes time to perform an operation.
• Computation
• Messages transferred over a network
• Reading/writing information from a disk (rotating or solid state)
• Dependency
– Coordination among concurrent activities is necessary if
they are sharing resource or results
• Problems arise because operations take time and can
be interrupted. I.e. are not atomic.
6
7. Synchronous vs asynchronous
• Synchronous coordination between two
concurrent processes means that process A sends
a message for process B and waits for a response.
• Asynchronous coordination means that process A
does not wait for a response.
– It can poll for a response
– A response from process B can be sent as an event.
• In either case, coordination takes time and so
coordination is not an atomic operation.
7
8. Some problems with concurrent
activities
• Time stamps.
• Many protocols involve putting a time stamp on messages
for error detection and ordering purposes.
• Time stamps are often used to identify log messages used
for debugging problems.
• In some environments, e.g. stock market, trades must be
satisfied in the sequence in which they arrive.
• Race conditions – two processes are simultaneously
accessing the same resource.
• Inconsistency – If two activities are being performed
simultaneously, data may become inconsistent.
8
9. Clock synchronization
• Suppose two different computers are connected via a
network. How do they synchronize their clocks?
• If one computer sends its time reading to another, it
takes time for the message to arrive.
• NTP (Network Time Protocol) can be used to
synchronize time on a collection of computers.
– Accurate to around 1 millisecond in local area networks
– Accurate to around 10 milliseconds over public internet
– Congestion can cause errors of 100 milliseconds or more.
9
10. Suppose NTP is insufficiently accurate
• Financial industry is spending 100s of millions of dollars to
reduce latency between Chicago and New York by 3
milliseconds.
– Well within error range of NTP
• GPS time is accurate within
– 14 nanocseconds (theoretically)
– 100 nanoseconds (mostly)
• Timestamp messages with GPS time
– Used by electric companies to measure phase angle
– Used by Google to coordinate time across all of their distributed
systems.
– Requires specialized hardware and installation not yet cheaply
available.
10
11. Example of a race condition
• Suppose withdrawals are being made from a bank
account. If there are two users simultaneously
withdrawing, the following sequence can occur.
11
User 1 User 2 Acct amount
1000
Read account (1000) 1000
Read acount (1000) 1000
Withdraw 100 (900) 1000
Write new amount (900) 900
Withdraw 100 (900) 900
Write new amount (900) 900
12. Example of inconsistency
• A cache is frequently used to keep data locally rather
than requiring it to be fetched for each request. Web
browsers, for example, cache web pages.
• For every request, the sequence is
1) look in cache to see if the request can be satisfied with the
contents of the cache
2)If no, then retrieve information and return it to the
requester and place it in the cache.
• Now suppose the web page is changed at its source
• Retrievals of the web page from the cache will retrieve
an out of date version of the web page.
12
13. Solutions bring new problems
• One technique to prevent race conditions is to
lock critical resources.
• Can lead to deadlock – two processes waiting for
each other to release critical resources
– Process one gets a lock on row 1 of a data base
– Process two gets a lock on row 2.
– Process one waits for process 2 to release its lock on
row 2
– Process two waits for process 1 to release its lock on
row 1
– No progress.
13
14. Yet more problems
• Locks are logical structures maintained in
software or in persistent storage.
• Getting a lock across distributed systems is not an
atomic operation.
– It is possible that while requesting a lock another
process can acquire the lock. This can go on for a long
time (it is called livelock if there is no possibility of
ever acquiring a lock)
• Suppose the virtual machine holding the lock
fails. Then the owner of the lock can never
release it.
14
15. Is there a solution?
• The general problem is that you want to manage
synchronization of data across a distributed set of
servers where up to half of the servers can fail.
• Paxos is a family of algorithms that use consensus to
manage state concurrency. Complicated and difficult to
implement.
• An example of the problems
– Choose one server as the master that keeps the
“authorative” state.
– Now master server fails. Need to
• Find a new master
• Make sure it is up to data with the authoritative state.
16. Luckily
• Several open source systems are now available
that
– Implement Paxos or an alternative consencus
algorithm
– Are reasonably easy to use.
• Two such systems are
– Memcached – discussed at the end of this lecture
– Zookeeper – discussed in tomorrow’s lecture.
17. In general
• Introducing concurrency will improve
performance but also introduces problems.
• Concurrency is a constant consideration when
architecting for the cloud.
– Coordinating activities across concurrent processes is
difficult and prone to many errors.
– Allowing for failure complicates coordination of
activities.
• Systems are available to provide concurrency for
small amounts of data without your having to
worry about the details.
17
18. Topics
In order to understand how to achieve elasticity
you must understand
• Concurrency
• State
and their interactions
18
19. Recall Load Balancer
• Client makes a request that is routed to a
server through a load balancer
22. Message sequence – request is send to
one server
Servers
Clients
Load
Balancer
23. Message sequence – reply goes back
to client
Servers
Clients
Load
Balancer
24. Message sequence – now client makes second
request – does it matter which server it goes to?
Servers
Clients
Load
Balancer
???
25. “Sticky” http requests
• Normally load balancer will route requests
depending on load of servers attached to it.
• This is why it is called “load balancer”
• Client can request to be always routed to same
server. This is done by making a “sticky”
http request.
• Dangerous for two reasons:
– Server may be overloaded and response delayed
– Server may have failed and no response is
forthcoming.
• We assume non sticky http requests.
26. Suppose message is routed to an
arbitrary instance.
• Understanding what happens requires a
digression into state.
• A computation has two inputs
– Instructions
– Data
• The data input of a computation is called the
state.
27. How does this work with functions?
• Consider a function that counts how many times it is called.
• Option 1:
int countv1()
{
int i = 0; //declare i and initialize it to 0.
i = i + 1; //add 1 to the last value of i
return i;
}
• The function count remembers i from one call to the next.
• State is maintained inside the function – it is stateful
27
28. Option 2
int countv2(int i)
{
int a;
a = i + 1; //add 1 to the last value of i
return a;
}
• The function count does not remember the value of i
from one call to the next.
• The client must pass the last value returned.
• State is passed into the function. The function is stateless
28
29. Option 3
int countv3()
{
int a;
a = dbase_get (“count”); //retrieve current value
a = a + 1; //add 1 to the last value of a
dbase_write(“count” a); //save current value
return a;
}
• The count is stored in a database.
• Neither the client nor the function remembers the value.
• The function is stateless.
29
30. What is the difference?
• In option 1, the function kept track of the
count value.
• In option 2, the client must keep track of the
count value.
• In option 3, the count value is kept in an
external database.
• In each case, the state (count value) must be
kept somewhere.
30
31. Suppose the functions are packaged as
processes in virtual machines
Option 1 Option 2 Option 3
Countv2 Countv3Countv1
Client
DB
32. Processes communicate via messages
• Message from client to process is call
• Message from process back to client is return
of a value
32
33. Now suppose each process has two
clients – what is computed by option
1?
Countv1
36. Where state is kept matters
• Option 1 – counts number of times called by
either client. Process remembers value
• Option 2 – counts number of times called by
each client. Client remembers value
• Option 3 – counts number of times called by
either client. Database remembers value.
Options 1 & 3 calculate different things than
option 2.
36
37. Now suppose each process has two
instances– remember the load balancer
Countv1 Countv1
Load balancer distributes messages to servers
41. Now what do the options compute?
• Option 1 – each instance of the function
countv1 computes how many times it was
invoked
• Option 2 – each instance of the function
countv2 computes how many times each
client invoked either instance
• Option 3 – the database contains the number
of times either instance was invoked by either
client.
41
42. What have we seen?
• When there was one instance of a client and
one instance of the count process- all three
versions were identical
• When there were two clients and one instance
of the count process– two versions were the
same, one was different
• When there were two clients and two
instances of the count process– all three
versions produced different results.
42
43. Message so far
• How state is managed is important and will
lead to different results when there are
multiple instances of clients or functions.
• Now we return to elasticity
• Remember the sequence?
43
46. Message sequence – request is send to
one server
Servers
Clients
Load
Balancer
47. Message sequence – reply goes back
to client
Servers
Clients
Load
Balancer
48. Message sequence – now client makes second
request – does it matter which server it goes to?
Servers
Clients
Load
Balancer
???
49. It depends where state is kept
• If state is kept in the client, then it does not
matter since the client keeps track of the calls
• If state is kept in a database then it does not
matter since the results are kept external to
the servers
• If state is kept in the server then it does
matter since sending message back to server 1
will give different result than sending it to
server 2.
50. Keeping servers stateless enables
elasticity
• A new instance of a server can be
– Created/stopped
– Registered /unregistered with the load balancer
– Placed in/removed from service
without
Requiring the client to be aware of which server
instance it is interacting with
Requiring that clients be notified if a server is taken
out of service
51. Types of State
• Session state
• Client side state
• Server side
• Persistent
52. What is a session?
• A session typically refers to a series of
interactions between one client login to a
system and the termination of that login –
whether through logging out or through
timing out.
• A session can also span multiple logins. E.g.
Netflix keeps track of where you are in a
movie and returns you to that location the
next time you log in.
53. Session State
• Session state is information that persists for a
session. We are considering a single login here.
The multiple login case is a special case of
persistent state.
• What happens when you login
– When you successfully login to a service, the service
returns a code that identifies you. This is the session
ID.
– Other information can also be included such as MAC
address (to prevent man in the middle attacks).
– It is typically managed on the client side. Your
browser does all of this.
54. Client Side State
• This can be difficult if there is significant state
to save, however
– This means you’ll need to pass all of this state with
each request
– This requires more network overhead
• This also means you’ll need to store data on
the client machine
– This can have security implications
55. Stateful Services
• If your services are stateful that makes
scalability more difficult
• If you’re able to design your system such that
the services are stateless you’ll make scaling
much easier
• If an operation is dependent on the results of a
previous operation it’s more difficult to make
services stateless
56. Management of state between
services and persistent tier
• Non client side state can be either kept in the
services or in a persistent store.
• The choice depends on the volume of data,
the latency involved, the synchronization
needs for the servers and the time the state is
expected to persist.
57. Important latency numbers
• Main memory reference 100 ns
• Send 1K bytes over 1 Gbps network 0.01 ms
• Read 4K randomly from SSD. 15 ms
• Read 1 MB sequentially from memory 0.25 ms
• Round trip within same datacenter 0.5 ms
• Read 1 MB sequentially from SSD 1 ms
(4X memory)
• Disk seek 10 ms
(20x datacenter roundtrip)
• Read 1 MB sequentially from disk 20 ms
(80x memory, 20X SSD)
• Send packet CA->Netherlands->CA 150 ms
57
* dean-keynote-ladis2009_scalable_distributed_google_system
58. Implications of latency numbers
• State stored in persistent storage (disk or SSD) will
take longer to fetch than state stored in memory.
• State stored in a different datacenter will take longer
to access than state stored locally, especially across
continents.
• Persistent store is typically replicated both for
performance (latency) reasons and for availability
(failure) reasons.
• => keeping data consistent across different
occurrences of it is important but difficult.
59. Topics
In order to understand how to achieve elasticity
you must understand
• Concurrency
• State
and their interactions
59
60. Keeping data consistent
• We will discuss persistent data consistency
when we discuss databases.
• Memcached is an open source tool that
provides in-memory synchronization of data
across different instances of a service.
61. • Now consider these layers deployed onto
multiple servers.
Layers of a service
Business logic for the service
Memcached
62. Memcached in multiple servers
• Memcached keeps small amount of state in all
servers consistent.
• At a small cost in latency as long as they are in
same physical location.
Memcached Memcached
Business
logic
Business
logic
63. When to use Memcached
• Data must be synchronized among servers.
• Memcached takes care of concurrency issues
• Data is relatively small
– One object < 1MB
– Total memory used per server depends on how much
you are willing to give it per server since it is stored in
memory, not on a persistent store
• Lifetime of the data should not exceed time any
of the servers are alive. I.e. if all the servers die,
then the data disappears.
64. Summary
• The cloud doesn’t guarantee elasticity
• You’ll need to design your system to be elastic
• State management, your storage solution, and
consistency, are all factors that you’ll need to
consider
67. Agenda
• What is security?
• Understanding the threat
• Architectural approaches to security
• Designing for security
• Summary
68. Agenda
• What is security?
• Understanding the threat
• Architectural approaches to security
• Designing for security
• Summary
69. Your Experience
• Think about your past experience
– How have you thought about security?
– What steps have you (or your organization) taken
to protect the system?
• Do you remember Assignment 2?
– Security was equivalent to having a login feature
or encryption
70. Security … What is it?
• What do we mean when we say security?
• In your experience what does this mean?
72. Fort Knox
• Fort Knox is a US Army post in Kentucky
• In addition to housing various US Army
functions it is also the home to a gold bullion
depository
– 5000+ tons of gold housed there
73. Security
• What is the business asset that needs
protection in this case?
• What does protect mean here?
74. What About the CIA?
• The Central Intelligence Agency (CIA) is a US civilian
intelligence organization
• Primary purpose is to collect information about
foreign governments, corporations, and individuals
• It uses this information to influence public
policymakers
– It does at times engage in tactical operations as well
75. Security
• What is the business asset that needs
protection?
• What does protect mean in this case?
77. Business Context
• The business need differs from one context to another
• Organizations have assets they need to protect
• They need to protect these assets for different reasons
– Business continuity
– Liability reasons
– Regulation
– Protection of IP
– …
78. Security – A Set of Concerns
• The related concerns are typically classified as
“security” concerns
• In software these concerns are typically:
– Confidentiality
– Data integrity
– Non repudiation
– Availability
79. Confidentiality
• The property that reflects the extent to which:
– Data and services are only available to those that
are authorized to access them
• Is this a concern for a Museum? How about a
Financial Institution?
80. Integrity
• This property can also refer to data or services
• It reflects the extent to which data or services can
be delivered as intended
• E.g. hopefully the grade that we have recorded
for you in this course is correct …
81. Non Repudiation
• Nonrepudiation is refers to the ability to guarantee
that the sender can not later repudiate or deny
having sent the message
• It can also refer to the guarantee that the recipient
cannot later deny having received the message
• When might this be important?
82. Availability
• This is the property that reflects the extent to
which the system will be available for
legitimate use
• A denial of service attack is meant to disrupt
the availability of a system
83. Protection Against What?
• Now that we understand the business asset,
what are we protecting against?
• In order to appropriately protect our system
we need to understand the threat
• Let’s look at example exploits …
84. Agenda
• What is security?
• Understanding the threat
• Architectural approaches to security
• Summary
86. Who is Leveraging These Techniques?
• The art of hacking has gone from an individual
activity to a highly coordinated and sophisticated
effort
– It can now be quite lucrative as well
• Today many legitimate and illegitimate organizations
routinely launch attacks
– Just run a port scan detector on your system
• Let’s look at the progression of exploits
87. Progression of Exploits
• Mischievous individuals:
– The first generation of hackers were technical youth performing mischievous acts
• Revenue generation: a proof of concept
– These were the first example of hacking for money
– Still small scale
• Organized crime
– These were criminal organizations involved in larger scale criminal activity
• Widespread adoption
– The infrastructure needed to launch Cyber attacks is now widespread
– The barrier to entry has been lowered
– Legitimate entities enter the game
• Advanced persistent threats
88. Hackers – First Generation
• In the 1990s hackers were by and large not
malicious
• They were in it for the challenge
• Notable hackers
– Kevin Mitnick
– Chen Ing-Hau
– Jeffery Lee Parson
– Sven Jaschan
89. Kevin Mitnick
• Broke into dozens of computer networks
– Pac Bell
– DEC
– MCI
– Digital
– …
• Wasn’t in it for financial gain
• Largely used “social engineering” techniques
• Arrested twice 1988 and again in 1999
91. Mitnick’s Techniques
• Largely used “social engineering” to gain
access to passwords and insider information
• Used this information to gain access to target
system
• Mitnick claims that he never “hacked” a
system (still a point of controversy)
92. Chen Ing-Hau
• University student that created and released the CIH
virus in 1999
– Wrote the virus to “make a fool of the software vendors”
• Virus that would render the computer essentially
inoperable on a specified date
• Became one of the most widespread viruses
• Some version of this virus have showed up multiple
times
93. CIH Virus
• Exploited vulnerability in Windows 95, 98, &
ME
– Along with an issue in various BIOS chipsets
• Would overwrite the first megabyte of the
hard drive and attempt to overwrite flashdrive
• Result rendered the pc inoperable
94. Jeffery Lee Parson
• Was 18 when he confessed to be the creator of
Blaster worm
• A Chinese “cracking” collective reverse engineered a
MS patch
• Parson created a worm to exploit a buffer overflow
issue
• Affected DCOM’s RPC service
– Worm could spread without users opening an attachment
95. Blaster Worm
• In addition to changing RPC service it would
– Change registry to launch msblast.exe
• Worm would launch a distributed denial of
service attack from infected computers
– Attack was against windowsupdate.com
• Sent messages to Bill Gates
96. Sven Jaschan
• Authored Sasser and Netsky worms
• Claims to have written them to remove
Mydoom and Bagle worms
• Worms were responsible for 70% of the
infections in 2004
97. Netsky
• Sent out as an email attachment
• Contained insults aimed at the author of
Mydoom and Bagle
• Other symptoms included “beeping” in the
early morning hours of specific dates
98. Sasser
• Would connect to computers through a
particular port that was often open by default
• Exploited a buffer overflow
• Would shut the computer down after
displaying a shutdown timer
99. Cyber Criminals – Proof of Concept
• After the turn of the century a new breed emerged
• They took the techniques employed by the
mischievous youth and used them for monetary gain
• These were the first real “cyber criminals”
– Ferid Essebar
– Attilla Ekici
– Jeanson James Ancheta
100. Ferid Essebar & Attilla Ekici
• The two people behind Zotab computer worm
• Worm affected CNN, ABC News, NY Times, US
Dept of Homeland Security, …
• Intention was to facilitate credit card forgery
scams
101. Zotab
• Exploited vulnerability in Windows 2000
• Caused the computer to restart continuously
• Files would be created with every reboot
• Spyware was installed on the system
– The spyware remained after the virus was removed
• The goal was to facilitate scams (for money)
102. Jeanson James Anacheta
• First person to be arrested for controlling a
large number of hijacked computers
• Created a large Botnet
– Network of bots or “software robots”
• Offered his collection of bots for hire
• Leveraged rxbot to increase his network
103. Rxbot
• Contained a proxy server
• Server can be spawned by a remote attacker
• Typically used for denial of service attacks
104. Cyber Gangs
• “Organized” crime gets involved
• Coordinated attacks against high value targets
• Often involve groups and large sums of money
• Examples
– Yaron Bolondi
– Maria Zarubina
– Albert Gonzalez
105. Yarib Bolondi
• Part of a gang that attempted to steal £220
million from Japanese bank
• Used keylogging to gain access to bank’s
computers
• Software is installed on employees computers
– Via malware or other virus
106. Maria Zarubina
• Part of a gang that used cyber attacks as a means for
extortion
• Attacked British “bookmakers”
– Agreed to stop attacks if ransom was paid
• Used denial of service attacks to shut down gambling
sites
• Would then threaten additional attacks unless
payment was made
107. Albert Gonalez
• Responsible for largest credit card theft in history
• Stole and resold more than 170 million cards
• Used SQL injection to introduce “malware
backdoors”
– These allowed packet sniffing attacks
• Targets included Target, TJ Max, Dave & Busters, 7-
eleven, JC Pennys, …
108. ARP Spoofing
• Used to attack an ethernet network
• Allows attacker to “sniff” data on a LAN and modify
or stop the traffic
• Attacker sends a spoofed ARP message to Ethernet
LAN
• “Man in the middle” attack
– Attackers computer masquerades as destination computer
and gets intended traffic
109. Advanced Persistent Threat
• Today we’ve started to see a new class of threat
emerge
• These threats are against specific high value targets
• They are characterized by coordinated activity taking
place of a long period of time
– The individual actions may seem isolated
• The perpetrator doesn’t act on the exploit until
sufficient penetration has been achieved
• Has anyone heard of Stuxnet?
• How about Gauss or Flame?
110. Software as a Weapon
• In 2010 Iran announced they put their nuclear
program on hold
– No one was sure why
• It turns out the reason was that more than 1000
centrifuges in their uranium enrichment facilities
were destroyed
• How were these centrifuges destroyed?
– By the first known weapon that was 100% software
111. Stuxnet
• Stuxnet was a worm that infected SCADA systems
made by Siemens
– Think power plant and power distribution control systems
• It was capable of
– Increasing the pressure inside nuclear reactors
– Switching off oil pipelines
• Additionally it would report that the systems were operating
normally
112. Sophisticated Attack
• Why is stuxnet special?
• First, it didn’t use a forged security clearance
– It used a genuine security clearance that was stolen
• Second, it had a specific target
– It infected many systems worldwide but remained dormant until
it found the systems controlling the intended target
• Third, it exploited not 1, but 20 zero day vulnerabilities
113. Response
• Iran responded to the attack with an open call
for hackers to join the Iranian Revolutionary
Guard
• Iran now has reportedly amassed the 2nd
largest online army in the world
114. Side Note
• Stuxnet is now open source
• This is code that is capable of crashing power
plants and disrupting oil pipelines
• Go to youtube and search for stuxnet
– You’ll get many videos of people dissecting
stuxnet …
115. Advanced Persistent Threats
• Stuxnet is an example of what we call “Advanced
Persistent Threats”
• In some cases exploits are not opportunistic
reactions to discovering a vulnerability
• They are coordinated multipronged attacks that can
take place over an extended period of time
116. Coordinated Attack
• Intruders will look for some way to find access
to a system
• They will then try to move laterally until they
are able to access the intended target
• This can take days, weeks, months, or even
years
118. What’s the Point?
• Almost all of these incidents exploited
vulnerabilities
• These vulnerabilities came along with the
commercially available software used in the
attacked systems
• Vulnerabilities continue to exist in the
software that we use
119. Vulnerabilities
• Many organizations (legitimate and illegitimate) try to find these
vulnerabilities
– CERT is an example of such an organization
• Organizations like CERT would inform the developers of the
software of the vulnerability
• Historically companies were slow to react
• CERT didn’t want to release it publically without a fix being available
• So CERT would notify the organization and then release the
vulnerability publically after a given time elapsed
120. X Day Vulnerabilities
• Vulnerabilities are characterized by the time since
they were made public
– 1 day vulnerabilities were released 1 day ago
• The newer the vulnerability the less likely it is to be
patched
• Zero day vulnerabilities are those that the
manufacturer doesn’t yet know about
– Clearly these are the most attractive to attackers
121. Vulnerability Market
• A market has emerged for these vulnerabilities
• If you discover a vulnerability you can sell it
• The value of the vulnerability is determined by:
– The “day” of the vulnerability
– The number of instances of the software containing the
vulnerability
122. Selling The Vulnerability
• Many entities buy these vulnerabilities
– Governments (including the US)
– Organized crime syndicates
– Individuals
• Prices range from $10 - $250,000 or more
– Depending on the exclusivity of the sale as well as the value of the exploit
• Check out:
– http://www.forbes.com/sites/andygreenberg/2012/03/23/shopping-for-zero-days-an-
price-list-for-hackers-secret-software-exploits/
– http://www.zdnet.com/blog/security/black-market-for-zero-day-vulnerabilities-still-
thriving/2108
123. Exploit Auction Houses
• There are now auction houses that sell
vulnerabilities (or exploits)
– Like the ebay of exploits
– In fact exploits were originally sold on ebay
• It’s actually legal to sell these exploits
– Even though the attacks themselves may be
illegal
124. Exploit as a Service (EaaS)
• Believe it or not you can now get a service to
manage your attacks
• One issue if you’re going to launch an attack is
finding a “bulletproof” provider
– A provider willing to host a malware server
• These services will provide “exploit kits” and
manage the hosting
• In some cases they even offer analytics for the
consumer’s campaigns (think google analytics)
125. Widespread Adoption
• All of this has lowered the barrier to entry for
exploiting vulnerabilities
• There are large numbers of people with the
means and motive to attack any system online
• Furthermore secure practices are often not
followed
– See next slide
126. Many Systems Remain Vulnerable
• Remember the issues with Open SSL that surfaced in early 2014?
– Despite widespread news reports, many systems continue to be vulnerable
• June 2014 survey of TLS vulnerabilities
127. Cloud Related Issues
• In many respects security in the cloud is not
different from security for a traditional system
• Some threats are magnified, and some
additional threats exit
• We’ll look at:
– VM sprawl
– Insecure interfaces or API
– Malicious insiders
– Shared resources
128. VM Sprawl
• VM creations is quick and easy
– It can be done in seconds without procuring hardware,
administrative knowledge, or securing permissions
• As a result it’s done often
– Sometimes for transient needs
• Once created the VM is often forgotten about
– It might still exist even if it is no longer doing any work
• Keeping track of the existing VMs is difficult to do
– It requires different processes than tracking physical assets
• This results in something called VM Sprawl
129. Consequences of VM Sprawl
• VM Sprawl is bad for many reasons
• First, it imposes additional overhead on the
overall solution
– The VM still costs money even if it is offline
• Second, it is less likely to be included in the
normal maintenance efforts
– Updates and patches might not be applied
• As a result the VM can remain vulnerable
130. Insecure Interfaces or API
• IaaS and PaaS providers expose a set of API
• These API are used by customers to:
– Provision
– Manage
– Orchestrate
– Monitor
– …
• The security of the cloud is dependent on the security
of these API
• These API must be designed in a way to resist
accidental and malicious attempts to circumvent policy
131. 3rd Party API
• We not only need to trust the expertise and
procedures of the cloud providers but 3rd party vendors
as well
• Organizations often layer capability on top of the
provided API in order to add value to the consumer e.g.
– Deployment tools
– Monitoring aggregation tools
– Data management services
– …
• The security of these providers also needs to be trusted
132. How Does This Work?
User 3rd Party Service Cloud Provider
133. Malicious Insiders
• Malicious insiders are a known and significant
threat to corporate security
– E.g. former and disgruntled employees
• When deploying your application on the cloud
you need to worry about employees of the
cloud provider as well
135. Shared Resources
• When software running in a process within
a VM can elevate privileges sufficiently they
can “escape” the bounds of the VM
• This is called “guest to host VM escape”
• Once this happens the software is able to
control all of the instances within that
hypervisor
136. Hypervisor Vulnerabilities
• The most commonly used hypervisors have all
been exploited
• Vulnerabilities continue to be discovered in all of
the major hypervisor software
– Discovered by both the good guys and bad guys
• Do a Google search on VM Escape for the latest
vulnerabilities …
137. Addressing Security Issues
• The strategies for dealing with security issues
typically fall into one of three categories
– Secure coding practices
– Processes and policy
– Architectural approaches
138. Secure Coding Practices
• Looking at the source of the vulnerabilities it may seem that secure
coding practices will solve the problem
• While this is true to some extent as we said these vulnerabilities
exist in most commercially available software
• We must therefore assume that our software is to some extent
insecure
• It’s also the case that we will miss issues
• Inevitably the software will have defects, will be used in a context
other than what was intended, or will be used with software that it
wasn’t intended to work with
139. Processes and Policy
• A large aspect of dealing with security includes
processes and procedures
• The security of the system is impacted by things
like:
– Physical security
– IT policy governing computers on the network
– Updating and patching procedures
– Organizational structure and access policies
• Defining appropriate practices is a key
component to security
140. Agenda
• What is security?
• Understanding the threat
• Architectural approaches to security
• Designing for security
• Summary
141. Security Strategies
• Security strategies fall in one of several categories
– Policy/process
– Secure coding practices
– Architectural
• We will now look at some architectural strategies
• The thing to keep in mind is that you cannot easily eliminate
all vulnerabilities
– Some of the approaches are aimed at minimizing vulnerabilities
– Some are aimed at reducing the impact if the vulnerabilities are
exploited
142. Resisting Attacks
• Resisting attacks is analogous to securing the
perimeter
• Strategies for resisting attacks include:
– Encryption
– Checking data integrity
– Limiting exposure
– Limiting access
143. Encryption
• Applied to data and communications can help
maintain confidentiality
• Can be symmetric
– Both parties use the same key
• Or asymmetric
– Public/private key
144. Encryption
• What kind of attack would encryption protect
against?
• What kind of attack would it not protect
against?
• What kind of security concern would it
address?
145. Data Integrity
• Encoding data with checksum or hash results
can help ensure the data has not been
tampered with
• This additional data can be encrypted along
with or independently from the original data
146. Data Integrity
• Think about data integrity concerns in the context of
some of the recent attacks
– Stuxnet
– Gauss
– …
• These techniques can be important for detecting an
attack
– Additional techniques might be needed to recover
147. Limiting Exposure
• Attacks depend on exploiting weaknesses to
gain access to data and services
• Limiting access to the attack surface limits
risk*
• The following are approaches to limiting
exposure
* Manadhata 2006
148. Client Data Storage
• Problem: many applications store data at
potentially untrusted clients.
– These clients could tamper with the data
• Solution: this pattern uses encryption to store
security-critical data client-side
149. Client Data Storage II
• Manual inspection of this data could reveal
details of the application that could be used to
compromise the site
150. Client Input Filters
• Problem: in many cases clients execute
outside the control of the system developer.
– These clients can be tampered with to behave in
an untrustworthy manner
• Solution: treat all data provided by clients as
suspect
151. Client Input Filters II
• Perform (or re-execute) data validity checks on the server
• Exam headers and URLs for malicious code
• Text input should be checked for scripts
• Calculated fields should be re-computed on the server
• Considerations:
– Should use a symmetric key as it’s less computationally expensive
– Storage of the key should not be stored in a file
152. Trusted Proxy
• Problem: it may be necessary to expose
inadequately protected aspects of the system
to untrusted users
• Solution: create a trusted proxy that acts as a
buffer between the component and the users
153. Trusted Proxy II
• This proxy intercepts and filters all
communication
• In that way it can compensate for the lack
of protections
• Typically two options
– Filter requests for bad input
– Recreate a new request with only the essential
parts of the old request
154. Single Access Point
Problem: a system is more difficult to secure if it has multiple
entry points
• With multiple entry points:
– You may need to separately secure multiple applications
– You may have duplicate authentication logic to maintain
– Unix is an example with multiple entry points
– Different services can be set up on different machines
155. Single Access Point II
• The solution is to create a single point of entry
• A session is then created
• This allows global tracking of session state and
authorization information
• There is a single “gateway” or “check point” through
which user’s login is validated
156. Single Access Point III
• Which aspects of security does this pattern
address?
• What are some of the implications of using
this pattern?
157. Partitioned Application
• Problem: large complex applications often
require root privileges in some portions of the
application
– If these elements are compromised the entire
system is at risk
• Solution: partition the large application into
smaller elements each adhering to least
privilege principle
158. Partitioned Application II
• This becomes more difficult to manage
• Additionally performance can suffer as
interprocess communication increases
• Additional points of entry are introduced
– Even though the impact of being compromised is
diminished
159. Password Propagation
• Problem: most applications manage user data
under a single database account
– Thus if the single account is compromised all user
data can be accessed
• Solution: the users password is required with
each backend database request
160. Password Propagation II
• This is essentially an instance of application
partitioning
• The front end will cache the password and
provide it with each back end request
161. Limiting Access
• You can think of this as “securing the
perimeter”
• This is a widely used approach of limiting
access to data and services
• The following are examples of techniques for
limiting access
162. Session
• Background: Systems need to keep track of
user’s login status, level of authorization, and
so forth
– The Singleton pattern is often used for this
– This pattern can be difficult to use when the
system support concurrent logins
• The solution is to create a “session” object to
hold these global variables
163. Session II
• This session object is accessible by all
components of the application
• This facilitates having a common interface for
accessing this information
– Easier to implement and maintain than having a
number of variables passed around
164. Roles
• Background: when an application supports
many types of users security becomes more
complicated
– It can be difficult to track and maintain all of the
things that every user has access to
• It eases implementation issues if a smaller
number of “roles” are created
• Each role has a given set of rights
165. Roles II
• What kinds of security does this address?
• Implications?
166. Account Lockout
• Problem: there is an increased number of password guessing
tools to compromise systems requiring user authentication
• Solution: lock the user account after some number of
incorrect attempts
• How it works:
– The system records each incorrect login attempt
– When a predetermined number of attempts is reached the account is
locked
– Each time there is a correct login the account is reset
167. Account Lockout II
• Issues:
– Doesn’t address the situation where different user
IDs are used
– Usability can be adversely affected
– Availability can be adversely affected
• Can facilitate denial of service
169. Minefield
• Problem: hackers are likely familiar with the
vulnerabilities of various configurations
– Once they figure out your setup they’ll know how
to get in
• Solution: change your setup to a non-standard
configuration
170. Minefield II
• Even small changes can increase the effort enough to
discourage hackers
• You can do things like:
– Alter file structure
– Rename common administrative commands
– Instrument commands to alert administrators
– Add booby traps that will recognize tampering
171. Secure Assertion
• Problem: the activities performed by a
malicious intruder may look legitimate at the
local level
– E.g. transferring money from an account
• Solution: create a framework for reporting
specific activities that violate assertions
172. Secure Assertion II
• The application developer is in a position to determine
activities that may be suspicious
– They can create assertions
• If the application is being developed in an environment
that supports exceptions, assertion violations could be
reported in a similar fashion
• The violations could be collected globally to provide
additional insight on the current activities
173. Recovering From Attacks
• Availability tactics
– We will discuss these in a future class
• Auditing
– Keeps a trail of the users and their actions
– Helps to maintain a record of the attack
174. Network Address Blacklist
• Problem: all systems with an online presence
are subject to attack
– Locking individual accounts doesn’t address
systemic attacks
• Solution: block network addresses that are the
source of attack
175. Network Address Blacklist II
• The server will monitor requests from clients
– Any suspicious requests will be logged
– If there are repeated suspicious requests the address is
blocked
• One question is where to implement the check
– Network (e.g. firewall) or application
• Performance as list grows can be an issue
• Can still be subject to denial of service attack
176. Agenda
• What is security?
• Understanding the threat
• Architectural approaches to security
• Designing for security
• Summary
177. So How Do We Decide?
• There are many options, which ones are
required?
• What are the side effects of selecting these
security mechanisms?
178. Fit for Purpose
• It is (hopefully) clear that each of these techniques
addresses a different concern
• What concerns does your organization have?
– This depends on the business assets that need
protection
– And the ways in which these assets could be
compromised given the system
179. Threat Modeling
Threat Modeling and Analysis in a nutshell:
– Identify the business asset to protect
– Brainstorm the known threats to the system
– Rank the threats by decreasing risk
– Chose techniques to mitigate the threats
– Chose appropriate technologies from the identified
techniques
180. Business Asset
• The reason for security is to protect some
aspect of the business
• You need to identify those aspects of the
business that need protection
• You also to determine what “protection”
means
181. Brainstorm Threats
• Given a particular design what might happen to
compromise the business asset?
• You should think about these from two perspectives
– Likelihood
– Impact
• At this point you don’t worry about if they need
mitigation
182. Rank the Threats
• Based on the likelihood and the impact you
can determine the “risk exposure”
– Look at risk management techniques
• Prioritize the risks according to the exposure
• Determine the threshold that require
mitigation
183. Mitigation Techniques
• Look for generic patterns that will mitigate the
risks
• Mitigate means lower the risk exposure to a
tolerable level
– You lower the exposure by reducing the likelihood
or reducing the impact
– A tolerable level means below the threshold
defined previously
184. Choose Technologies
• Basically you need to map the generic pattern to
some concrete solution
• This is where you factor in the costs
• Costs could come in terms of level of effort to
implement
• Costs could also come in terms of tradeoffs
– You might need to iterate these steps
185. Consider Trade Offs
• Most of these mechanism adversely impact
performance
– Blindly selecting these capabilities can bring the
system to a standstill
• They also have an impact on the flexibility of
the system
• Balancing concerns is key
186. References
• STRIDE: http://msdn2.microsoft.com/en-us/library/aa302419.aspx
• Hinton, Hondo, Hutchison: Security Patterns within a Service Oriented Architecture
IBM 2005
• Hafiz, Johnson Security Patterns and their Classification Schemes
• Thomas Erl Service Oriented Architecture Chapters 4 and 11
• SEI/CERT OCTAVE: Operationally Critical Threat, Asset, and Vulnerability
Evaluation: http://www.cert.org/octave
• Manadhata et al. Measuring the Attack Surfaces of Two FTP Daemons 2006