1. Big Data Analysis using
Hadoop on a Eucalyptus
Cloud
How secure is our cloud?
PRESENTED BY: ABHISHEK DE
STUDENT, CSE 2ND YEAR, BPPIMT
2. Contents:
The Big Data Crisis
Let’s embrace Cloud Computing
Benefits of cloud
Establishing an IaaS using Eucalyptus
A word on Virtualization
Hadoop as a Platform
MapReduce and HDFS
Typical algorithms
Benefits we achieve
How secure is the system?
PREPARED BY: ABHISHEK DE
2
06-Apr-13
3. The drifting era: BIG DATA and crisis
• YouTube users upload 48 hours of
new video every minute of the
day.
• 100 terabytes of data uploaded
daily to Facebook.
• Twitter sees roughly 175 million
tweets every day, and has more
than 465 million accounts.
• Walmart handles more than 1
million customer transactions
every hour, and databases more
than 2.5 petabytes of data.
PREPARED BY: ABHISHEK DE
3
06-Apr-13
5. Solution: Cloud Computing
Conventional Computing:
You data gets processed in your own
computer.
Cloud computing:
You send your data to some other
computer. It gets processed there and it
comes back to you.
“Cloud Computing is the use
of computing resources (hardware and
soft ware) that are delivered as a service
over a network (typically the Internet)”
--WIKIPEDIA
PREPARED BY: ABHISHEK DE
5
06-Apr-13
6. Benefits of Cloud Computing:
High
reliability.
Highly scalable and
fault tolerant.
Reduced Cost: Only
pay for what you
need.
Efficient management of
resources.
Improved
Security.
Achieved out of
commodity
hardware.
PREPARED BY: ABHISHEK DE
6
06-Apr-13
7. Why Eucalyptus?
“Elastic Utility Computing Architecture Linking Your Programs To Useful System”
Eucalyptus is the world's most widely deployed software platform for on-premise
(private) Infrastructure as a Service (IaaS) clouds.
It uses existing infrastructure to create a scalable, secure web services layer that
abstracts compute, network and storage to offer IaaS.
Eucalyptus can be dynamically scaled up or down depending on application
workloads.
PREPARED BY: ABHISHEK DE
7
06-Apr-13
8. Architecture of Eucalyptus:
FRONT END:
• Users login to
the cloud
using
credentials
• The user is
redirected to
the back end
of the
cloud, i.e., the
Storage and
the Resource
pool
user1
user1@nc1:
BACK END:
• Runs the Node
Controller.
• Mounts
images as
Virtual
Machines or
instances
using XEN or
KVM.
• Hosts the
resource pool.
FRONT END BACK END
PREPARED BY: ABHISHEK DE
8
06-Apr-13
9. XEN: Virtualize your resources
XEN, is the under laying technology used by
eucalyptus. Xen hypervisor allows several guest
operating systems to be executed on the same
computer hardware concurrently.
Xen partitions a single physical machine into
multiple virtual machines, to provide server
consolidation and utility computing. Existing
applications and binaries run unmodified.
The hypervisor controls the MMU, CPU
scheduling, and interrupt controller, presenting a
virtual machine to guests.
PREPARED BY: ABHISHEK DE
9
06-Apr-13
10. HADOOP: Solution to BIG DATA
PREPARED BY: ABHISHEK DE
10
Roughly how long does it take to read 1TB from a commodity hard disk:
That is roughly around 4 hours.
With HADOOP it takes around :
06-Apr-13
11. Birth of HADOOP: Opensource
alternative to GFS
Pre-2004 : Cutting and Cafarella develop open source projects for web-scale
indexing, crawling and search.
2004: Jeffrey Dean and Sanjay Ghemawat introduce map reduce model used internally
at Google.
2006: Hadoop becomes official Apache project, Cutting joins Yahoo! Yahoo adopts
Hadoop.
06-Apr-13PREPARED BY: ABHISHEK DE
11
12. HDFS: Hadoop Distributed File System
Files split into 128MB (or 64MB) blocks
Blocks replicated across several datanodes(usually 3)
Single namenode stores metadata (file names, block
locations, etc.)
Optimized for large files, sequential reads
Clients read from closest replica available.(note:
locality of reference.)
If the replication for a block drops below target, it is
automatically re-replicated.
Datanodes
1
2
3
4
1
2
4
2
1
3
1
4
3
3
2
4
Namenode
06-Apr-13PREPARED BY: ABHISHEK DE
12
13. Data Flow
Web Servers Scribe
Servers
Network
Storage
Hadoop ClusterOracle
RAC
MySQL
06-Apr-13PREPARED BY: ABHISHEK DE
13
15. Word Count: A typical Example
PREPARED BY: ABHISHEK DE
15
06-Apr-13
16. Implementation: Hardware
PREPARED BY: ABHISHEK DE
16
Move code to data (local
computation)
Allow programs to scale
transparently w.r.t size of input
Abstract away fault tolerance,
synchronization, etc.
06-Apr-13
17. HADOOP in
action!
SOCIAL NETWORKING
ANALYSIS
PAGE RANKING ANALYSIS
ANALYTICS ENGINE WITH
MAP/REDUCE
IMAGE PROCESSING
06-Apr-13PREPARED BY: ABHISHEK DE
17
18. Social Networking Analysis:
Problem: recommend new friends (friend-of-a-friend, FOAF)
Map task:
– U (target user) is fixed and its friends list copied to all cluster nodes (“copy join”); each cluster node
stores part of the social graph
– In: (X, <friendsX>), i.e. the local data for the cluster node
– Out:
if (U, X) are friends => (U, <friendsXfriendsU>), i.e. the users who are friends of X but not already
friends of U
nil otherwise
Reduce task:
– In: (U, <<friendsAfriendsU>,<friendsBfriendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who
are friends with U
– Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of
occurrences in all FOAF lists (sort/rank the result!)
06-Apr-13PREPARED BY: ABHISHEK DE
18
19. Pro’s and Con’s
Batch, offline jobs
Write-once, read-many across full
data set
Usually, though not always, simple
computations
I/O bound by disk/network
bandwidth
PREPARED BY: ABHISHEK DE
19
What it’s not:
High-performance
parallel computing, e.g.
MPI
Low-latency random
access relational
database
Always the right solution
06-Apr-13
20. Cloud Security: Threats unveiled
XML SIGNATURE ATTACK:
The original SOAP body element is moved to a newly
added bogus wrapper element in the SOAP security
header. Note that the moved body is still referenced
by the signature using its identifier attribute Id="body".
The signature is still cryptographically valid, as the
body element in question has not been modified (but
simply relocated). Subsequently, in order to make the
SOAP message XML schema compliant, the attacker
changes the identifier of the cogently placed SOAP
body (in this example he uses Id="attack"). The filling
of the empty SOAP body with bogus content can
now begin, as any of the operations denied by the
attacker can be effectively executed due to the
successful signature verification.
06-Apr-13PREPARED BY: ABHISHEK DE
20
21. Script Injection Attack
targets only the AWS management console users.
exploits the shared credentials between the amazon shop interface and AWS.
The first vulnerability is exploits the GET parameters in the download link users
utilize for downloading their X.509 certificates issued by Amazon. However the
preconditions for the attack are rather high including use of UTF-7 encoding for
the injected script to bypass server logic to encode standard HTML characters
as well as the exploitation of features in specific IE versions.
The second script injection attack uses a persistent cross site scripting attack by
exploiting the login session that is initiated with AWS the first time a user logs into
the Amazons hop interface
06-Apr-13PREPARED BY: ABHISHEK DE
21
22. Who uses it? Applications and
Innovations
Projects under Hadoop:
HBase
ZooKeeper
Pig
Zombie
Hive
Sqoop
PREPARED BY: ABHISHEK DE
22
06-Apr-13
24. That’s the end..
But the beginning of a new
horizon..
Special thanks to the entire
team that helped me in this
endeavor.
ALL QUERIES, PLEASE CONTACT ME AT: abhishekde@hotmail.com
QUESTIONS?