2. Overview
• The Big Data Challenge
• Big Data tools and what can we do with them ?
• Packetloop – Big Data Security Analytics
• Intel technology on big data.
3. An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it
7. Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
12. What is Amazon Redshift ?
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS
cloud
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools
14. How does EMR work ?
EMR
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS
19. Resize Nodes with Spot Instances
Cost without Spot Add 10 nodes on spot
10 node cluster running for 14 hours
Cost = 1.2 * 10 * 14 = $168
20 node cluster running for 7 hours
Cost = 1.2 * 10 * 7 = $84
= 0.6 * 10 * 7 = $42
= Total $126
25% reduction in price
50% reduction in time
20. Ad-Hoc Clusters – What are they ?
EMR Cluster
S3
When processing is complete, you
can terminate the cluster (and stop
paying)
1
21. Ad-Hoc Clusters – When to use
EMR Cluster
S3
Not using HDFS
Not using the cluster 24/7
Transient jobs
1
22. EMR
EMR Cluster
“Alive” Clusters – What are they ?
S3
If you run your jobs 24 x 7 , you
can also run a persistent cluster
and use RI models to save costs
2
24. S3 instead of HDFS
S3
EMR
EMR Cluster
• S3 provides 99.99999999999% of
durability
• Elastic
• Version control against failure
• Run multiple clusters with a single
source of truth
• Quick recovery from failure
• Continuously resize clusters
3
25. S3 and HDFS
S3
EMR
EMR Cluster
Load data from S3 using S3DistCP
Benefits of HDFS
Master copy of the data in S3
Get all the benefits of S3
HDFS
S3distCP
4