This document discusses big data analytics tools and technologies. It begins with an overview of big data challenges and available tools. It then discusses Packetloop, a company that provides big data security analytics using tools like Amazon EMR, Cassandra, and PostgreSQL on AWS. Next, it discusses how EMR and Redshift from AWS can be used as big data tools for tasks like batch processing, data warehousing, and live analytics. It concludes by discussing how Intel technologies can help power big data platforms by providing optimized processors, networking, and storage to enable analytics at scale.
2. Overview
• The Big Data Challenge
• Big Data tools and what can we do with them ?
• Packetloop – Big Data Security Analytics
• Intel technology on big data.
3. An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it
7. Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
12. What is Amazon Redshift ?
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS
cloud
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools
14. How does EMR work ?
EMR
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS
19. Resize Nodes with Spot Instances
Cost without Spot Add 10 nodes on spot
10 node cluster running for 14 hours
Cost = 1.2 * 10 * 14 = $168
20 node cluster running for 7 hours
Cost = 1.2 * 10 * 7 = $84
= 0.6 * 10 * 7 = $42
= Total $126
25% reduction in price
50% reduction in time
20. Ad-Hoc Clusters – What are they ?
EMR Cluster
S3
When processing is complete, you
can terminate the cluster (and stop
paying)
1
21. Ad-Hoc Clusters – When to use
EMR Cluster
S3
Not using HDFS
Not using the cluster 24/7
Transient jobs
1
22. EMR
EMR Cluster
“Alive” Clusters – What are they ?
S3
If you run your jobs 24 x 7 , you
can also run a persistent cluster
and use RI models to save costs
2
24. S3 instead of HDFS
S3
EMR
EMR Cluster
• S3 provides 99.99999999999% of
durability
• Elastic
• Version control against failure
• Run multiple clusters with a single
source of truth
• Quick recovery from failure
• Continuously resize clusters
3
25. S3 and HDFS
S3
EMR
EMR Cluster
Load data from S3 using S3DistCP
Benefits of HDFS
Master copy of the data in S3
Get all the benefits of S3
HDFS
S3distCP
4
33. Disclaimer and Urban Myth
Customers must make the decision to upload data to Packetloop.
We do not transparently intercept customer traffic, nor is it possible within
AWS to do this.
AWS does not give us access to any other AWS customer traffic.
34. What is Packetloop?
• Big Data Security Analytics
• Uses complete data set from the network flow via packet capture
• 100% delivered in the Cloud
• Instantly available, always up to date
• Powerful visualizations
• Intuitive to use
• Reduces security analysis to minutes
35.
36. What business problems are we solving?
• Security related information is growing exponentially
• The current generation of technology is struggling to deliver the intelligence
organizations needs, and these technologies create friction due to:
– Solution complexity
– Amount of integration and customization required
– Lack of context and fidelity
• Threats are becoming more complex, including blended attacks and long
running attacks (spanning months and potentially terabytes of flow data)
• Analysts have less time and are forced to be more reactive
37. Who are we targeting?
• Any organization that definitively wants to know exactly what is happening on
their networks using information that can be determined in real-time and the
information that can be added over time.
• Customers that are currently not receiving what was promised by SIEM
solutions in terms of analytics, size and scale, fidelity and drill-down capabilities.
• Organizations that are already leveraging Cloud providers such as Amazon
AWS.
• Security consultants, Analysts, Penetration Testers who want to take packet
captures and quickly analyze them by uploading to the cloud.
38. What business challenges did we face?
• Fastest processing possible
• Infinite scale and storage
• Global presence
• Always be available and up to date
• Commodity affordability
• Small team of people
• Limited capital
• Based only in Sydney
• Current databases don’t scale the
way we needed.
The Vision The Reality
39. Why choose AWS?
• Brand – number 1 in Cloud market
• Presence - everywhere we need to be
• Availability options – allows us to build in the resilience we need
• Flexibility and elasticity – only use what we need and when we need it, whilst
supporting limitless horizontal growth
• Feature sets - always expanding, allows us to constantly refine our offering
• Support – AWS supports our business growth
• Cost – low to start with, always improving, easy to understand and predict
40. What do we use?
PgSQLCASS CASSLOOP IPS
WEB WEB
Subnet A/24
Subnet B/24
ZONE: US-WEST-2a ZONE: US-WEST-2b
NAT to Elastic IP's NAT to Elastic IP's
www.packetloop.com?
Loop Network
PgSQLCASS CASSLOOP IPS
WEB WEB
Subnet C/24
Subnet D/24
Loop Network
VPC
ROUTER
Cassandra Replicates between availability zones
Postgres is Active/Active between availability zones
Elastic Load Balancer
EMR-1 EMR-N EMR-1 EMR-N
41. What do we use?
• Elastic MapReduce (EMR) – Hadoop to process jobs to extract security
analytics
• Cassandra – Patented insertion method for storing security metrics data
• PgSQL – user databases, customer settings
• IPS – 2 open source and 2 commercial to obtain indicators and warnings
• S3 – Packet capture storage, both long term and temporary
• VPC – handles replication and active/active traffic between Availability Zones
• Elastic Load Balancer – allows us to scale out Web instances as needed
• Cloudflare (not shown) – cache and acceleration
42. What has AWS allowed us to achieve?
• Global presence and big company performance
• To be the first truly Cloud centric Security Analytics tool
• Deliver a revolutionary security analytics tool to any user/analyst on the Internet
as a commodity service (charged per GB/per month)
• To dynamically change development and architecture direction without worrying
about any capital investment we may have already made, and while maintaining
a full production instance
• Determine exactly what we spend and 100% link it to customer demand
• To remain a self funded startup
43. What’s next?
• Shift from batch processing and post hoc analysis to real time processing
• Addition of On Premise appliances, Virtual Machines and AMIs to perform local
capture, preprocessing and transmission of security metrics to Cloud
• Additional modules for analyzing Sessions, Protocols and Files
• Move to Probabilistic Threat Analysis using machine learning
44. Do your own Big Data Security Analytics…..
• Packetpig is an open source version of our Network Security Analytics toolset
available at github.com/packetloop/packetpig
• Optimised in October 2012 to use AWS Elastic Map Reduce - how to configure
blog.packetloop.com/2012/10/packetpig-on-amazon-elastic-map-
reduce.html
• Configurable scripts to specify what size AWS instances are used for Hadoop,
and how many instances are to be spawned to run the mappers and reducers
47. Analysis of Data Can Transform
Society
Create new
business
models and
improve
organizationa
l processes.
Enhance
scientific
understanding
, drive
innovation,
and
accelerate
Increase
public safety
and improve
energy
efficiency
with smart
grids.
49. Intel at the Intersection
of Big Data
Enabling
exascale
computing on
massive data
sets
Helping
enterprises
build open
interoperab
le clouds
Contributin
g code and
fostering
ecosystem
HPC Clou
d
Open
Source
50. Intel at the Heart of the Cloud
Server
Storage
Network
51. Scale-Out Platform
Optimizations for Big Data
Cost-effective
performance
•Intel® Advanced Vector
Extension Technology
•Intel® Turbo Boost
Technology 2.0
•Intel® Advanced
Encryption Standard New
Instructions Technology
52. 52
Intel® Advanced Vector
Extensions Technology
• Newest in a
long line of
processor
instruction
innovations
• Increases
floating point
operations per
clock up to
2X1
performance1 : Performance comparison using Linpack benchmark. See backup for configuration details.
For more legal information on performance forecasts go to http://www.intel.com/performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
53. Intel® Turbo Boost Technology
2.0
More
Performance
Higher turbo
speeds maximize
performance for
single and
multi-threaded
applications
55. Power of the Platform built
by Intel
Richer
user
experiences
4HR
S
50%
Reduction
10MI
N
80%
Reduction 50%
Reduct
ion
40%
Reduct
ion
TeraSo
rt for
1TB
sort
Intel
®
Xeon®
Proce
ssor
E5
2600
Solid-
State
Drive
10G
Ethernet Intel®
Apache
Hadoop
Previ
ous
Intel
®
Xeon®
Proce
ssor
The key messages that we want to deliver with this slide are 1. Elastic MapReduce is a hosted hadoop service. We use the most stable version of apache hadoop and provide a hosted service, and build integration point withs other services on the AWS eco-system such as S3, Cloudwatch, Dynamodb etc. We make other improvements to Hadoop so that it becomes easier to scale and manage on AWS2. We will keep iterating on the different versions of hadoop as they become stable. When you use the console you launch the latest version of hadoop, but you also have the choice or launching an older version of hadoop via the CLI or the SDK. 3. So what all you can do with EMR ?You can build applications on Amazon EMR, just like you would with HadoopIn order to develop custom Hadoop applications, you used to need access to a lot of hardware to test your Hadoop programs. Amazon EMR makes it easy to spin up a set of Amazon EC2 instances as virtual servers to run your Hadoop cluster. You can also test various server configurations without having to purchase or reconfigure hardware. When you're done developing and testing your application, you can terminate your cluster, only paying for the computational time you used.Amazon EMR provides three types of clusters (also called job flows) that you can launch to run custom map-reduce applications, depending on the type of program you're developing and which libraries you intend to use.
EMR supports multiple instance types including the latest HS1 instance types EMR now supports High Storage Instances (hs1.8xlarge) in US East. These new instances offer 48 TB of storage across 24 hard disk drives, 35 EC2 Compute Units (ECUs) of compute capacity, 117 GB of RAM, 10 Gbps networking, and 2.4+ GB per second of sequential I/O performance. High Storage Instances are ideally suited for Hadoop and they significantly reduce the cost of processing very large data sets on EMR. We look forward to adding support for High Storage Instances in additional regions early next year.
10 x 10 = 100 nodes running for 1 hour
And the concept of adding nodes works well with hadoop – especially on the cloud since 10 nodes running for 10 hours costs the same as 100 nodes running for 1 hour.
10 x 10 = 100 nodes running for 1 hour
10 x 10 = 100 nodes running for 1 hour
10 x 10 = 100 nodes running for 1 hour
10 x 10 = 100 nodes running for 1 hour
Speaker Notes:Often the question about Big Data is, “What can it do for me?” And that’s a very important question because without the value proposition, Big Data would just be an exercise. But I’m here to tell you Big Data services, provided by AWS and supported by Intel, are a Game Changer.For example: Yes, Big Data offers insights into how we conduct business. But it also enables scientific discovery, opens up the possibility to treat and cure diseases, and enhances our communities with intelligent power grids and highways. These are just a handful of ideas. The frontier of Big Data is so much more. The technology provided means no limits to how you use the information. People are innovating new uses for Big Data every day.
Speaker notes:Intel’s vision of Big Data is more than just the possibility for streamlined business. We see entire cities and communities connected, using the data we generate in every aspect – business and personal – to inform us and enable us to make better decisions about our lives. And all of this is made possible by the innovations developed in partnership between Intel and Amazon Web Services. A Big Data infrastructure, vast enough to handle the data we produce, and cost effective enough for us to use. Big Data really is about the, a future of challenges and great opportunities AWS and Intel are ready and eager to tackle.
Speaker notes:As you can see, Intel is at the intersection of enabling Big Data:- Exascale-level High Performance Computing and cloud environments based on Intel® Xeon® processors. - Plus, Intel is encouraging the growth of the open source ecosystem to foster innovation among developers, and keep cloud services, like AWS, affordable for all.
Speaker Notes:And to be at that intersection, to allow the proverbial traffic of Big Data goes smoothly, we’ve built the technological backbone for Big Data. The challenges to scale and the capabilities we’ve built into the Intel® Xeon® processor are needed across the entire data center – servers, storage devices and network solutions. It should be noted, Intel is #1 in Servers, Storage and Networks. - These industry-standard, modular building blocks allow efficient and cost-effective scaling of compute, storage and network systems to match user needs.- Traditionally storage devices used lower performance, proprietary ASICs, but today the demand for performance has increased to tackle challenges like data de-duplication and improved archiving. This in addition to distributed files systems for cloud based storage and a desire for improved analytics drives a need for more processing power… and vendors are increasingly turning to Intel® Xeon® processors. Plus, the improvements that Intel offers in our latest processors can benefit every aspect of what your infrastructure does. And these building blocks are what makes amazing software like Hadoop work.
Speaker Notes:Key points:Intel® Xeon® Processor E5 Family provides:Cost-effective performanceIntel® Advanced Vector Extension TechnologyIntel® Turbo Boost Technology 2.0 Intel® Advanced Encryption Standard New Instructions Technology Significant performance gains delivered by featuressuch as new Intel® Advanced Vector Extensions and improved Intel® Turbo Boost Technology 2.0 providing performance when you need it. Dramatically reduce compute time with Intel® Advanced Vector Extensions Accelerate floating point calculation for scientific simulations & financial analyticsPerformance when you need it with Intel® Turbo Boost Technology 2.0 Up to 80% performance boost vs. prior gen To improve flexibility and operational efficiency significant improvements in I/O with new Intel® Integrated I/O which reduces latency ~30% will adding more lanes and higher bandwidth with support for PCI Express 3.0Cost-effective performance for standardizing scale out nodes for Hadoop Intel® AES-NI to accelerate security encryption workloads Optimized core to memory footprint ratios Top Memory Channels and frequency for nothing shared scalingStory:To meet the growing demands of IT such as readiness for cloud computing, the growth in users and the ability to tackle the most complex technical problems, Intel has focused on increasing the capabilities of the processor that lies at the heart of a next generation data center. The Intel® Xeon® processor E5-2600 product family is the next generation Xeon® processor that replaces Platforms based on the Intel® Xeon® processor 5600 & 5500 series. Continuing to build on the success of the Intel® Xeon® 5600, the E5-2600 product family has increased core count and cache size in addition to supporting more efficient instructions with Intel® Advance Vector Extensions, to deliver up to an average of 80% more performance across a range of workloads. These processors will offer better than ever performance no matter what your constraint is – floor space, power or budget – and on workloads that range from the most complicated scientific exploration to simple, yet crucial, web serving and infrastructure applications. In addition to the raw performance gains, we’ve invested in improved I/O with Intel Integrated I/O which reduces latency ~30% will adding more lanes and higher bandwidth with support for PCIe 3.0. This helps to reduce network and storage bottlenecks to unleash the performance capabilities of the latest Xeon processor. The Intel® Xeon® processor E5-2600 product family – versatile processers at the heart of today’s data center.
Key points: Intel® Advanced Vector Extensions Technology is a collection of CPU instructions that increase floating point performance by doubling the length of the FP registers to 256-bits and reducing the number of operations required to execute large FP tasks Applications include: Science/Engineering, Data Mining, Visual Processing, HPCStory:Another avenue that Intel has taken advantage to add more flexible performance is to add in instructions that make the processor do more work every clock cycle. Intel® Advanced Vector Extensions can offer up to double the floating point operations per clock cycle by doubling the length of registers. Where this is used is when you need to address very complex problems or deal with large-number calculations, integral to many technical, financial and scientific computing problems. Workloads that can see improvements from AVX range from manufacturing optimizations, to the analysis of competing options to content creation and engineering simulations. Intel® AVX is the newest in a long line of instruction innovations going back to the mid 90’s with MMX and SSE1 which are all now standard software practices. Intel AVX is supported by Intel and 3rd party compilers that take advantage of the latest instructions to optimize code to significantly reduce compute time enabling faster time to results. With the Xeon processor E5-2600 family you can be confident that you’ll benefit from those optimizations as new applications are introduced and updates to existing software packages are released.Legal Info:(AVX Performance) Source: Performance comparison using Linpack benchmark. Baseline score of 159.4 based on Intel internal measurements as of 5 December 2011 using a Supermicro* X8DTN+ system with two Intel® Xeon® processor X5690, Turbo Enabled, EIST Enabled, Hyper-Threading Enabled, 48 GB RAM, Red Hat* Enterprise Linux Server 6.1. New score of 347.7 based on Intel internal measurements as of 5 December 2011 using an Intel® Rose City platform with two Intel® Xeon® processor E5-2690, Turbo Enabled or Disabled, EIST Enabled, Hyper-Threading Enabled, 64 GB RAM, Red Hat* Enterprise Linux Server 6.1. Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.
Key points:Get more computing power when you need it with performance that adapts to spikes in your workload. with Intel® Turbo Boost Technology 2.0New Intel® Turbo Boost Technology 2.0 delivers up to 2x more performance upside than previous generation turbo technology.Story:Beyond simply making the processor more capable with more cores, cache, & memory we’ve also focused on making the processor more adaptive and intelligent. Starting with the Intel® Xeon® processor 5500 series (formerly codenamed Nehalem-EP) we introduced a feature called Intel Turbo Boost Technology which allowed the processor to increase frequency at the OS’ request to handle workload spikes as well as shift power across the processor so if you had one core working hard and one core idle the processor could “turbo up” by redirecting power from the idle core to the active one. With the Xeon processor E5-2600 product family we are able to refine this technology to enable even higher turbo speeds – for example the top Xeon processor 5690 with only 1 core active could turbo up ~266 MHz while the top Xeon processor E5-2690 can frequency 900 MHz specifically. This greater ability to turbo up is due to improved power and thermal management data across the platform – the processor keeps track of how hard it’s been running and will modulate how far it will push itself in a turbo situation to provide the maximum frequency while meeting Intel’s stringent reliability standards. In addition we’ve improved the turbo algorithm to assess if the core speed is the limiter or if the processor is waiting for data from memory or I/O before it commits power to the burst of speed. The goal of turbo is to get workload spikes dealt with as quickly as possible to get back to a lower power state which reduces average power draw and cost of operation.Legal Info:Source: Performance comparison using SPECint*_rate_base2006 benchmark with turbo enabled and disabled. Estimated scores of 393 (turbo enabled) and 376 (turbo disabled) based on Intel internal estimates as of 6 March 2012 using a Supermicro* X8DTN+ system with two Intel® Xeon® processor X5690, Turbo Enabled (or Disabled), EIST Enabled, Hyper-Threading Enabled, 48 GB RAM, Intel® Compiler 12.0, Red Hat* Enterprise Linux Server 6.1 for x86_6. Estimated scores of 659 (turbo enabled) and 594 (turbo disabled) based on Intel internal estimates using an Intel® Rose City platform with two Intel® Xeon® processor E5-2680, Turbo Enabled (or Disabled), EIST Enabled, Hyper-Threading Enabled, 64 GB RAM, Intel® Compiler 12.1, Red Hat* Enterprise Linux Server 6.1 for x86_6.
Intel AES-NI: What is it?Key Point: Data Encryption shows 10xspeedup1 in AES encryptionIntel AES-NI is a set of new instructions for enhancing the performance for cryptography using the widely-accepted Advanced Encryption Standard (AES) algorithm.There are 7 new instructions in the processor that target some of the more complex and compute-expensive encryption, decryption, key expansion and multiplication steps (and there are multiple steps in every instance of working with encrypted data) that increase the performance and efficiency of these operations. But note that the instructions do not implement the entire AES algorithm in silicon—only the most processor intensive elements have been targeted. This provides more flexibility and balance between HW performance and SW extensibility. Another benefit of the new instructions is that actually helps protect the data better as well. The use of the more efficient steps enabled in AES-NI makes the use of “side channel” snooping attacks. These attacks use SW agents to analyze how a system processes data and searches for cache and memory access patterns to try to gather patterns or other system data to help deduce elements of the cryptographic processing—and therefore make it easier to “crack”. AES-NI helps hide critical elements such as table lookups, making it harder to determine what elements of crypto processing are happening.Taking down the performance tax frees IT managers to use encryption more broadly without sacrificing performance.
Speaker Notes:So let’s see rubber meet road and look at how the technology enables high performance computing. Right here you’re seeing the Intel-based ecosystem at work. - Start with a 4 hour process time to sort 1 Terabyte of data. - Upgrade the processor to the latest Intel® Xeon® processor to cut compute time in half.- Add an SSD to reduce by another 80%.- Upgrade to 10 Gigabit Ethernet for additional reductions.The end result is a fraction of the original compute time: 10 minutes to sort 1 Terabyte of data. These datacenter innovations streamline the process and make affordable Big Data analytics possible.As this testing shows, as important as the processor is in improving the customer experience, it’s not the entire solution. By understanding the benefits of SSDs, 10GbE and Intel SW tools we can give an even better experience with Intel optimized platforms, and boost business results.
Speaker Notes:If you wanted to see this process of transforming Big Data into action, it would look something like this.- Big Data provides rich, personalized, immersive experiences for clients. - This in turn creates more rich interactions, and generates more data into the cloud.- Which leads to higher volumes of data to analyze through intelligent systems, - Which leads to even more rich, personalized, and immersive experiences. As you can see, the cycle feeds into itself. And, this brings users into the fold. We’re not just talking businesses anymore, but we’re looking at how Big Data affects us all on a day-to-day basis.