AWS Sydney Summit 2013 - Big Data Analytics

Abhishek Sinha
Big Data Analytics
Business Development Manager

Overview
• The Big Data Challenge
• Big Data tools and what can we do with them ?
• Packetloop – Big Data Security Analytics
• Intel technology on big data.

An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it

Generation
Collection & storage
Analytics & computation
Collaboration & sharing

Generation
Lower cost,
higher throughput

Generation
Lower cost,
higher throughput
Highly
constrained

Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Amazon Web Services helps remove
constraints

Remove constraints = More experimentation
More experimentation = More innovation
More Innovation = Competitive edge

Elastic MapReduce and Redshift
Big Data tools

What is Amazon Redshift ?
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS
cloud
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools

How does EMR work ?
EMR
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS

What can you run on EMR…
S3
EMR
EMR Cluster

EMR
EMR Cluster
Resize Nodes
S3
You can easily add and
remove nodes

Resize Nodes with Spot Instances
Cost without Spot
10 node cluster running for 14 hours
Cost = 1.2 * 10 * 14 = $168

Cost without Spot Add 10 nodes on spot
Cost = 1.2 * 10 * 14 = $168
Cost = 1.2 * 10 * 7 = $84
= 0.6 * 10 * 7 = $42

Cost without Spot Add 10 nodes on spot
Cost = 1.2 * 10 * 14 = $168
Cost = 1.2 * 10 * 7 = $84
= 0.6 * 10 * 7 = $42
= Total $126
25% reduction in price
50% reduction in time

Ad-Hoc Clusters – What are they ?
EMR Cluster
S3
When processing is complete, you
can terminate the cluster (and stop
paying)
1

Ad-Hoc Clusters – When to use
EMR Cluster
S3
Not using HDFS
Not using the cluster 24/7
Transient jobs
1

EMR
EMR Cluster
“Alive” Clusters – What are they ?
S3
If you run your jobs 24 x 7 , you
can also run a persistent cluster
and use RI models to save costs
2

EMR
EMR Cluster
“Alive” Clusters – When ?
S3
Frequently running jobs
Dependencies on map-reduce-map
outputs
2

S3 instead of HDFS
S3
EMR
EMR Cluster
• S3 provides 99.99999999999% of
durability
• Elastic
• Version control against failure
• Run multiple clusters with a single
source of truth
• Quick recovery from failure
• Continuously resize clusters
3

S3 and HDFS
S3
EMR
EMR Cluster
Load data from S3 using S3DistCP
Benefits of HDFS
Master copy of the data in S3
Get all the benefits of S3
HDFS
S3distCP
4

Reporting Data-warehouse
RDBMS
Redshift
OLTP
ERP
Reporting
and BI
1

Live Archive for (Structured) Big Data
DynamoDB
Redshift
OLTP
Web Apps Reporting
and BI
2

Cloud ETL for Big Data
Redshift
Reporting
and BI
Elastic MapReduce
S3
3

Streaming Hive Pig DynamoDB Redshift
Unstructured
Data
✓ ✓
Structured Data ✓ ✓ ✓ ✓
Language
Support
Any* HQL Pig Latin Client SQL
SQL ✓SQL-Like ✓
Volume Unlimited Unlimited Unlimited Relatively
Low
1.6 PB
Latency Medium Medium Medium Ultra Low Low

Remove
Constraints
Generation

Scott Crane
Packetloop – Big Data Security Analytics
CEO & Co-founder

Disclaimer and Urban Myth
Customers must make the decision to upload data to Packetloop.
We do not transparently intercept customer traffic, nor is it possible within
AWS to do this.
AWS does not give us access to any other AWS customer traffic.

What is Packetloop?
• Big Data Security Analytics
• Uses complete data set from the network flow via packet capture
• 100% delivered in the Cloud
• Instantly available, always up to date
• Powerful visualizations
• Intuitive to use
• Reduces security analysis to minutes

What business problems are we solving?
• Security related information is growing exponentially
• The current generation of technology is struggling to deliver the intelligence
organizations needs, and these technologies create friction due to:
– Solution complexity
– Amount of integration and customization required
– Lack of context and fidelity
• Threats are becoming more complex, including blended attacks and long
running attacks (spanning months and potentially terabytes of flow data)
• Analysts have less time and are forced to be more reactive

Who are we targeting?
• Any organization that definitively wants to know exactly what is happening on
their networks using information that can be determined in real-time and the
information that can be added over time.
• Customers that are currently not receiving what was promised by SIEM
solutions in terms of analytics, size and scale, fidelity and drill-down capabilities.
• Organizations that are already leveraging Cloud providers such as Amazon
AWS.
• Security consultants, Analysts, Penetration Testers who want to take packet
captures and quickly analyze them by uploading to the cloud.

What business challenges did we face?
• Fastest processing possible
• Infinite scale and storage
• Global presence
• Always be available and up to date
• Commodity affordability
• Small team of people
• Limited capital
• Based only in Sydney
• Current databases don’t scale the
way we needed.
The Vision The Reality

Why choose AWS?
• Brand – number 1 in Cloud market
• Presence - everywhere we need to be
• Availability options – allows us to build in the resilience we need
• Flexibility and elasticity – only use what we need and when we need it, whilst
supporting limitless horizontal growth
• Feature sets - always expanding, allows us to constantly refine our offering
• Support – AWS supports our business growth
• Cost – low to start with, always improving, easy to understand and predict

What do we use?
PgSQLCASS CASSLOOP IPS
WEB WEB
Subnet A/24
Subnet B/24
ZONE: US-WEST-2a ZONE: US-WEST-2b
NAT to Elastic IP's NAT to Elastic IP's
www.packetloop.com?
Loop Network
PgSQLCASS CASSLOOP IPS
WEB WEB
Subnet C/24
Subnet D/24
Loop Network
VPC
ROUTER
Cassandra Replicates between availability zones
Postgres is Active/Active between availability zones
Elastic Load Balancer
EMR-1 EMR-N EMR-1 EMR-N

What do we use?
• Elastic MapReduce (EMR) – Hadoop to process jobs to extract security
analytics
• Cassandra – Patented insertion method for storing security metrics data
• PgSQL – user databases, customer settings
• IPS – 2 open source and 2 commercial to obtain indicators and warnings
• S3 – Packet capture storage, both long term and temporary
• VPC – handles replication and active/active traffic between Availability Zones
• Elastic Load Balancer – allows us to scale out Web instances as needed
• Cloudflare (not shown) – cache and acceleration

What has AWS allowed us to achieve?
• Global presence and big company performance
• To be the first truly Cloud centric Security Analytics tool
• Deliver a revolutionary security analytics tool to any user/analyst on the Internet
as a commodity service (charged per GB/per month)
• To dynamically change development and architecture direction without worrying
about any capital investment we may have already made, and while maintaining
a full production instance
• Determine exactly what we spend and 100% link it to customer demand
• To remain a self funded startup

What’s next?
• Shift from batch processing and post hoc analysis to real time processing
• Addition of On Premise appliances, Virtual Machines and AMIs to perform local
capture, preprocessing and transmission of security metrics to Cloud
• Additional modules for analyzing Sessions, Protocols and Files
• Move to Probabilistic Threat Analysis using machine learning

Do your own Big Data Security Analytics…..
• Packetpig is an open source version of our Network Security Analytics toolset
available at github.com/packetloop/packetpig
• Optimised in October 2012 to use AWS Elastic Map Reduce - how to configure
blog.packetloop.com/2012/10/packetpig-on-amazon-elastic-map-
reduce.html
• Configurable scripts to specify what size AWS instances are used for Hadoop,
and how many instances are to be spawned to run the mappers and reducers

Thank you
www.packetloop.com
blog.packetloop.com
scott@packetloop.com

Corey Loehr
corey.loehr@intel.com
Executive, Digital
Economy Enablement
Intel Australia and New
Zealand

Analysis of Data Can Transform
Society
Create new
business
models and
improve
organizationa
l processes.
Enhance
scientific
understanding
, drive
innovation,
and
accelerate
Increase
public safety
and improve
energy
efficiency
with smart
grids.

Democratizing Analytics gets
Value out of Big Data
Unlock
Value in
Silicon
Support Open
Platforms
Deliver
Software
Value

Intel at the Intersection
of Big Data
Enabling
exascale
computing on
massive data
sets
Helping
enterprises
build open
interoperab
le clouds
Contributin
g code and
fostering
ecosystem
HPC Clou
d
Open
Source

Intel at the Heart of the Cloud
Server
Storage
Network

Scale-Out Platform
Optimizations for Big Data
Cost-effective
performance
•Intel® Advanced Vector
Extension Technology
•Intel® Turbo Boost
Technology 2.0
•Intel® Advanced
Encryption Standard New
Instructions Technology

52
Intel® Advanced Vector
Extensions Technology
• Newest in a
long line of
processor
instruction
innovations
• Increases
floating point
operations per
clock up to
2X1
performance1 : Performance comparison using Linpack benchmark. See backup for configuration details.
For more legal information on performance forecasts go to http://www.intel.com/performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel® Turbo Boost Technology
2.0
More
Performance
Higher turbo
speeds maximize
performance for
single and
multi-threaded
applications

Intel® Advanced
Encryption
Standard New
Instructions• Processor
assistance for
performing AES
encryption
7 new instructions
• Makes enabled
encryption software
faster and stronger

Power of the Platform built
by Intel
Richer
user
experiences
4HR
S
50%
Reduction
10MI
N
80%
Reduction 50%
Reduct
ion
40%
Reduct
ion
TeraSo
rt for
1TB
sort
Intel
®
Xeon®
Proce
ssor
E5
2600
Solid-
State
Drive
10G
Ethernet Intel®
Apache
Hadoop
Previ
ous
Intel
®
Xeon®
Proce
ssor

Cloud
Intelligent
Systems
Clients
Virtuous Cycle of Data-Driven Experience

AWS Sydney Summit 2013 - Big Data Analytics

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à AWS Sydney Summit 2013 - Big Data Analytics

Similaire à AWS Sydney Summit 2013 - Big Data Analytics (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

AWS Sydney Summit 2013 - Big Data Analytics

Notes de l'éditeur