AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AWS Summit 2013 Tel Aviv
Oct 16 – Tel Aviv, Israel

Data Analytics on BigData
Jan Borch | AWS Solutions Architect

GENERATE  STORE  ANALYZE  SHARE

THE COST OF DATA
GENERATION IS FALLING

Progress is not evenly distributed

1980
14,000,000$/TB  450,000 ÷ 
 30,000 X 
100MB
 50 X 
4MB/s

Today
30$/TB
3TB
200MB/s

THE MORE DATA YOU COLLECT
THE MORE VALUE YOU CAN
DERIVE FROM IT

Lower cost,
higher throughput


Lower cost,
higher throughput



Highly
constrained

DATA VOLUME

Generated data

Available for analysis

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

GENERATE

STORE  ANALYZE  SHARE

ACCELERATE

GENERATE 

STORE  ANALYZE  SHARE

+ ELASTIC AND HIGHLY SCALABLE
+ NO UPFRONT CAPITAL EXPENSE
+ ONLY PAY FOR WHAT YOU USE
+ AVAILABLE ON-DEMAND

= REMOVE

CONSTRAINTS

AWS EC2
AWS CloudFront


•
•
•
•
•

Fluentd
Flume
Scribe
Chukwa
LogStash

{output{ s3 {
bucket => myBucket,
aws_credential_file => ~/cred.json
size_file=> 120MB
}}

Embed poor-man pixel
http://www.poor-mananalytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban
.com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=enus&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr
=-&utmp=%2F&utmac=UA-70197651&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B
%2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(re
ferral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analyticsarchitecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~

AWS Import / Export
AWS Direct Connect
AWS Elastic Map Reduce


Generated and stored in AWS
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
Regional replication of AMIs and snapshots

S3distcp on EMR job sample
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args
'--src,s3://myawsbucket/cf,
--dest,s3://myoutputbucket/aggregate ,
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,
--targetSize,128,
--outputCodec,lzo,
--deleteOnSuccess'

Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2


AMAZON S3
SIMPLE STORAGE SERVICE

AMAZON
DYNAMODB
HIGH-PERFORMANCE, FULLY MANAGED
NoSQL DATABASE SERVICE

DURABLE &
AVAILABLE
CONSISTENT, DISK-ONLY
WRITES (SSD)

LOW LATENCY
AVERAGE READS < 5MS,
WRITES < 10MS

Very general table structure
not many
rows

Ads

frequent
update
(near realtime)

advertiser

max-price

imps to
deliver

imps
delivered

1

AAA

100

50000

1200

2
so many
rows

ad-id

BBB

150

30000

2500

user-id

attribute1

attribute2

attribute3

attribute4

A

XXX

XXX

XXX

XXX

B

YYY

YYY
YYY
batch manner update

YYY

Profiles

500,000 WRITES PER SECOND
DURING SUPER BOWL

AMAZON
GLACIER
reliable long term archiving

S3 Lifecycle policies
AMAZON S3

If object older than
5 month

Archive to
Amazon Glacier

S3 Lifecycle policies
AMAZON S3

5 month

Delete object
from S3
1 year

/dev/null

AMAZON
REDSHIFT
FULLY MANAGED, PETA-BYTE SCALE
DATAWAREHOUSE ON AWS

DESIGN OBJECTIVES:
A petabyte-scale data warehouse service that was…

A Lot Faster

AMAZON
REDSHIFT

A Lot Cheaper
A Whole Lot Simpler

AMAZON REDSHIFT
RUNS ON OPTIMIZED HARDWARE
HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate

HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage

30 MINUTES
DOWN TO

12 SECONDS

AMAZON REDSHIFT LETS YOU
START SMALL AND GROW BIG
Extra Large Node
(HS1.XL)

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

Eight Extra Large Node (HS1.8XL)
Cluster 2-100 Nodes (32 TB – 1.6 PB)

Price Per Hour for
HS1.XL Single
Node

Effective Hourly
Price Per TB

Effective Annual
Price per TB

On-Demand

$ 0.850

$ 0.425

$ 3,723

1 Year
Reservation

$ 0.500

$ 0.250

$ 2,190

3 Year
Reservation

$ 0.228

$ 0.114

$

999

DATA WAREHOUSING DONE THE AWS WAY
Easy to provision and scale up massively

No upfront costs, pay as you go
Really fast performance at a really low price
Open and flexible with support for popular tools

Reporting Warehouse

OLTP
ERP

RDBMS

Redshift

Reporting
and BI

Accelerated operational reporting
Support for short-time use cases
Data compression, index redundancy

On-Premises Integration

OLTP
ERP

RDBMS

Data
Integration
Partners*

Redshift

Reporting
and BI

Live Archive for (Structured) Big Data

OLTP
Web Apps

DynamoDB

Redshift

Reporting
and BI

Direct integration with copy command
High velocity data
Data ages into Redshift
Low cost, high scale option for new apps

Cloud ETL for Big Data

S3

Elastic MapReduce

Redshift

Reporting
and BI

Maintain online SQL access to historical logs
Transformation and enrichment with EMR
Longer history ensures better insight

COPY into Amazon Redshift
create table cf_logs
(
d date,
t char(8),
edge char(4),
bytes int,
cip varchar(15),
verb char(3), distro varchar(MAX), object varchar(MAX), status int,
Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )

COPY into Amazon Redshift

copy cf_logs from 's3://cfri/cflogs-sm/E123ABCDEF/'
credentials
'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'
IGNOREHEADER 2
GZIP
DELIMITER 't'
DATEFORMAT 'YYYY-MM-DD'

Amazon EC2
Amazon Elastic
MapReduce

AMAZON EC2
ELASTIC COMPUTE CLOUD

EC2 instance families – General purpose

m1.small

Virtual core: 1
Memory: 1.7 GiB
I/O performance: Moderate

EC2 instance families – Compute optimized
Virtual core: 32 - 2 x Intel Xeon
Memory: 60,5 GiB
I/O performance: 10 Gbit

m1.small

cc2.8xlarge

EC2 instance families – Memory optimized
Virtual core: 32 - 2 x Intel Xeon
Memory: 240 GiB
SSD Instance store: 240 GB

m1.small

cc2.8xlarge

cr1.8xlarge

EC2 instance families – Storage optimized

m1.small

cc2.8xlarge

cr1.8xlarge

hi.4xlarge

Virtual core: 16
Memory: 60.5 GiB
SSD Instance store: 2 x 1TB

hs1.8xlarge

Virtual core: 16
Memory: 117 GiB
Instance store: 24 x 2TB

ON A SINGLE INSTANCE

COMPUTE TIME: 4h
COST: 4h x $2.1 = $8.4

ON MULTIPLE INSTANCES

COMPUTE TIME: 1h
COST: 1h x 4 x $2.1 = $8.4

Instead of
$20+ MILLIONS
in infrastructure

•
•
•
•

A FRAMEWORK
SPLITS DATA INTO PIECES
LETS PROCESSING OCCUR
GATHERS THE RESULTS

AMAZON ELASTIC
MAPREDUCE
HADOOP AS A SERVICE

Corporate Data
Center

Elastic Data
Center

Corporate Data
Center

Application data
and logs for
analysis pushed
to S3

Elastic Data
Center

Amazon Elastic
Map Reduce
master node to
control analysis
M

Corporate Data
Center

Elastic Data
Center

M

Corporate Data
Center

Hadoop cluster
started by Elastic
Map Reduce

Elastic Data
Center

M

Corporate Data
Center

Adding many
hundreds or
thousands of
nodes
Elastic Data
Center

Disposed of when
job completes

M

Corporate Data
Center

Elastic Data
Center

Corporate Data
Center

Results of
analysis pulled
back into your
systems

Elastic Data
Center

Your Spreadsheet does not
scale …

A real Pig script
(used at Twitter)

Run on
a sample
dataset on
your Laptop

M
Run the same
script on a
50 node
Hadoop cluster
Elastic Data
Center

$ ./elastic-mapreduce --create
--name "$USER's Pig JobFlow"
--pig-script
--args s3://myawsbucket/mypigquery.q
--instance-type m1.xlarge --instance-count 50

$ elastic-mapreduce -j j-21IMWIA28LRK1
--add-instance-group task
--instance-count 10
--instance-type m1.xlarge

Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2


PUBLIC DATA SETS
http://aws.amazon.com/publicdatasets


AWS Data Pipeline

AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution and retry logic
Map data dependencies
Create and manage compute resources

AWS Import / Export
AWS Direct Connect

Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2

Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2

Amazon EC2
Amazon Elastic
MapReduce

AWS Data Pipeline

FROM DATA TO
ACTIONABLE
INFORMATION

Amazon AWS generates big data core component for Ginger
Software

Shlomi Vaknin
Oct 16, 2013

English writing
assistant

An open platform for
personal assistants

118

Natural language speech
interface for mobile apps

•

•

An end-to-end Speech-to-Action solution

•
120

Users talk naturally with any mobile
application, Ginger understands and
executes their command

First open platform for creating personal
assistants

Web Corpus

Domain
Corpus

Language model

User
Corpus

Semantic Model

NLP/NLU Algorithms

Writing Assistant

Proofrea
der

Rephras
e

DB
Persona
l Coach

PA Platform

Speech
Engine

Query
Understanding

Our platform depends on scanning and indexing
all the language we can find on the internet
• A collection of all the language we found on the internet,
accessible and pre-processed
• Has to contain lots and lots of sentences
• Needs to represent “common written language”
• Accessible both for offline (research) and online (service)
uses
122

1. Crawling [own cluster, EMR+S3]
• Generated about 50 TB of raw data
• Reduced to about 5 TB of text data
2. Post processing
• Tokenize
• Normalize
• Split to n-grams

[EMR+S3]

•
•
•

Generalize
Count
Filter

3. Indexing/Serving [EMR+S3]
• Key/Value – has to be super fast
• Full-text-search
4. Archiving (Glacier) [S3+Glacier]
• Keeping data available for later research while minimizing cost
123

• Mainly an NLP task
• So we picked up
• It’s a Lisp!
• Integrates very well with EMR, S3, etc..
• n-Gram Counting
• How are you, How are, are you, How, are, you
• Lots of grams are repeated
• Generalize contextually similar tokens
• Fits map-reduce paradigm very well
• Most parts can be trivially parallelized
• One part is sequential by grams
124

• EMR cluster node types
• Master, Task, Core

• Ratio between Core and Task nodes
• We expected a very large output (100TB)
• m2.4xlarge core output 1690GB
•

core nodes

• Estimate number of total map tasks

• Final specs:

Instance

Count

MASTER

cc2.8xlarge

1

CORE
125

Node Type

m2.4xlarge

200

TASK

m2.2xlarge

500

• Job took about 30 hours to complete
• We generated nearly 100TB of output data
• During map phase, the cluster achieved nearly 100%
utilization
• After initial filtration, 20TB remained

126

• Stay up to date with AMI releases
• Don't stick to an old AMI just because it previously worked
• Use the Job-Tracker
• Use custom progress notification
• Increase mapred.task.timeout
• Limit number of concurrent map tasks
• Use the minimum number that gets you close to 100% CPU
• Beware of spot nodes
• If you ask for too many you might compete against your own price
127

• Stash the data for later use, to reduce cost
• Glacier offers very cheap storage
• Important things to know about Glacier:
• Restoring the data could be VERY expensive
• The key to reduce restore costs - restore SLOWLY
• There is no built-in mechanism to restore slowly
•
•

3rd party application
do it manually

• Glacier is very useful if your use case matches its design

128

• EMR/S3 provides great power and elasticity, to grow and
shrink as required
• Do your homework before running large jobs!

129

• Our platforms depends on scanning and indexing all the
language we can find on the internet
• To achieve this Ginger Software makes heavy use of
Amazon EMR
• With Amazon EMR, Ginger Software can scale up vast
amounts of computing power and scale back down
when it is not needed
• This gives Ginger Software the ability to create the world’s
most accurate language enhancement technology
without the need to have expensive hardware lying idle
130
during quiet periods

Thank You!
We are hiring!
shlomiv@gingersoftware.com

AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Similaire à AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data