12. DATA VOLUME
Generated data
Available for analysis
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
26. AWS Import / Export
AWS Direct Connect
AWS Elastic Map Reduce
GENERATE STORE ANALYZE SHARE
27. Generated and stored in AWS
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
Regional replication of AMIs and snapshots
57. AMAZON REDSHIFT LETS YOU
START SMALL AND GROW BIG
Extra Large Node
(HS1.XL)
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Eight Extra Large Node (HS1.8XL)
Cluster 2-100 Nodes (32 TB – 1.6 PB)
60. Price Per Hour for
HS1.XL Single
Node
Effective Hourly
Price Per TB
Effective Annual
Price per TB
On-Demand
$ 0.850
$ 0.425
$ 3,723
1 Year
Reservation
$ 0.500
$ 0.250
$ 2,190
3 Year
Reservation
$ 0.228
$ 0.114
$
999
61. DATA WAREHOUSING DONE THE AWS WAY
Easy to provision and scale up massively
No upfront costs, pay as you go
Really fast performance at a really low price
Open and flexible with support for popular tools
65. Live Archive for (Structured) Big Data
OLTP
Web Apps
DynamoDB
Redshift
Reporting
and BI
Direct integration with copy command
High velocity data
Data ages into Redshift
Low cost, high scale option for new apps
66. Cloud ETL for Big Data
S3
Elastic MapReduce
Redshift
Reporting
and BI
Maintain online SQL access to historical logs
Transformation and enrichment with EMR
Longer history ensures better insight
67. COPY into Amazon Redshift
create table cf_logs
(
d date,
t char(8),
edge char(4),
bytes int,
cip varchar(15),
verb char(3), distro varchar(MAX), object varchar(MAX), status int,
Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )
68. COPY into Amazon Redshift
copy cf_logs from 's3://cfri/cflogs-sm/E123ABCDEF/'
credentials
'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'
IGNOREHEADER 2
GZIP
DELIMITER 't'
DATEFORMAT 'YYYY-MM-DD'
110. AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution and retry logic
Map data dependencies
Create and manage compute resources
111.
112.
113.
114. AWS Import / Export
AWS Direct Connect
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
GENERATE STORE ANALYZE SHARE
Amazon EC2
Amazon Elastic
MapReduce
AWS Data Pipeline
120. Natural language speech
interface for mobile apps
•
•
An end-to-end Speech-to-Action solution
•
120
Users talk naturally with any mobile
application, Ginger understands and
executes their command
First open platform for creating personal
assistants
122. Our platform depends on scanning and indexing
all the language we can find on the internet
• A collection of all the language we found on the internet,
accessible and pre-processed
• Has to contain lots and lots of sentences
• Needs to represent “common written language”
• Accessible both for offline (research) and online (service)
uses
122
123. 1. Crawling [own cluster, EMR+S3]
• Generated about 50 TB of raw data
• Reduced to about 5 TB of text data
2. Post processing
• Tokenize
• Normalize
• Split to n-grams
[EMR+S3]
•
•
•
Generalize
Count
Filter
3. Indexing/Serving [EMR+S3]
• Key/Value – has to be super fast
• Full-text-search
4. Archiving (Glacier) [S3+Glacier]
• Keeping data available for later research while minimizing cost
123
124. • Mainly an NLP task
• So we picked up
• It’s a Lisp!
• Integrates very well with EMR, S3, etc..
• n-Gram Counting
• How are you, How are, are you, How, are, you
• Lots of grams are repeated
• Generalize contextually similar tokens
• Fits map-reduce paradigm very well
• Most parts can be trivially parallelized
• One part is sequential by grams
124
125. • EMR cluster node types
• Master, Task, Core
• Ratio between Core and Task nodes
• We expected a very large output (100TB)
• m2.4xlarge core output 1690GB
•
core nodes
• Estimate number of total map tasks
• Final specs:
Instance
Count
MASTER
cc2.8xlarge
1
CORE
125
Node Type
m2.4xlarge
200
TASK
m2.2xlarge
500
126. • Job took about 30 hours to complete
• We generated nearly 100TB of output data
• During map phase, the cluster achieved nearly 100%
utilization
• After initial filtration, 20TB remained
126
127. • Stay up to date with AMI releases
• Don't stick to an old AMI just because it previously worked
• Use the Job-Tracker
• Use custom progress notification
• Increase mapred.task.timeout
• Limit number of concurrent map tasks
• Use the minimum number that gets you close to 100% CPU
• Beware of spot nodes
• If you ask for too many you might compete against your own price
127
128. • Stash the data for later use, to reduce cost
• Glacier offers very cheap storage
• Important things to know about Glacier:
• Restoring the data could be VERY expensive
• The key to reduce restore costs - restore SLOWLY
• There is no built-in mechanism to restore slowly
•
•
3rd party application
do it manually
• Glacier is very useful if your use case matches its design
128
129. • EMR/S3 provides great power and elasticity, to grow and
shrink as required
• Do your homework before running large jobs!
129
130. • Our platforms depends on scanning and indexing all the
language we can find on the internet
• To achieve this Ginger Software makes heavy use of
Amazon EMR
• With Amazon EMR, Ginger Software can scale up vast
amounts of computing power and scale back down
when it is not needed
• This gives Ginger Software the ability to create the world’s
most accurate language enhancement technology
without the need to have expensive hardware lying idle
130
during quiet periods