Business intelligence is often described as a set of methodologies and technologies that transform raw data into meaningful and useful information for business purposes. But this simple description hides many technical challenges IT teams struggle with. This session will show how to build business intelligence applications leveraging AWS, from the raw data import, consumption and storage down to the information production. We will also cover best practices for services such as Amazon Redshift or Amazon RDS, and how to use applications such as SAP Hana, Jaspersoft and others.
10. Getting your Data into AWS
Amazon S3
Corporate Data
Center
• Console Upload
• FTP
• AWS Import Export
• S3 API
• Direct Connect
• Storage Gateway
• 3rd Party Commercial Apps
• Tsunami UDP
11. Write directly to a data source
Your application Amazon S3
DynamoDB
Any other data
store
Amazon S3
Amazon EC2
12. Queue, pre-process and then write
Amazon Simple
Queue Service
(SQS)
Amazon S3
DynamoDB
Any other data
store
16. EMR is Hadoop in the Cloud
Amazon Elastic MapReduce (EMR)?
17. EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution,
# of nodes, types of nodes,
custom configs, Hive/Pig/etc.
Get the output
from S3
Launch the cluster using
the EMR console, CLI,
SDK, or APIs
You can also
store everything
in HDFS
How does EMR work ?
23. When you turn off your cloud resources, you
actually stop paying for them
24. SQL based processing
Amazon S3 Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse
Amazon SQS
DynamoDB
Any SQL or NoSQL
Store
Log Aggregation
tools
25. Amazon Redshift is a fast and powerful, fully
managed, petabyte-scale data warehouse service
in the AWS cloud
What is Amazon Redshift ?
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools
29. Your choice of BI Tools
Amazon S3 Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Amazon SQS
DynamoDB
Any SQL or NoSQL
Store
Log Aggregation
tools
33. Sharing results and visualizations
Amazon S3 Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
Amazon SQS
DynamoDB
Any SQL or NoSQL
Store
Log Aggregation
tools
34. Sharing results and visualizations
Amazon S3 Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
Amazon SQS
DynamoDB
Any SQL or NoSQL
Store
Log Aggregation
tools
35. Geospatial Visualizations
Amazon S3 Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Visualization tools
Amazon SQS
DynamoDB
Any SQL or NoSQL
Store
Log Aggregation
tools
36. Rinse and Repeat
Amazon S3 Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
Amazon SQS
DynamoDB
Any SQL or NoSQL
Store
Log Aggregation
tools
37. The complete architecture
Amazon S3 Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
Amazon SQS
DynamoDB
Any SQL or NoSQL
Store
Log Aggregation
tools
39. Amazon Kinesis
• Real-time processing
• Massive scale
• Integrated
• Use cases:
• Real-time log analysis
• Real-time data analytics
• Social media monitoring
• Financial transactions
• Online machine learning
40. Amazon Kinesis Data Flow
Data
Sources
App.4
[Machine
Learning]
AWSEndpoint
App.1
[Aggregate
& De-
Duplicate]
Data
Sources
Data
Sources
Data
Sources
App.2
[Metric
Extraction]
S3
DynamoDB
Redshift
App.3
[Sliding
Window
Analysis]
Data
Sources
Availability Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Availability
Zone
44. Data Architecture
Data Analyst
Raw Data
Get
Data
Join via Facebook
Add a Skill Page
Invite Friends
Web Servers Amazon S3
User Action Trace Events
EMR
Hive Scripts Process Content
• Process log files with
regular expressions to
parse out the info we need.
• Processes cookies into
useful searchable data such
as Session, UserId, API
Security token.
• Filters surplus info like
internal varnish logging.
Amazon S3
Aggregated Data
Raw Events
Internal Web
Excel Tableau
Amazon Redshift
45. We found that Amazon Redshift offers the
performance we needed while freeing us from
the licensing costs of our previous solution
With Amazon Redshift and Tableau, anyone in the
company can set up any queries they like—from
how users are reacting to a feature, to growth by
demographic or geography, to the impact sales
efforts have had in different areas. It’s very
flexible
Jon Hoffman, Software Engineer, Foursquare
0
0.2
0.4
0.6
Female Male
Gender
0 50 100
Age
Foursquare
Gorilla Coffee
Gray's Papaya
Amorino
When do people go to a place?
48. • Hadoop Technology and Use Cases:
http://www.powerof60.com/
• http://aws.amazon.com/de
• Start with the Free Tier:
http://aws.amazon.com/de/free/
• 25 US$ credits for new German customers:
http://aws.amazon.com/de/campaigns/account/
• Twitter: @AWS_Aktuell
• Facebook:
http://www.facebook.com/awsaktuell
• Webinars: http://aws.amazon.com/de/about-aws/events/
Resources
Editor's Notes
EMR supports multiple instance types including the latest HS1 instance types
EMR now supports High Storage Instances (hs1.8xlarge) in US East. These new instances offer 48 TB of storage across 24 hard disk drives, 35 EC2 Compute Units (ECUs) of compute capacity, 117 GB of RAM, 10 Gbps networking, and 2.4+ GB per second of sequential I/O performance. High Storage Instances are ideally suited for Hadoop and they significantly reduce the cost of processing very large data sets on EMR. We look forward to adding support for High Storage Instances in additional regions early next year.
And the concept of adding nodes works well with hadoop – especially on the cloud since 10 nodes running for 10 hours costs the same as 100 nodes running for 1 hour.
Vertical scaling on commodity hardware. Perfect for Hadoop.