2. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
What is big data?
When your data sets become so large that you have to start
innovating around how to collect, store, organize, analyze, and
share it
• Velocity
– Rate of data flow in
• Latency
– High or Low
• Volume
– High or Low
• Variety
– Diversity of source data
• Item Size
– KB or MB
• Request Rate
– Access patterns
• Change Rate
– How much is the data changing?
• Processing Requirements
– How much computation?
• Durability
– Preservation of source data?
• Availability
– Tolerance for downtime?
• Growth Rate
– Rate of data growth?
• Views
– The diversity of consumers?
3. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Plethora of tools
Amazon
Glacier
Amazon
S3
Amazon
DynamoDB
Amazon
RDS
Amazon
EMR
Amazon
Redshift
AWS
Data Pipeline
Amazon
Kinesis
Cassandra
Amazon
CloudSearch
4. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Ingest Store Analyze Visualize
Data Answers
Time
Multiple stages
Storage decoupled from processing
Simplify data analytics flow
5. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Amazon
GlacierS3
DynamoDB
RDS
Amazon Kinesis
Spark
Streaming
EMR
Ingest Store Process/Analyze Visualize
Data Pipeline
Storm
Kafka
Amazon
Redshift
Cassandra
Amazon
CloudSearch
Amazon
Kinesis
Connector
Kinesis
enabled app
App Server
Web Server
Devices
6. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Collect / Ingest
Amazon Kinesis
Process / Analyze
Amazon EMR Amazon EC2
Amazon
Redshift
AWS
Data Pipeline
Visualize / ReportStore
Amazon Glacier
Amazon S3
Amazon DynamoDB
Amazon RDS
AWS Import/Export
AWS Direct Connect
Amazon SQS
AWS big data portfolio
7. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Mobile / Cable
Telecom
Oil and Gas
Industrial
Manufacturing
Retail/Consumer
Entertainment
Hospitality
Life Sciences
Scientific
Exploration
Financial
Services
Publishing Media
Advertising
Online Media
Social Network
Gaming
Industries using AWS for data analysis
8. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Ingest: The act of collecting and storing data
9. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Types of data ingest
• Transactional
– Database reads/writes
• File
– Media files; log files
• Stream
– Click-stream logs (sets of
events)
Database
Cloud
Storage
Stream
Storage
LoggingFrameworksDevicesApps
10. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Real-time processing of streaming data
High throughput
Elastic
Easy to use
Connectors for EMR, S3, Amazon Redshift,
DynamoDB
Amazon
Kinesis
11. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Sending and reading data from Amazon
Kinesis streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Reading
Write Read
12. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Hparser, Big Data Edition
Flume, Sqoop
AWS Partners for data ingest, load, and
transformation
14. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
App/Web Tier
Client Tier
Database & Storage Tier
Cloud database and storage tier anti-pattern
15. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
App/Web Tier
Client Tier
Data TierDatabase & Storage Tier
Search
Hadoop/HDFS
Cache
Blob Store
SQL NoSQL
Cloud database and storage tier — use the right
tool for the job!
16. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Database & Storage Tier
Amazon RDSAmazon
DynamoDB
Amazon
ElastiCache
Amazon S3
Amazon
Glacier
Amazon
CloudSearch
HDFS on Amazon EMR
Cloud database and storage tier — use the right
tool for the job!
17. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Store anything
Object storage
Scalable
Designed for 99.999999999% durability
Amazon
S3
18. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Aggregate all data in S3 surrounded by a collection of the right tools
Amazon
EMR
Amazon
Kinesis
Amazon
Redshift
Amazon
DynamoDB
Amazon RDS
AWS
Data Pipeline
Spark
Streaming
Cassandra Storm
Amazon S3
• No limit on the number of objects
• Object size up to 5 TB
• Central data storage for all
systems
• High bandwidth
• 99.999999999% durability
• Versioning; lifecycle policies
• Amazon Glacier integration
19. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Fully managed NoSQL database service
Built on solid-state drives (SSDs)
Consistent low-latency performance
Any throughput rate
No storage limits
Amazon
DynamoDB
20. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
• Scaling without downtime
• Automatic sharding
• Security inspections, patches,
upgrades
• Automatic hardware failover
• Multi-AZ replication
• Hardware configuration
designed specifically for
DynamoDB
• Performance tuning
DynamoDB: managed high availability and
durability
21. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Relational databases
Fully managed; zero admin
MySQL, PostgreSQL, Oracle, SQL Server
Aurora
Amazon
RDS
22. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Process and analyze
23. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Processing frameworks
• Batch processing
– Take large amount (>100 TB) of cold data and ask questions
– Takes minutes or hours to get answers back
– Example: Generating hourly, daily, weekly reports
• Stream processing (real-time)
– Take small amount of hot data and ask questions
– Takes short amount of time to get your answer back
– Example: 1 min metrics
24. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Processing frameworks
• Batch processing/analytic
– Amazon Redshift
– Amazon EMR (Hadoop)
– Spark, Hive/Tez, Pig, Impala, Presto, ….
• Stream processing
– Amazon Kinesis client and connector library
– Spark Streaming
– Storm (+Trident)
MPPMPPHadoopStreamProcessing
25. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Columnar data warehouse
ANSI SQL compatible
Massively parallel
Petabyte scale
Fully managed
Very cost-effective
Amazon
Redshift
26. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Amazon Redshift architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3
– Parallel load from Amazon DynamoDB
• Hardware optimized for data processing
• Two hardware platforms
– DS2 (dense storage): HDD; scale to 1.6PB
– DC1 (dense compute): SSD; scale to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
27. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Amazon Kinesis
Amazon
Elastic
MapReduce
28. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR Cluster
S3
1. Put the data
into S3
2. Choose: Hadoop distribution, #
of nodes, types of nodes, Hadoop
apps like Hive/Pig/HBase
4. Get the output
from S3
3. Launch the cluster using
the EMR console, CLI, SDK, or
APIs
How does EMR work?
29. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
The Hadoop ecosystem works with EMR
30. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Partners – advanced analytics
35. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Amazon EMR
as ETL Grid
and Analysis
Amazon Redshift –
Production DWH
VisualizationLogs
Traffic Statistics
Demo
37. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
ICAO and Hadoop
Marco Merens
Chief (Acting) Integrated Analysis
International Civil Aviation Organization
39. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Cloudability principles at ICAO
1. What comes from the cloud, can stay in the
cloud
2. What comes from in-house
A. should stay in-house if private, or
B. can be synced with the cloud if public
40. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Data sync
Data
Basic UI
Create
Read
Update
Delete
sync Data
FancyUI
Read
Metrics
41. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Collect Map Reduce Publish
Key Priority
EMR example: blended accident list
42. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Input format
XML
<?xml version="1.0" encoding="utf-8"?>
<root>
<ADREP><FilingInformation
State="XX"><ReportingOrganization>Ascend</ReportingOrganization>
<StateFileNumber>S1982045</StateFileNumber>
<Headline>MU-2, Collision with high ground, (near) Kelowna</Headline>
</FilingInformation>
….
</root>
CSV
|26/12/2001|Germany|Germany|"ICE:Icing"|Accident|Fatal|8|Germany|Bremerhav
en|D-IAAI|"BRITTEN NORMAN"||||"2 251 to 5 700 Kg"|Scheduled|Airplane|Take-
off||
…
43. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
#!/bin/sh
wget "http://somexml" -qO- | tr -d "n" | tr -d "r" |
sed "s#<Accident>#n<Accident>#g" > tmp
aws s3 put tmp s3://accidents/input/source1
…….
Amazon S3
Amazon EC2
Use linux
crontab
to schedule
Make One XML
element per line for
EMR
Collect
44. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR command line
elastic-mapreduce
--create
--bootstrap-action
s3://elasticmapreduce/samples/node/install-node-bin-
x86.sh
--instance-type m1.small --instance-count 3
--json job.json
--put /home/ec2-user/key/newtest.pem
--to /home/hadoop
--enable-debugging
Put ssh key to hadoop
if you need to remote
sh
45. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR json config file
[{
"Name": "Make accident map",
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar":"/home/hadoop/contrib/streaming/hadoop-streaming.jar",
"Args": [
"-input",
"s3://accidentstats/input/*", …
]},{
"Name": "Store in mongo",
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar" ,
"Args": [
"s3://edmscripts/uploadtomongo.sh",
"accidentstats/output",
"NEWACCIDENTLIST"
]}}
Move the results
from S3 to
somewhere else
46. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Map
sourceX
#!/usr/bin/env node
function treatline(line) {
If (line.indexOf(“<ADREP>”))
{
source1(line)
}
……
Function source1(line)
{
Var data=xml2json(line)
data.records.forEach(function(v){
Var el={ Date:v.Date,
Registration:v.Registration,
Model:v.Model,
Source:”Source1”,
Priority:1
}
var key=el.Date+”#”+el.Registration
process.stdout.write(key+”/t”+JSON.stringify(el)
)
})
}
mapped
Amazon Elastic
MapReduce
Amazon S3
47. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Reduce
Mapped and
sorted
#!/usr/bin/env node
Var oldkey,key,array=[]
function treatline(line) {
key=line.split(“/t”)[0]
data=JSON.parse(line.split(“/t”)[1])
If ((key==oldkey) || !oldkey)
{
array.push(data)}
Else {
treat(array)
array=[]}
oldkey=key
……}
Function treat(array)
{
el={}
array=array.sort(prioritysort)
array.forEach(function(v){
el=updateresult(el,v)
})
process.stdout.write(JSON.stringify(el)+”n”)
}
Reduced
Amazon Elastic
MapReduce
Amazon S3
48. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Real-time statistics
Amazon Elastic
MapReduce
49. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Thank You.
This presentation will be loaded to SlideShare the week following the Symposium.
http://www.slideshare.net/AmazonWebServices
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Notes de l'éditeur
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems.
But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages and we dig deeper into AWS services for these different stages
AWS Big Data Portfolio
Customers can of course can use compute, storage and networking building blocks + open source tools. But managed services take care of undifferentiated heavy lifting of setting up, patching and scaling allowing you to focus on the mission.
Visualization we rely on our partners – really good at it and what our customers are using.
Amazon Kinesis is a fully managed service for real-time data processing over large, distributed data streams. Amazon Kinesis can continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, sensor data, IoT, location-tracking events.
With Amazon Kinesis Client Library (KCL), you can build Amazon Kinesis Applications and use streaming data to power real-time dashboards, generate alerts, implement dynamic pricing and advertising, and more. You can also emit data from Amazon Kinesis to other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elastic Map Reduce (Amazon EMR), and AWS Lambda.
A shard is the base throughput unit of an Amazon Kinesis stream. One shard provides a capacity of 1MB/sec data input and 2MB/sec data output. One shard can support up to 1000 PUT records per second. You will specify the number of shards needed when you create a stream.
Amazon Kinesis Client Library (KCL) is a pre-built library that helps you easily build Amazon Kinesis Applications for reading and processing data from an Amazon Kinesis stream.
It handles complex issues such as adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance.
Amazon Kinesis Connector Library is a pre-built library that helps you easily integrate Amazon Kinesis with other AWS services and third-party tools.
The current version of this library provides connectors to Amazon DynamoDB, Amazon Redshift, Amazon S3, and Elasticsearch. The library also includes sample connectors of each type, plus Apache Ant build files for running the samples.
Data structure
Query complexity
Data characteristics: hot, warm, cold
2 x 2 Matrix
Structured
Level of query (from none to complex)
Amazon S3 is object storage service which is highly-scalable, reliable, low-latency and low cost.
Designed for 11 9’s of durability.
Amazon S3 stores data as objects within resources called "buckets." You can store as many objects as you want within a bucket, and write, read, and delete objects in your bucket. Objects can be up to 5 terabytes in size.
You can control access to the bucket (who can create, delete, and retrieve objects in the bucket for example), view access logs for the bucket and its objects, and choose the AWS region where a bucket is stored to optimize for latency, minimize costs, or address regulatory requirements.
Integrates well with other AWS services and a lot of tools from ISVs and in the open source community
Acts as data lake in a large majority of big data solutions
Features include versioning and life cycle management. Glacier integration for archival of data
It protects your data by offering encryption of data at rest and in-flight and provides security and access management features for fine grained control on who can access the data
Several other features including Event Notifications that can be delivered using Amazon SQS or Amazon SNS, or sent directly to AWS Lambda, enabling you to trigger workflows, alerts, or other processing.
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. Amazon DynamoDB enables customers to offload the administrative burdens of operating and scaling distributed databases to AWS, so they don’t have to worry about hardware provisioning, setup and configuration, replication, software patching, or cluster scaling.
DynamoDB supports key-value and document data structures.
A key-value store is a database service that provides support for storing, querying and updating collections of objects that are identified using a key and values that contain the actual content being stored.
A document store provides support for storing, querying and updating items in a document format such as JSON, XML, and HTML.
Table: A table is a collection of data items – just like a table in a relational database is a collection of rows. Each table can have an infinite number of data items. Amazon DynamoDB is schema-less, in that the data items in a table need not have the same attributes or even the same number of attributes. Each table must have a primary key.
Item: An Item is composed of a primary or composite key and a flexible number of attributes. There is no explicit limitation on the number of attributes associated with an individual item, but the aggregate size of an item, including all the attribute names and attribute values, is 400K
Attribute: Each attribute associated with a data item is composed of an attribute name (e.g. “Color”) and a value or set of values (e.g. “Red” or “Red, Yellow, Green”). Individual attributes have no explicit size limit, but the total value of an item (including all attribute names and values) cannot exceed 400KB.
Amazon DynamoDB supports GET/PUT operations using a user-defined primary key. The primary key is the only required attribute for items in a table and it uniquely identifies each item. You specify the primary key when you create a table. In addition to that DynamoDB provides flexible querying by letting query on non-primary key attributes using Global Secondary Indexes and Local Secondary Indexes.
Transition Statement – RDBMS is still a viable and important component in Big Data Architecture
Amazon Relational Database Service (Amazon RDS) is a managed service that makes it easy to set up, operate, and scale a relational database in the cloud.
Amazon RDS gives you access to the capabilities of a familiar MySQL, Oracle, SQL Server, or PostgreSQL database. This means that the code, applications, and tools you already use today with your existing databases should work seamlessly with Amazon RDS. Amazon RDS automatically patches the database software and backs up your database, storing the backups for a user-defined retention period.
For optional Multi-AZ deployments, Amazon RDS also manages synchronous data replication across Availability Zones and automatic failover.
Amazon Aurora is a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
Generally come in two major types
Batch
Streaming
Examples
Query Speed
Redshift – Extremely fast SQL queries
Spark, Impala – Extremely Fast to Fast Hive QL
Hive, Tez – Moderately Fast to Slow Hive QL
Data Volume?
UDFs?
Manageability?
http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn
https://amplab.cs.berkeley.edu/benchmark/
Add connector
Direct Acyclic Graphs?
Exactly once processing & DAG? – how do you do this??
https://storm.apache.org/documentation/Rationale.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Customers can start small for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year, less than a tenth of most other data warehousing solutions.
Traditional data warehouses require significant time and resource to administer, especially for large datasets. And they are costly.
Amazon Redshift not only significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly.
Amazon Redshift uses a variety of innovations to achieve up to ten times higher performance than traditional databases for data warehousing and analytics workloads:
Columnar Data Storage: Instead of storing data as a series of rows, Amazon Redshift organizes the data by column. Unlike row-based systems, which are ideal for transaction processing, column-based systems are ideal for data warehousing and analytics, where queries often involve aggregates performed over large data sets. Since only the columns involved in the queries are processed and columnar data is stored sequentially on the storage media, column-based systems require far fewer I/Os, greatly improving query performance.
Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
Massively Parallel Processing (MPP): Amazon Redshift automatically distributes data and query load across all nodes. Amazon Redshift makes it easy to add nodes to your data warehouse and enables you to maintain fast query performance as your data warehouse grows.
Amazon Redshift gives you fast querying capabilities over structured data using familiar SQL-based clients and business intelligence (BI) tools using standard ODBC and JDBC connections. Queries are distributed and parallelized across multiple physical resources.
Easy to scale
Amazon Redshift automatically patches and backs up your data warehouse, storing the backups for a user-defined retention period.
You can create a cluster using either Dense Storage (DS) nodes or Dense Compute nodes (DC). Dense Storage nodes allow you to create very large data warehouses using hard disk drives (HDDs) for a very low price point. Dense Compute nodes allow you to create very high performance data warehouses using fast CPUs, large amounts of RAM and solid-state disks (SSDs).
An Amazon Redshift data warehouse cluster can contain from 1-128 compute nodes, depending on the node type
Amazon EMR enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.
Amazon EMR uses Apache Hadoop as its distributed data processing engine. Hadoop is an open source, Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware. Hadoop implements a programming model named “MapReduce,” where the data is divided into many small fragments of work, each of which may be executed on any node in the cluster.
EMR uses EC2 and S3 to set up hadoop clusters.
Regular Hadoop/HDFS
Support for popular add-ons
Fully managed and easy to use
On demand and SPOT pricing
Integrated with other AWS services
S3
DDB
Kinesis
Bootstrap capabilities have most flexibility at the layer above core Hadoop/HDFS
Popular pattern
1-Customer puts data into S3
2-Make some decisions about what to run (type, number and other technologies to install)
3-Use CLI, SDK, Console or API to launch
4-Output is sent to S3
Easy to resize cluster
Use spot instances to save money
Time to resize is going to be a combination of EC2/AMI boot time + the bootstrap options.
Task nodes - Additional nodes to a running cluster that are SPOT
S3DistCp to load/unload from HDFS
Shutdown the cluster (stop being charged except
Core Hadoop is:
Map Reduce – Computational Model
HDFS – Hadoop Distributed File System
Additional Tools have entered the eco system
Tools to help get data into Hadoop
Tools to connect to Relational Systems
Monitoring
Machine Learning
This slide is a small slice
Scientific, algorithmic, predictive, etc
Real time / stream processing: kinesis and dynamo (First two boxes in first row)
Batch processing: last two boxes, hdfs and s3
This is a summary of all six design patterns together. This summarizes all of the solutions available in the context of the temperature of the data and the data processing latency requirements.
Hive – 1 year worth of click stream data
Spark – 1 year of click stream data – what people are buying frequently together
Redshift – reporting, enterprise reporting tool – SQL Heavy
Impala – same as redshift
Preseto same league as Impala presto – Interactive SQL analytics – have a Hadoop installed base….
NoSQL – Analytics on NoSQL
Contains several months of hourly pageview statistics for all articles in Wikipedia
Data copied into EMR from S3
emr does data transformation using hive
raw data doesn’t have date and time
processing file name to get that info, hive is doing it
Contains several months of hourly pageview statistics for all articles in Wikipedia
Data copied into EMR from S3
emr does data transformation using hive
raw data doesn’t have date and time
processing file name to get that info, hive is doing it