SlideShare une entreprise Scribd logo
1  sur  75
Télécharger pour lire hors ligne
Ankara Cloud Meetup – July 2016
Serkan ÖZAL
BIG DATA ON AWS
AGENDA
• What is Big Data?
• Big Data Concepts
• Big Data on AWS
• Data Storage on AWS
• Data Analytics & Querying on AWS
• Data Processing on AWS
• Data Flow on AWS
• Demo
2
3
WHAT IS BIG DATA???
4
5
4V OF BIG DATA
6
BIG DATA CONCEPTS
7
BIG DATA CONCEPTS
• Scalability
• Cloud Computing
• Data Storage
• Data Analytics & Querying
• Data Processing
• Data Flow
8
BIG DATA ON AWS
9
AWS BIG DATA PORTOFILO
10
SCALABILITY ON AWS
• Types of scalability:
• Vertical Scalability: More/Less CPU, RAM, ...
• Horizontal Scalability: More/Less machine
• Auto scalable by
• CPU
• Memory
• Network
• Disk
• …
11
CLOUD COMPUTING ON AWS
• Rich service catalog
• «Pay as you go» payment model
• Cloud Computing models:
 IaaS (Infrastructure as a Service)
o EC2, S3
 PaaS (Platform as a Service)
o Elastic Beanstalk
 SaaS (Software as a Service)
o DynamoDB, ElasticSearch
12
DATA STORAGE ON AWS
13
DATA STORAGE ON AWS
• There are many different data storage services
for different purposes
• Key features for storage:
• Data storage services on AWS:
 S3
 Glacier
14
 Durability
 Security
 Cost
 Scalability
 Access Performance
 Availability
 Reliability
S3
• Secure, durable and highly-available storage service
• Bucket: Containers for objects
 Globally unique regardless of the AWS region
• Object: Stored entries
 Uniquely within a bucket by a name and a version ID
• Supports versioning
• WORM (Write once, read many)
• Consistency models
 «read-after-write» consistency for puts of new objects
 «eventual consistency» for puts/deletes of existing objects
 Updates to a single object are atomic.
15
GLACIER
• Storage service optimized for infrequently used data
• Archives are stored in the containers name «vault»s
• Uploading is synchronous operation with replicas
• Downloading is asynchronous operation through retrieval job
• Immutable
• Encrypted by default
• «S3» data can be archived into «Glacier» by life-cycle rules
• 4x cheaper than «S3»
16
DATA
ANALYTICS & QUERYING
ON AWS
17
DATA ANALYTICS & QUERYING ON AWS
• There are different data analytics&querying services for
different purposes.
 NoSQL
 SQL
 Text search
 Analytics / Visualising
• Data analytics&querying services on AWS:
18
 DynamoDB
 Redshift
 RDS
 Elasticsearch (+ Kibana)
 CloudSearch
 QuickSight
DYNAMODB
• Key-value or document storage based NoSQL database
• Consistency model
• Eventual consistency by default
• Strong consistency as optional
• Supports
• «DynamoDB Streaming» for tracking changes in near real time
• Integrated with
19
 EMR
 Elasticsearch
 Redshift
 Lambda
 Secondary indexes
 Batch operations
 Conditional writes
 Atomic counters
REDSHIFT
• SQL data warehouse solution
• Fault tolerant by
 Replicating to other nodes in the cluster
 Backing up to «S3» continuously and automatically
• Scalable
 Vertically by allowing switching to new cluster
 Horizontally by adding new nodes to cluster
• Integrated with
20
 S3
 EMR
 DynamoDB
 Kinesis
 Data Pipeline
RDS
• Makes it easier to set up, operate and scale a relational
database in the cloud
• Manages backups, software patching, automatic failure
detection and recovery
• Supports
 MySQL
 PostgreSQL
 Oracle
 Microsoft SQL Server
 Aurora (Amazon’s MySQL based RDMS by 5x performance)
21
ELASTICSEARCH
• Makes it easy to deploy, operate and scale «Elasticsearch» in the
AWS cloud
• Supports
 Taking snapshots for backup
 Restoring from backups
 Replicating domains across Availability Zones
• Integrated with
 «Kibana» for data visualization
 «Logstash» to process logs and load them into «ES»
 «S3», «Kinesis» and «DynamoDB» for loading data into «ES»
22
CLOUDSEARCH
• AWS’s managed and high-performance search solution
• Reliable by replicating search domain into multiple AZ
• Automatic monitoring and recovery
• Auto scalable
• Supports popular search features such as
On 34 languages
23
 Free-text search
 Faceted search
 Highlighting
 Autocomplete suggestion
 Geospatial search
 …
QUICKSIGHT
• AWS’s new very fast, cloud-powered business intelligence (BI)
service
• Uses a new, Super-fast, Parallel, In-memory Calculation Engine
(«SPICE») to perform advanced calculations and render
visualizations rapidly
• Supports creating analyze stories and sharing them
• Integrated/Supported data sources:
• Said that 1/10th the cost of traditional BI solutions
• Native Access on available major mobile platforms
24
 EMR
 Kinesis
 S3
 DynamoDB
 Redshift
 RDS
 3rd party (i.e. Salesforce)
 Local file upload
 …
DATA PROCESSING ON
AWS
25
DATA PROCESSING ON AWS
• Allows processing of huge amount of data in highly scalable
way as distributed
• Support both of
 Batch
 Stream
data processing concepts
• Data processing services on AWS:
 EMR (Elastic Map-Reduce)
 Kinesis
 Lambda
 Machine Learning
26
EMR (ELASTIC MAPREDUCE)
• Big data service/infrastructure to process large amounts of
data efficiently in highly-scalable manner
• Supported Hadoop distributions:
 Amazon
 MapR
• Low cost by «on-demand», «spot» or «reserved» instances
• Reliable by retrying failed tasks and automatically replacing
poorly performing or failed instances
• Elastic by resizing cluster on the fly
• Integrated with
27
 S3
 DynamoDB
 Data Pipeline
 Redshift
 RDS
 Glacier
 VPC
 …
EMR – CLUSTER COMPONENTS
• Master Node
 Tracks, monitors and manages the cluster
 Runs the «YARN Resource Manager» and
the «HDFS Name Node Service»
• Core Node [Slave Node]
 Runs tasks and stores data in the HDFS
 Runs the «YARN Node Manager Service»
and «HDFS Data Node Daemon» service
• Task Node [Slave Node]
• Only runs the tasks
• Runs the «YARN Node Manager Service»
28
EMR – BEST PRACTICES FOR INSTANCE
GROUPS
29
Project Master Instance Group Core Instance Group Task Instance Group
Long-
running
clusters
On-Demand On-Demand Spot
Cost-driven
workloads
Spot Spot Spot
Data-critical
workloads
On-Demand On-Demand Spot
Application
testing
Spot Spot Spot
EMR – CLUSTER LIFECYCLE
30
EMR - STORAGE
• The storage layer includes the different file systems for
different purposes as follows:
 HDFS (Hadoop Distributed File System)
 EMRFS (Elastic MapReduce File System)
 Local Storage
31
EMR - HDFS
• Distributed, scalable file system for Hadoop
• Stores multiple copies to prevent data lost on failure
• It is an ephemeral storage because it is reclaimed when cluster
terminates
• Used by the master and core nodes
• Useful for intermediate results during Map-Reduce processing
or for workloads which have significant random I/O
• Prefix: «hdfs://»
32
EMR - EMRFS
• An implementation of HDFS which allows clusters to store data
on «AWS S3»
• Eventually consistent by default on top of «AWS S3»
• Consistent view can be enabled with the following supports:
 Read-after-write consistency
 Delete consistency
 List consistency
• Client/Server side encryption
• Most often, «EMRFS» (S3) is used to store input and output
data and intermediate results are stored in «HDFS»
• Prefix: «s3://»
33
EMR – LOCAL STORAGE
• Refers to an ephemeral volume directly attached to an EC2
instance
• Ideal for storing temporary data that is continually changing,
such as
 Buffers
 Caches
 Scratch data
 And other temporary content
• Prefix: «file://»
34
EMR – DATA PROCESSING FRAMEWORKS
35
• Hadoop
• Spark
• Hive
• Pig
• HBase
• HCatalog
• Zookeeper
• Presto
• Impala
• Sqoop
• Oozie
• Mahout
• Phoenix
• Tez
• Ganglia
• Hue
• Zeppelin
• …
KINESIS
• Collects and processes large streams of data records in real time
• Reliable by synchronously replicating streaming data across
facilities in an AWS Region
• Elastic by increasing shards based on the volume of input data
• Has
 S3
 DynamoDB
 ElasticSearch
connectors
• Also integrated with «Storm» and «Spark»
36
KINESIS – HIGH LEVEL ARCHITECTURE
37
KINESIS – KEY CONCEPTS[1]
• Streams
 Ordered sequence of data records
 All data is stored for 24 hours (can be increased to 7 days)
• Shards
 It is a uniquely identified group of data records in a stream
 Kinesis streams are scaled by adding or removing shards
 Provides 1MB/sec data input and 2MB/sec data output capacity
 Supports up to 1000 TPS (put records per second)
• Producers
 Put records into streams
• Consumers
 Get records from streams and process them
38
KINESIS – KEY CONCEPTS[2]
• Partition Key
 Used to group data by shard within a stream
 Used to determine which shard a given data record belongs to
• Sequence Number
 Each record has a unique sequence number in the owned shard
 Increase over time
 Specific to a shard within a stream, not across all shards
• Record
 Unit of data stored
 Composed of <partition-key, sequence-number, data-blob>
 Data size can be up to 1 MB
 Immutable
39
KINESIS – GETTING DATA IN
• Producers use «PUT» call to store data in
a stream
• A partition key is used to distribute the
«PUT»s across shards
• A unique sequence # is returned to the
producer for each event
• Data can be ingested at 1MB/second or
1000 Transactions/second per shard
40
KINESIS – GETTING DATA OUT
• Kinesis Client Library (KCL) simplifies
reading from the stream by abstracting
you from individual shards
• Automatically starts a worker thread for
each shard
• Increases and decreases thread count as
number of shards changes
• Uses checkpoints to keep track of a
thread’s location in the stream
• Restarts threads & workers if they fail
41
LAMBDA
• Compute service that run your uploaded code on your behalf
using AWS infrastructure
• Stateless
• Auto-scalable
• Pay per use
• Use cases
 As an event-driven compute service triggered by actions
on «S3», «Kinesis», «DynamoDB», «SNS», «CloudWatch»
 As a compute service to run your code in response to HTTP
requests using «AWS API Gateway»
• Supports «Java», «Python» and «Node.js» languages
42
MACHINE LEARNING
• Machine learning service for building ML models and
generating predictions
• Provides implementations of common ML data transformations
• Supports industry-standard ML algorithms such as
 binary classification
 multi-class classification
 regression
• Supports batch and real-time predictions
• Integrated with
• S3
• RDS
• Redshift
43
DATA FLOW ON AWS
44
DATA FLOW ON AWS
• Allows transferring data between data
storage/processing points
• Data flow services on AWS:
 Firehose
 Data Pipeline
 DMS (Database Migration Service)
 Snowball
45
FIREHOSE
• Makes it easy to capture and load massive volumes of load
streaming data into AWS
• Supports endpoints
 S3
 Redshift
 Elasticsearch
• Supports buffering
 By size: Ranges from 1 MB to 128 MB. Default 5 MB
 By interval: Ranges from 60 seconds to 900 seconds.
Default 5 minutes
• Supports encryption and compression
46
DATA PIPELINE
• Automates the movement and transformation of data
• Comprise of workflows of tasks
• The flow can be scheduled
• Has visual designer with drag-and-drop support
• The definition can be exported/imported
• Supports custom activities and preconditions
• Retry flow in case of failure and notifies
• Integrated with
47
 EMR
 S3
 DynamoDB
 Redshift
 RDS
DATABASE MIGRATION SERVICE
• Helps to migrate databases into AWS easily and securely
• Supports
 Homogeneous migrations (i.e. Oracle <=> Oracle)
 Heterogeneous migrations (i.e. Oracle <=> PostgreSQL)
• Also used for:
• Database consolidation
• Continuous data replication
• Keep your applications running during migration
 Start a replication instance
 Connect source and target databases
 Let «DMS» load data and keep them in sync
 Switch applications over to the target at your convenience
48
SNOWBALL
• Import/Export service that accelerates transferring large
amounts of data into and out of AWS using physical storage
appliances, bypassing the Internet
• Data is transported securely as encrypted
• Data is transferred via 10G connection
• Has 50TB or 80TB storage options
• Steps:
49
 Job Create
 Processing
 In transit to you
 Delivered to you
 In transit to AWS
 At AWS
 Importing
 Completed
DEMO TIME
50
Source code
and
Instructions
are available at
github.com/serkan-ozal/ankaracloudmeetup-bigdata-demo
51
STEPS OF DEMO
• Storing
• Searching
• Batch Processing
• Stream Processing
• Analyzing
52
STORING
53
[STORING]
CREATING DELIVERY STREAM
• Create an «AWS S3» bucket to store tweets
• Create an «AWS Firehose» stream to push tweets into to be stored
• Attach created bucket as endpoint to the stream
Then;
• Listened tweet data can be pushed into this stream to be stored
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-storing-firehose
54
[STORING]
CRAWLING TWEETS
• Configure stream name to be used by crawler application
• Build «bigdata-demo-storing-crawler» application
• Deploy into «AWS Elasticbeanstalk»
Then;
• Application listens tweets via «Twitter Streaming API»
• Application pushes tweets data into «AWS Firehose» stream
• «AWS Firehose» stream stores tweets on «AWS S3» bucket
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-storing-crawler
55
SEARCHING
56
[SEARCHING]
CREATING SEARCH DOMAIN
• Create an «AWS Elasticsearch» domain
• Wait until it becomes active and ready to use
Then
• Tweet data can also be indexed to be searched
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-searching-
elasticsearch
57
[SEARCHING]
INDEXING INTO SEARCH DOMAIN
• Create an «AWS Lambda» function with the given code
• Configure created «AWS S3» bucket as trigger
• Wait until it becomes active and ready to use
Then;
• Stored tweet data will also be indexed to be searched
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-searching-lambda
58
BATCH PROCESSING
59
[BATCH PROCESSING]
CREATING EMR CLUSTER
• Create an «AWS EMR» cluster
• Wait until it becomes active and ready to use
Then;
• Stored tweet data can be processed as batch
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-emr
60
[BATCH PROCESSING]
CREATING TABLE TO STORE BATCH RESULT
• Create an «AWS DynamoDB» table to save batch results
• Wait until it becomes active and ready to use
Then;
• Analyze result data can be saved into this table
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-
dynamodb
61
[BATCH PROCESSING]
PROCESSING TWEETS VIA HADOOP
• Build&upload «bigdata-demo-batchprocessing-hadoop» to «AWS S3»
• Run uploaded jar as «Custom JAR» step on «AWS EMR» cluster
• Wait until step has completed
Then;
• Processed tweet data is on «AWS S3» bucket
• Analyze result is in «AWS DynamoDB» table
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-hadoop
62
[BATCH PROCESSING]
QUERYING TWEETS VIA HIVE
• Upload Hive query file to «AWS S3»
• Run uploaded query as «Hive» step on «AWS EMR» cluster
• Wait until step has completed
Then;
• Query results are dumped to «AWS S3»
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-hive
63
[BATCH PROCESSING]
SCHEDULING WITH DATA PIPILINE
• Import «AWS Data Pipeline» definition
• Activate pipeline
Then;
• Every day, all batch processing jobs (Hadoop +Hive) will be
executed automatically
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-batchprocessing-
datapipeline
64
STREAM PROCESSING
65
[STREAM PROCESSING]
CREATING REALTIME STREAM
• Create an «AWS Kinesis» stream to push tweets to be analyzed
• Wait until it becomes active and ready to use
Then;
• Listened tweet data can be pushed into this stream to be analyzed
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
kinesis
66
[STREAM PROCESSING]
PRODUCING STREAM DATA
• Configure stream name to be used by producer application
• Build «bigdata-demo-streamprocessing-kinesis-producer»
• Deploy into «AWS Elasticbeanstalk»
Then;
• Application listens tweets via «Twitter Streaming API»
• Listened tweets are pushed into «AWS Kinesis» stream
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
kinesis-producer
67
[STREAM PROCESSING]
CREATING TABLE TO SAVE STREAM RESULT
• Create an «AWS DynamoDB» table to save realtime results
• Wait until it becomes active and ready to use
Then;
• Analyze result can be saved into this table
• Instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
dynamodb
68
[STREAM PROCESSING]
CONSUMING STREAM DATA
• Configure
• «AWS Kinesis» stream name for consuming tweets from
• «AWS DynamoDB» table name for saving results into
• Build «bigdata-demo-streamprocessing-consumer» application
• Deploy into «AWS Elasticbeanstalk»
Then;
• Application consumes tweets from «AWS Kinesis» stream
• Sentiment analyze is applied to tweets via «Stanford NLP»
• Analyze results are saved into «AWS DynamoDB» table in realtime
• Source code and instructions:
github.com/serkan-ozal/ankaracloudmeetup-bigdata-
demo/tree/master/bigdata-demo-streamprocessing-
kinesis-consumer 69
ANALYZING
70
[ANALYZING]
ANALYTICS VIA QUICKSIGHT
• Create a data source from «AWS S3»
• Create an analyse based on data source from «AWS S3»
• Wait until importing data and analyze has completed
Then;
• You can do your
 Query
 Filter
 Visualization
 Story (flow of visualisations like presentation)
stuff on your dataset and share them
71
THE BIG PICTURE
72
73
END OF DEMO
74
THANKS!!!
75

Contenu connexe

Tendances

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 

Tendances (20)

Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 

Similaire à Big data on aws

Similaire à Big data on aws (20)

AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
REDSHIFT - Amazon
REDSHIFT - AmazonREDSHIFT - Amazon
REDSHIFT - Amazon
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
Data Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMRData Warehousing and Analytics on Redshift and EMR
Data Warehousing and Analytics on Redshift and EMR
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the Cloud
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
AWS 101.pptx
AWS 101.pptxAWS 101.pptx
AWS 101.pptx
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Cloud Service.pptx
Cloud Service.pptxCloud Service.pptx
Cloud Service.pptx
 
Create cloud service on AWS
Create cloud service on AWSCreate cloud service on AWS
Create cloud service on AWS
 
Real-time Analytics with Open-Source
Real-time Analytics with Open-SourceReal-time Analytics with Open-Source
Real-time Analytics with Open-Source
 

Plus de Serkan Özal (7)

Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS Lambda
 
MySafe
MySafeMySafe
MySafe
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
JVM Under the Hood
JVM Under the HoodJVM Under the Hood
JVM Under the Hood
 
Ankara JUG Big Data Presentation
Ankara JUG Big Data PresentationAnkara JUG Big Data Presentation
Ankara JUG Big Data Presentation
 
AWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceAWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map Reduce
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 

Dernier

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 

Dernier (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 

Big data on aws

  • 1. Ankara Cloud Meetup – July 2016 Serkan ÖZAL BIG DATA ON AWS
  • 2. AGENDA • What is Big Data? • Big Data Concepts • Big Data on AWS • Data Storage on AWS • Data Analytics & Querying on AWS • Data Processing on AWS • Data Flow on AWS • Demo 2
  • 3. 3
  • 4. WHAT IS BIG DATA??? 4
  • 5. 5
  • 6. 4V OF BIG DATA 6
  • 8. BIG DATA CONCEPTS • Scalability • Cloud Computing • Data Storage • Data Analytics & Querying • Data Processing • Data Flow 8
  • 9. BIG DATA ON AWS 9
  • 10. AWS BIG DATA PORTOFILO 10
  • 11. SCALABILITY ON AWS • Types of scalability: • Vertical Scalability: More/Less CPU, RAM, ... • Horizontal Scalability: More/Less machine • Auto scalable by • CPU • Memory • Network • Disk • … 11
  • 12. CLOUD COMPUTING ON AWS • Rich service catalog • «Pay as you go» payment model • Cloud Computing models:  IaaS (Infrastructure as a Service) o EC2, S3  PaaS (Platform as a Service) o Elastic Beanstalk  SaaS (Software as a Service) o DynamoDB, ElasticSearch 12
  • 13. DATA STORAGE ON AWS 13
  • 14. DATA STORAGE ON AWS • There are many different data storage services for different purposes • Key features for storage: • Data storage services on AWS:  S3  Glacier 14  Durability  Security  Cost  Scalability  Access Performance  Availability  Reliability
  • 15. S3 • Secure, durable and highly-available storage service • Bucket: Containers for objects  Globally unique regardless of the AWS region • Object: Stored entries  Uniquely within a bucket by a name and a version ID • Supports versioning • WORM (Write once, read many) • Consistency models  «read-after-write» consistency for puts of new objects  «eventual consistency» for puts/deletes of existing objects  Updates to a single object are atomic. 15
  • 16. GLACIER • Storage service optimized for infrequently used data • Archives are stored in the containers name «vault»s • Uploading is synchronous operation with replicas • Downloading is asynchronous operation through retrieval job • Immutable • Encrypted by default • «S3» data can be archived into «Glacier» by life-cycle rules • 4x cheaper than «S3» 16
  • 18. DATA ANALYTICS & QUERYING ON AWS • There are different data analytics&querying services for different purposes.  NoSQL  SQL  Text search  Analytics / Visualising • Data analytics&querying services on AWS: 18  DynamoDB  Redshift  RDS  Elasticsearch (+ Kibana)  CloudSearch  QuickSight
  • 19. DYNAMODB • Key-value or document storage based NoSQL database • Consistency model • Eventual consistency by default • Strong consistency as optional • Supports • «DynamoDB Streaming» for tracking changes in near real time • Integrated with 19  EMR  Elasticsearch  Redshift  Lambda  Secondary indexes  Batch operations  Conditional writes  Atomic counters
  • 20. REDSHIFT • SQL data warehouse solution • Fault tolerant by  Replicating to other nodes in the cluster  Backing up to «S3» continuously and automatically • Scalable  Vertically by allowing switching to new cluster  Horizontally by adding new nodes to cluster • Integrated with 20  S3  EMR  DynamoDB  Kinesis  Data Pipeline
  • 21. RDS • Makes it easier to set up, operate and scale a relational database in the cloud • Manages backups, software patching, automatic failure detection and recovery • Supports  MySQL  PostgreSQL  Oracle  Microsoft SQL Server  Aurora (Amazon’s MySQL based RDMS by 5x performance) 21
  • 22. ELASTICSEARCH • Makes it easy to deploy, operate and scale «Elasticsearch» in the AWS cloud • Supports  Taking snapshots for backup  Restoring from backups  Replicating domains across Availability Zones • Integrated with  «Kibana» for data visualization  «Logstash» to process logs and load them into «ES»  «S3», «Kinesis» and «DynamoDB» for loading data into «ES» 22
  • 23. CLOUDSEARCH • AWS’s managed and high-performance search solution • Reliable by replicating search domain into multiple AZ • Automatic monitoring and recovery • Auto scalable • Supports popular search features such as On 34 languages 23  Free-text search  Faceted search  Highlighting  Autocomplete suggestion  Geospatial search  …
  • 24. QUICKSIGHT • AWS’s new very fast, cloud-powered business intelligence (BI) service • Uses a new, Super-fast, Parallel, In-memory Calculation Engine («SPICE») to perform advanced calculations and render visualizations rapidly • Supports creating analyze stories and sharing them • Integrated/Supported data sources: • Said that 1/10th the cost of traditional BI solutions • Native Access on available major mobile platforms 24  EMR  Kinesis  S3  DynamoDB  Redshift  RDS  3rd party (i.e. Salesforce)  Local file upload  …
  • 26. DATA PROCESSING ON AWS • Allows processing of huge amount of data in highly scalable way as distributed • Support both of  Batch  Stream data processing concepts • Data processing services on AWS:  EMR (Elastic Map-Reduce)  Kinesis  Lambda  Machine Learning 26
  • 27. EMR (ELASTIC MAPREDUCE) • Big data service/infrastructure to process large amounts of data efficiently in highly-scalable manner • Supported Hadoop distributions:  Amazon  MapR • Low cost by «on-demand», «spot» or «reserved» instances • Reliable by retrying failed tasks and automatically replacing poorly performing or failed instances • Elastic by resizing cluster on the fly • Integrated with 27  S3  DynamoDB  Data Pipeline  Redshift  RDS  Glacier  VPC  …
  • 28. EMR – CLUSTER COMPONENTS • Master Node  Tracks, monitors and manages the cluster  Runs the «YARN Resource Manager» and the «HDFS Name Node Service» • Core Node [Slave Node]  Runs tasks and stores data in the HDFS  Runs the «YARN Node Manager Service» and «HDFS Data Node Daemon» service • Task Node [Slave Node] • Only runs the tasks • Runs the «YARN Node Manager Service» 28
  • 29. EMR – BEST PRACTICES FOR INSTANCE GROUPS 29 Project Master Instance Group Core Instance Group Task Instance Group Long- running clusters On-Demand On-Demand Spot Cost-driven workloads Spot Spot Spot Data-critical workloads On-Demand On-Demand Spot Application testing Spot Spot Spot
  • 30. EMR – CLUSTER LIFECYCLE 30
  • 31. EMR - STORAGE • The storage layer includes the different file systems for different purposes as follows:  HDFS (Hadoop Distributed File System)  EMRFS (Elastic MapReduce File System)  Local Storage 31
  • 32. EMR - HDFS • Distributed, scalable file system for Hadoop • Stores multiple copies to prevent data lost on failure • It is an ephemeral storage because it is reclaimed when cluster terminates • Used by the master and core nodes • Useful for intermediate results during Map-Reduce processing or for workloads which have significant random I/O • Prefix: «hdfs://» 32
  • 33. EMR - EMRFS • An implementation of HDFS which allows clusters to store data on «AWS S3» • Eventually consistent by default on top of «AWS S3» • Consistent view can be enabled with the following supports:  Read-after-write consistency  Delete consistency  List consistency • Client/Server side encryption • Most often, «EMRFS» (S3) is used to store input and output data and intermediate results are stored in «HDFS» • Prefix: «s3://» 33
  • 34. EMR – LOCAL STORAGE • Refers to an ephemeral volume directly attached to an EC2 instance • Ideal for storing temporary data that is continually changing, such as  Buffers  Caches  Scratch data  And other temporary content • Prefix: «file://» 34
  • 35. EMR – DATA PROCESSING FRAMEWORKS 35 • Hadoop • Spark • Hive • Pig • HBase • HCatalog • Zookeeper • Presto • Impala • Sqoop • Oozie • Mahout • Phoenix • Tez • Ganglia • Hue • Zeppelin • …
  • 36. KINESIS • Collects and processes large streams of data records in real time • Reliable by synchronously replicating streaming data across facilities in an AWS Region • Elastic by increasing shards based on the volume of input data • Has  S3  DynamoDB  ElasticSearch connectors • Also integrated with «Storm» and «Spark» 36
  • 37. KINESIS – HIGH LEVEL ARCHITECTURE 37
  • 38. KINESIS – KEY CONCEPTS[1] • Streams  Ordered sequence of data records  All data is stored for 24 hours (can be increased to 7 days) • Shards  It is a uniquely identified group of data records in a stream  Kinesis streams are scaled by adding or removing shards  Provides 1MB/sec data input and 2MB/sec data output capacity  Supports up to 1000 TPS (put records per second) • Producers  Put records into streams • Consumers  Get records from streams and process them 38
  • 39. KINESIS – KEY CONCEPTS[2] • Partition Key  Used to group data by shard within a stream  Used to determine which shard a given data record belongs to • Sequence Number  Each record has a unique sequence number in the owned shard  Increase over time  Specific to a shard within a stream, not across all shards • Record  Unit of data stored  Composed of <partition-key, sequence-number, data-blob>  Data size can be up to 1 MB  Immutable 39
  • 40. KINESIS – GETTING DATA IN • Producers use «PUT» call to store data in a stream • A partition key is used to distribute the «PUT»s across shards • A unique sequence # is returned to the producer for each event • Data can be ingested at 1MB/second or 1000 Transactions/second per shard 40
  • 41. KINESIS – GETTING DATA OUT • Kinesis Client Library (KCL) simplifies reading from the stream by abstracting you from individual shards • Automatically starts a worker thread for each shard • Increases and decreases thread count as number of shards changes • Uses checkpoints to keep track of a thread’s location in the stream • Restarts threads & workers if they fail 41
  • 42. LAMBDA • Compute service that run your uploaded code on your behalf using AWS infrastructure • Stateless • Auto-scalable • Pay per use • Use cases  As an event-driven compute service triggered by actions on «S3», «Kinesis», «DynamoDB», «SNS», «CloudWatch»  As a compute service to run your code in response to HTTP requests using «AWS API Gateway» • Supports «Java», «Python» and «Node.js» languages 42
  • 43. MACHINE LEARNING • Machine learning service for building ML models and generating predictions • Provides implementations of common ML data transformations • Supports industry-standard ML algorithms such as  binary classification  multi-class classification  regression • Supports batch and real-time predictions • Integrated with • S3 • RDS • Redshift 43
  • 44. DATA FLOW ON AWS 44
  • 45. DATA FLOW ON AWS • Allows transferring data between data storage/processing points • Data flow services on AWS:  Firehose  Data Pipeline  DMS (Database Migration Service)  Snowball 45
  • 46. FIREHOSE • Makes it easy to capture and load massive volumes of load streaming data into AWS • Supports endpoints  S3  Redshift  Elasticsearch • Supports buffering  By size: Ranges from 1 MB to 128 MB. Default 5 MB  By interval: Ranges from 60 seconds to 900 seconds. Default 5 minutes • Supports encryption and compression 46
  • 47. DATA PIPELINE • Automates the movement and transformation of data • Comprise of workflows of tasks • The flow can be scheduled • Has visual designer with drag-and-drop support • The definition can be exported/imported • Supports custom activities and preconditions • Retry flow in case of failure and notifies • Integrated with 47  EMR  S3  DynamoDB  Redshift  RDS
  • 48. DATABASE MIGRATION SERVICE • Helps to migrate databases into AWS easily and securely • Supports  Homogeneous migrations (i.e. Oracle <=> Oracle)  Heterogeneous migrations (i.e. Oracle <=> PostgreSQL) • Also used for: • Database consolidation • Continuous data replication • Keep your applications running during migration  Start a replication instance  Connect source and target databases  Let «DMS» load data and keep them in sync  Switch applications over to the target at your convenience 48
  • 49. SNOWBALL • Import/Export service that accelerates transferring large amounts of data into and out of AWS using physical storage appliances, bypassing the Internet • Data is transported securely as encrypted • Data is transferred via 10G connection • Has 50TB or 80TB storage options • Steps: 49  Job Create  Processing  In transit to you  Delivered to you  In transit to AWS  At AWS  Importing  Completed
  • 51. Source code and Instructions are available at github.com/serkan-ozal/ankaracloudmeetup-bigdata-demo 51
  • 52. STEPS OF DEMO • Storing • Searching • Batch Processing • Stream Processing • Analyzing 52
  • 54. [STORING] CREATING DELIVERY STREAM • Create an «AWS S3» bucket to store tweets • Create an «AWS Firehose» stream to push tweets into to be stored • Attach created bucket as endpoint to the stream Then; • Listened tweet data can be pushed into this stream to be stored • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-storing-firehose 54
  • 55. [STORING] CRAWLING TWEETS • Configure stream name to be used by crawler application • Build «bigdata-demo-storing-crawler» application • Deploy into «AWS Elasticbeanstalk» Then; • Application listens tweets via «Twitter Streaming API» • Application pushes tweets data into «AWS Firehose» stream • «AWS Firehose» stream stores tweets on «AWS S3» bucket • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-storing-crawler 55
  • 57. [SEARCHING] CREATING SEARCH DOMAIN • Create an «AWS Elasticsearch» domain • Wait until it becomes active and ready to use Then • Tweet data can also be indexed to be searched • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-searching- elasticsearch 57
  • 58. [SEARCHING] INDEXING INTO SEARCH DOMAIN • Create an «AWS Lambda» function with the given code • Configure created «AWS S3» bucket as trigger • Wait until it becomes active and ready to use Then; • Stored tweet data will also be indexed to be searched • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-searching-lambda 58
  • 60. [BATCH PROCESSING] CREATING EMR CLUSTER • Create an «AWS EMR» cluster • Wait until it becomes active and ready to use Then; • Stored tweet data can be processed as batch • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing-emr 60
  • 61. [BATCH PROCESSING] CREATING TABLE TO STORE BATCH RESULT • Create an «AWS DynamoDB» table to save batch results • Wait until it becomes active and ready to use Then; • Analyze result data can be saved into this table • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing- dynamodb 61
  • 62. [BATCH PROCESSING] PROCESSING TWEETS VIA HADOOP • Build&upload «bigdata-demo-batchprocessing-hadoop» to «AWS S3» • Run uploaded jar as «Custom JAR» step on «AWS EMR» cluster • Wait until step has completed Then; • Processed tweet data is on «AWS S3» bucket • Analyze result is in «AWS DynamoDB» table • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing-hadoop 62
  • 63. [BATCH PROCESSING] QUERYING TWEETS VIA HIVE • Upload Hive query file to «AWS S3» • Run uploaded query as «Hive» step on «AWS EMR» cluster • Wait until step has completed Then; • Query results are dumped to «AWS S3» • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing-hive 63
  • 64. [BATCH PROCESSING] SCHEDULING WITH DATA PIPILINE • Import «AWS Data Pipeline» definition • Activate pipeline Then; • Every day, all batch processing jobs (Hadoop +Hive) will be executed automatically • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-batchprocessing- datapipeline 64
  • 66. [STREAM PROCESSING] CREATING REALTIME STREAM • Create an «AWS Kinesis» stream to push tweets to be analyzed • Wait until it becomes active and ready to use Then; • Listened tweet data can be pushed into this stream to be analyzed • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- kinesis 66
  • 67. [STREAM PROCESSING] PRODUCING STREAM DATA • Configure stream name to be used by producer application • Build «bigdata-demo-streamprocessing-kinesis-producer» • Deploy into «AWS Elasticbeanstalk» Then; • Application listens tweets via «Twitter Streaming API» • Listened tweets are pushed into «AWS Kinesis» stream • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- kinesis-producer 67
  • 68. [STREAM PROCESSING] CREATING TABLE TO SAVE STREAM RESULT • Create an «AWS DynamoDB» table to save realtime results • Wait until it becomes active and ready to use Then; • Analyze result can be saved into this table • Instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- dynamodb 68
  • 69. [STREAM PROCESSING] CONSUMING STREAM DATA • Configure • «AWS Kinesis» stream name for consuming tweets from • «AWS DynamoDB» table name for saving results into • Build «bigdata-demo-streamprocessing-consumer» application • Deploy into «AWS Elasticbeanstalk» Then; • Application consumes tweets from «AWS Kinesis» stream • Sentiment analyze is applied to tweets via «Stanford NLP» • Analyze results are saved into «AWS DynamoDB» table in realtime • Source code and instructions: github.com/serkan-ozal/ankaracloudmeetup-bigdata- demo/tree/master/bigdata-demo-streamprocessing- kinesis-consumer 69
  • 71. [ANALYZING] ANALYTICS VIA QUICKSIGHT • Create a data source from «AWS S3» • Create an analyse based on data source from «AWS S3» • Wait until importing data and analyze has completed Then; • You can do your  Query  Filter  Visualization  Story (flow of visualisations like presentation) stuff on your dataset and share them 71
  • 73. 73