Big Data analytics is well known to uncover hidden insights that gives an organization an edge over the competition. But data does not need to be big in order to be useful. Smaller companies and startups may lack the volume of data that qualifies as big data, yet the variety of data can still yield a trove of insights that helps in driving the business strategies of a company. Startups may also lack the resources to fund an additional, seemingly expensive development project. The key is in simplicity, start small, simple and architect for scalability and performance. But how do you start? In this presentation, we share our experience in building a cost effective, AWS serverless data analytics platform that became an invaluable tool for sales, marketing and operational efficiencies.Serverless architectures simplify development work where servers and software are managed by a third party cloud provider. Developers can focus on just building the data wrangling and data analysis logic where critical aspects like scalability and high availability are guaranteed by the cloud provider. Besides, serverless services offer the pay as you go model, where you pay only based on the amount of resources you use. This turns out to be another attractive aspect where costs can be managed based on the usage. In this presentation we will focus on techniques and best practices to build a big data analytics platform using AWS serverless services like Lambda, DynamoDB, S3, Kinesis, Athena, QuickSight and Amazon ML. We will highlight the strengths of each of these services and what role each plays in the data analytics pipeline. We compare and contrast these services with some of the other popularly used big data technologies like Hadoop, Spark and Kafka. We also demonstrate the usage of these services to build intelligent components that detect anomalies, yield recommendations, simulate chat bots and generate predictive analytics.
2. Building Serverless Data
Pipelines in the Cloud
Manisha Sule
Director of Big Data Analytics, Linux Academy.
Board Member on SMU’s Big Data Advisory Board.
linkedin.com/in/manisha-sule
@tweetDataS
3. Agenda
1. What is serverless?
2. Big Data architectures and best practices
3. AWS Server less services:
Lambda
Kinesis (Streams, Firehose, Analytics)
DynamoDB
S3
Athena
4. Analytics for CoudAssessments.com
4. What is Server less?
Source: https://www.slideshare.net/CodeOps/serverless-architecture-a-gentle-overview?qid=aecf8d27-8b16-4da5-987f-600fe1cb0655&v=&b=&from_search=5
5. Server less architectures
Depend on 3rd party services, known as Backend As a Service (BaaS).
Distributed system that reacts to events and triggers.
Dynamically scales, based on demand
Utilized ephemeral (short-lived) containers or computational resources in the cloud.
6. Advantages of Server less
Fully managed, cloud manages servers.
Highly Available, scalable, no provisioning needed and zero administration.
Not just compute containers, but also includes NoSQL databases, interactive query services,
storage services, messaging services.
Cost efficient, never have to pay for idle time.
Support for continuous integration/ continuous delivery pipelines.
Developers can focus on architecture and code only.
Gartner terms as fPaaS, lists several use cases. Utility logic, scheduled processing, event-
driven architecture, micro services, full blown applications
7. AWS Serverless Application Model
Template based mechanism of defining and deploying serverless applications.
Source : AWS Tech Talk Webinar
8. Big Data Lambda architecture
Requirements of Big Data architectures:
1. Processing real time streams.
2. Processing batch data.
3. Real time ETL.
4. Enrich real time data with batch data.
5. Queries must be answerable using
batch data and real time data.
9. Big Data best practices
1. Build decoupled architecture, decouple data->store->process->store steps.
2. Use right tools: Latency, throughput, access patterns, data structures.
3. Cost effective: Big data, not big cost.
10. AWS Managed vs Serverless services
Need to manage servers, their scale, their location,
software updates etc.
Elastic Map Reduce: Managed Hadoop
framework, includes Apache Spark,
Zeppelin, Hbase, Flink etc.
ElasticSearch: For log analytics, full text
search, application monitoring, and more.
Fully integrated with Kibana and LogStash.
RedShift: Fully managed data warehouse,
to analyze data and integrate with BI tools.
RDS: Database service to setup, operate
and scale a database in the cloud.
Automatically available in all availability zones
in the region, set on a regional level in the AWS
infrastructure. HA and fault tolerant.
Lambda
Kinesis
S3
DynamoDB
Athena
API Gateway
CloudWatch
QuickSight
IoT
Cognito
SQS
11. AWS Lambda
• Heart of serverless architecture patterns.
• Stateless, event driven code. Supports Node.js, Python, Java, C#.
• No infrastructure to manage.
• No risk of over provisioning or under provisioning, don’t pay for idle time
• Logging and operation monitoring is in-built.
• Efficient performance at scale. If a thousand requests come in, it scales automatically.
• Allows to skip the boring and the hard part. Easy to author, deploy and focus on business
logic.
12. AWS Kinesis Streams
What is it?: High throughput, low latency, service for real time
data processing over large distributed data streams. Stores
streaming data for a period of 24 hours, during which data can
be read, processed, stored in real time.
How to use it? Configure producer data sources to emit data
into the stream. Build consuming applications that read and
process data from that stream in real-time.
Applications: Real-time metrics and reporting. Extracting
metrics and generating KPIs to power reports and dashboards
at real-time speeds. Used for streaming data that needs custom
processing.
Why use it? Amazon Kinesis Streams has simple pay-as–you-
go pricing, with no up-front costs or minimum fees, and you’ll
only pay for the resources you consume. Guarantees durability
and availability of data. Also maintains order of data.
Source:
https://www.slideshare.net/frodriguezolivera/aws-
kinesis-streams
13. AWS Kinesis vs Kafka
Both are data ingest frameworks for streaming data with durability, reliability and scalability.
Differences:
1. Kafka is open source. User is responsible for managing, installing clusters.
2. Kinesis is a managed service by AWS and saves cost and effort in managing servers.
3. Kafka’s costs includes DevOps engineers and storage and compute servers.
4. Kinesis being serverless, resource and human costs are much lower.
14. AWS Kinesis Firehose
What is it? Fully managed service that offers an easy to use solution to collect and deliver
streaming data to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service.
How to use it? Configure and use. No code needed.
Applications: Load streaming data into S3, Redshift, ElasticSearch that can connect to BI tools
for real time analysis. Unlike Kinesis streams, Firehose is used when data does not need
custom processing.
Why to use it?: Seamlessly scales to match data throughput without intervention.
15. AWS Kinesis Analytics
What is it? Fully managed service to process streaming data with SQL.
How to use it? Configure input stream, write queries and configure output stream.
Applications: Perform continual processing on streaming data.
Why to use it?: Pre-processing, basic analytics like aggregates, filtering, advanced analytics like
anomaly detection, alerting and triggering.
16. AWS Kinesis: serverless stream processing
Kinesis Streams: With Lambda, allows stateless processing of data. Ingests from multiple
producers and delivers to multiple destinations. Needs management of scale using shards.
Kinesis Firehose: Transform streaming data with Lambda and guaranteed delivery to S3,
Redshift or Elastic Search.
Kinesis Analytics: Stateful processing of streaming data, like aggregations over a time period.
When to use which approach?
17. AWS DynamoDB
• Fully managed NoSQL Database that supports both key-value and document store models.
• Other than the primary key, the table is schema less.
• Supports 32 levels of nested attributes.
• In memory cache allows response times to reduce to microseconds.
18. AWS DynamoDB Stream processing
• Durability and high availability
• Managed streams
• Performant
• Native integration with Lambda.
Source: AWS Webinars
19. AWS S3
Object storage that provides you a highly reliable, secure, and scalable storage for all your data,
big or small. It is designed to deliver 99.999999999% durability, and scale past trillions of objects.
20. AWS Athena
Launched at AWS re:Invent Novemebr 2016.
Interactive query service, to analyze data stored in S3 buckets.
Serverless, no infrastructure setup needed.
Pay only for the queries you run; $5 per terabyte scanned by the queries
Works with a variety of standard data formats, including CSV, JSON, ORC, and Parquet.
Uses Presto with full SQL support.
Ideal for quick ad-hoc querying as well as complex analysis.
Powers real time dashboards.
21. Linux Academy launches Cloud Assessments
(https://www.cloudassessments.com/)
1. Assess: Enroll in Quests (Example: AWS CSA) and take assessments that test real-
world AWS skills on live cloud environments.
2. Learn: Lean learning, based on your performance, you are presented a tailor made
learning path.
3. Earn: Earn proven skills and ability to pass certification exams, earn badges and
micro certifications.
22. Linux Academy and AWS Partnership
Give nonprofit teams and individuals unlimited access to our entire library of cloud certification training
content to facilitate cloud building skills for all levels:
• More than 2,500 self-paced video courses
• 209 total hours of AWS course training
• 438 Linux training hours
• 105 OpenStack training hours
• More than 60 hands-on, scenario-based labs for AWS skill building
• Live AWS lab servers for practicing newly-acquired skills
• Quizzes, study guides, flash cards, study groups, and practice exams
23. Analytics for CloudAssessments.com
(https://www.cloudassessments.com/)
1. Descriptive Analytics: Dashboards with charts and graphs
• Historical views
• Real time views
2. Anomaly Detection: detect abuse of system, operational inefficiencies
3. Recommendation Engine: to provide custom tailor-made learning paths
4. Predictive analytics: Predict student performance
5. Chat bots: Virtual assistants for learning guidance.